Text analysis: topic modeling

MACS 30500
University of Chicago

November 9, 2016

Topic modeling

  • Keywords
  • Links
  • Themes
  • Probabilistic topic models
    • Latent Dirichlet allocation

Food and animals

  1. I ate a banana and spinach smoothie for breakfast.
  2. I like to eat broccoli and bananas.
  3. Chinchillas and kittens are cute.
  4. My sister adopted a kitten yesterday.
  5. Look at this cute hamster munching on a piece of broccoli.

LDA document structure

  • Decide on the number of words N the document will have
  • Generate each word in the document:
    • Pick a topic
    • Generate the word
  • LDA backtracks from this assumption

Food and animals

  • Decide that \(D\) will be 1/2 about food and 1/2 about cute animals.
  • Pick 5 to be the number of words in \(D\).
  • Pick the first word to come from the food topic
  • Pick the second word to come from the cute animals topic
  • Pick the third word to come from the cute animals topic
  • Pick the fourth word to come from the food topic
  • Pick the fifth word to come from the food topic

How does LDA learn?

  • Randomly assign each word in the document to one of \(K\) topics
  • For each document \(d\):
    • Go through each word \(w\) in \(d\)
      • And for each topic \(t\), compute two things:
        1. \(p(t | d)\)
        2. \(p(w | t)\)
      • Reassign \(w\) a new topic \(t\) with probability \(p(t|d) \times p(w|t)\)
  • Rinse and repeat

  • Estimate from LDA
    1. The topic mixtures of each document
    2. The words associated to each topic

LDA with a known topic structure

  • Great Expectations by Charles Dickens
  • The War of the Worlds by H.G. Wells
  • Twenty Thousand Leagues Under the Sea by Jules Verne
  • Pride and Prejudice by Jane Austen

topicmodels

## <<DocumentTermMatrix (documents: 193, terms: 18215)>>
## Non-/sparse entries: 104721/3410774
## Sparsity           : 97%
## Maximal term length: 19
## Weighting          : term frequency (tf)

Terms associated with each topic

Per-document classification

Consensus topic

## # A tibble: 4 × 2
##                               consensus topic
##                                   <chr> <int>
## 1                    Great Expectations     4
## 2                   Pride and Prejudice     1
## 3                 The War of the Worlds     3
## 4 Twenty Thousand Leagues under the Sea     2

Mis-identification

chapter_classifications %>%
  inner_join(book_topics, by = "topic") %>%
  count(title, consensus) %>%
  knitr::kable()
title consensus n
Great Expectations Great Expectations 57
Great Expectations Pride and Prejudice 1
Great Expectations The War of the Worlds 1
Pride and Prejudice Pride and Prejudice 61
The War of the Worlds The War of the Worlds 27
Twenty Thousand Leagues under the Sea Twenty Thousand Leagues under the Sea 46

Incorrectly classified words

title Great Expectations Pride and Prejudice The War of the Worlds Twenty Thousand Leagues under the Sea
Great Expectations 49770 3876 1845 77
Pride and Prejudice 1 37229 7 5
The War of the Worlds 0 0 22561 7
Twenty Thousand Leagues under the Sea 0 5 0 39629

Most commonly mistaken words

## # A tibble: 3,500 × 4
##                 title             consensus     term     n
##                 <chr>                 <chr>    <chr> <dbl>
## 1  Great Expectations   Pride and Prejudice     love    44
## 2  Great Expectations   Pride and Prejudice sergeant    37
## 3  Great Expectations   Pride and Prejudice     lady    32
## 4  Great Expectations   Pride and Prejudice     miss    26
## 5  Great Expectations The War of the Worlds     boat    25
## 6  Great Expectations   Pride and Prejudice   father    19
## 7  Great Expectations The War of the Worlds    water    19
## 8  Great Expectations   Pride and Prejudice     baby    18
## 9  Great Expectations   Pride and Prejudice  flopson    18
## 10 Great Expectations   Pride and Prejudice   family    16
## # ... with 3,490 more rows

LDA with an unknown topic structure

Associated Press articles

## <<DocumentTermMatrix (documents: 2246, terms: 10134)>>
## Non-/sparse entries: 259208/22501756
## Sparsity           : 99%
## Maximal term length: 18
## Weighting          : term frequency (tf)

Perplexity

  • A statistical measure of how well a probability model predicts a sample
  • Given the theoretical word distributions represented by the topics, compare that to the actual topic mixtures, or distribution of words in your documents
  • Perplexity for LDA model with 12 topics
    • 2301.8138725

Topics from \(k=100\)

Acknowledgments