What is the best way to explain topic modeling to a layman?

What is the best way to explain topic modeling to a layman? by Annalyn Ng

Answer by Annalyn Ng:

(Tutorial entry taken from: Annalyzing Life | Data Analytics Tutorials & Experiments for Layman)

Suppose you have the following set of sentences:

  • I eat fish and vegetables.
  • Fish are pets.
  • My kitten eats fish.

Latent Dirichlet allocation (LDA) is a technique that automatically discovers topics that these documents contain.

Given the above sentences, LDA might classify the bold words under the Topic F, which we might label as “food“. Similarly, underlined words might be classified under a separate Topic P, which we might label as “pets“. LDA defines each topic as a bag of words, and you have to label the topics as you deem fit.

There are 2 benefits from LDA defining topics on a word-level:

1) We can infer the content spread of each sentence by a word count: Sentence 1: 100% Topic F Sentence 2: 100% Topic P Sentence 3: 33% Topic P and 67% Topic F

2) We can derive the proportions that each word constitutes in given topics. For example, Topic F might comprise words in the following proportions: 40% eat, 40% fish, 20% vegetables, …

LDA achieves the above results in 3 steps.

To illustrate these steps, imagine that you are now discovering topics in documents instead of sentences. Imagine you have 2 documents with the following words:

Step 1

You tell the algorithm how many topics you think there are. You can either use an informed estimate (e.g. results from a previous analysis), or simply trial-and-error. In trying different estimates, you may pick the one that generates topics to your desired level of interpretability, or the one yielding the highest statistical certainty (i.e. log likelihood). In our example above, the number of topics might be inferred just by eyeballing the documents.

Step 2

The algorithm will assign every word to a temporary topic. Topic assignments are temporary as they will be updated in Step 3. Temporary topics are assigned to each word in a semi-random manner (according to a Dirichlet distribution, to be exact). This also means that if a word appears twice, each word may be assigned to different topics. Note that in analyzing actual documents, function words (e.g. “the”, “and”, “my”) are removed and not assigned to any topics.

Step 3 (iterative)

The algorithm will check and update topic assignments, looping through each word in every document. For each word, its topic assignment is updated based on two criteria:

  • How prevalent is that word across topics?
  • How prevalent are topics in the document?

To understand how these two criteria work, imagine that we are now checking the topic assignment for the word “fish” in Doc Y:

  • How prevalent is that word across topics? Since “fish” words across both documents nearly half of remaining Topic F words but 0% of remaining Topic P words, a “fish” word picked at random would more likely be about Topic F.
  • How prevalent are topics in the document? Since the words in Doc Y are assigned to Topic F and Topic P in a 50-50 ratio, the remaining “fish” word seems equally likely to be about either topic.

Weighing conclusions from the two criteria, we would assign the “fish” word of Doc Y to Topic F. Doc Y might then be a document on what to feed kittens.

The process of checking topic assignment is repeated for each word in every document, cycling through the entire collection of documents multiple times. This iterative updating is the key feature of LDA that generates a final solution with coherent topics.

(Credits to Edwin Chen for the sentence-problem definition approach.)

For an example application of topic modeling on news articles, see Automated Biography of a Nation.

For more tutorials, visit my site: Annalyzing Life | Data Analytics Tutorials & Experiments for Layman

What is the best way to explain topic modeling to a layman?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.