Computational Linguistics
About

Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA) is a generative probabilistic model that represents documents as mixtures of latent topics drawn from Dirichlet distributions, providing the foundational framework for modern topic modeling.

P(w,z,θ,φ|α,β) = Π_k Dir(φ_k|β) · Π_d [Dir(θ_d|α) · Π_n P(z_{dn}|θ_d) · P(w_{dn}|φ_{z_{dn}})]

Latent Dirichlet Allocation, introduced by David Blei, Andrew Ng, and Michael Jordan in 2003, is the most widely used topic model in computational linguistics and machine learning. LDA treats each document as a bag of words generated by a mixture of latent topics, where each topic is a probability distribution over the vocabulary and each document exhibits a characteristic blend of topics. The use of Dirichlet priors over both topic proportions and word distributions gives the model desirable smoothing properties and enables principled Bayesian inference. LDA has become a standard tool for exploratory text analysis across disciplines.

Generative Model

LDA Generative Process Hyperparameters: α (topic prior), β (word prior), K (number of topics)

For each topic k = 1, …, K:
φ_k ~ Dirichlet(β) — word distribution for topic k

For each document d = 1, …, D:
θ_d ~ Dirichlet(α) — topic proportions for document d
For each word position n = 1, …, N_d:
z_{d,n} ~ Multinomial(θ_d) — topic assignment
w_{d,n} ~ Multinomial(φ_{z_{d,n}}) — observed word

The generative story of LDA proceeds as follows. First, K topic-word distributions are drawn from a Dirichlet prior with parameter beta. Then, for each document, a topic proportion vector is drawn from a Dirichlet prior with parameter alpha. Each word in the document is generated by first sampling a topic from the document's topic proportions and then sampling a word from that topic's word distribution. The observed words are the only visible variables; the topics, topic assignments, and proportions are all latent and must be inferred from the data. The Dirichlet prior with small alpha encourages sparse topic mixtures (documents address few topics), while small beta encourages sparse word distributions (topics use distinctive vocabularies).

Inference Algorithms

Since exact posterior inference in LDA is intractable, several approximate inference methods have been developed. The original paper used variational expectation-maximization, which approximates the posterior with a factorized distribution and iteratively optimizes a lower bound on the log evidence. Collapsed Gibbs sampling (Griffiths and Steyvers, 2004) integrates out the topic and word distributions analytically and samples only the topic assignment variables, providing a simpler implementation that is often preferred in practice. The conditional distribution for resampling each topic assignment has a closed form proportional to the product of the word's frequency in the topic and the topic's frequency in the document.

Choosing the Number of Topics

Selecting the number of topics K is a fundamental model selection problem. Approaches include held-out perplexity (selecting K that minimizes perplexity on unseen documents), Bayesian model comparison using harmonic mean estimators or variational bounds, and coherence-based selection (choosing K that maximizes average topic coherence). The Hierarchical Dirichlet Process (Teh et al., 2006) avoids the problem entirely by placing a nonparametric prior over the number of topics, allowing K to be inferred from data. In practice, domain knowledge and the intended application often guide the choice of K as much as formal criteria.

Variations and Legacy

LDA has spawned a vast family of extensions. The Correlated Topic Model (Blei and Lafferty, 2007) replaces the Dirichlet prior with a logistic normal distribution to capture correlations between topics. Supervised LDA (McAuliffe and Blei, 2008) incorporates document labels to learn topics that are predictive of response variables. Dynamic Topic Models track topic evolution over time using state-space models. Online LDA (Hoffman et al., 2010) enables efficient inference on streaming data by processing documents in mini-batches. The Structural Topic Model (Roberts et al., 2014), popular in political science, incorporates document-level covariates into both topic prevalence and content.

LDA's influence extends well beyond its original text modeling application. The same generative framework has been applied to image analysis (modeling images as mixtures of visual "topics"), genomics (discovering motif patterns in DNA sequences), social network analysis (discovering communities), and recommendation systems (modeling user preferences as mixtures of latent factors). While neural topic models and pre-trained embeddings have offered competitive alternatives, LDA remains widely used due to its interpretability, sound probabilistic foundations, and the availability of efficient, well-tested implementations across multiple programming languages and platforms.

Interactive Calculator

Enter multiple documents, one per line. The calculator tokenizes the text, removes stop words, and performs a simplified topic analysis using TF-IDF-based word clustering to identify pseudo-topics and document-topic assignments.

Click Calculate to see results, or Animate to watch the statistics update one record at a time.

Related Topics

References

  1. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.
  2. Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences, 101(suppl 1), 5228–5235. doi:10.1073/pnas.0307752101
  3. Hoffman, M. D., Blei, D. M., & Bach, F. (2010). Online learning for latent Dirichlet allocation. Advances in Neural Information Processing Systems, 23, 856–864.
  4. Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77–84. doi:10.1145/2133806.2133826

External Links