Latent Dirichlet Allocation (Blei, Ng, Jordan, 2003) is the canonical probabilistic topic model. It fixes the overfitting of its predecessor pLSA by placing Dirichlet priors on both the document-topic and topic-word distributions. Every modern topic model still gets compared to LDA as a baseline.

For the broader picture — when LDA makes sense versus other methods — see the Topic Modeling.

Generative process

Generative process

A generative process is a recipe for making synthetic data. LDA posits that each document in the corpus was produced by sampling from a specific hierarchy of distributions. Inference inverts the recipe: given the observed words, recover the most likely distributions that generated them.

Dirichlet distribution

A distribution over distributions on the -dimensional simplex. A sample from is a probability vector with . The concentration parameter controls sparsity: small (< 1) makes most of the mass sit on a few components; large (> 1) produces nearly uniform vectors.

Generative story:

  1. For each topic : draw a word distribution
  2. For each document :
    • Draw a topic distribution
    • For each word position :
      • Sample a topic
      • Sample a word

Dependency structure:

graph LR
    a(("α")) --> t["θ_d<br/>(doc-topic)"]
    t --> z["z_d,n<br/>(topic)"]
    z --> w["w_d,n<br/>(word)"]
    p["φ_k<br/>(topic-word)"] --> w
    b(("β")) --> p
    classDef prior fill:#f5f5f5,stroke:#999,stroke-dasharray:3 3
    classDef latent fill:#eef,stroke:#447
    classDef observed fill:#efe,stroke:#474
    class a,b prior
    class t,z,p latent
    class w observed

Formal plate notation (with documents, topics, words per document) is the standard way to draw this in the literature — see the Wikipedia LDA page for the canonical figure.

Inference

Inference recovers the hidden and from the observed words. Three common approaches:

  • Collapsed Gibbs sampling — analytically integrates out and , then iteratively reassigns each word’s topic by sampling from the conditional posterior. Used in Mallet and Gensim’s LdaModel. More accurate, but slow on large corpora.
  • Variational Bayes — approximates the posterior with a tractable family and optimizes the Evidence Lower Bound (ELBO). Used in scikit-learn. Faster, but the approximation can bias results.
  • Online Variational Bayes (Hoffman et al., 2010) — processes documents in mini-batches, enabling streaming updates. The default for LdaModel in recent Gensim versions.

Mallet is often preferred when users want collapsed Gibbs sampling and are willing to trade speed for topic quality.

Hyperparameters

ParameterMeaningDefaultEffect
number of topicsmust specifytoo few → merged themes; too many → duplicated topics
document-topic concentrationlow (0.01–0.1) → each document concentrates on a few topics
//eta/topic_word_priortopic-word concentrationlow → each topic concentrates on a few characteristic words

For short texts (tweets, product titles), default produces near-uniform topic distributions and renders the model useless. Manually set below default and consider letting Gensim auto-tune it (alpha='auto').

Why counts instead of TF-IDF

LDA is a probabilistic generative model over words sampled from multinomial distributions, which means the math requires integer counts. TF-IDF values are real-valued weights, so they don’t fit the generative story. Using TF-IDF with LDA technically works but typically gives worse topics than raw counts.

Contrast with NMF, which is a matrix factorization objective with no distributional assumptions, so TF-IDF works well there.

Inductive inference on new documents

LDA uses folding-in: freeze the topic-word distributions learned during training, then run variational inference (or a few Gibbs sweeps) on the new document alone to infer its topic distribution . Contrast with BERTopic, where assignment is a single nearest-centroid lookup after the embedding is computed.

Guided and supervised variants

Real-world projects often have partial domain knowledge — you know one topic should be about “payments” and another about “refunds”, and you want the model to respect that.

  • Seeded/Guided LDA — pre-set a few words per topic as anchors. The model is biased toward producing topics that contain those anchor words, while still discovering the remaining topics freely. Python library: GuidedLDA.
  • CorEx (Correlation Explanation) — not strictly LDA, but an information-theoretic alternative that accepts anchor words per topic and is often easier to steer than Seeded LDA. Library: corextopic.
  • Labeled LDA / SLDA — supervised variants that condition topics on document labels. Useful when you have partial labels and want topic discovery within each labeled group.

If you mainly want interpretable themes biased by a seed list, CorEx is frequently the best starting point; if you want a probabilistic LDA-style model with anchors, use Seeded LDA.

Advantages and disadvantages

Advantages: theoretically grounded, interpretable per-document topic mixtures, well-studied, cheap to train and serve compared to embedding-based methods, mature library support.

Disadvantages: must pre-specify , bag-of-words ignores word order, results vary across runs (Gibbs sampling is stochastic — fix the random seed for reproducibility), sensitive to preprocessing, struggles on short texts, cannot capture semantic similarity (for LDA, “car” and “automobile” are just two unrelated tokens).

When LDA still earns its keep

  • You want explicit probabilistic per-document topic mixtures (not hard cluster assignments).
  • Your documents are medium-length and well-formed (news articles, papers, reports).
  • You need a model that’s cheap to train and serve, with predictable inference cost per document.
  • You have an established pipeline and changing tools would cost more than improving LDA.
  • You want topic distributions as features for a downstream classifier — LDA’s vectors are natural inputs.

When LDA fails, the next steps are: NMF (for cleaner topics on short/noisy text), BERTopic (for semantic similarity), or ETM (for large vocabularies with rare words).