LDA

Latent Dirichlet Allocation (Blei, Ng, Jordan, 2003) is the canonical probabilistic topic model. It fixes the overfitting of its predecessor pLSA by placing Dirichlet priors on both the document-topic and topic-word distributions. Every modern topic model still gets compared to LDA as a baseline.

For the broader picture (when LDA makes sense versus other methods), see Topic Modeling.

Generative process

Generative process

A generative process is a recipe for making synthetic data. LDA posits that each document in the corpus was produced by sampling from a specific hierarchy of distributions. Inference inverts the recipe: given the observed words, recover the most likely distributions that generated them.

Dirichlet distribution

A distribution over distributions on the $K$ -dimensional simplex. A sample from $Dir (α)$ is a probability vector $(θ_{1}, \dots, θ_{K})$ with $\sum_{k} θ_{k} = 1$ . The concentration parameter $α$ controls sparsity: small $α$ (< 1) makes most of the mass sit on a few components; large $α$ (> 1) produces nearly uniform vectors.

Generative story:

For each topic $k = 1, \dots, K$ : draw a word distribution $ϕ_{k} \sim Dir (β)$
For each document $d$ :
- Draw a topic distribution $θ_{d} \sim Dir (α)$
- For each word position $n = 1, \dots, N_{d}$ :
  - Sample a topic $z_{d, n} \sim Multinomial (θ_{d})$
  - Sample a word $w_{d, n} \sim Multinomial (ϕ_{z_{d, n}})$

Dependency structure:

graph LR
    a(("α")) --> t["θ_d<br/>(doc-topic)"]
    t --> z["z_d,n<br/>(topic)"]
    z --> w["w_d,n<br/>(word)"]
    p["φ_k<br/>(topic-word)"] --> w
    b(("β")) --> p
    classDef prior fill:#f5f5f5,stroke:#999,stroke-dasharray:3 3
    classDef latent fill:#eef,stroke:#447
    classDef observed fill:#efe,stroke:#474
    class a,b prior
    class t,z,p latent
    class w observed

Formal plate notation (with $D$ documents, $K$ topics, $N_{d}$ words per document) is the standard way to draw this in the literature; see the Wikipedia LDA page for the canonical figure.

Inference

Inference recovers the hidden $θ_{d}$ and $ϕ_{k}$ from the observed words. Three common approaches:

Collapsed Gibbs sampling: analytically integrates out $θ$ and $ϕ$ , then iteratively reassigns each word’s topic by sampling from the conditional posterior. Used in Mallet and Gensim’s LdaModel. More accurate, but slow on large corpora.
Variational Bayes: approximates the posterior with a tractable family and optimizes the Evidence Lower Bound (ELBO). Used in scikit-learn. Faster, but the approximation can bias results.
Online Variational Bayes (Hoffman et al., 2010): processes documents in mini-batches, enabling streaming updates. The default for LdaModel in recent Gensim versions.

Mallet is often preferred when users want collapsed Gibbs sampling and are willing to trade speed for topic quality.

Hyperparameters

Parameter	Meaning	Default	Effect
$K$	number of topics	must specify	too few → merged themes; too many → duplicated topics
$α$	document-topic concentration	$1/ K$	low (0.01–0.1) → each document concentrates on a few topics
$β$ / $η$ /eta/topic_word_prior	topic-word concentration	$1/ V$	low → each topic concentrates on a few characteristic words

For short texts (tweets, product titles), default $α$ produces near-uniform topic distributions and renders the model useless. Manually set $α$ below default and consider letting Gensim auto-tune it (alpha='auto').

Why counts instead of TF-IDF

LDA is a probabilistic generative model over words sampled from multinomial distributions, which means the math requires integer counts. TF-IDF values are real-valued weights, so they don’t fit the generative story. Using TF-IDF with LDA technically works, but typically gives worse topics than raw counts.

Contrast with NMF, which is a matrix factorization objective with no distributional assumptions, so TF-IDF works well there.

Inductive inference on new documents

LDA uses folding-in: freeze the topic-word distributions $ϕ_{k}$ learned during training, then run variational inference (or a few Gibbs sweeps) on the new document alone to infer its topic distribution $θ_{new}$ . Contrast with BERTopic, where assignment is a single nearest-centroid lookup after the embedding is computed.

Guided and supervised variants

Real-world projects often carry partial domain knowledge: one topic should be about “payments” and another about “refunds”, and the model should respect that.

Seeded/Guided LDA: pre-set a few words per topic as anchors. The model is biased toward producing topics that contain those anchor words, while still discovering the remaining topics freely. Python library: GuidedLDA.
CorEx (Correlation Explanation): not strictly LDA, but an information-theoretic alternative that accepts anchor words per topic and is often easier to steer than Seeded LDA. Library: corextopic.
Labeled LDA / SLDA (Supervised LDA): supervised variants that condition topics on document labels. Useful with partial labels when topic discovery should stay within each labeled group.

For interpretable themes biased by a seed list, CorEx is frequently the best starting point; for a probabilistic LDA-style model with anchors, Seeded LDA is the more direct choice.

Advantages and disadvantages

Advantages: theoretically grounded, interpretable per-document topic mixtures, well-studied, cheap to train and serve compared to embedding-based methods, mature library support.

Disadvantages: must pre-specify $K$ , bag-of-words ignores word order, results vary across runs (Gibbs sampling is stochastic, so the random seed needs to be fixed for reproducibility), sensitive to preprocessing, struggles on short texts, cannot capture semantic similarity (for LDA, “car” and “automobile” are just two unrelated tokens).

When LDA is still appropriate

The use case calls for explicit probabilistic per-document topic mixtures rather than hard cluster assignments.
Documents are medium-length and well-formed (news articles, papers, reports).
A model that is cheap to train and serve with a predictable inference cost per document is the priority.
An established pipeline already runs on LDA, and changing tools would cost more than improving the existing model.
Topic distributions are needed as features for a downstream classifier: LDA’s $θ_{d}$ vectors are natural inputs.

When LDA fails, the next steps are: NMF (for cleaner topics on short/noisy text), BERTopic (for semantic similarity), or ETM (for large vocabularies with rare words).

Code example (gensim, including folding-in for new documents)

from gensim import corpora
from gensim.models import LdaMulticore
from gensim.parsing.preprocessing import preprocess_string
 
# --- Train ---
docs = [preprocess_string(d) for d in train_corpus]
dictionary = corpora.Dictionary(docs)
dictionary.filter_extremes(no_below=5, no_above=0.8)
corpus = [dictionary.doc2bow(doc) for doc in docs]
 
lda = LdaMulticore(
    corpus=corpus,
    id2word=dictionary,
    num_topics=20,
    alpha='symmetric',
    eta='auto',
    passes=10,
    random_state=42,
)
 
# Top words per topic
for i, topic in lda.print_topics(num_words=10):
    print(f"Topic {i}: {topic}")
 
# --- Folding-in: infer topics for a NEW document without retraining ---
new_doc = preprocess_string("The central bank raised rates again.")
new_bow = dictionary.doc2bow(new_doc)
new_topics = lda[new_bow]  # list of (topic_id, probability)
print(new_topics)

DSWoK — Data Science Well of Knowledge

Explorer

LDA

Generative process

Inference

Hyperparameters

Why counts instead of TF-IDF

Inductive inference on new documents

Guided and supervised variants

Advantages and disadvantages

When LDA is still appropriate

Links

Graph View

Table of Contents