Topic Modeling

Topic modeling is an unsupervised technique for discovering abstract themes in a document collection, where a document is whatever unit of text the project treats as one (article, review, tweet, paragraph, support ticket). Each theme (topic) is expressed through a distribution over words (classical methods) or a cluster in embedding space labeled by representative words (embedding-based methods).

The key contrast with classification: classification uses predefined labels, whereas topic modeling discovers them from scratch. The result is structure without supervision, but the decision of whether the discovered structure matches the intended one comes after the fact.

The field evolved from matrix factorization (LSA, NMF) through probabilistic generative models (pLSA, LDA) to modern neural and embedding-based approaches (BERTopic, ETM). Find the list of approaches in Topic Modeling Methods. LDA and BERTopic have separate notes due to their significance.

Key assumptions

Documents are mixtures of topics (classical and neural methods), or each document belongs to one primary cluster (embedding-based methods with hard assignment)
Topics are distributions over words (classical/neural) or clusters in embedding space labeled by representative words (BERTopic, Top2Vec)
Bag-of-words for classical methods; embedding-based methods inherit contextual understanding from their encoder

When topic modeling is not the right tool

Labels already exist: supervised classification or zero-shot classification with an LLM is the more direct path.
Stable, reproducible labels are required. Most topic models drift between runs and across retraining; downstream consumers object to the instability.
The corpus fits an LLM budget. If the documents fit in a context window or per-document LLM calls are affordable, prompt the LLM for theme summaries directly. An alternative is to use an LLM to generate synthetic labels on a subset and train a cheap supervised classifier; this often beats raw topics.
The corpus is small (fewer than a few thousand documents). Embedding with a sentence transformer and clustering directly is usually enough.
The primary need is semantic similarity, not interpretable themes. Embeddings and retrieval are the right tool.
Very short texts (tweets, queries, chat snippets). Classical topic models struggle on sparsity at this length. Embedding-based methods or specialized short-text models (Biterm Topic Model) handle them better.

When topic modeling is still the right tool

Interpretability or open-ended exploration of the data is the goal.
Cheap and/or fast inference matters. LDA’s folding-in and BERTopic’s nearest-centroid assignment cost milliseconds per document; per-document LLM calls cost orders of magnitude more.
Topic distributions are needed as features for downstream models.
Tracking theme prevalence over time is the deliverable. Dynamic topics or BERTopic’s .topics_over_time() turn a corpus-plus-timestamps into a time series of themes.

Evaluation

Coherence metrics

Coherence measures how semantically related the top- $N$ words per topic are.

UMass Coherence: based on document co-occurrence of top word pairs from the training corpus:

C_{UMass} = \frac{2}{N ( N - 1 )} i = 2 \sum N j = 1 \sum i - 1 lo g \frac{D ( w _{i} , w _{j} ) + 1}{D ( w _{j} )}

where $D (w_{i}, w_{j})$ is the count of documents containing both words. Ranges roughly $[- 14, 0]$ ; less negative is better.

NPMI (Normalized Pointwise Mutual Information): normalizes PMI to $[- 1, 1]$ :

NPMI (w_{i}, w_{j}) = \frac{lo g \frac{P ( w _{i} , w _{j} )}{P ( w _{i} ) P ( w _{j} )}}{- lo g P ( w _{i} , w _{j} )}

Showed the best overall correlation with human judgment in classical topic model evaluations. Typical “good” values land in the 0.05–0.25 range across the studies surveyed. Hoyle et al. (2021) showed that automated coherence metrics correlate poorly with human judgment for neural topic models; treat them as a rough sanity check, not a leaderboard.

C_v. A composite measure combining NPMI with cosine similarity of sliding-window word vectors. Ranges $[0, 1]$ . Gensim’s default coherence metric.

Perplexity

For probabilistic models, perplexity measures how well the model predicts held-out documents:

Perplexity = exp (- \frac{\sum _{d} lo g p ( w _{d} )}{\sum _{d} N _{d}})

Lower is better, but perplexity and human judgment are negatively correlated: models that optimize perplexity tend to produce less interpretable topics (Chang et al., 2009).

Topic diversity

Topic Uniqueness (TU): fraction of top- $N$ words that appear in only one topic. TU = 1 means zero word overlap across topics.
Proportion of unique words: percentage of distinct words across all topics’ top- $N$ lists.

Manual inspection

Automated metrics catch errors but may approve topics that a human reader would reject. Things to check:

Read the top 10–20 words of every topic.
Read three to five representative documents per topic.
Ask whether two topics should be merged or whether a topic is actually two.
Try to assign a short human-readable label; if no label fits, the topic is probably noise.

Human evaluation

Word intrusion: insert a random word into a topic’s top words; humans identify the intruder. Higher accuracy means a more coherent topic.
Topic intrusion: show a document with its assigned topics plus one random topic; humans identify the misfit.
Direct rating: judges rate topic quality on a Likert scale for coherence, usefulness, and interpretability.

Visualization

pyLDAvis: the standard interactive tool for classical models (LDA). Projects topics into 2D via Jensen-Shannon divergence + Principal Coordinate Analysis (js_PCoA, the default mds option) and compares within-topic term frequencies against corpus-wide frequencies. See the pyLDAvis demo for an example screenshot.
BERTopic built-in visualizations: UMAP projections of the document space overlaid with cluster boundaries, intertopic distance maps, topic hierarchies, and topics-over-time. See the BERTopic visualization gallery.

Choosing the number of topics

For methods that require $K$ (LDA, NMF, ProdLDA):

Run multiple values, plot coherence vs $K$ , look for a peak or elbow.
Coherence plots often plateau rather than peak: there are usually several reasonable values depending on desired granularity.
Inspect topics manually at candidate $K$ values. Repeated keywords across topics signal too many; merged themes signal too few.
Compare candidate $K$ values on a train/validation split, but rely mainly on coherence plus manual inspection.
Starting range: $K \in [5, 50]$ .

BERTopic and Top2Vec sidestep this via HDBSCAN, where min_cluster_size indirectly controls granularity. Hierarchical merging allows post-hoc reduction.

Preprocessing

Classical methods (LDA, NMF, LSA):

Tokenize, lowercase, remove punctuation.
Remove stopwords with domain-specific lists, not just generic ones (e.g., “patient” in medical corpora; “thanks”, “regards”, “http” in email; company boilerplate).
Lemmatize (optional; improves interpretability).
Filter rare terms and overly frequent ones.
Consider bigrams/trigrams for multi-word concepts.
NMF and LSA work best with TF-IDF weighting; LDA uses raw counts.

Embedding-based methods (BERTopic, Top2Vec):

Minimal preprocessing: the transformer handles semantics.
Do not remove stopwords or lemmatize before embedding.
Clean up URLs, HTML tags, and special characters.
Very short documents (<10 words) produce poor embeddings; aggregate or filter.

Scale

Rough thresholds for what fits on one machine:

Mallet LDA: single-machine, Java. Handles millions of documents given enough RAM and runtime; best topic quality for classical LDA.
Gensim LDA: streams from disk; handles millions of documents with constant memory via online variational Bayes.
scikit-learn LDA: loads the corpus into memory; practical ceiling around 100k documents before OOM on typical machines.
BERTopic: embedding compute is the main cost (GPU-hours per million documents with all-MiniLM-L6-v2). UMAP memory grows roughly quadratically; fails around 1–2M documents without low_memory=True or approximate nearest neighbors. HDBSCAN handles the million-document range if UMAP output fits.
NMF: very fast for the matrix sizes it handles; typically memory-bounded by the document-term matrix.

Incremental and online updates

Gensim: online variational Bayes is the default training mode; .update(new_corpus) folds in additional documents without retraining. Hyperparameters can be re-tuned or held fixed.
BERTopic: .partial_fit() enables streaming updates with a subset of documents at a time; merge_models() combines independently trained models. See BERTopic online topic modeling docs.
River: a streaming-ML library with a river.feature_extraction module that supports incremental TF-IDF and can feed NMF.
scikit-learn NMF: supports partial_fit for mini-batch updates via MiniBatchNMF.

Topic drift and alignment

When retraining on a newer corpus, the topic IDs produced by the new model do not correspond to the old ones. Common approaches:

Topic alignment via the Hungarian algorithm. Compute a similarity matrix between old and new topic-word distributions (or embedding centroids), then solve the optimal assignment. Libraries: scipy.optimize.linear_sum_assignment.
BERTopic merge_models: merges topics from models trained on disjoint data; can also serve as a rolling-update mechanism where old topics are preserved.
Manual curation: human review of new topics against the previous ones.

Multilingual corpora

BERTopic with a multilingual sentence encoder: paraphrase-multilingual-MiniLM-L12-v2 or distiluse-base-multilingual-cased-v2 handles 50+ languages and clusters documents across languages when unified topics are the goal.
ZeroShotTM / CombinedTM: designed for cross-lingual transfer. Train on one language, infer topics on another without parallel corpora.
Per-language classical pipelines: for LDA/NMF on mixed-language corpora, segment by language first, then train separate models per language. Topics will not be aligned across languages by default.

What the output actually looks like

Top words per topic (typically 10–20): the canonical interpretable label.
Representative documents per topic: 3–5 exemplars that make the topic concrete for humans.
Document-topic matrix: either probabilistic ( $θ_{d}$ from LDA) or a single cluster ID (BERTopic default) with optional soft distribution via .approximate_distribution().
Topic prevalence over time: when the corpus has timestamps, aggregate topic assignments per time bin. Useful for trend reports.
Curated human-readable labels: after manual inspection, replace “Topic 3” with “Refund and Billing Complaints”. This is what gets handed to analysts and PMs.

Post-processing

Merge near-duplicate topics: coherence plots don’t catch topics that say the same thing with different top words. Check pairwise cosine similarity on topic-word vectors or topic centroids.
Drop junk topics: boilerplate topics (“thanks, please, regards”), formatting artifacts, or topics that match no documents meaningfully. BERTopic’s topic -1 (outliers) is the extreme case.
Rename topics: generic c-TF-IDF labels rarely survive user feedback. Rewrite as short noun phrases that an analyst would recognize.
Inspect outliers: the noise cluster can be the most interesting one. Documents that don’t fit may be early signals of emerging themes.

DSWoK — Data Science Well of Knowledge

Explorer