A survey of the main topic modeling methods, ordered roughly by historical development (matrix factorization → probabilistic generative models → neural → embedding-based). For information on when to use topic modeling, how to evaluate results, and other things — see Topic Modeling.
where U captures term-topic relationships, Σ contains singular values, V captures document-topic relationships, and K is the number of retained dimensions.
Truncating to the top K singular values forces synonyms to cluster together and partially resolves polysemy by projecting terms and documents into a shared latent space.
Polysemy
Polysemy is when a single word carries multiple meanings (bank = river bank / financial institution; apple = fruit / company). Bag-of-words methods treat every occurrence of a word as the same token, so they cannot distinguish these senses from context. Embedding-based methods handle polysemy naturally because contextual encoders produce different vectors for the same word in different contexts.
Advantages: deterministic, fast, captures co-occurrence patterns.
Disadvantages: topic dimensions can have negative weights (hard to interpret as probabilities), no probabilistic foundation, requires specifying K.
Probabilistic LSA (pLSA)
Hofmann, 1999. Adds a probabilistic interpretation to LSA:
P(w∣d)=k=1∑KP(w∣zk)P(zk∣d)
where zk is a latent topic. Trained via Expectation-Maximization. pLSA bridges LSA and LDA historically — it introduced documents-as-topic-mixtures, but lacks priors on the document-topic distributions, so the number of parameters grows linearly with corpus size, and the model cannot assign probabilities to unseen documents.
Latent Dirichlet Allocation (LDA)
Blei, Ng, Jordan, 2003. The canonical probabilistic topic model. Fixes pLSA’s overfitting by adding Dirichlet priors to both document-topic and topic-word distributions. Each document’s topic distribution θd and each topic’s word distribution ϕk are inferred from observed words via Gibbs sampling or variational inference.
Lee & Seung, 1999. Factorizes a document-term matrix V into two non-negative matrices:
Vm×n≈Wm×KHK×n,W,H≥0min∥V−WH∥F2
where W captures document-topic weights and H captures topic-word weights. This matches sklearn’s convention (components_ is the topic-word matrix H); the original Lee & Seung paper uses the transposed convention (V as term-document, W as term-topic, H as topic-document). The non-negativity constraint produces naturally sparse, additive, parts-based representations.
NMF typically uses TF-IDF input (unlike LDA, which needs integer counts for its multinomial generative assumption). TF-IDF down-weights generic high-frequency words, which tends to produce cleaner topics.
Scikit-learn solver tip: use solver='cd' (coordinate descent) with the default Frobenius loss for speed. Switch to solver='mu' (multiplicative update) if optimizing for Kullback-Leibler divergence (beta_loss='kullback-leibler'), which produces more LDA-like probabilistic topics.
Advantages: fast, deterministic given initialization, clean sparse topics, works well with TF-IDF.
Disadvantages: requires specifying K, no probabilistic interpretation, results depend on initialization.
Neural Topic Models
ProdLDA
Srivastava & Sutton, 2017. The first effective neural variational inference approach for LDA. Uses a VAE architecture: an encoder maps bag-of-words input to topic proportions, a decoder reconstructs the document.
The key innovation is approximating the Dirichlet prior with a logistic normal distribution, avoiding the “component collapsing” problem where earlier neural approaches produced degenerate topics. Trained by maximizing the ELBO. Much faster inference than Gibbs sampling, and handles new documents without retraining.
Embedded Topic Model (ETM)
Dieng, Ruiz, Blei, 2020. Combines topic modeling with word embeddings. Each word and each topic live in the same embedding space; the probability of word w under topic k is:
P(w∣k)∝exp(ew⊤tk)
where ew is the word embedding and tk is the topic embedding. Can use pre-trained embeddings (Word2Vec, GloVe) or learn them jointly. Handles large vocabularies and rare words better than standard LDA because semantically similar words share embedding structure.
Contextualized Topic Models (CTM / CombinedTM)
Bianchi et al., 2021. Extends ProdLDA by feeding sentence-transformer embeddings into the encoder alongside (or instead of) bag-of-words representations. Two variants:
CombinedTM — BoW + sentence embeddings. Typically best coherence.
ZeroShotTM — sentence embeddings only. Enables cross-lingual topic modeling: train on English, infer topics on German text without parallel corpora.
Embedding-based Methods
BERTopic
Grootendorst, 2022. A modular pipeline — embed documents with sentence-transformers, reduce dimensionality with UMAP, cluster with HDBSCAN, and label clusters with class-based TF-IDF. Automatically determines the number of topics and supports dynamic, hierarchical, guided, and online modes.
Angelov, 2020. Jointly embeds documents and words, applies UMAP + HDBSCAN to find dense clusters. Topic vectors are cluster centroids; topic words are the nearest words in embedding space. Similar philosophy to BERTopic, predates it, and uses embedding distance rather than c-TF-IDF for topic representation.
LLM-assisted Topic Discovery
A practical hybrid that works well: run BERTopic for clustering, then use an LLM as the representation model to label each cluster. This gives the scalability of embedding-based clustering with human-readable LLM-generated labels, without paying per-document LLM costs.
Fully LLM-based approaches (TopicGPT) use an LLM to generate topic labels directly from document samples, optionally in a hierarchical refinement loop. Labels are immediately readable, and you can specify granularity in natural language, but the cost scales with corpus size, and results are non-deterministic.
Short-text variants
Biterm Topic Model (BTM)
Yan et al., 2013. Designed for very short texts (tweets, queries, titles) where classical LDA struggles with within-document sparsity. Instead of modeling documents, BTM models unordered word pairs (biterms) extracted from the corpus as a whole and treats the entire corpus as a single implicit mixture of topics. Each biterm is generated by sampling a topic from the corpus-level topic distribution, then sampling both words independently from that topic’s word distribution.
Trades per-document topic proportions for better topic quality on short text. Available in the original C++ implementation and Python ports (bitermplus, biterm).
Other Notable Approaches
BigARTM
Vorontsov & Potapenko, 2015. Extends pLSA with additive regularization — multiple regularizers (sparsity, decorrelation, hierarchy, label supervision) combine into a single objective:
L(Φ,Θ)+i∑τiRi(Φ,Θ)→max
This flexibility allows simultaneous optimization for topic sparsity, inter-topic distinctness, and incorporation of side information.
Dynamic Topic Models
Blei & Lafferty, 2006. Extends LDA for temporally ordered corpora. Topic distributions evolve over time via Gaussian noise:
αt∣αt−1∼N(αt−1,σ2I)
Captures how topics change meaning over time. Available in Gensim (DtmModel) and tomotopy. BERTopic’s .topics_over_time() provides a simpler alternative by recalculating c-TF-IDF per time bin, though it is not a true generative model.
Hierarchical and nonparametric variants
Hierarchical LDA (hLDA). Discovers a tree-structured topic hierarchy using the nested Chinese Restaurant Process. Does not require specifying K or tree depth.
Correlated Topic Model (CTM). Replaces the Dirichlet prior with a logistic normal, allowing topics to be correlated (e.g., “genetics” and “biology” co-occur more often than “genetics” and “cooking”).
Structural Topic Model (STM) — incorporates document-level metadata (date, source, author) as covariates that affect topic prevalence and content. Useful when topics are not independent of document attributes.
When to Use What
Scenario
Recommended
Short texts (tweets, reviews)
BERTopic, Top2Vec
Long documents (papers, articles)
LDA, NMF, BERTopic
Mixed-membership (document in multiple topics)
LDA, NMF, CTM
Automatic number of topics
BERTopic, Top2Vec, HDP
Resource-constrained / fast results
NMF (fastest), LDA
State-of-the-art coherence
BERTopic
Temporal topic evolution
BERTopic .topics_over_time(), DTM
Cross-lingual topics
ZeroShotTM / CombinedTM
Large vocabulary with rare words
ETM
Hierarchical topic structure
hLDA, BERTopic (hierarchical mode)
Maximum flexibility via regularization
BigARTM
Metadata-aware topics (date, author, source)
STM
Feature engineering for a downstream classifier
LDA (probabilistic θd), NMF
LLM-quality labels at scale
BERTopic with LLM representation model
One-shot theme summary, small corpus
Prompt an LLM directly
In current practice, BERTopic is the default starting point for new projects. LDA and NMF remain relevant for resource-constrained settings, mixed-membership requirements, and established pipelines.
Tools and Libraries
Library
Methods
Notes
Gensim
LDA, LSA/LSI, HDP, LdaMallet wrapper
Memory-efficient streaming; best for classical methods
scikit-learn
NMF, LDA, TruncatedSVD (LSA)
Tight integration with ML pipelines; in-memory ceiling
BERTopic
BERTopic (dynamic, hierarchical, guided, online)
Modular; supports LLM-based representation tuning
Top2Vec
Top2Vec
Simple API; automatic K
tomotopy
LDA, HDP, CTM, DTM, SLDA, LLDA
Very fast C++ backend; many LDA variants
OCTIS
LDA, NMF, ETM, ProdLDA, CTM, and more
Benchmarking framework with hyperparameter optimization
contextualized-topic-models
CombinedTM, ZeroShotTM
Cross-lingual support
Mallet
LDA (Gibbs sampling)
Often produces the best LDA results; Gensim has a wrapper
Code example (scikit-learn NMF & LDA, BERTopic)
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizerfrom sklearn.decomposition import NMF, LatentDirichletAllocationfrom sklearn.datasets import fetch_20newsgroupsdata, _ = fetch_20newsgroups(shuffle=True, random_state=1, remove=('headers', 'footers', 'quotes'), return_X_y=True)docs = data[:2000]n_topics = 10# --- NMF with TF-IDF ---tfidf = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')X_tfidf = tfidf.fit_transform(docs)nmf = NMF(n_components=n_topics, random_state=1, init='nndsvd').fit(X_tfidf)# --- LDA with raw counts ---tf = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')X_tf = tf.fit_transform(docs)lda = LatentDirichletAllocation(n_components=n_topics, random_state=0).fit(X_tf)def show_topics(model, feature_names, n_words=10): for i, topic in enumerate(model.components_): words = [feature_names[j] for j in topic.argsort()[:-n_words - 1:-1]] print(f"Topic {i}: {', '.join(words)}")show_topics(nmf, tfidf.get_feature_names_out())show_topics(lda, tf.get_feature_names_out())# --- BERTopic ---from bertopic import BERTopictopic_model = BERTopic()topics, probs = topic_model.fit_transform(docs)topic_model.get_topic_info()
For a BERTopic example with LLM-based labels and soft assignment, see BERTopic. For a gensim LDA example with folding-in for new documents, see LDA.