A survey of the main topic modeling methods, ordered roughly by historical development (matrix factorization → probabilistic generative models → neural → embedding-based). For information on when to use topic modeling, how to evaluate results, and other things — see Topic Modeling.

Classical Methods

Latent Semantic Analysis (LSA / LSI)

Deerwester et al., 1990. The earliest approach, based on dimensionality reduction via Singular Value Decomposition (SVD).

Given a TF-IDF weighted document-term matrix :

where captures term-topic relationships, contains singular values, captures document-topic relationships, and is the number of retained dimensions.

Truncating to the top singular values forces synonyms to cluster together and partially resolves polysemy by projecting terms and documents into a shared latent space.

Polysemy

Polysemy is when a single word carries multiple meanings (bank = river bank / financial institution; apple = fruit / company). Bag-of-words methods treat every occurrence of a word as the same token, so they cannot distinguish these senses from context. Embedding-based methods handle polysemy naturally because contextual encoders produce different vectors for the same word in different contexts.

Advantages: deterministic, fast, captures co-occurrence patterns.
Disadvantages: topic dimensions can have negative weights (hard to interpret as probabilities), no probabilistic foundation, requires specifying .

Probabilistic LSA (pLSA)

Hofmann, 1999. Adds a probabilistic interpretation to LSA:

where is a latent topic. Trained via Expectation-Maximization. pLSA bridges LSA and LDA historically — it introduced documents-as-topic-mixtures, but lacks priors on the document-topic distributions, so the number of parameters grows linearly with corpus size, and the model cannot assign probabilities to unseen documents.

Latent Dirichlet Allocation (LDA)

Blei, Ng, Jordan, 2003. The canonical probabilistic topic model. Fixes pLSA’s overfitting by adding Dirichlet priors to both document-topic and topic-word distributions. Each document’s topic distribution and each topic’s word distribution are inferred from observed words via Gibbs sampling or variational inference.

See LDA for mode information.

Non-negative Matrix Factorization (NMF)

Lee & Seung, 1999. Factorizes a document-term matrix into two non-negative matrices:

where captures document-topic weights and captures topic-word weights. This matches sklearn’s convention (components_ is the topic-word matrix ); the original Lee & Seung paper uses the transposed convention ( as term-document, as term-topic, as topic-document). The non-negativity constraint produces naturally sparse, additive, parts-based representations.

NMF typically uses TF-IDF input (unlike LDA, which needs integer counts for its multinomial generative assumption). TF-IDF down-weights generic high-frequency words, which tends to produce cleaner topics.

Scikit-learn solver tip: use solver='cd' (coordinate descent) with the default Frobenius loss for speed. Switch to solver='mu' (multiplicative update) if optimizing for Kullback-Leibler divergence (beta_loss='kullback-leibler'), which produces more LDA-like probabilistic topics.

Advantages: fast, deterministic given initialization, clean sparse topics, works well with TF-IDF.
Disadvantages: requires specifying , no probabilistic interpretation, results depend on initialization.

Neural Topic Models

ProdLDA

Srivastava & Sutton, 2017. The first effective neural variational inference approach for LDA. Uses a VAE architecture: an encoder maps bag-of-words input to topic proportions, a decoder reconstructs the document.

The key innovation is approximating the Dirichlet prior with a logistic normal distribution, avoiding the “component collapsing” problem where earlier neural approaches produced degenerate topics. Trained by maximizing the ELBO. Much faster inference than Gibbs sampling, and handles new documents without retraining.

Embedded Topic Model (ETM)

Dieng, Ruiz, Blei, 2020. Combines topic modeling with word embeddings. Each word and each topic live in the same embedding space; the probability of word under topic is:

where is the word embedding and is the topic embedding. Can use pre-trained embeddings (Word2Vec, GloVe) or learn them jointly. Handles large vocabularies and rare words better than standard LDA because semantically similar words share embedding structure.

Contextualized Topic Models (CTM / CombinedTM)

Bianchi et al., 2021. Extends ProdLDA by feeding sentence-transformer embeddings into the encoder alongside (or instead of) bag-of-words representations. Two variants:

  • CombinedTM — BoW + sentence embeddings. Typically best coherence.
  • ZeroShotTM — sentence embeddings only. Enables cross-lingual topic modeling: train on English, infer topics on German text without parallel corpora.

Embedding-based Methods

BERTopic

Grootendorst, 2022. A modular pipeline — embed documents with sentence-transformers, reduce dimensionality with UMAP, cluster with HDBSCAN, and label clusters with class-based TF-IDF. Automatically determines the number of topics and supports dynamic, hierarchical, guided, and online modes.

See BERTopic for more details.

Top2Vec

Angelov, 2020. Jointly embeds documents and words, applies UMAP + HDBSCAN to find dense clusters. Topic vectors are cluster centroids; topic words are the nearest words in embedding space. Similar philosophy to BERTopic, predates it, and uses embedding distance rather than c-TF-IDF for topic representation.

LLM-assisted Topic Discovery

A practical hybrid that works well: run BERTopic for clustering, then use an LLM as the representation model to label each cluster. This gives the scalability of embedding-based clustering with human-readable LLM-generated labels, without paying per-document LLM costs.

Fully LLM-based approaches (TopicGPT) use an LLM to generate topic labels directly from document samples, optionally in a hierarchical refinement loop. Labels are immediately readable, and you can specify granularity in natural language, but the cost scales with corpus size, and results are non-deterministic.

Short-text variants

Biterm Topic Model (BTM)

Yan et al., 2013. Designed for very short texts (tweets, queries, titles) where classical LDA struggles with within-document sparsity. Instead of modeling documents, BTM models unordered word pairs (biterms) extracted from the corpus as a whole and treats the entire corpus as a single implicit mixture of topics. Each biterm is generated by sampling a topic from the corpus-level topic distribution, then sampling both words independently from that topic’s word distribution.

Trades per-document topic proportions for better topic quality on short text. Available in the original C++ implementation and Python ports (bitermplus, biterm).

Other Notable Approaches

BigARTM

Vorontsov & Potapenko, 2015. Extends pLSA with additive regularization — multiple regularizers (sparsity, decorrelation, hierarchy, label supervision) combine into a single objective:

This flexibility allows simultaneous optimization for topic sparsity, inter-topic distinctness, and incorporation of side information.

Dynamic Topic Models

Blei & Lafferty, 2006. Extends LDA for temporally ordered corpora. Topic distributions evolve over time via Gaussian noise:

Captures how topics change meaning over time. Available in Gensim (DtmModel) and tomotopy. BERTopic’s .topics_over_time() provides a simpler alternative by recalculating c-TF-IDF per time bin, though it is not a true generative model.

Hierarchical and nonparametric variants

  • Hierarchical LDA (hLDA). Discovers a tree-structured topic hierarchy using the nested Chinese Restaurant Process. Does not require specifying or tree depth.
  • Hierarchical Dirichlet Process (HDP). Nonparametric extension of LDA that infers from data. Available in Gensim and tomotopy.
  • Correlated Topic Model (CTM). Replaces the Dirichlet prior with a logistic normal, allowing topics to be correlated (e.g., “genetics” and “biology” co-occur more often than “genetics” and “cooking”).
  • Structural Topic Model (STM) — incorporates document-level metadata (date, source, author) as covariates that affect topic prevalence and content. Useful when topics are not independent of document attributes.

When to Use What

ScenarioRecommended
Short texts (tweets, reviews)BERTopic, Top2Vec
Long documents (papers, articles)LDA, NMF, BERTopic
Mixed-membership (document in multiple topics)LDA, NMF, CTM
Automatic number of topicsBERTopic, Top2Vec, HDP
Resource-constrained / fast resultsNMF (fastest), LDA
State-of-the-art coherenceBERTopic
Temporal topic evolutionBERTopic .topics_over_time(), DTM
Cross-lingual topicsZeroShotTM / CombinedTM
Large vocabulary with rare wordsETM
Hierarchical topic structurehLDA, BERTopic (hierarchical mode)
Maximum flexibility via regularizationBigARTM
Metadata-aware topics (date, author, source)STM
Feature engineering for a downstream classifierLDA (probabilistic ), NMF
LLM-quality labels at scaleBERTopic with LLM representation model
One-shot theme summary, small corpusPrompt an LLM directly

In current practice, BERTopic is the default starting point for new projects. LDA and NMF remain relevant for resource-constrained settings, mixed-membership requirements, and established pipelines.

Tools and Libraries

LibraryMethodsNotes
GensimLDA, LSA/LSI, HDP, LdaMallet wrapperMemory-efficient streaming; best for classical methods
scikit-learnNMF, LDA, TruncatedSVD (LSA)Tight integration with ML pipelines; in-memory ceiling
BERTopicBERTopic (dynamic, hierarchical, guided, online)Modular; supports LLM-based representation tuning
Top2VecTop2VecSimple API; automatic
tomotopyLDA, HDP, CTM, DTM, SLDA, LLDAVery fast C++ backend; many LDA variants
OCTISLDA, NMF, ETM, ProdLDA, CTM, and moreBenchmarking framework with hyperparameter optimization
contextualized-topic-modelsCombinedTM, ZeroShotTMCross-lingual support
MalletLDA (Gibbs sampling)Often produces the best LDA results; Gensim has a wrapper

For a BERTopic example with LLM-based labels and soft assignment, see BERTopic. For a gensim LDA example with folding-in for new documents, see LDA.