Topic Modeling Methods

A survey of the main topic modeling methods, ordered roughly by historical development (matrix factorization → probabilistic generative models → neural → embedding-based). For information on when to use topic modeling, how to evaluate results, and other things, see Topic Modeling.

Classical methods

Latent Semantic Analysis (LSA / LSI)

Deerwester et al., 1990. The earliest approach, based on dimensionality reduction via Singular Value Decomposition (SVD).

Given a TF-IDF weighted document-term matrix $X$ :

X_{m \times n} = U_{m \times K} Σ_{K \times K} V_{K \times n}^{T}

where $U$ captures term-topic relationships, $Σ$ contains singular values, $V$ captures document-topic relationships, and $K$ is the number of retained dimensions.

Truncating to the top $K$ singular values forces synonyms to cluster together and partially resolves polysemy by projecting terms and documents into a shared latent space.

Polysemy

Polysemy is when a single word carries multiple meanings (bank = river bank / financial institution; apple = fruit / company). Bag-of-words methods treat every occurrence of a word as the same token, so they cannot distinguish these senses from context. Embedding-based methods handle polysemy naturally because contextual encoders produce different vectors for the same word in different contexts.

Advantages: deterministic, fast, captures co-occurrence patterns.
Disadvantages: topic dimensions can have negative weights (hard to interpret as probabilities), no probabilistic foundation, requires specifying $K$ .

Probabilistic LSA (pLSA)

Hofmann, 1999. Adds a probabilistic interpretation to LSA:

P (w ∣ d) = k = 1 \sum K P (w ∣ z_{k}) P (z_{k} ∣ d)

where $z_{k}$ is a latent topic. Trained via Expectation-Maximization. pLSA bridges LSA and LDA historically: it introduced documents-as-topic-mixtures, but lacks priors on the document-topic distributions, so the number of parameters grows linearly with corpus size, and the model cannot assign probabilities to unseen documents.

Latent Dirichlet Allocation (LDA)

Blei, Ng, Jordan, 2003. The canonical probabilistic topic model. Fixes pLSA’s overfitting by adding Dirichlet priors to both document-topic and topic-word distributions. Each document’s topic distribution $θ_{d}$ and each topic’s word distribution $ϕ_{k}$ are inferred from observed words via Gibbs sampling or variational inference.

See LDA for more information.

Non-negative Matrix Factorization (NMF)

Lee & Seung, 1999. Factorizes a document-term matrix $V$ into two non-negative matrices:

V_{m \times n} \approx W_{m \times K} H_{K \times n}, W, H \geq 0 min ∥ V - W H ∥_{F}^{2}

where $W$ captures document-topic weights and $H$ captures topic-word weights. This matches sklearn’s convention (components_ is the topic-word matrix $H$ ); the original Lee & Seung paper uses the transposed convention ( $V$ as term-document, $W$ as term-topic, $H$ as topic-document). The non-negativity constraint produces naturally sparse, additive, parts-based representations.

NMF typically uses TF-IDF input (unlike LDA, which needs integer counts for its multinomial generative assumption). TF-IDF down-weights generic high-frequency words, which tends to produce cleaner topics.

Scikit-learn solver tip: use solver='cd' (coordinate descent) with the default Frobenius loss for speed. Switch to solver='mu' (multiplicative update) if optimizing for Kullback-Leibler divergence (beta_loss='kullback-leibler'), which produces more LDA-like probabilistic topics.

Advantages: fast, deterministic given initialization, clean sparse topics, works well with TF-IDF.
Disadvantages: requires specifying $K$ , no probabilistic interpretation, results depend on initialization.

Neural topic models

ProdLDA

Srivastava & Sutton, 2017. The first effective neural variational inference approach for LDA. Uses a Variational Autoencoder (VAE) architecture: an encoder maps bag-of-words input to topic proportions, a decoder reconstructs the document.

The key innovation is approximating the Dirichlet prior with a logistic normal distribution, avoiding the “component collapsing” problem where earlier neural approaches produced degenerate topics. Trained by maximizing the Evidence Lower Bound (ELBO). Much faster inference than Gibbs sampling, and handles new documents without retraining.

Embedded Topic Model (ETM)

Dieng, Ruiz, Blei, 2020. Combines topic modeling with word embeddings. Each word and each topic live in the same embedding space; the probability of word $w$ under topic $k$ is:

P (w ∣ k) \propto exp (e_{w}^{⊤} t_{k})

where $e_{w}$ is the word embedding and $t_{k}$ is the topic embedding. Can use pre-trained embeddings (Word2Vec, GloVe) or learn them jointly. Handles large vocabularies and rare words better than standard LDA because semantically similar words share embedding structure.

Contextualized Topic Models (CombinedTM, ZeroShotTM)

Bianchi et al., 2021. Extends ProdLDA by feeding sentence-transformer embeddings into the encoder alongside (or instead of) bag-of-words representations. Two variants:

CombinedTM: BoW + sentence embeddings. Typically reports the higher coherence of the two on standard benchmarks.
ZeroShotTM: sentence embeddings only. Enables cross-lingual topic modeling: train on English, infer topics on German text without parallel corpora.

Embedding-based methods

BERTopic

Grootendorst, 2022. A modular pipeline: embed documents with sentence-transformers, reduce dimensionality with UMAP, cluster with HDBSCAN, and label clusters with class-based TF-IDF. Automatically determines the number of topics and supports dynamic, hierarchical, guided, and online modes.

See BERTopic for more details.

Top2Vec

Angelov, 2020. Jointly embeds documents and words, applies UMAP + HDBSCAN to find dense clusters. Topic vectors are cluster centroids; topic words are the nearest words in embedding space. Similar philosophy to BERTopic, predates it, and uses embedding distance rather than c-TF-IDF for topic representation.

LLM-assisted topic discovery

A practical hybrid that works well: run BERTopic for clustering, then use an LLM as the representation model to label each cluster. This gives the scalability of embedding-based clustering with human-readable LLM-generated labels, without paying per-document LLM costs.

Fully LLM-based approaches (TopicGPT) use an LLM to generate topic labels directly from document samples, optionally in a hierarchical refinement loop. Labels are immediately readable, and granularity can be specified in natural language, but the cost scales with corpus size and results are non-deterministic.

Short-text variants

Biterm Topic Model (BTM)

Yan et al., 2013. Designed for very short texts (tweets, queries, titles) where classical LDA struggles with within-document sparsity. Instead of modeling documents, BTM models unordered word pairs (biterms) extracted from the corpus as a whole and treats the entire corpus as a single implicit mixture of topics. Each biterm is generated by sampling a topic from the corpus-level topic distribution, then sampling both words independently from that topic’s word distribution.

Trades per-document topic proportions for better topic quality on short text. Available in the original C++ implementation and Python ports (bitermplus, biterm).

Other notable approaches

BigARTM

Vorontsov & Potapenko, 2015. Extends pLSA with additive regularization (ARTM): multiple regularizers (sparsity, decorrelation, hierarchy, label supervision) combine into a single objective:

L (Φ, Θ) + i \sum τ_{i} R_{i} (Φ, Θ) \to max

This flexibility allows simultaneous optimization for topic sparsity, inter-topic distinctness, and incorporation of side information.

Dynamic Topic Models

Blei & Lafferty, 2006. Extends LDA for temporally ordered corpora. Topic distributions evolve over time via Gaussian noise:

α_{t} ∣ α_{t - 1} \sim N (α_{t - 1}, σ^{2} I)

Captures how topics change meaning over time. Available in Gensim (DtmModel) and tomotopy. BERTopic’s .topics_over_time() provides a simpler alternative by recalculating c-TF-IDF per time bin, though it is not a true generative model.

Hierarchical and nonparametric variants

Hierarchical LDA (hLDA). Discovers a tree-structured topic hierarchy using the nested Chinese Restaurant Process. Does not require specifying $K$ or tree depth.
Hierarchical Dirichlet Process (HDP). Nonparametric extension of LDA that infers $K$ from data. Available in Gensim and tomotopy.
Correlated Topic Model (CTM). Replaces the Dirichlet prior with a logistic normal, allowing topics to be correlated (e.g., “genetics” and “biology” co-occur more often than “genetics” and “cooking”).
Structural Topic Model (STM): incorporates document-level metadata (date, source, author) as covariates that affect topic prevalence and content. Useful when topics are not independent of document attributes.

When to use what

Scenario	Recommended
Short texts (tweets, reviews)	BERTopic, Top2Vec
Long documents (papers, articles)	LDA, NMF, BERTopic
Mixed-membership (document in multiple topics)	LDA, NMF, CombinedTM
Automatic number of topics	BERTopic, Top2Vec, HDP
Resource-constrained / fast results	NMF (fastest), LDA
Highest reported coherence on standard benchmarks	BERTopic, CombinedTM
Temporal topic evolution	BERTopic `.topics_over_time()`, DTM
Cross-lingual topics	ZeroShotTM / CombinedTM
Large vocabulary with rare words	ETM
Hierarchical topic structure	hLDA, BERTopic (hierarchical mode)
Maximum flexibility via regularization	BigARTM
Metadata-aware topics (date, author, source)	STM
Feature engineering for a downstream classifier	LDA (probabilistic $θ_{d}$ ), NMF
LLM-quality labels at scale	BERTopic with LLM representation model
One-shot theme summary, small corpus	Prompt an LLM directly

BERTopic is a common starting point for new projects in current practice. LDA and NMF remain relevant for resource-constrained settings, mixed-membership requirements, and established pipelines.

Tools and libraries

In the table below, CTM refers to the Correlated Topic Model (logistic-normal prior); CombinedTM and ZeroShotTM are the Contextualized Topic Models. DTM is Dynamic Topic Model, SLDA is Supervised LDA, LLDA is Labeled LDA.

Library	Methods	Notes
Gensim	LDA, LSA/LSI, HDP, LdaMallet wrapper	Memory-efficient streaming; best for classical methods
scikit-learn	NMF, LDA, TruncatedSVD (LSA)	Tight integration with ML pipelines; in-memory ceiling
BERTopic	BERTopic (dynamic, hierarchical, guided, online)	Modular; supports LLM-based representation tuning
Top2Vec	Top2Vec	Simple API; automatic $K$
tomotopy	LDA, HDP, CTM, DTM, SLDA, LLDA	Very fast C++ backend; many LDA variants
OCTIS	LDA, NMF, ETM, ProdLDA, CTM, and more	Benchmarking framework with hyperparameter optimization
contextualized-topic-models	CombinedTM, ZeroShotTM	Cross-lingual support
Mallet	LDA (Gibbs sampling)	Often produces the best LDA results; Gensim has a wrapper

Code example (scikit-learn NMF & LDA, BERTopic)

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups
 
data, _ = fetch_20newsgroups(shuffle=True, random_state=1,
                             remove=('headers', 'footers', 'quotes'),
                             return_X_y=True)
docs = data[:2000]
n_topics = 10
 
# --- NMF with TF-IDF ---
tfidf = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')
X_tfidf = tfidf.fit_transform(docs)
nmf = NMF(n_components=n_topics, random_state=1, init='nndsvd').fit(X_tfidf)
 
# --- LDA with raw counts ---
tf = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')
X_tf = tf.fit_transform(docs)
lda = LatentDirichletAllocation(n_components=n_topics, random_state=0).fit(X_tf)
 
def show_topics(model, feature_names, n_words=10):
    for i, topic in enumerate(model.components_):
        words = [feature_names[j] for j in topic.argsort()[:-n_words - 1:-1]]
        print(f"Topic {i}: {', '.join(words)}")
 
show_topics(nmf, tfidf.get_feature_names_out())
show_topics(lda, tf.get_feature_names_out())
 
# --- BERTopic ---
from bertopic import BERTopic
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)
topic_model.get_topic_info()

DSWoK — Data Science Well of Knowledge

Explorer

Topic Modeling Methods

Classical methods

Latent Semantic Analysis (LSA / LSI)

Probabilistic LSA (pLSA)

Latent Dirichlet Allocation (LDA)

Non-negative Matrix Factorization (NMF)

Neural topic models

ProdLDA

Embedded Topic Model (ETM)

Contextualized Topic Models (CombinedTM, ZeroShotTM)

Embedding-based methods

BERTopic

Top2Vec

LLM-assisted topic discovery

Short-text variants

Biterm Topic Model (BTM)

Other notable approaches

BigARTM

Dynamic Topic Models

Hierarchical and nonparametric variants

When to use what

Tools and libraries

Links

Graph View

Table of Contents

Backlinks