Text Clustering and Topic Modeling
Chapter 5: Text Clustering and Topic Modeling
Introduction
Where Chapter 4 covered supervised classification, this chapter explores unsupervised techniques: grouping documents by meaning when you have no labels. We build a three-step text clustering pipeline (embed → reduce dimensions → cluster) on 44,949 NLP arXiv abstracts, then transition to topic modeling with BERTopic, a modular framework that combines transformer embeddings, dimensionality reduction, density-based clustering, c-TF-IDF, and (optionally) generative LLMs to produce interpretable topic labels.
Topic modeling means giving meaning to clusters by extracting representative keywords or labels:
Beyond exploration, clustering helps with outlier detection, labeling speedup, and finding mislabeled data.
Section 1: The arXiv NLP Dataset
44,949 abstracts from the cs.CL (Computation and Language) section between 1991 and 2024.
from datasets import load_dataset
dataset = load_dataset("maartengr/arxiv_nlp")["train"]
abstracts = dataset["Abstracts"]
titles = dataset["Titles"]
Section 2: A Common Pipeline for Text Clustering
The pipeline has three steps. First, embed documents with a sentence-embedding model. Second, reduce dimensionality of embeddings. Third, cluster the reduced embeddings.
2.1 Step 1 — Embed
Pick an embedding model optimized for semantic similarity (use the MTEB leaderboard). The book uses thenlper/gte-small, which is newer, faster, and stronger on clustering tasks than all-mpnet-base-v2.
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer("thenlper/gte-small")
embeddings = embedding_model.encode(abstracts, show_progress_bar=True)
embeddings.shape # (44949, 384)
2.2 Step 2 — Reduce Dimensionality
High-dim spaces grow exponentially, and clustering algorithms struggle there (the curse of dimensionality).
PCA versus UMAP: UMAP handles nonlinear structure better and is the default. Information loss is unavoidable, so there's a tradeoff between dimension reduction and signal preservation.
from umap import UMAP
umap_model = UMAP(
n_components=5, min_dist=0.0, metric='cosine', random_state=42
)
reduced_embeddings = umap_model.fit_transform(embeddings)
| Parameter | Why |
|---|---|
n_components=5 | 5–10 typically captures global structure |
min_dist=0.0 | Tighter clusters |
metric='cosine' | Euclidean breaks down at high dim |
random_state=42 | Reproducible (disables parallelism — slower) |
2.3 Step 3 — Cluster
When the number of clusters is unknown and the data is noisy, density-based clustering beats k-means.
HDBSCAN (Hierarchical DBSCAN) doesn't need you to specify the number of clusters, and it detects outliers by labeling them -1 and not forcing them into any cluster.
from hdbscan import HDBSCAN
hdbscan_model = HDBSCAN(
min_cluster_size=50, metric="euclidean", cluster_selection_method="eom"
).fit(reduced_embeddings)
clusters = hdbscan_model.labels_
len(set(clusters)) # 156
A lower min_cluster_size produces more (smaller) clusters.
2.4 Inspect a Cluster
import numpy as np
for index in np.where(clusters == 0)[0][:3]:
print(abstracts[index][:300] + "...\n")
Cluster 0 is about sign-language translation: every abstract mentions ASL, sign language phonology, or translation.
2.5 Visualize in 2D
Re-run UMAP with n_components=2 for plotting (separate from the 5-D space used for clustering):
import pandas as pd
import matplotlib.pyplot as plt
reduced_2d = UMAP(n_components=2, min_dist=0.0, metric="cosine", random_state=42).fit_transform(embeddings)
df = pd.DataFrame(reduced_2d, columns=["x", "y"])
df["title"] = titles
df["cluster"] = [str(c) for c in clusters]
clusters_df = df.loc[df.cluster != "-1", :]
outliers_df = df.loc[df.cluster == "-1", :]
plt.scatter(outliers_df.x, outliers_df.y, alpha=0.05, s=2, c="grey")
plt.scatter(clusters_df.x, clusters_df.y,
c=clusters_df.cluster.astype(int),
alpha=0.6, s=2, cmap="tab20b")
plt.axis("off")
2D plots are approximations. They distort relative distances. Always inspect actual cluster contents for ground truth.
Section 3: From Text Clustering to Topic Modeling
Topic modeling is the step that automatically describes each cluster.
3.1 The Classical Approach (LDA)
Latent Dirichlet Allocation assumes each topic is a probability distribution over the vocabulary.
LDA uses bag-of-words, so it has no semantic context. Modern transformer-embedding-based clustering captures meaning that LDA can't.
Section 4: BERTopic — A Modular Topic Modeling Framework
BERTopic has two big steps. The first is clustering, which is the same pipeline we just built:
The second is representing topics, with a clever twist on bag-of-words.
4.1 c-TF-IDF: Bag-of-Words at the Cluster Level
Standard bag-of-words counts word frequency per document:
For topics we want word frequency per cluster. That's c-TF (class TF):
But stop words like "the" and "of" dominate raw counts. To fix that, multiply by an IDF weight that penalizes words appearing in many clusters:
The result is c-TF-IDF: words that distinguish a cluster get high weight, and words common to all clusters get low weight.
4.2 The Full BERTopic Pipeline
The two halves are independent, so you can swap any block. Don't want outliers? Use k-means. Want different embeddings? Use any sentence-transformer.
This Lego-block design supports many variants: guided / (semi-)supervised topic modeling; hierarchical, dynamic, multimodal, multi-aspect; online / incremental; and zero-shot.
4.3 Running BERTopic on arXiv
from bertopic import BERTopic
topic_model = BERTopic(
embedding_model=embedding_model,
umap_model=umap_model,
hdbscan_model=hdbscan_model,
verbose=True
).fit(abstracts, embeddings)
topic_model.get_topic_info()
| Topic | Count | Name |
|---|---|---|
| -1 | 14,520 | outliers |
| 0 | 2,290 | speech / asr / recognition / end |
| 1 | 1,403 | medical / clinical / biomedical / patient |
| 2 | 1,156 | sentiment / aspect / analysis / reviews |
| 3 | 986 | translation / nmt / machine / neural |
| ... | ||
| 151 | 54 | prompt / prompts / optimization / prompting |
| 153 | 53 | counseling / mental / health / therapy |
| 154 | 50 | backdoor / attacks / attack / triggers |
Topic -1 is HDBSCAN's outlier bucket. Use reduce_outliers() (or k-means) to force assignment.
4.4 Topic Search
topic_model.find_topics("topic modeling")
# ([22, -1, 1, 47, 32], [0.95, 0.91, 0.91, 0.91, 0.91])
topic_model.get_topic(22)
# [('topic', 0.066), ('topics', 0.035), ('lda', 0.016),
# ('latent', 0.013), ('document', 0.013), ('dirichlet', 0.010), ...]
# Verify by checking which topic the BERTopic paper landed in:
topic_model.topics_[titles.index("BERTopic: Neural topic modeling with a class-based TF-IDF procedure")]
# 22
4.5 Visualizations
topic_model.visualize_documents(titles, reduced_embeddings=reduced_2d, width=1200, hide_annotations=True)
topic_model.visualize_barchart()
topic_model.visualize_heatmap(n_clusters=30)
topic_model.visualize_hierarchy()
Section 5: Adding a Special Lego Block — Reranking Topic Representations
c-TF-IDF is fast but ignores semantics. We can rerank the initial keywords using more powerful (slower) models.
The reranker block plugs in on top of c-TF-IDF. It only runs once per topic (not once per document), so even slow models stay practical.
from copy import deepcopy
original_topics = deepcopy(topic_model.topic_representations_)
def topic_differences(model, original_topics, nr_topics=5):
df = pd.DataFrame(columns=["Topic", "Original", "Updated"])
for topic in range(nr_topics):
og_words = " | ".join(list(zip(*original_topics[topic]))[0][:5])
new_words = " | ".join(list(zip(*model.get_topic(topic)))[0][:5])
df.loc[len(df)] = [topic, og_words, new_words]
return df
5.1 KeyBERTInspired
KeyBERT extracts keywords by comparing word and document embeddings via cosine similarity. KeyBERTInspired in BERTopic does the same on the per-cluster level.
from bertopic.representation import KeyBERTInspired
representation_model = KeyBERTInspired()
topic_model.update_topics(abstracts, representation_model=representation_model)
topic_differences(topic_model, original_topics)
| Topic | Original | KeyBERTInspired |
|---|---|---|
| 0 | speech | asr | recognition | end | acoustic | speech | encoder | phonetic | language | trans... |
| 1 | medical | clinical | biomedical | patient | nlp | ehr | clinical | biomedical | language |
| 2 | sentiment | aspect | analysis | reviews | aspect | sentiment | sentiments | ... |
| 3 | translation | nmt | machine | neural | bleu | translation | translating | translate | trans... |
Tradeoff: embeddings reorder for semantic clarity but can drop domain abbreviations (
nmt,bleu) that were highly informative to experts.
5.2 Maximal Marginal Relevance (MMR)
summaries, summary, and summarization in the same topic are redundant. MMR picks keywords that are diverse but still relevant. It iteratively chooses the next best keyword by balancing similarity to the topic with dissimilarity from already-chosen keywords (controlled by diversity).
from bertopic.representation import MaximalMarginalRelevance
representation_model = MaximalMarginalRelevance(diversity=0.2)
topic_model.update_topics(abstracts, representation_model=representation_model)
topic_differences(topic_model, original_topics)
| Topic | Original | MMR |
|---|---|---|
| 4 | summarization | summaries | summary | abstract | summarization | document | extractive | rouge |
MMR drops near-duplicates. Only one "summary"-flavored word survives.
Section 6: Generative LLMs as Topic Labelers
Don't ask an LLM to label every document (millions of calls). Ask it to generate one label per topic (hundreds of calls).
The prompt receives [DOCUMENTS], the 4 most representative documents (highest c-TF-IDF cosine to topic), and [KEYWORDS], the current topic keywords from c-TF-IDF or reranker.
6.1 Flan-T5
from transformers import pipeline
from bertopic.representation import TextGeneration
prompt = """I have a topic that contains the following documents:
[DOCUMENTS]
The topic is described by the following keywords: '[KEYWORDS]'.
Based on the documents and keywords, what is this topic about?"""
generator = pipeline("text2text-generation", model="google/flan-t5-small")
representation_model = TextGeneration(
generator, prompt=prompt, doc_length=50, tokenizer="whitespace"
)
topic_model.update_topics(abstracts, representation_model=representation_model)
topic_differences(topic_model, original_topics)
| Topic | Flan-T5 label |
|---|---|
| 0 | Speech-to-description |
| 1 | Science/Tech |
| 2 | Review |
| 3 | Attention-based neural machine translation |
| 4 | Summarization |
Some good ("Summarization", "Attention-based neural machine translation"), some too generic ("Science/Tech").
6.2 OpenAI GPT-3.5
import openai
from bertopic.representation import OpenAI
prompt = """
I have a topic that contains the following documents:
[DOCUMENTS]
The topic is described by the following keywords: [KEYWORDS]
Based on the information above, extract a short topic label in the following format:
topic: <short topic label>
"""
client = openai.OpenAI(api_key="YOUR_KEY_HERE")
representation_model = OpenAI(
client, model="gpt-3.5-turbo", exponential_backoff=True, chat=True, prompt=prompt,
)
topic_model.update_topics(abstracts, representation_model=representation_model)
| Topic | GPT-3.5 label |
|---|---|
| 0 | Leveraging External Data for Improving Low-Resource Speech Recognition |
| 1 | Improved Representation Learning for Biomedical NLP |
| 2 | Advancements in Aspect-Based Sentiment Analysis |
| 3 | Neural Machine Translation Enhancements |
| 4 | Document Summarization Techniques |
Far more informative, and we're not even using GPT-4.
Best practice: keep multiple representations side by side. KeyBERTInspired + MMR + GPT-3.5 give complementary perspectives on the same topic. BERTopic supports this natively via the
aspectsparameter.
6.3 Final Visualization
fig = topic_model.visualize_document_datamap(
titles,
topics=list(range(20)),
reduced_embeddings=reduced_2d,
width=1200,
label_font_size=11,
label_wrap_width=20,
use_medoids=True,
)
Summary
- The standard text clustering pipeline is embed → reduce → cluster (sentence-transformers + UMAP + HDBSCAN). Sentence embeddings give semantic reach beyond bag-of-words.
- HDBSCAN is preferred over k-means: no need to specify cluster count, and outliers (label
-1) aren't forced into clusters. - 2D plots are useful but lossy. Always inspect cluster contents directly.
- Topic modeling produces automated labels for clusters. Classical LDA uses bag-of-words. Modern approaches build on transformer embeddings.
- BERTopic is a modular framework: cluster pipeline → c-TF-IDF (cluster-level term-frequency × IDF) → optional reranker representations.
- c-TF-IDF is the secret sauce: it distinguishes words that characterize a cluster from words common across all clusters.
- The pipeline's two halves (clustering, topic representation) are independent. Swap any block, and re-rank topics without re-clustering.
- KeyBERTInspired uses semantic similarity to clean up keywords. MMR removes redundancy with a diversity parameter.
- Generative LLMs (Flan-T5, GPT-3.5) can label topics with concise human-readable summaries, applied per topic and not per document, so cost stays bounded.
- Multiple representations (keyword-based, generated label, MMR-diversified) can coexist and give different views of the same topic.
Previous chapter
Text ClassificationNext chapter
Prompt Engineering