Pagefy

Pagefy

Back to AI Engineering

Text Clustering and Topic Modeling

Hands On Large Language Models by Jay Alammar & Maarten GrootendorstBuy the book
Sign in to save bookmarks, reading progress, and highlights.

Chapter 5: Text Clustering and Topic Modeling

Introduction

Where Chapter 4 covered supervised classification, this chapter explores unsupervised techniques: grouping documents by meaning when you have no labels. We build a three-step text clustering pipeline (embed → reduce dimensions → cluster) on 44,949 NLP arXiv abstracts, then transition to topic modeling with BERTopic, a modular framework that combines transformer embeddings, dimensionality reduction, density-based clustering, c-TF-IDF, and (optionally) generative LLMs to produce interpretable topic labels.

Topic modeling means giving meaning to clusters by extracting representative keywords or labels:

Beyond exploration, clustering helps with outlier detection, labeling speedup, and finding mislabeled data.


Section 1: The arXiv NLP Dataset

44,949 abstracts from the cs.CL (Computation and Language) section between 1991 and 2024.

from datasets import load_dataset

dataset = load_dataset("maartengr/arxiv_nlp")["train"]
abstracts = dataset["Abstracts"]
titles = dataset["Titles"]

Section 2: A Common Pipeline for Text Clustering

The pipeline has three steps. First, embed documents with a sentence-embedding model. Second, reduce dimensionality of embeddings. Third, cluster the reduced embeddings.

2.1 Step 1 — Embed

Pick an embedding model optimized for semantic similarity (use the MTEB leaderboard). The book uses thenlper/gte-small, which is newer, faster, and stronger on clustering tasks than all-mpnet-base-v2.

from sentence_transformers import SentenceTransformer

embedding_model = SentenceTransformer("thenlper/gte-small")
embeddings = embedding_model.encode(abstracts, show_progress_bar=True)
embeddings.shape  # (44949, 384)

2.2 Step 2 — Reduce Dimensionality

High-dim spaces grow exponentially, and clustering algorithms struggle there (the curse of dimensionality).

PCA versus UMAP: UMAP handles nonlinear structure better and is the default. Information loss is unavoidable, so there's a tradeoff between dimension reduction and signal preservation.

from umap import UMAP

umap_model = UMAP(
    n_components=5, min_dist=0.0, metric='cosine', random_state=42
)
reduced_embeddings = umap_model.fit_transform(embeddings)
ParameterWhy
n_components=55–10 typically captures global structure
min_dist=0.0Tighter clusters
metric='cosine'Euclidean breaks down at high dim
random_state=42Reproducible (disables parallelism — slower)

2.3 Step 3 — Cluster

When the number of clusters is unknown and the data is noisy, density-based clustering beats k-means.

HDBSCAN (Hierarchical DBSCAN) doesn't need you to specify the number of clusters, and it detects outliers by labeling them -1 and not forcing them into any cluster.

from hdbscan import HDBSCAN

hdbscan_model = HDBSCAN(
    min_cluster_size=50, metric="euclidean", cluster_selection_method="eom"
).fit(reduced_embeddings)
clusters = hdbscan_model.labels_
len(set(clusters))  # 156

A lower min_cluster_size produces more (smaller) clusters.

2.4 Inspect a Cluster

import numpy as np

for index in np.where(clusters == 0)[0][:3]:
    print(abstracts[index][:300] + "...\n")

Cluster 0 is about sign-language translation: every abstract mentions ASL, sign language phonology, or translation.

2.5 Visualize in 2D

Re-run UMAP with n_components=2 for plotting (separate from the 5-D space used for clustering):

import pandas as pd
import matplotlib.pyplot as plt

reduced_2d = UMAP(n_components=2, min_dist=0.0, metric="cosine", random_state=42).fit_transform(embeddings)
df = pd.DataFrame(reduced_2d, columns=["x", "y"])
df["title"] = titles
df["cluster"] = [str(c) for c in clusters]

clusters_df = df.loc[df.cluster != "-1", :]
outliers_df = df.loc[df.cluster == "-1", :]

plt.scatter(outliers_df.x, outliers_df.y, alpha=0.05, s=2, c="grey")
plt.scatter(clusters_df.x, clusters_df.y,
            c=clusters_df.cluster.astype(int),
            alpha=0.6, s=2, cmap="tab20b")
plt.axis("off")

2D plots are approximations. They distort relative distances. Always inspect actual cluster contents for ground truth.


Section 3: From Text Clustering to Topic Modeling

Topic modeling is the step that automatically describes each cluster.

3.1 The Classical Approach (LDA)

Latent Dirichlet Allocation assumes each topic is a probability distribution over the vocabulary.

LDA uses bag-of-words, so it has no semantic context. Modern transformer-embedding-based clustering captures meaning that LDA can't.


Section 4: BERTopic — A Modular Topic Modeling Framework

BERTopic has two big steps. The first is clustering, which is the same pipeline we just built:

The second is representing topics, with a clever twist on bag-of-words.

4.1 c-TF-IDF: Bag-of-Words at the Cluster Level

Standard bag-of-words counts word frequency per document:

For topics we want word frequency per cluster. That's c-TF (class TF):

But stop words like "the" and "of" dominate raw counts. To fix that, multiply by an IDF weight that penalizes words appearing in many clusters:

The result is c-TF-IDF: words that distinguish a cluster get high weight, and words common to all clusters get low weight.

4.2 The Full BERTopic Pipeline

The two halves are independent, so you can swap any block. Don't want outliers? Use k-means. Want different embeddings? Use any sentence-transformer.

This Lego-block design supports many variants: guided / (semi-)supervised topic modeling; hierarchical, dynamic, multimodal, multi-aspect; online / incremental; and zero-shot.

4.3 Running BERTopic on arXiv

from bertopic import BERTopic

topic_model = BERTopic(
    embedding_model=embedding_model,
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    verbose=True
).fit(abstracts, embeddings)

topic_model.get_topic_info()
TopicCountName
-114,520outliers
02,290speech / asr / recognition / end
11,403medical / clinical / biomedical / patient
21,156sentiment / aspect / analysis / reviews
3986translation / nmt / machine / neural
...
15154prompt / prompts / optimization / prompting
15353counseling / mental / health / therapy
15450backdoor / attacks / attack / triggers

Topic -1 is HDBSCAN's outlier bucket. Use reduce_outliers() (or k-means) to force assignment.

topic_model.find_topics("topic modeling")
# ([22, -1, 1, 47, 32], [0.95, 0.91, 0.91, 0.91, 0.91])

topic_model.get_topic(22)
# [('topic', 0.066), ('topics', 0.035), ('lda', 0.016),
#  ('latent', 0.013), ('document', 0.013), ('dirichlet', 0.010), ...]

# Verify by checking which topic the BERTopic paper landed in:
topic_model.topics_[titles.index("BERTopic: Neural topic modeling with a class-based TF-IDF procedure")]
# 22

4.5 Visualizations

topic_model.visualize_documents(titles, reduced_embeddings=reduced_2d, width=1200, hide_annotations=True)
topic_model.visualize_barchart()
topic_model.visualize_heatmap(n_clusters=30)
topic_model.visualize_hierarchy()

Section 5: Adding a Special Lego Block — Reranking Topic Representations

c-TF-IDF is fast but ignores semantics. We can rerank the initial keywords using more powerful (slower) models.

The reranker block plugs in on top of c-TF-IDF. It only runs once per topic (not once per document), so even slow models stay practical.

from copy import deepcopy
original_topics = deepcopy(topic_model.topic_representations_)

def topic_differences(model, original_topics, nr_topics=5):
    df = pd.DataFrame(columns=["Topic", "Original", "Updated"])
    for topic in range(nr_topics):
        og_words  = " | ".join(list(zip(*original_topics[topic]))[0][:5])
        new_words = " | ".join(list(zip(*model.get_topic(topic)))[0][:5])
        df.loc[len(df)] = [topic, og_words, new_words]
    return df

5.1 KeyBERTInspired

KeyBERT extracts keywords by comparing word and document embeddings via cosine similarity. KeyBERTInspired in BERTopic does the same on the per-cluster level.

from bertopic.representation import KeyBERTInspired

representation_model = KeyBERTInspired()
topic_model.update_topics(abstracts, representation_model=representation_model)
topic_differences(topic_model, original_topics)
TopicOriginalKeyBERTInspired
0speech | asr | recognition | end | acousticspeech | encoder | phonetic | language | trans...
1medical | clinical | biomedical | patientnlp | ehr | clinical | biomedical | language
2sentiment | aspect | analysis | reviewsaspect | sentiment | sentiments | ...
3translation | nmt | machine | neural | bleutranslation | translating | translate | trans...

Tradeoff: embeddings reorder for semantic clarity but can drop domain abbreviations (nmt, bleu) that were highly informative to experts.

5.2 Maximal Marginal Relevance (MMR)

summaries, summary, and summarization in the same topic are redundant. MMR picks keywords that are diverse but still relevant. It iteratively chooses the next best keyword by balancing similarity to the topic with dissimilarity from already-chosen keywords (controlled by diversity).

from bertopic.representation import MaximalMarginalRelevance

representation_model = MaximalMarginalRelevance(diversity=0.2)
topic_model.update_topics(abstracts, representation_model=representation_model)
topic_differences(topic_model, original_topics)
TopicOriginalMMR
4summarization | summaries | summary | abstractsummarization | document | extractive | rouge

MMR drops near-duplicates. Only one "summary"-flavored word survives.


Section 6: Generative LLMs as Topic Labelers

Don't ask an LLM to label every document (millions of calls). Ask it to generate one label per topic (hundreds of calls).

The prompt receives [DOCUMENTS], the 4 most representative documents (highest c-TF-IDF cosine to topic), and [KEYWORDS], the current topic keywords from c-TF-IDF or reranker.

6.1 Flan-T5

from transformers import pipeline
from bertopic.representation import TextGeneration

prompt = """I have a topic that contains the following documents:
[DOCUMENTS]
The topic is described by the following keywords: '[KEYWORDS]'.
Based on the documents and keywords, what is this topic about?"""

generator = pipeline("text2text-generation", model="google/flan-t5-small")
representation_model = TextGeneration(
    generator, prompt=prompt, doc_length=50, tokenizer="whitespace"
)
topic_model.update_topics(abstracts, representation_model=representation_model)
topic_differences(topic_model, original_topics)
TopicFlan-T5 label
0Speech-to-description
1Science/Tech
2Review
3Attention-based neural machine translation
4Summarization

Some good ("Summarization", "Attention-based neural machine translation"), some too generic ("Science/Tech").

6.2 OpenAI GPT-3.5

import openai
from bertopic.representation import OpenAI

prompt = """
I have a topic that contains the following documents:
[DOCUMENTS]
The topic is described by the following keywords: [KEYWORDS]
Based on the information above, extract a short topic label in the following format:
topic: <short topic label>
"""

client = openai.OpenAI(api_key="YOUR_KEY_HERE")
representation_model = OpenAI(
    client, model="gpt-3.5-turbo", exponential_backoff=True, chat=True, prompt=prompt,
)
topic_model.update_topics(abstracts, representation_model=representation_model)
TopicGPT-3.5 label
0Leveraging External Data for Improving Low-Resource Speech Recognition
1Improved Representation Learning for Biomedical NLP
2Advancements in Aspect-Based Sentiment Analysis
3Neural Machine Translation Enhancements
4Document Summarization Techniques

Far more informative, and we're not even using GPT-4.

Best practice: keep multiple representations side by side. KeyBERTInspired + MMR + GPT-3.5 give complementary perspectives on the same topic. BERTopic supports this natively via the aspects parameter.

6.3 Final Visualization

fig = topic_model.visualize_document_datamap(
    titles,
    topics=list(range(20)),
    reduced_embeddings=reduced_2d,
    width=1200,
    label_font_size=11,
    label_wrap_width=20,
    use_medoids=True,
)

Summary

  • The standard text clustering pipeline is embed → reduce → cluster (sentence-transformers + UMAP + HDBSCAN). Sentence embeddings give semantic reach beyond bag-of-words.
  • HDBSCAN is preferred over k-means: no need to specify cluster count, and outliers (label -1) aren't forced into clusters.
  • 2D plots are useful but lossy. Always inspect cluster contents directly.
  • Topic modeling produces automated labels for clusters. Classical LDA uses bag-of-words. Modern approaches build on transformer embeddings.
  • BERTopic is a modular framework: cluster pipeline → c-TF-IDF (cluster-level term-frequency × IDF) → optional reranker representations.
  • c-TF-IDF is the secret sauce: it distinguishes words that characterize a cluster from words common across all clusters.
  • The pipeline's two halves (clustering, topic representation) are independent. Swap any block, and re-rank topics without re-clustering.
  • KeyBERTInspired uses semantic similarity to clean up keywords. MMR removes redundancy with a diversity parameter.
  • Generative LLMs (Flan-T5, GPT-3.5) can label topics with concise human-readable summaries, applied per topic and not per document, so cost stays bounded.
  • Multiple representations (keyword-based, generated label, MMR-diversified) can coexist and give different views of the same topic.