Pagefy

Pagefy

Back to AI Engineering

Text Classification

Hands On Large Language Models by Jay Alammar & Maarten GrootendorstBuy the book
Sign in to save bookmarks, reading progress, and highlights.

Chapter 4: Text Classification

Introduction

Classification (assigning a label to input text) is one of NLP's oldest and most useful tasks: sentiment, intent detection, entity extraction, language ID. This chapter is the entry point to using pretrained models rather than training from scratch. We compare four different ways to classify the Rotten Tomatoes movie-review sentiment dataset: a task-specific RoBERTa model, an embedding model plus logistic regression, zero-shot classification with embeddings, and generative models (Flan-T5 and ChatGPT). Throughout, we keep models frozen and lean on pretraining.

Don't skip the baseline: TF-IDF + logistic regression is a strong starting point. Always benchmark the LLM approach against it.


Section 1: The Sentiment of Movie Reviews

We use the rotten_tomatoes dataset on Hugging Face: 5,331 positive plus 5,331 negative short reviews, split into train, validation, and test.

from datasets import load_dataset

data = load_dataset("rotten_tomatoes")
# DatasetDict: train(8530), validation(1066), test(1066)
# Features: 'text', 'label' (1 = positive, 0 = negative)

Two examples:

TextLabel
"the rock is destined to be the 21st century's new conan…"1 (positive)
"things really get weird, though not particularly scary…"0 (negative)

Throughout the chapter we train on train and report weighted F1 on test.


Section 2: Classification with Representation Models

Two flavors are worth knowing. A task-specific model is a BERT-family model already fine-tuned for sentiment. An embedding model is a general-purpose embedder, on top of which you train a tiny classifier.

In this chapter we keep both fully frozen and just consume their outputs.

2.1 Model Selection

There are over 60K classification models and 8K+ embedding models on HF Hub. Filtering by language, architecture, size, and performance is critical.

Solid BERT-family baselines include BERT-base (uncased), RoBERTa-base, DistilBERT-base (uncased), DeBERTa-base, bert-tiny, and ALBERT-base v2.

For embeddings, the MTEB leaderboard is the standard benchmark, but inference speed matters as much as accuracy in production. The book uses sentence-transformers/all-mpnet-base-v2.


Section 3: Using a Task-Specific Model

We use cardiffnlp/twitter-roberta-base-sentiment-latest: a RoBERTa fine-tuned on tweets (not movie reviews) for sentiment. Out-of-domain on purpose, to see how it generalizes.

from transformers import pipeline

model_path = "cardiffnlp/twitter-roberta-base-sentiment-latest"
pipe = pipeline(model=model_path, tokenizer=model_path,
                return_all_scores=True, device="cuda:0")

The tokenizer-then-model flow:

Subword tokenization means even unseen words can be embedded by composing pieces:

Run inference and pick the higher of negative_score or positive_score:

import numpy as np
from tqdm import tqdm
from transformers.pipelines.pt_utils import KeyDataset

y_pred = []
for output in tqdm(pipe(KeyDataset(data["test"], "text")), total=len(data["test"])):
    negative_score = output[0]["score"]
    positive_score = output[2]["score"]
    y_pred.append(np.argmax([negative_score, positive_score]))

Evaluation helper:

from sklearn.metrics import classification_report

def evaluate_performance(y_true, y_pred):
    print(classification_report(y_true, y_pred,
                                target_names=["Negative Review", "Positive Review"]))
ClassPrecisionRecallF1
Negative0.760.880.81
Positive0.860.720.78
Weighted avg0.810.800.80

3.1 Reading the Classification Report

Four metrics derive from the confusion matrix. Precision asks, of items the model labeled positive, how many really are? It's accuracy of relevant results. Recall asks, of all true positives, how many did the model find? Accuracy is the fraction of all predictions that are correct. F1 is the harmonic mean of precision and recall.

We use weighted F1 to balance class imbalance.

0.80 F1 from a tweet-tuned RoBERTa applied to movie reviews — already strong without any in-domain training.


Section 4: Classification via Embeddings (Supervised)

What if no fine-tuned model exists for your task? Generate embeddings and train a tiny classifier on them.

Step 1: encode train and test text into 768-dim embeddings. The embedder stays frozen.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")
train_embeddings = model.encode(data["train"]["text"], show_progress_bar=True)
test_embeddings  = model.encode(data["test"]["text"],  show_progress_bar=True)

train_embeddings.shape  # (8530, 768)

Step 2: train logistic regression on top, which runs on CPU in seconds.

from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(random_state=42)
clf.fit(train_embeddings, data["train"]["label"])
y_pred = clf.predict(test_embeddings)
evaluate_performance(data["test"]["label"], y_pred)
ClassPrecisionRecallF1
Negative0.850.860.85
Positive0.860.850.85
Weighted avg0.850.850.85

Better than the task-specific tweet-trained model. The all-purpose embeddings + tiny supervised head beats a specialized but out-of-domain BERT.

If you don't want a GPU dependency, you can call Cohere or OpenAI embedding APIs and run the entire pipeline on CPU.


Section 5: Zero-Shot Classification with Embeddings

Labeling is expensive. Zero-shot classification lets you skip labeled data entirely. You describe the labels in natural language and let embeddings do the work.

The trick is to write a sentence describing each label, embed it, and compare to document embeddings via cosine similarity.

label_embeddings = model.encode(["A negative review", "A positive review"])

Cosine similarity is the cosine of the angle between two vectors: dot(a, b) / (|a| * |b|).

For each document, pick the label with the highest cosine similarity:

from sklearn.metrics.pairwise import cosine_similarity

sim_matrix = cosine_similarity(test_embeddings, label_embeddings)
y_pred = np.argmax(sim_matrix, axis=1)
evaluate_performance(data["test"]["label"], y_pred)
ClassPrecisionRecallF1
Negative0.780.770.78
Positive0.770.790.78
Weighted avg0.780.780.78

0.78 with zero labeled data — just two label descriptions. Try "A very negative/positive movie review" for a small extra bump — more concrete descriptions move embeddings toward the correct semantic neighborhood.

NLI-based zero-shot classifiers also work, but the embedding approach showcases how versatile general-purpose embeddings are.


Section 6: Classification with Generative Models

Generative models are sequence-to-sequence: they emit text, not class IDs. To use them for classification we prompt-engineer instructions and parse the output.

A bare review with no instruction confuses the model. We need to add context:

6.1 Flan-T5 (Encoder-Decoder)

T5 is the Text-to-Text Transfer Transformer, encoder-decoder, with 12 of each. Everything (translation, summarization, classification) is cast as text-in / text-out.

T5 pretrains via span-corruption masked language modeling: mask multi-token spans (not single tokens) and predict them.

For fine-tuning, every task is converted into a text instruction and the model is trained on all of them at once.

Flan-T5 ("Scaling instruction-finetuned language models") extended this to over 1,000 instruction-formatted tasks, getting much closer to GPT-style instruction following.

pipe = pipeline("text2text-generation", model="google/flan-t5-small", device="cuda:0")

prompt = "Is the following sentence positive or negative? "
data = data.map(lambda example: {"t5": prompt + example['text']})

y_pred = []
for output in tqdm(pipe(KeyDataset(data["test"], "t5")), total=len(data["test"])):
    text = output[0]["generated_text"]
    y_pred.append(0 if text == "negative" else 1)

Note we have to map textual output to integer label ourselves.

ClassPrecisionRecallF1
Negative0.830.850.84
Positive0.850.830.84
Weighted avg0.840.840.84

Even flan-t5-small reaches 0.84. Try flan-t5-large or flan-t5-xl for more.

6.2 ChatGPT (GPT-3.5-turbo)

Closed-source, decoder-only. OpenAI's training pipeline added two big steps. Instruction tuning is supervised fine-tuning on hand-written instruction/response pairs.

Preference tuning (RLHF) has humans rank multiple model outputs from best to worst, and that preference signal trains the final model.

Preference data conveys nuance (good vs. better) that instruction data doesn't. Chapter 12 covers both.

import openai

client = openai.OpenAI(api_key="YOUR_KEY_HERE")

def chatgpt_generation(prompt, document, model="gpt-3.5-turbo-0125"):
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user",   "content": prompt.replace("[DOCUMENT]", document)},
    ]
    return client.chat.completions.create(
        messages=messages, model=model, temperature=0
    ).choices[0].message.content

prompt = """Predict whether the following document is a positive or negative movie review:

[DOCUMENT]

If it is positive return 1 and if it is negative return 0. Do not give any other answers.
"""

predictions = [chatgpt_generation(prompt, doc) for doc in tqdm(data["test"]["text"])]
y_pred = [int(pred) for pred in predictions]
evaluate_performance(data["test"]["label"], y_pred)
ClassPrecisionRecallF1
Negative0.870.970.92
Positive0.960.860.91
Weighted avg0.920.910.91

0.91 F1 — but with closed models, you don't know what's in the training data. The model may already have seen Rotten Tomatoes. Caveat the metric. Chapter 12 covers more rigorous evaluation.

Practical tips: track API spend (the test set ran for ~3¢ on gpt-3.5-turbo-0125); handle rate limits with exponential backoff.


Section 7: Side-by-Side Results

ApproachLabeled data neededComputeF1
TF-IDF + LR (recommended baseline)YesCPU
Task-specific RoBERTa (out-of-domain)None at inferenceGPU0.80
Embedding + LogReg (supervised)YesGPU embed → CPU classify0.85
Embedding + cosine (zero-shot)NoneGPU/CPU0.78
Flan-T5-small (text2text)NoneGPU0.84
ChatGPT (GPT-3.5-turbo)NoneAPI0.91

The same dataset, four very different paths.


Summary

  • Pretrained models make text classification accessible without training your own. Pick from task-specific (already fine-tuned), embedding-based, or generative approaches.
  • A task-specific model (e.g., a sentiment-tuned RoBERTa) is the most direct: tokenize → forward pass → argmax over class scores.
  • An embedding model plus a lightweight classifier (logistic regression on sentence embeddings) is often the best speed-quality tradeoff. It beats a domain-mismatched task-specific model on this dataset.
  • Zero-shot classification via embeddings needs no labeled data. Describe the labels in natural language, embed them, and pick by cosine similarity. Surprisingly competitive (0.78 F1 here).
  • Generative models classify by prompt: write an instruction, then parse the model's text output back into a label. Flan-T5 gives strong results in pure open-source. ChatGPT goes higher (0.91), but you can't audit the training data.
  • Always benchmark against a TF-IDF + logistic regression baseline. Modern doesn't always mean better for narrow classification tasks.
  • Evaluate with precision, recall, F1, and accuracy drawn from the confusion matrix. Report weighted F1 when classes might be imbalanced.