Pagefy

Book

Hands On Large Language Models by Jay Alammar & Maarten Grootendorst

1.An Introduction to Large Language Models 2.Tokens and Embeddings 3.Looking Inside Large Language Models 4.Text Classification 5.Text Clustering and Topic Modeling 6.Prompt Engineering 7.Advanced Text Generation Techniques and Tools 8.Semantic Search and Retrieval Augmented Generation 9.Multimodal Large Language Models 10.Creating Text Embedding Models 11.Fine Tuning Representation Models for Classification 12.Fine Tuning Generation Models

Creating Text Embedding Models

Hands On Large Language Models by Jay Alammar & Maarten GrootendorstBuy the book

Chapter 10: Creating Text Embedding Models

Introduction

Embedding models power classification, clustering, semantic search, RAG, and even chatbot memory. They are the connective tissue of modern NLP. This chapter starts Part III of the book ("Training and Fine-Tuning") by going under the hood. What is an embedding model trying to learn? What is contrastive learning? How does SBERT train a fast bi-encoder? And how do we train, fine-tune, and even unsupervised-train these models on real data using the sentence-transformers framework?

Section 1: What Embedding Models Do

An embedding model converts unstructured text into numerical vectors that capture meaning.

The goal is for semantically similar inputs to land near each other in vector space, and dissimilar ones to spread apart.

But "similar" depends on the task. For sentiment analysis, we want sentiment-similar reviews to cluster, even if they discuss completely different products.

So we steer the model with example pairs. Feed it semantically similar pairs to train semantic similarity. Feed it sentiment-similar pairs to train sentiment similarity.

Section 2: Contrastive Learning

Contrastive learning is the dominant technique for training embedding models. The idea is to show the model examples of similar and dissimilar pairs simultaneously.

Contrastive explanations make the intuition concrete. A robber asked "Why did you rob a bank?" answers "because that's where the money is." Factually correct, but it misses the point. The intent was "Why rob (P) instead of obey the law (Q)?" Models learn what something is faster when shown what it is not.

A "tail, nose, four legs" being doesn't pin down "dog" versus "cat." Contrasting dog vs cat examples teaches the distinguishing features.

word2vec was already contrastive. Skip-gram + negative sampling (Chapter 2) is the original contrastive recipe in NLP.

Section 3: SBERT — Sentence-BERT (Bi-Encoder)

3.1 The Cross-Encoder Bottleneck

A cross-encoder feeds two sentences in together (separated by <SEP>) and outputs one similarity score:

Quality is great, but for 10,000 sentences you need n·(n-1)/2 ≈ 50 million forward passes to find the most similar pair. And cross-encoders don't produce reusable embeddings, since they only score (query, candidate) pairs.

A naive fix (averaging BERT's output layer or using [CLS]) is actually worse than averaging GloVe word vectors.

3.2 The Bi-Encoder (Siamese) Solution

SBERT uses two identical (weight-shared) BERT models, a Siamese architecture. Each sentence goes through one branch, mean-pooling produces a fixed-size embedding, and similarity is computed between those embeddings.

Because weights are shared, in practice you run a single model twice. SBERT (bi-encoder) is fast and emits reusable embeddings. Cross-encoders score better but don't.

Section 4: Generating Contrastive Examples — NLI Datasets

Natural Language Inference (NLI) datasets contain (premise, hypothesis, label) triples where the label is entailment (positive), neutral, or contradiction (negative).

Entailment is a positive pair (high similarity). Contradiction is a negative pair. Perfect for contrastive learning.

The chapter uses MNLI (Multi-Genre NLI) from GLUE, with 392K pairs. We sample 50K for fast iteration:

from datasets import load_dataset

train_dataset = load_dataset("glue", "mnli", split="train").select(range(50_000))
train_dataset = train_dataset.remove_columns("idx")
# 0 = entailment, 1 = neutral, 2 = contradiction

train_dataset[2]
# {'premise': 'One of our number will carry out your instructions minutely.',
#  'hypothesis': 'A member of my team will execute your orders with immense precision.',
#  'label': 0}   # entailment

Section 5: Training an Embedding Model from Scratch

5.1 Setup

from sentence_transformers import SentenceTransformer, losses

embedding_model = SentenceTransformer('bert-base-uncased')

train_loss = losses.SoftmaxLoss(
    model=embedding_model,
    sentence_embedding_dimension=embedding_model.get_sentence_embedding_dimension(),
    num_labels=3,
)

5.2 Evaluator — STSB (Semantic Textual Similarity Benchmark)

from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator

val_sts = load_dataset("glue", "stsb", split="validation")
evaluator = EmbeddingSimilarityEvaluator(
    sentences1=val_sts["sentence1"],
    sentences2=val_sts["sentence2"],
    scores=[s/5 for s in val_sts["label"]],   # normalize 1–5 → 0–1
    main_similarity="cosine",
)

5.3 Training Arguments

from sentence_transformers.training_args import SentenceTransformerTrainingArguments

args = SentenceTransformerTrainingArguments(
    output_dir="base_embedding_model",
    num_train_epochs=1,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    warmup_steps=100,
    fp16=True,
    eval_steps=100,
    logging_steps=100,
)

Argument	Why
`num_train_epochs`	Passes through the data — increase for better quality
`per_device_*_batch_size`	Bigger = faster but more memory; MNR loss benefits a lot from bigger batches
`warmup_steps`	LR linearly ramps from 0 to target over these many steps
`fp16`	Mixed precision (16-bit) — saves memory and speeds training

5.4 Train

from sentence_transformers.trainer import SentenceTransformerTrainer

trainer = SentenceTransformerTrainer(
    model=embedding_model, args=args,
    train_dataset=train_dataset, loss=train_loss, evaluator=evaluator,
)
trainer.train()

evaluator(embedding_model)
# pearson_cosine: 0.59  ← softmax loss baseline

Section 6: Evaluation — MTEB

GLUE's STSB is just one task. MTEB (Massive Text Embedding Benchmark) evaluates across 8 tasks, 58 datasets, and 112 languages, plus inference time.

from mteb import MTEB

evaluation = MTEB(tasks=["Banking77Classification"])
results = evaluation.run(model)

Full MTEB takes hours, so the chapter uses STSB throughout for quick iteration.

Restart your notebook between training runs to clear VRAM.

Section 7: Loss Functions

7.1 Cosine Similarity Loss

For pairs labeled with a similarity score in [0, 1]:

Map MNLI labels to similarity:

# entailment (0) → 1, neutral (1) and contradiction (2) → 0
mapping = {2: 0, 1: 0, 0: 1}
train_dataset = Dataset.from_dict({
    "sentence1": train_dataset["premise"],
    "sentence2": train_dataset["hypothesis"],
    "label": [float(mapping[l]) for l in train_dataset["label"]],
})

train_loss = losses.CosineSimilarityLoss(model=embedding_model)

After training: pearson_cosine of 0.72, a big jump from softmax's 0.59.

7.2 Multiple Negatives Ranking (MNR) Loss

Also called InfoNCE or NTXentLoss, MNR is the modern default. Inputs are positive pairs (or (anchor, positive, negative) triplets). Within each batch, every other positive's "answer" is treated as a negative for this anchor (in-batch negatives).

Build the triplets: anchor (premise), positive (entailed hypothesis), random hypothesis as a soft negative.

import random
mnli = load_dataset("glue", "mnli", split="train").select(range(50_000))
mnli = mnli.filter(lambda x: x["label"] == 0)   # entailment only

soft_negatives = mnli["hypothesis"]
random.shuffle(soft_negatives)

train_dataset = Dataset.from_dict({
    "anchor":   mnli["premise"],
    "positive": mnli["hypothesis"],
    "negative": soft_negatives,
})

train_loss = losses.MultipleNegativesRankingLoss(model=embedding_model)

After training: 0.80. MNR loss > cosine > softmax.

Bigger batches help MNR loss. The harder the in-batch negative pool, the more the model has to learn to discriminate.

7.3 Easy / Semi-Hard / Hard Negatives

Easy negatives come from random sampling, usually unrelated to question and answer. Semi-hard negatives are the top-k from a pretrained embedding similarity search: related topic but not the right answer. Hard negatives are manually labeled or generated. For "How many people live in Amsterdam?" a hard negative might be "More than a million people live in Utrecht, which is more than Amsterdam."

Hard negatives push the model to learn finer distinctions and generally produce the strongest models.

Section 8: Fine-Tuning a Pretrained Embedding Model

Training from scratch is expensive. The easier path is to start with a sentence-transformers model and fine-tune. all-MiniLM-L6-v2 is a small, fast, strong starting point.

embedding_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
train_loss = losses.MultipleNegativesRankingLoss(model=embedding_model)
# ... same trainer as before

After training: 0.85, the best score yet. Caveat: all-MiniLM-L6-v2 was already trained on the full MNLI dataset, so this is a slight cheat. But it shows the pattern of fine-tuning on your own data.

Domain mismatch? Run masked language modeling on a base BERT first to adapt it to your domain, then fine-tune as an embedding model. Chapter 11 covers MLM domain adaptation.

Section 9: Augmented SBERT — Bootstrapping with Limited Data

Real datasets often have only a few thousand labeled pairs. Augmented SBERT uses a slow but accurate cross-encoder to label more pairs, then trains a fast bi-encoder on the combined dataset.

The recipe has four steps. First, train a cross-encoder on the small gold dataset (real labels). Second, generate new sentence pairs (or sample from a larger unlabeled pool). Third, use the cross-encoder to label those pairs to produce a silver dataset. Fourth, train the bi-encoder on gold + silver.

# Step 1: train cross-encoder on small gold set (10K labeled pairs)
from sentence_transformers.cross_encoder import CrossEncoder
cross_encoder = CrossEncoder("bert-base-uncased", num_labels=2)
cross_encoder.fit(train_dataloader=gold_dataloader, epochs=1, ...)

# Step 2-3: label 40K new pairs with the cross-encoder
output = cross_encoder.predict(pairs, apply_softmax=True)
silver = pd.DataFrame({"sentence1": ..., "sentence2": ..., "label": np.argmax(output, axis=1)})

# Step 4: combine gold + silver, train bi-encoder
data = pd.concat([gold, silver]).drop_duplicates(...)
embedding_model = SentenceTransformer("bert-base-uncased")
train_loss = losses.CosineSimilarityLoss(model=embedding_model)
# ... train as before

The result: 0.71 on STSB, nearly matching the 0.72 the cosine-loss model got with the full dataset, but using only 20% of the labels.

Picking pairs to label matters. Random pair generation skews dissimilar. Better: embed all sentences with a pretrained model, retrieve top-k similar candidates per anchor — that biases toward pairs that are likely interesting.

Section 10: Unsupervised — TSDAE

When you have zero labels, options include SimCSE, Contrastive Tension, GPL, and TSDAE (Transformer-based Sequential Denoising Auto-Encoder).

10.1 TSDAE Idea

Damage a sentence (delete random words). Pass the noisy version through an encoder plus pooling to get a sentence embedding. A decoder reconstructs the original sentence from that embedding. The better the embedding, the better the reconstruction.

It's like masked language modeling but at the sentence level: reconstruct the whole thing, not just masked words.

10.2 Implementation

import nltk; nltk.download("punkt")
from sentence_transformers.datasets import DenoisingAutoEncoderDataset
from sentence_transformers import models, SentenceTransformer, losses

mnli = load_dataset("glue", "mnli", split="train").select(range(25_000))
flat_sentences = mnli["premise"] + mnli["hypothesis"]
damaged_data = DenoisingAutoEncoderDataset(list(set(flat_sentences)))

# Use [CLS] pooling (preserves position info — TSDAE paper finding)
word_embedding_model = models.Transformer("bert-base-uncased")
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(), "cls")
embedding_model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

train_loss = losses.DenoisingAutoEncoderLoss(embedding_model, tie_encoder_decoder=True)
train_loss.decoder = train_loss.decoder.to("cuda")

# Train (smaller batch — DAE is memory-heavy)
args = SentenceTransformerTrainingArguments(per_device_train_batch_size=16, ...)
trainer = SentenceTransformerTrainer(...).train()

Inspecting damaged pairs:

{'damaged_sentence': 'Grim jaws are.',
 'original_sentence': 'Grim faces and hardened jaws are not people-friendly.'}

The result: 0.70 on STSB. Impressive given zero labels.

10.3 Domain Adaptation

Unsupervised methods are usually beaten by supervised ones, but they shine for domain adaptation: mapping an existing embedding model into a new domain (medical, legal, financial) where you have unlabeled in-domain text but no labeled pairs.

Adaptive pretraining recipe: first run TSDAE (or MLM) on your unlabeled in-domain corpus, then fine-tune the resulting model on a labeled set. Even an out-of-domain labeled set works because step 1 already adapted the model to your domain.

You can combine TSDAE pretraining with MNR-loss fine-tuning or Augmented SBERT for the strongest result.

Section 11: Score Comparison

Setup	Loss	Data	STSB pearson_cosine
BERT-base from scratch	Softmax	50K MNLI	0.59
BERT-base from scratch	Cosine similarity	50K MNLI	0.72
BERT-base from scratch	MNR	16.8K entailment triplets	0.80
MiniLM-L6-v2 fine-tune	MNR	50K MNLI	0.85
Augmented SBERT (gold 10K + silver 40K)	Cosine similarity	mixed	0.71
TSDAE (unsupervised)	DenoisingAutoEncoder	50K unlabeled	0.70

Loss function and data quality matter more than throwing tokens at the model.

Summary

An embedding model projects text into a vector space where "near" means "similar in the way you trained it" (semantic, sentiment, intent, and so on).
Contrastive learning is the dominant training paradigm. Feed the model similar and dissimilar pairs so it learns what makes things alike and different.
Cross-encoders are accurate but slow and don't produce reusable embeddings. SBERT (bi-encoder) uses Siamese BERTs to produce embeddings fast, the standard for retrieval and classification.
NLI datasets like MNLI translate cleanly into contrastive training: entailment is a positive pair, contradiction is a negative pair.
Loss functions matter a lot. On the same base BERT, softmax (0.59) → cosine similarity (0.72) → MNR / InfoNCE (0.80).
MNR loss uses in-batch negatives. Bigger batches mean a harder learning task and better embeddings.
Negative quality matters more than quantity: easy < semi-hard < hard (manually-labeled or generated).
Fine-tune a pretrained sentence-transformers model (e.g., all-MiniLM-L6-v2) instead of training from scratch. Usually fewer steps to a much better model.
Augmented SBERT bootstraps with a cross-encoder labeling extra pairs, useful when labeled data is small.
TSDAE trains unsupervised by reconstructing damaged sentences. Great for domain adaptation when labeled data doesn't exist.
Evaluate via STSB for quick iteration. Use MTEB for breadth across 58 datasets and 112 languages.

Previous chapter

Multimodal Large Language Models

Next chapter

Fine Tuning Representation Models for Classification