Pagefy

Pagefy

Back to AI Engineering

Tokens and Embeddings

Hands On Large Language Models by Jay Alammar & Maarten GrootendorstBuy the book
Sign in to save bookmarks, reading progress, and highlights.

Chapter 2: Tokens and Embeddings

Introduction

Tokens and embeddings are the two central concepts behind using LLMs. Tokens are the small chunks of text the model actually sees as input and produces as output. Embeddings are the numeric vectors the model computes on. Both are decided long before the model itself ever runs, by the tokenizer's algorithm, parameters, and training data, and by the model's vocabulary embedding matrix. This chapter walks through tokenization methods (BPE, WordPiece, SentencePiece, byte-level), compares real-world tokenizers from BERT, GPT-2, GPT-4, StarCoder, Galactica, and Phi-3, then moves to token and text embeddings, and ends with word2vec, contrastive training, and embedding-based song recommendations.


Section 1: LLM Tokenization

LLMs generate one token at a time, and they also consume tokens. Before any prompt reaches the model, a tokenizer breaks it into tokens.

A tokenizer assigns a unique integer ID to each token. The model only ever sees IDs.

1.1 Loading and Tokenizing with Phi-3

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="cuda", torch_dtype="auto", trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

prompt = "Write an email apologizing to Sarah for the tragic gardening mishap. Explain how it happened.<|assistant|>"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")

generation_output = model.generate(input_ids=input_ids, max_new_tokens=20)
print(tokenizer.decode(generation_output[0]))

The model receives input_ids (a tensor of integers), not the raw string:

tensor([[ 1, 14350, 385, 4876, 27746, 5281, 304, 19235, 363, 278, 25305, 293,
        16423, 292, 286, 728, 481, 29889, 12027, 7420, 920, 372, 9559, 29889, 32001]])

Decoding each ID:

<s>  Write  an  email  apolog  izing  to  Sarah  for  the  trag  ic  garden  ing
m  ish  ap  .  Exp  lain  how  it  happened  .  <|assistant|>

A few things to notice. ID 1 (<s>) is a special beginning-of-text token. Whole-word tokens like Write, an, and email coexist with word parts like apolog, izing, trag, and ic. Punctuation tokens stand alone. There is no explicit space token: tokens have an invisible "this attaches to the previous token" marker, and tokens without that marker are assumed to have a leading space.

The output side is symmetric. The tokenizer's decode turns generated IDs back into text. The model emitted 3323 ('Sub') and 622 ('ject'), which combine to Subject, then 29901 (':').

1.2 What Determines Tokenizer Behavior?

Three design choices govern a tokenizer's behavior. The first is method, like byte pair encoding (BPE) for GPT or WordPiece for BERT. The second is parameters, including vocabulary size, special tokens, and capitalization handling. The third is training data: the same method and parameters trained on different data will produce a different tokenizer (English versus multilingual versus code).

Tokenizers operate on both ends, input AND output:

1.3 Word vs. Subword vs. Character vs. Byte Tokens

GranularityProsCons
Word (word2vec era)IntuitiveCan't handle OOV words; bloated vocab (apology/apologize/apologetic)
Subword (modern default)Handles OOV by composing pieces; expressive vocab (apolog + -y/-ize/-ist)Slightly more model effort
CharacterRobust to any new wordVery long sequences eat context length; harder modeling
Byte (CANINE, ByT5)Truly tokenization-free; competitive in multilingualEven longer sequences

Subword tokens average around 3 chars per token, so a 1024-context model fits roughly 3x more text than character-level. Some subword tokenizers like GPT-2 and RoBERTa keep bytes as a fallback. That doesn't make them byte-level tokenizers. Bytes are only used when nothing else matches.


Section 2: Comparing Trained LLM Tokenizers

A side-by-side test on this string:

English and CAPITALIZATION
🎡鸟
show_tokens False None elif == >= else: two tabs:" " Three tabs: "   "
12.0*50=600

This stresses capitalization, non-English (emoji and Chinese), code keywords and whitespace, and digits.

TokenizerMethodVocabNotable
BERT base (uncased), 2018WordPiece30,522Lowercases everything; ## prefix marks subword continuation; emoji/Chinese β†’ [UNK]; newlines lost
BERT base (cased), 2018WordPiece28,996Keeps case but CAPITALIZATION becomes 8 tokens (CA ##PI ##TA ##L ##I ##Z ##AT ##ION)
GPT-2, 2019BPE50,257Preserves newlines and case; emoji split into multiple bytes that decode back; whitespace tokens exist
Flan-T5, 2022SentencePiece (BPE/unigram)32,100No whitespace tokens (bad for code); emoji/Chinese β†’ <unk>
GPT-4, 2023BPE~100,000Single token for runs of up to 83 spaces; elif is one token; CAPITALIZATION is just 2 tokens
StarCoder2, 2024BPE49,152Code-specialized; each digit is its own token (so 600 β†’ 6 0 0) β€” better number representation; <filename> / <reponame> / <gh_stars> / fim tokens
GalacticaBPE50,000Science-specialized; [START_REF]/[END_REF] for citations, <work> for chain-of-thought, single tokens for tab runs
Phi-3 / Llama 2BPE32,000Reuses Llama 2 tokenizer + chat tokens (<|user|>, <|assistant|>, <|system|>)

Why the digit-per-token trick matters: GPT-2 represents 870 as one token but 871 as 8 + 71. That asymmetry confuses the model on math. StarCoder2/Galactica's per-digit tokens fix this.

2.1 Tokenizer Properties β€” three groups of design choices

Method covers the algorithm itself: BPE, WordPiece, SentencePiece, and so on, each with its own approach to selecting the best vocabulary.

Parameters include vocabulary size (30K and 50K are common, with 100K+ trending), special tokens (beginning- and end-of-text, padding, unknown, CLS, MASK, plus domain tokens like Galactica's <work> and [START_REF]), and capitalization (lowercase everything, or keep case at the cost of vocab space).

The domain of the training data also matters. Code-focused tokenizers handle indentation as single tokens, science models add citation tokens, and multilingual tokenizers need balanced scripts. For Python code, a text-tuned tokenizer wastes tokens on every space of indentation, while a code-tuned one collapses runs of spaces into single tokens.


Section 3: Token Embeddings

Once tokens are IDs, the next problem is finding the best numeric representation so the model can compute on them. That's exactly what embeddings are.

3.1 The Embedding Matrix

A pretrained language model holds an embedding vector for each token in its tokenizer's vocabulary. That collection is the model's embedding matrix. Vectors start randomly initialized, and training assigns them useful values.

Because the pretrained model and tokenizer co-evolve, you can't swap a different tokenizer onto an already-trained model.

Side note on RAG: as model coherence improved, users started trusting LLMs as fact engines. They aren't reliable search engines on their own. Retrieval-augmented generation (RAG) combines search with LLMs to ground outputs (Chapter 8).

3.2 Static vs. Contextualized Token Embeddings

Word2vec gives a static embedding: bank always maps to the same vector. LLMs produce contextualized embeddings, where the embedding of a token depends on the surrounding sequence.

These contextualized vectors power named-entity recognition, extractive summarization, and even drive image-generation systems like DALLΒ·E, Midjourney, and Stable Diffusion.

from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-base")
model = AutoModel.from_pretrained("microsoft/deberta-v3-xsmall")

tokens = tokenizer('Hello world', return_tensors='pt')
output = model(**tokens)[0]
output.shape   # torch.Size([1, 4, 384])

DeBERTa v3 is small, efficient, and one of the best token-embedding models at the time of writing. The output shape [1, 4, 384] reads as 1 batch, 4 tokens ([CLS], Hello, world, [SEP]), 384-dim vectors.

The transformation from token ID β†’ static embedding is the very first step inside a language model:


Section 4: Text Embeddings (Sentences and Documents)

Sometimes you need a single vector for an entire sentence, paragraph, or document. Text embedding models produce that.

A common cheap approach is to average all token embeddings, but high-quality models are trained specifically for the task. The sentence-transformers package wraps these:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")
vector = model.encode("Best movie ever!")
vector.shape  # (768,)

These vectors power categorization, semantic search, and RAG (Part II).


Section 5: Word Embeddings Beyond LLMs

Embeddings are useful in any domain where you can define meaningful "neighbors": recommender engines, robotics, multimodal models. We'll revisit contrastive training in Chapter 10.

5.1 Using Pretrained word2vec / GloVe

import gensim.downloader as api
model = api.load("glove-wiki-gigaword-50")  # 66MB GloVe, 50-dim
model.most_similar([model['king']], topn=11)

Returns: prince (0.82), queen (0.78), ii (0.77), emperor (0.77), son (0.77), uncle (0.76), kingdom (0.75), throne (0.75), brother (0.75), ruler (0.74). Distance in this space encodes semantic relatedness.

5.2 The word2vec Algorithm

Word2vec is trained on a sliding window over text. Take the sentence "Thou shalt not make a machine in the likeness of a human mind". With window size 2, the central word and its 2 left and 2 right neighbors generate positive training pairs:

The task is simple: given a (word, neighbor) pair, predict 1 if they tend to appear together, 0 otherwise.

There's a problem, though. If all training labels are 1, the model can cheat by always outputting 1. Negative sampling fixes this by randomly pairing words that aren't neighbors and labeling them 0.

This is noise-contrastive estimation under the hood. The two big ideas of word2vec are skip-gram (pick neighbors from the sliding window) and negative sampling (random non-neighbors as 0-labeled examples).

Then build a vocab_size Γ— embedding_dim matrix of randomly initialized vectors:

Training looks like this: each example takes two embeddings, predicts neighbor-or-not, and updates the embeddings via gradient descent.

This two-vectors-and-predict-relation pattern is one of ML's most powerful templates. It shows up in sentence embeddings and retrieval (Chapter 10), and in cross-modal alignment (image + caption) for image generation (Chapter 9).


Section 6: Embeddings for Recommendation Systems

6.1 Songs-as-Tokens, Playlists-as-Sentences

If you treat each song as a token and each playlist as a sentence, you can run word2vec verbatim and get song embeddings that capture co-occurrence in human-curated playlists.

Recommendations for Michael Jackson's "Billie Jean":

SongArtist
KissPrince & The Revolution
Wanna Be Startin' Somethin'Michael Jackson
The Way You Make Me FeelMichael Jackson
HolidayMadonna
Don't Stop 'Til You Get EnoughMichael Jackson

For 2Pac's "California Love":

SongArtist
If I Ruled the WorldNas
I'll Be Missing YouPuff Daddy & The Family
Hate It or Love ItThe Game
HypnotizeThe Notorious B.I.G.
Drop It Like It's HotSnoop Dogg

6.2 Building It

import pandas as pd
from urllib import request
from gensim.models import Word2Vec

# Load playlist + song-metadata files
data = request.urlopen('https://storage.googleapis.com/maps-premium/dataset/yes_complete/train.txt')
lines = data.read().decode("utf-8").split('\n')[2:]
playlists = [s.rstrip().split() for s in lines if len(s.split()) > 1]

songs_file = request.urlopen('https://storage.googleapis.com/maps-premium/dataset/yes_complete/song_hash.txt')
songs = [s.rstrip().split('\t') for s in songs_file.read().decode("utf-8").split('\n')]
songs_df = pd.DataFrame(data=songs, columns=['id', 'title', 'artist']).set_index('id')

# Train Word2Vec β€” songs as tokens
model = Word2Vec(playlists, vector_size=32, window=20, negative=50, min_count=1, workers=4)

# Recommend like word similarity
model.wv.most_similar(positive='2172')

Song 2172 is Metallica's "Fade To Black". Its neighbors are Van Halen, Dio, Guns N' Roses, and Judas Priest, which is to say all heavy metal and hard rock. The same algorithm that learned king β‰ˆ queen learns Metallica β‰ˆ Van Halen purely from playlist co-occurrence.


Summary

  • Tokenization turns text into integer IDs the model can consume. The inverse (decode) turns generated IDs back into text.
  • Tokenizer behavior is dictated by method (BPE, WordPiece, SentencePiece, byte-level), parameters (vocab size, special tokens, capitalization), and training data (text vs. code vs. science vs. multilingual).
  • Subword tokenization is the modern default. It handles OOV words by composing parts and gives a more expressive vocabulary than word-level. Character and byte tokenization are robust but eat context length.
  • Modern tokenizers add domain-specific tokens. GPT-4 packs whitespace into single tokens for code. StarCoder2 splits digits per-token for math. Galactica adds citation and reasoning tokens. Phi-3 and Llama-2 add chat-role tokens.
  • A model holds an embedding matrix: one vector per vocabulary token. The token-ID β†’ embedding lookup is the model's first internal step.
  • LLMs produce contextualized token embeddings. The same word gets different vectors in different contexts. These power NER, summarization, classification, and even multimodal generation.
  • Text embeddings collapse a sentence or document into one vector, used for clustering, search, and RAG.
  • word2vec trains via skip-gram + negative sampling, the predict-if-related template. The same template generalizes to song recommendations (playlists as sentences) and beyond.