Tokens and Embeddings
Chapter 2: Tokens and Embeddings
Introduction
Tokens and embeddings are the two central concepts behind using LLMs. Tokens are the small chunks of text the model actually sees as input and produces as output. Embeddings are the numeric vectors the model computes on. Both are decided long before the model itself ever runs, by the tokenizer's algorithm, parameters, and training data, and by the model's vocabulary embedding matrix. This chapter walks through tokenization methods (BPE, WordPiece, SentencePiece, byte-level), compares real-world tokenizers from BERT, GPT-2, GPT-4, StarCoder, Galactica, and Phi-3, then moves to token and text embeddings, and ends with word2vec, contrastive training, and embedding-based song recommendations.
Section 1: LLM Tokenization
LLMs generate one token at a time, and they also consume tokens. Before any prompt reaches the model, a tokenizer breaks it into tokens.
A tokenizer assigns a unique integer ID to each token. The model only ever sees IDs.
1.1 Loading and Tokenizing with Phi-3
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"microsoft/Phi-3-mini-4k-instruct",
device_map="cuda", torch_dtype="auto", trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
prompt = "Write an email apologizing to Sarah for the tragic gardening mishap. Explain how it happened.<|assistant|>"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")
generation_output = model.generate(input_ids=input_ids, max_new_tokens=20)
print(tokenizer.decode(generation_output[0]))
The model receives input_ids (a tensor of integers), not the raw string:
tensor([[ 1, 14350, 385, 4876, 27746, 5281, 304, 19235, 363, 278, 25305, 293,
16423, 292, 286, 728, 481, 29889, 12027, 7420, 920, 372, 9559, 29889, 32001]])
Decoding each ID:
<s> Write an email apolog izing to Sarah for the trag ic garden ing
m ish ap . Exp lain how it happened . <|assistant|>
A few things to notice. ID 1 (<s>) is a special beginning-of-text token. Whole-word tokens like Write, an, and email coexist with word parts like apolog, izing, trag, and ic. Punctuation tokens stand alone. There is no explicit space token: tokens have an invisible "this attaches to the previous token" marker, and tokens without that marker are assumed to have a leading space.
The output side is symmetric. The tokenizer's decode turns generated IDs back into text. The model emitted 3323 ('Sub') and 622 ('ject'), which combine to Subject, then 29901 (':').
1.2 What Determines Tokenizer Behavior?
Three design choices govern a tokenizer's behavior. The first is method, like byte pair encoding (BPE) for GPT or WordPiece for BERT. The second is parameters, including vocabulary size, special tokens, and capitalization handling. The third is training data: the same method and parameters trained on different data will produce a different tokenizer (English versus multilingual versus code).
Tokenizers operate on both ends, input AND output:
1.3 Word vs. Subword vs. Character vs. Byte Tokens
| Granularity | Pros | Cons |
|---|---|---|
| Word (word2vec era) | Intuitive | Can't handle OOV words; bloated vocab (apology/apologize/apologetic) |
| Subword (modern default) | Handles OOV by composing pieces; expressive vocab (apolog + -y/-ize/-ist) | Slightly more model effort |
| Character | Robust to any new word | Very long sequences eat context length; harder modeling |
| Byte (CANINE, ByT5) | Truly tokenization-free; competitive in multilingual | Even longer sequences |
Subword tokens average around 3 chars per token, so a 1024-context model fits roughly 3x more text than character-level. Some subword tokenizers like GPT-2 and RoBERTa keep bytes as a fallback. That doesn't make them byte-level tokenizers. Bytes are only used when nothing else matches.
Section 2: Comparing Trained LLM Tokenizers
A side-by-side test on this string:
English and CAPITALIZATION
π΅ιΈ
show_tokens False None elif == >= else: two tabs:" " Three tabs: " "
12.0*50=600
This stresses capitalization, non-English (emoji and Chinese), code keywords and whitespace, and digits.
| Tokenizer | Method | Vocab | Notable |
|---|---|---|---|
| BERT base (uncased), 2018 | WordPiece | 30,522 | Lowercases everything; ## prefix marks subword continuation; emoji/Chinese β [UNK]; newlines lost |
| BERT base (cased), 2018 | WordPiece | 28,996 | Keeps case but CAPITALIZATION becomes 8 tokens (CA ##PI ##TA ##L ##I ##Z ##AT ##ION) |
| GPT-2, 2019 | BPE | 50,257 | Preserves newlines and case; emoji split into multiple bytes that decode back; whitespace tokens exist |
| Flan-T5, 2022 | SentencePiece (BPE/unigram) | 32,100 | No whitespace tokens (bad for code); emoji/Chinese β <unk> |
| GPT-4, 2023 | BPE | ~100,000 | Single token for runs of up to 83 spaces; elif is one token; CAPITALIZATION is just 2 tokens |
| StarCoder2, 2024 | BPE | 49,152 | Code-specialized; each digit is its own token (so 600 β 6 0 0) β better number representation; <filename> / <reponame> / <gh_stars> / fim tokens |
| Galactica | BPE | 50,000 | Science-specialized; [START_REF]/[END_REF] for citations, <work> for chain-of-thought, single tokens for tab runs |
| Phi-3 / Llama 2 | BPE | 32,000 | Reuses Llama 2 tokenizer + chat tokens (<|user|>, <|assistant|>, <|system|>) |
Why the digit-per-token trick matters: GPT-2 represents
870as one token but871as8+71. That asymmetry confuses the model on math. StarCoder2/Galactica's per-digit tokens fix this.
2.1 Tokenizer Properties β three groups of design choices
Method covers the algorithm itself: BPE, WordPiece, SentencePiece, and so on, each with its own approach to selecting the best vocabulary.
Parameters include vocabulary size (30K and 50K are common, with 100K+ trending), special tokens (beginning- and end-of-text, padding, unknown, CLS, MASK, plus domain tokens like Galactica's <work> and [START_REF]), and capitalization (lowercase everything, or keep case at the cost of vocab space).
The domain of the training data also matters. Code-focused tokenizers handle indentation as single tokens, science models add citation tokens, and multilingual tokenizers need balanced scripts. For Python code, a text-tuned tokenizer wastes tokens on every space of indentation, while a code-tuned one collapses runs of spaces into single tokens.
Section 3: Token Embeddings
Once tokens are IDs, the next problem is finding the best numeric representation so the model can compute on them. That's exactly what embeddings are.
3.1 The Embedding Matrix
A pretrained language model holds an embedding vector for each token in its tokenizer's vocabulary. That collection is the model's embedding matrix. Vectors start randomly initialized, and training assigns them useful values.
Because the pretrained model and tokenizer co-evolve, you can't swap a different tokenizer onto an already-trained model.
Side note on RAG: as model coherence improved, users started trusting LLMs as fact engines. They aren't reliable search engines on their own. Retrieval-augmented generation (RAG) combines search with LLMs to ground outputs (Chapter 8).
3.2 Static vs. Contextualized Token Embeddings
Word2vec gives a static embedding: bank always maps to the same vector. LLMs produce contextualized embeddings, where the embedding of a token depends on the surrounding sequence.
These contextualized vectors power named-entity recognition, extractive summarization, and even drive image-generation systems like DALLΒ·E, Midjourney, and Stable Diffusion.
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-base")
model = AutoModel.from_pretrained("microsoft/deberta-v3-xsmall")
tokens = tokenizer('Hello world', return_tensors='pt')
output = model(**tokens)[0]
output.shape # torch.Size([1, 4, 384])
DeBERTa v3 is small, efficient, and one of the best token-embedding models at the time of writing. The output shape [1, 4, 384] reads as 1 batch, 4 tokens ([CLS], Hello, world, [SEP]), 384-dim vectors.
The transformation from token ID β static embedding is the very first step inside a language model:
Section 4: Text Embeddings (Sentences and Documents)
Sometimes you need a single vector for an entire sentence, paragraph, or document. Text embedding models produce that.
A common cheap approach is to average all token embeddings, but high-quality models are trained specifically for the task. The sentence-transformers package wraps these:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")
vector = model.encode("Best movie ever!")
vector.shape # (768,)
These vectors power categorization, semantic search, and RAG (Part II).
Section 5: Word Embeddings Beyond LLMs
Embeddings are useful in any domain where you can define meaningful "neighbors": recommender engines, robotics, multimodal models. We'll revisit contrastive training in Chapter 10.
5.1 Using Pretrained word2vec / GloVe
import gensim.downloader as api
model = api.load("glove-wiki-gigaword-50") # 66MB GloVe, 50-dim
model.most_similar([model['king']], topn=11)
Returns: prince (0.82), queen (0.78), ii (0.77), emperor (0.77), son (0.77), uncle (0.76), kingdom (0.75), throne (0.75), brother (0.75), ruler (0.74). Distance in this space encodes semantic relatedness.
5.2 The word2vec Algorithm
Word2vec is trained on a sliding window over text. Take the sentence "Thou shalt not make a machine in the likeness of a human mind". With window size 2, the central word and its 2 left and 2 right neighbors generate positive training pairs:
The task is simple: given a (word, neighbor) pair, predict 1 if they tend to appear together, 0 otherwise.
There's a problem, though. If all training labels are 1, the model can cheat by always outputting 1. Negative sampling fixes this by randomly pairing words that aren't neighbors and labeling them 0.
This is noise-contrastive estimation under the hood. The two big ideas of word2vec are skip-gram (pick neighbors from the sliding window) and negative sampling (random non-neighbors as 0-labeled examples).
Then build a vocab_size Γ embedding_dim matrix of randomly initialized vectors:
Training looks like this: each example takes two embeddings, predicts neighbor-or-not, and updates the embeddings via gradient descent.
This two-vectors-and-predict-relation pattern is one of ML's most powerful templates. It shows up in sentence embeddings and retrieval (Chapter 10), and in cross-modal alignment (image + caption) for image generation (Chapter 9).
Section 6: Embeddings for Recommendation Systems
6.1 Songs-as-Tokens, Playlists-as-Sentences
If you treat each song as a token and each playlist as a sentence, you can run word2vec verbatim and get song embeddings that capture co-occurrence in human-curated playlists.
Recommendations for Michael Jackson's "Billie Jean":
| Song | Artist |
|---|---|
| Kiss | Prince & The Revolution |
| Wanna Be Startin' Somethin' | Michael Jackson |
| The Way You Make Me Feel | Michael Jackson |
| Holiday | Madonna |
| Don't Stop 'Til You Get Enough | Michael Jackson |
For 2Pac's "California Love":
| Song | Artist |
|---|---|
| If I Ruled the World | Nas |
| I'll Be Missing You | Puff Daddy & The Family |
| Hate It or Love It | The Game |
| Hypnotize | The Notorious B.I.G. |
| Drop It Like It's Hot | Snoop Dogg |
6.2 Building It
import pandas as pd
from urllib import request
from gensim.models import Word2Vec
# Load playlist + song-metadata files
data = request.urlopen('https://storage.googleapis.com/maps-premium/dataset/yes_complete/train.txt')
lines = data.read().decode("utf-8").split('\n')[2:]
playlists = [s.rstrip().split() for s in lines if len(s.split()) > 1]
songs_file = request.urlopen('https://storage.googleapis.com/maps-premium/dataset/yes_complete/song_hash.txt')
songs = [s.rstrip().split('\t') for s in songs_file.read().decode("utf-8").split('\n')]
songs_df = pd.DataFrame(data=songs, columns=['id', 'title', 'artist']).set_index('id')
# Train Word2Vec β songs as tokens
model = Word2Vec(playlists, vector_size=32, window=20, negative=50, min_count=1, workers=4)
# Recommend like word similarity
model.wv.most_similar(positive='2172')
Song 2172 is Metallica's "Fade To Black". Its neighbors are Van Halen, Dio, Guns N' Roses, and Judas Priest, which is to say all heavy metal and hard rock. The same algorithm that learned king β queen learns Metallica β Van Halen purely from playlist co-occurrence.
Summary
- Tokenization turns text into integer IDs the model can consume. The inverse (
decode) turns generated IDs back into text. - Tokenizer behavior is dictated by method (BPE, WordPiece, SentencePiece, byte-level), parameters (vocab size, special tokens, capitalization), and training data (text vs. code vs. science vs. multilingual).
- Subword tokenization is the modern default. It handles OOV words by composing parts and gives a more expressive vocabulary than word-level. Character and byte tokenization are robust but eat context length.
- Modern tokenizers add domain-specific tokens. GPT-4 packs whitespace into single tokens for code. StarCoder2 splits digits per-token for math. Galactica adds citation and reasoning tokens. Phi-3 and Llama-2 add chat-role tokens.
- A model holds an embedding matrix: one vector per vocabulary token. The token-ID β embedding lookup is the model's first internal step.
- LLMs produce contextualized token embeddings. The same word gets different vectors in different contexts. These power NER, summarization, classification, and even multimodal generation.
- Text embeddings collapse a sentence or document into one vector, used for clustering, search, and RAG.
- word2vec trains via skip-gram + negative sampling, the predict-if-related template. The same template generalizes to song recommendations (playlists as sentences) and beyond.
Previous chapter
An Introduction to Large Language ModelsNext chapter
Looking Inside Large Language Models