Pagefy

Pagefy

Back to AI Engineering

Looking Inside Large Language Models

Hands On Large Language Models by Jay Alammar & Maarten GrootendorstBuy the book
Sign in to save bookmarks, reading progress, and highlights.

Chapter 3: Looking Inside Large Language Models

Introduction

With tokens and embeddings out of the way, this chapter dives into the mechanics of a Transformer LLM. We focus on decoder-only generative models: how they consume tokens, run a forward pass, sample next tokens, and what's actually inside a Transformer block (attention plus feedforward). We then survey recent architectural improvements: sparse attention, multi-query and grouped-query attention, Flash Attention, pre-normalization, RMSNorm, SwiGLU, and rotary position embeddings (RoPE).

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="cuda", torch_dtype="auto", trust_remote_code=True,
)
generator = pipeline(
    "text-generation", model=model, tokenizer=tokenizer,
    return_full_text=False, max_new_tokens=50, do_sample=False,
)

Section 1: Inputs, Outputs, and the Generation Loop

A Transformer LLM is, externally, just text-in / text-out:

But internally it does not generate the whole response at once. It generates one token at a time, with each token requiring one forward pass through the network.

After each pass, the new token is appended to the prompt and the longer prompt is re-fed to the model.

Models that consume their own earlier predictions to make later predictions are called autoregressive. That's how text-generation LLMs differ from representation models like BERT, which run a single non-autoregressive pass.

prompt = "Write an email apologizing to Sarah for the tragic gardening mishap. Explain how it happened."
output = generator(prompt)
print(output[0]['generated_text'])
# Subject: My Sincere Apologies for the Gardening Mishap
# Dear Sarah, I hope this message finds you well. I am writing to express my deep

Section 2: The Components of the Forward Pass

There are three big pieces: tokenizer → stack of Transformer blocks → LM head.

The tokenizer maps text to token IDs. The model has an embedding matrix with one vector per vocab token. For a 50K vocab the matrix has 50K rows.

The forward pass flows top to bottom: each Transformer block in turn, then the LM head, which outputs a probability score for every token in the vocabulary.

The LM head is one of several possible "heads" you can attach to a Transformer stack. Others include sequence-classification and token-classification heads.

2.1 What Phi-3's Layers Look Like

Phi3ForCausalLM(
  (model): Phi3Model(
    (embed_tokens): Embedding(32064, 3072, padding_idx=32000)
    (layers): ModuleList(
      (0-31): 32 x Phi3DecoderLayer(
        (self_attn): Phi3Attention(
          (qkv_proj): Linear(3072, 9216)
          (o_proj):  Linear(3072, 3072)
          (rotary_emb): Phi3RotaryEmbedding()
        )
        (mlp): Phi3MLP(
          (gate_up_proj): Linear(3072, 16384)
          (down_proj):   Linear(8192, 3072)
          (activation_fn): SiLU()
        )
        (input_layernorm):         Phi3RMSNorm()
        (post_attention_layernorm): Phi3RMSNorm()
      )
    )
    (norm): Phi3RMSNorm()
  )
  (lm_head): Linear(3072, 32064, bias=False)
)
  • 32,064 token vocab, each token has a 3,072-dim embedding.
  • 32 stacked decoder layers, each holding self-attention + MLP + RMSNorm + dropout.
  • LM head: 3072 → 32064 (one logit per vocab token).

Section 3: Decoding — Choosing the Next Token

After the forward pass, the LM head produces a probability over all vocab tokens. Picking which one to emit is the decoding strategy.

Greedy decoding always picks the highest-probability token. Setting temperature=0 does this. It's deterministic but tends to feel stale and repetitive.

A better default is to sample, giving every token a chance proportional to its probability. If Dear has 40% probability, it's picked 40% of the time. Other sampling parameters (top-k, top-p, temperature) are covered in Chapter 6.

prompt = "The capital of France is"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")

model_output = model.model(input_ids)        # before lm_head
lm_head_output = model.lm_head(model_output[0])  # shape [1, 6, 32064]

token_id = lm_head_output[0, -1].argmax(-1)  # highest-prob token at last position
tokenizer.decode(token_id)                   # → "Paris"

lm_head_output[0, -1] selects the final position in the only batch, which is where the next-token prediction lives.


Section 4: Parallel Token Streams and Context Size

Each input token flows through its own stream of computation, with attention letting streams interact. The number of parallel streams equals the model's context length.

Each stream takes a vector in (token embedding plus positional info), gives a vector out, same dimensionality (the model dimension, e.g., 3,072 for Phi-3).

For text generation, only the output of the last stream is sent to the LM head to predict the next token. So why bother computing the others? Because each Transformer block's attention layer needs the earlier streams' intermediate vectors to compute the last stream's output.

You can verify shapes:

model_output[0].shape    # torch.Size([1, 6, 3072])
lm_head_output.shape     # torch.Size([1, 6, 32064])

[batch=1, tokens=6, dim=3072]: six tokens of input, each becoming a 3072-d vector after the Transformer stack.


Section 5: KV Cache — Skipping Redundant Work

When generating token N+1, tokens 1..N have already been processed in the previous step. Recomputing them every step is wasteful. The keys and values (KV) cache stores their attention key and value vectors so only the new last stream is computed each iteration.

Hugging Face Transformers enables it by default. To compare:

%%timeit -n 1
generation_output = model.generate(input_ids=input_ids, max_new_tokens=100, use_cache=True)
# T4 GPU: ~4.5 s

%%timeit -n 1
generation_output = model.generate(input_ids=input_ids, max_new_tokens=100, use_cache=False)
# T4 GPU: ~21.8 s

Around a 5x speedup. Even 4 seconds feels long staring at a screen, which is why APIs stream tokens as they're generated rather than waiting for completion.


Section 6: Inside the Transformer Block

A Transformer LLM is a stack of Transformer blocks. The original 2017 paper used 6, while larger LLMs use 100 or more.

Each block has two sequential parts. The attention layer pulls in relevant info from other tokens. The feedforward layer (FFN or MLP) is where most of the model's stored knowledge and computation lives.

6.1 The Feedforward Layer — Memorization and Interpolation

Feed "The Shawshank" into a base LLM and you'd expect "Redemption" as the most likely next word. The FFNs across all layers are where the model memorizes training-data patterns and does the interpolation that lets it generalize beyond exact memorized patterns.

Note: a chat-tuned LLM like GPT-4 won't just say "Redemption". It will explain the movie. That's because instruction tuning and RLHF have shifted its behavior away from raw next-token completion.

6.2 The Attention Layer — Context

Memorization and FFN-only models can't disambiguate "The dog chased the squirrel because **it**". To predict what comes next, the model needs to know whether it refers to the dog or the squirrel. Attention is what pulls that information from earlier tokens into the current token's representation.


Section 7: How Attention Works

A simplified view of self-attention: at each position, take the input vector and produce an output vector that has been enriched with relevant info from earlier positions.

There are two steps. Relevance scoring asks how much each previous token matters to the current position. Information combination then takes a weighted aggregate of previous tokens by their scores.

7.1 Multiple Heads

Doing attention once captures one kind of pattern. Running it many times in parallel (multi-head attention) lets each head specialize on a different relationship.

7.2 Queries, Keys, Values — the Three Projection Matrices

Each attention head has three trained projection matrices: query, key, value. Inputs are multiplied by these to produce three corresponding matrices for the sequence.

The current position's row sits at the bottom of each matrix, with previous positions stacked above it.

Step 1, relevance scoring: multiply the current position's query by the keys matrix. Softmax the result so the scores sum to 1.

Step 2, combine: multiply each token's value vector by its score and sum. That's the attention output for the current position.

In a generative Transformer we're processing one position at a time, so attention is "concerned only with this one position and how info from previous positions flows into it."


Section 8: Recent Improvements to the Transformer

The 2017 architecture is largely intact, but several enhancements are now standard.

8.1 More Efficient Attention

Attention is O(n²) in sequence length, which makes it the most expensive piece of the model. There are several ways to attack this.

Local / Sparse Attention

Limit attention to a window of nearby tokens.

GPT-3 interleaves full attention (e.g., blocks 1, 3) with sparse attention (blocks 2, 4). Pure sparse would degrade quality.

The mask diagrams show, for each row (current token), which columns (previous tokens) it can attend to. Decoder-only models can only see previous tokens (autoregressive), while BERT (encoder, Bidirectional) sees both directions.

Multi-Query and Grouped-Query Attention

VariantQ matricesK, V matricesTradeoff
Multi-head (original)1 per head1 per headBest quality, biggest memory
Multi-query1 per headShared across all headsFastest inference, smaller cache, but quality cost at scale
Grouped-query (Llama 2/3)1 per head1 per group of headsSweet spot — most quality, much faster than MHA

Flash Attention

A re-implementation of attention that's IO-aware. It carefully orchestrates moves between GPU SRAM (fast, small) and HBM (slow, big) to skip materializing the full attention matrix. The result is significant speedups for both training and inference.

8.2 The Modern Transformer Block

The original Transformer block had residual connections, layer norm, attention, and an FFN.

2024-era variants like Llama 3 tweak three things:

AspectOriginalModern
Norm positionPost-attention, post-FFNPre-attention, pre-FFN — easier training
Norm typeLayerNormRMSNorm — simpler, faster
ActivationReLUSwiGLU (Gated Linear Unit variant)
AttentionMulti-headGrouped-query + RoPE

8.3 Rotary Position Embeddings (RoPE)

Position info is essential. the dog chased the squirrel is not the same sentence as the squirrel chased the dog. The original Transformer used absolute positional embeddings (token 1, token 2, and so on), either static (geometric functions) or learned during training.

Why is absolute awkward at scale? During training, short documents get packed into the context to avoid wasting compute on padding.

If document #2 starts at position 50 of the packed context, telling the model "this is position 50" misleads it. There's no relevant context before that within the document.

RoPE encodes positions by rotating the embedding vectors in their space, capturing both absolute and relative position. Critically, RoPE is applied inside attention (right before relevance scoring) on the queries and keys, not at the very start of the forward pass.

8.4 Other Directions

Transformer research has spread far beyond text into vision (Vision Transformers), robotics (RT-X), and time series. Many tweaks (better activations, position encodings, attention variants) are surveyed in "A Survey of Transformers."


Summary

  • A Transformer LLM generates one token per forward pass. The new token is appended to the prompt and the longer prompt is re-fed (autoregression).
  • Three top-level pieces: tokenizer → stack of Transformer blocks → LM head. The LM head outputs a probability over the vocab.
  • Decoding picks an actual token from that distribution. Greedy is always argmax (temperature=0); sampling adds randomness.
  • Tokens flow through parallel computation streams, one per position. Stream count equals context length. For generation only the last stream's output goes into the LM head, but the others are needed inside attention.
  • The KV cache stores keys and values from earlier tokens so each new token only requires fresh computation for itself, around 5x faster generation in practice.
  • A Transformer block has two parts: attention (incorporate context from other tokens) and a feedforward network (where most knowledge and computation lives).
  • Attention works in two steps. First, score each previous token's relevance via Q·K, softmax-normalized. Second, sum value vectors weighted by those scores. Multi-head attention runs many of these in parallel.
  • Efficient attention variants include sparse and local attention, multi-query and grouped-query attention (share K,V across heads or groups), and Flash Attention (a GPU-memory-aware kernel).
  • Modern blocks use pre-normalization, RMSNorm, SwiGLU activations, and rotary position embeddings (RoPE) which are applied in the attention step, not at the input.
  • All these tweaks are why a 2024 Llama 3 trains and runs significantly better than a literal 2017 Transformer of the same parameter count.