Pagefy

Pagefy

Back to AI Engineering

Understanding Foundation Models

AI Engineering by Chip HuyenBuy the book
Sign in to save bookmarks, reading progress, and highlights.

Chapter 2: Understanding Foundation Models

Introduction

You don't need to know how to build a foundation model to use one, but a working mental model helps when you're choosing, adapting, or debugging it. The differences between models trace back to four design choices: training data, model architecture and size, post-training (alignment), and sampling. This chapter walks through each, finishing with the most underrated topic in modern AI: how sampling produces the probabilistic, sometimes hallucinating behavior these models are known for.


Section 1: Training Data

A model is only as good as the data it was trained on. No Vietnamese in the data means no Vietnamese translation. The default approach of "use what we have, not what we want" (read: scrape Common Crawl) means models do well on tasks the web is full of, not necessarily your task.

  • Common Crawl crawls 2-3B web pages a month. Used by GPT-3, Gemini, and most others.
  • C4 (Colossal Clean Crawled Corpus) is Google's cleaned subset of Common Crawl.
  • Quality is uneven. Clickbait, misinformation, propaganda. OpenAI tried filtering with Reddit links that had at least 3 upvotes for GPT-2. Better, but not great.
  • A small high-quality dataset can beat a huge low-quality one. Gunasekar et al. (2023) trained a 1.3B-parameter model on 7B tokens of high-quality coding data that beat much larger models.

1.1 Multilingual Models

English dominates Common Crawl at roughly 45.88%, which is 8× the next language (Russian, 5.97%). A lot of widely-spoken languages are low-resource:

LanguageWorld pop. %% in Common CrawlRatio (under-rep)
Punjabi1.41%0.0061%231.56
Swahili0.89%0.0077%115.26
Urdu2.89%0.0274%105.38
Bengali3.40%0.0930%36.56
English18.15%45.88%0.40

Why Translation Doesn't Solve It

Three reasons. Quality, because translations lose information. Vietnamese pronouns encode the speaker's relationship to the listener, and English flattens all of that to I/you. Behavior differences, because models behave differently in different languages. NewsGuard found ChatGPT-3.5 produced false claims more readily in simplified and traditional Chinese than in English. Cost and latency, because tokenization is wildly inefficient in some languages. Median token length on the MASSIVE dataset: English=7, Hindi=32, Burmese=72. Burmese costs 10× more in API tokens for the same content.

Notable non-English models include ChatGLM and YAYI (Chinese), CroissantLLM (French), PhoGPT (Vietnamese), and Jais (Arabic).

1.2 Domain-Specific Models

General models like the GPT and Llama families do fine on domains in their training data, but not on specialized domains where the data is private or rare. Drug discovery, cancer screening, protein structures, fMRI scans.

Famous domain-specific models:

  • AlphaFold (DeepMind) for proteins, trained on around 100K known structures.
  • NVIDIA BioNeMo for biomolecular data and drug discovery.
  • Google Med-PaLM2 for medical Q&A.

Section 2: Modeling

Before training, you have to pick an architecture and a size. A 7B model is a much different beast to deploy than 175B, and transformer optimizations don't carry over to other architectures.

2.1 Model Architecture

The transformer architecture (Vaswani et al., 2017), built around the attention mechanism, has dominated language modeling since 2017.

Seq2Seq to Transformer

The 2014 seq2seq model used RNNs for the encoder and decoder. Two problems:

  1. The decoder generates outputs from only the final hidden state of the input. It's like writing answers using only a book's summary.
  2. RNNs are sequential. You can't parallelize over input tokens.

The transformer fixed both. The attention mechanism lets the model weigh any input token at every output step, like flipping back to any page in the book. And inputs run in parallel.

Transformer-based language model inference runs in two phases:

  • Prefill: process input tokens in parallel, build the K/V intermediate state for the first output token.
  • Decode: generate output tokens one at a time.

Attention Mechanism

Three vectors:

  • Query (Q) is the current decoder state. The person looking for info.
  • Key (K) represents a previous token. The page number.
  • Value (V) is the actual content of a previous token. The page's content.

To compute attention, dot-product Q with each K, normalize with softmax, then weighted-sum the V vectors:

Attention(Q, K, V) = softmax(QKᵀ / √d) V

These vectors come from the input x via three weight matrices: K = xWK, V = xWV, Q = xWQ. For Llama 2-7B, the hidden dim is 4096, and with 32 attention heads each Q/K/V splits into 32 vectors of dim 128.

Longer sequences mean more K/V vectors to compute and store. That's why long context is hard for transformers.

Transformer Block

A transformer model is a stack of transformer blocks, each containing:

  • An attention module with Q, K, V, and an output projection matrix.
  • An MLP module, which is just feedforward layers separated by nonlinear activations like ReLU or GELU. The activation function isn't doing anything fancy. Its job is to break linearity. ReLU(x) = max(0, x).

Then before and after the stack:

  • An embedding module with token and positional embedding matrices.
  • An output layer (also called unembedding or model head) that maps hidden vectors back to token probabilities.

Llama Dimensions

Model# blocksModel dimFFN dimVocabContext
Llama 2-7B324,09611,00832K4K
Llama 2-13B405,12013,82432K4K
Llama 2-70B808,19222,01632K4K
Llama 3-7B324,09614,336128K128K
Llama 3-70B808,19228,672128K128K
Llama 3-405B12616,38453,248128K128K

Increasing context length costs you memory but doesn't change the parameter count.

Other Architectures

The transformer has held on since 2017, but a few alternatives are getting traction:

  • RWKV (2023) is RNN-based but parallelizable for training. In theory there's no context-length limit, but real-world long-context performance isn't guaranteed.
  • State Space Models (SSMs) look promising for long-range memory.
    • S4 is the efficient SSM.
    • H3 adds an attention-like mechanism.
    • Mamba scales linearly with sequence length, hits 3B params, beats transformers of equal size, and matches transformers 2× its size on language modeling.
    • Jamba is a transformer-Mamba hybrid. 52B total / 12B active params, small memory footprint, up to 256K context.

2.2 Model Size

Three numbers describe a model's scale:

  1. Number of parameters, a proxy for learning capacity.
  2. Number of training tokens, a proxy for how much it learned.
  3. Number of FLOPs, a proxy for what training cost.

Parameters

More params usually means more capacity and better performance. Llama 3-8B (2024) outperforms Llama 2-70B (2023) on MMLU because newer-generation training is more efficient. For memory, 7B params at 2 bytes each (16 bits) is roughly 14GB of GPU memory just to do inference.

Sparse models can carry many zero-valued params. A 7B model that's 90% sparse only has 700M non-zero params. Mixture of Experts (MoE) (Shazeer et al., 2017) splits parameters across "experts" and only activates a subset per token. Mixtral 8x7B has 8 experts of around 7B params each, totaling 46.7B with sharing, but only 2 experts (around 12.9B params) are active per token. Cost and speed look like a 12.9B model.

Tokens

Dataset size is best measured in tokens, not samples or words:

  • Llama 1: 1.4T tokens. Llama 2: 2T. Llama 3: 15T.
  • RedPajama-v2: 30T tokens, around 450M books, 5,400× Wikipedia.
  • Number of training tokens = dataset size × epochs. Modern LLMs typically train for one epoch.

FLOPs and Compute

A FLOP is more standardized than counting GPUs, CPUs, or TPUs:

  • FLOP is a single floating point operation. The compute requirement for a task.
  • FLOP/s (FLOPS) is a peak machine performance number.
  • FLOP/s-day = 86,400 FLOPs. OpenAI uses this to disambiguate.

Some examples:

  • PaLM-2: 10²² FLOPs.
  • GPT-3-175B: 3.14 × 10²³ FLOPs.

256 H100s running at peak (5.2 × 10¹⁸ FLOPs/day) need about 236 days to train GPT-3-175B. Utilization is the fraction of peak you actually hit. 50% is fine, 70%+ is great. At 70% utilization and $2/hour, GPT-3-175B costs around $4M to train.

Inverse Scaling

Bigger isn't always better. Anthropic found that more alignment training pushed models further from human preference. They expressed stronger political and religious views, and a desire to not be shut down. The 2023 Inverse Scaling Prize found larger LMs sometimes do worse on memorization and tasks with strong priors.

Scaling Law (Chinchilla)

For compute-optimal training, training tokens should be around 20× the model size. Double the params, double the tokens.

Three goals for training data: quantity, quality, diversity. The cost to hit the same accuracy keeps falling (ImageNet 93% accuracy halved between 2019 and 2021), but the cost of improvement stays high. Going from 90 to 95% costs much more than 85 to 90%. The last-mile problem again.

Scaling Extrapolation / Hyperparameter Transfer

You can only train a large model once. Scaling extrapolation studies hyperparameters on small models and extrapolates up. Microsoft and OpenAI (2022) showed transfer from 40M to 6.7B. Emergent abilities, which only appear at scale, hurt extrapolation accuracy.

Scaling Bottlenecks

Two visible walls:

  1. Training data. Internet text is running out. Training data growth outpaces new data creation. C4 lost around 28% of its critical sources to ToS and crawl restrictions between 2023 and 2024 (45% of C4 is now restricted). The next fuel is proprietary data: books, contracts, medical records, genomes. OpenAI signed deals with Axel Springer and the AP.
  1. Electricity. Data centers consume 1-2% of global electricity, projected at 4-20% by 2030. Compute can grow at most around 50× before power shortages drive cost up.

Section 3: Post-Training

Pre-training optimizes for token-level quality (next-token prediction). Users care about response quality. A pre-trained model has two issues:

  1. It optimizes for completion, not conversation.
  2. It can spit out racist, sexist, or wrong outputs because the scrape was indiscriminate.

Post-training fixes both. Two phases:

  1. Supervised finetuning (SFT), where you finetune on instruction data.
  2. Preference finetuning, where you align outputs with human preference (RLHF, DPO, RLAIF).

Pre-training is like reading to acquire knowledge. Post-training is learning how to use it.

InstructGPT spent 98% of compute on pre-training and 2% on post-training.

3.1 Supervised Finetuning (SFT)

A pre-trained model treats "How to make pizza" as text to complete. Possible completions:

  1. Add context: "for a family of six?"
  2. Add a follow-up: "What ingredients do I need?"
  3. Give instructions (the one you actually want).

SFT trains the model on demonstration data, also called behavior cloning. You give it (prompt, response) pairs covering the tasks you care about.

Labelers need to be skilled. Around 90% of InstructGPT labelers had a college degree, and over a third had a master's. Each (prompt, response) pair takes up to 30 minutes. 13,000 pairs at $10 each is $130,000, before overhead. Cheaper alternatives include LAION volunteers (10,000 conversations), DeepMind's heuristics for dialogue-formatted text (used for Gopher), and AI-generated data (covered in Chapter 8).

3.2 Preference Finetuning

Demonstration teaches the model how to converse, not what conversations to have. Preference finetuning aligns it with human preference. The first algorithm to make this work was RLHF (Reinforcement Learning from Human Feedback):

  1. Train a reward model that scores responses.
  2. Optimize the foundation model to maximize the reward model's scores.

Newer alternatives:

  • DPO (Direct Preference Optimization). Meta moved from RLHF (Llama 2) to DPO (Llama 3) to cut complexity.
  • RLAIF (RL from AI Feedback). Possibly used by Claude.

Reward Model

Pointwise scoring is unreliable because labelers give different scores. The easier task is comparison data: (prompt, winning_response, losing_response).

Manual comparison takes 3-5 min per pair (LMSYS), at $3.50 per comparison for Llama-2 versus $25 to write a fresh response. Inter-labeler agreement at OpenAI: around 73%.

The loss for the reward model r_θ:

loss = -E_x [ log σ(r_θ(x, y_w) - r_θ(x, y_l)) ]

The reward model can be at-least-as-strong as the foundation model, but a weaker model can still judge a stronger one. Judging is easier than generating.

Finetuning with the Reward Model

Use Proximal Policy Optimization (PPO) to update the SFT model so its outputs maximize reward-model scores. Some companies (Stitch Fix, Grab) skip RL entirely. They generate N outputs and pick the one the reward model scores highest. That's the best of N strategy.


Section 4: Sampling

A model produces an output by sampling from a probability distribution. This is one of the most underrated concepts in AI. It explains hallucinations, inconsistencies, and creative outputs.

4.1 Sampling Fundamentals

To generate the next token, a language model:

  1. Computes a logit vector with one logit per vocab token.
  2. Applies softmax to get a probability distribution.
  3. Samples.
p_i = softmax(x_i) = exp(x_i) / Σ_j exp(x_j)

Greedy sampling always picks the highest probability. Fine for classification, terrible for language models. They'd always pick "red" and never "the color of a still lake".

4.2 Sampling Strategies

Temperature

A constant T divides each logit before softmax. Higher T flattens the distribution and makes the model more creative. Lower T makes it more deterministic.

  • T=1 (default): for logits [1, 2], softmax = [0.27, 0.73].
  • T=0.5: [0.12, 0.88]. More confident.
  • T=0 is effectively argmax. Dividing logits by zero isn't valid, so implementations just pick the largest logit.
  • Common range is 0-2. A reasonable creative default is 0.7.

Logprobs

Logprobs (log probabilities) are how you avoid underflow when probabilities for a vocab of 100K tokens get too small to represent.

Useful for classification, app evaluation, and debugging. Many providers limit or hide logprobs because models are easier to replicate when you have them.

Top-k

Keep only the top k logits, softmax over those, and sample. Cuts softmax compute. Typical k is 50-500. Smaller k makes outputs more predictable.

Top-p (Nucleus)

Sum probabilities in descending order, stop when the cumulative sum hits p. Sample from that set. The candidate set adapts to context. Common p is 0.9-0.95.

It doesn't reduce softmax compute but keeps outputs context-appropriate. Min-p is a related strategy: minimum probability a token has to clear to be considered.

Stopping Conditions

A fixed max-token cap is simple but cuts off mid-sentence. Stop tokens (like end-of-sequence) keep latency and cost down, but if they fire too early they'll break structured formats like JSON missing its closing brackets.

4.3 Test Time Compute

Generate multiple responses for one query and pick the best. Better quality, more cost.

Selection Methods

  1. Highest-probability output. Product of token probabilities. Use average logprob to avoid biasing toward short sequences. OpenAI's best_of does this.
  2. Reward model or verifier scoring. Used by Stitch Fix, Grab, Nextdoor. OpenAI showed verifiers gave performance equivalent to a 30× model size increase.
  3. Application-specific heuristics. For text-to-SQL, pick the shortest valid query.
  4. Most common output (self-consistency). Pick the modal answer. Google used this for Gemini's MMLU evaluation, sampling 32 outputs per question.

Scaling Test Time Compute

DeepMind found that scaling test time compute can be more efficient than scaling parameters. OpenAI (2021) saw performance improve up to 400 samples, then degrade as adversarial outputs fooled the verifier. Stanford's "Monkey Business" (2024) saw solved problems grow log-linearly with samples, from 1 all the way to 10,000.

Test time compute also helps latency. Generate many in parallel and return the first valid one.

A model is robust if outputs don't change much when inputs vary slightly. The less robust, the more you benefit from sampling multiple outputs.

4.4 Structured Outputs

Two scenarios where structure matters:

  1. Tasks that require structured outputs, like semantic parsing (text-to-SQL, text-to-regex) and classification.
  2. Tasks whose outputs feed into downstream apps. Emails returned as {"title": ..., "body": ...}. Critical for agentic workflows (Chapter 6).

Frameworks: guidance, outlines, instructor, llama.cpp. OpenAI was first with JSON mode, which guarantees valid JSON but says nothing about content correctness.

Five approaches, in order of increasing intervention:

  1. Prompting. Simplest. Anywhere from 0 to 90+% valid outputs depending on the model. Add an AI-as-a-judge validation pass for higher accuracy at the cost of latency and money.
  2. Post-processing. Write scripts to fix common errors. LinkedIn's defensive YAML parser pushed valid YAML from 90% to 99.99%. They picked YAML over JSON because it's less verbose, so fewer tokens.
  3. Test time compute. Keep sampling until you get a valid output.
  4. Constrained sampling. At each token, filter logits to only those allowed by the grammar. You need a grammar per format (JSON, YAML, regex, CSV). Less generalizable. Can add latency.
  1. Finetuning. Most reliable. For classification, attach a classifier head to the base model. Also called feature-based transfer.

4.5 The Probabilistic Nature of AI

Ask a friend the same question twice and you'll get the same answer. Ask AI twice and you'll get a probability-weighted distribution of answers. Probabilistic is not deterministic. This produces two failure modes: inconsistency and hallucination.

Inconsistency

Two scenarios:

  1. Same input gives different outputs.
  2. Slightly different input gives drastically different outputs. Capitalization can be enough.

To mitigate, you can cache answers, fix sampling variables (temperature, top-p, top-k, seed as the RNG starting point), but even with all of those pinned hardware differences can still change outputs. The "slightly different input" problem needs careful prompting and memory systems.

Hallucination

The model generates facts that aren't grounded in training data. There are two leading hypotheses:

  1. Self-delusion (Ortega et al., DeepMind, 2021). The model can't tell the difference between data given to it and data it generated itself. Once it produces an off-track sentence, it conditions on that sentence as if it were fact.

Mitigations: use RL to differentiate user-provided observations from model actions, and SFT with both factual and counterfactual signals.

  1. Mismatched internal knowledge (Leo Gao, OpenAI). During SFT, the model learns to mimic responses written by labelers. If the labelers know things the model doesn't, you're effectively training the model to make stuff up.

    John Schulman's mitigations: verification (force the model to retrieve sources for each response) and a better reward function that explicitly punishes made-up content.

Interestingly, InstructGPT showed RLHF made hallucinations worse than SFT alone. But it improved everything else, and labelers preferred RLHF overall.

For practical prompt-side mitigations: ask for short responses (fewer tokens to fabricate in) and explicitly say "if you're unsure, say 'I don't know'". Detection is hard. Chapter 4 goes deeper.


Summary

  • A model's training data drives its capabilities and biases. English dominates Common Crawl. Low-resource languages produce worse, slower, more expensive models. Specialized domains need curated data.
  • The transformer architecture with attention still rules language modeling. Inference has two phases: parallel prefill and sequential decode. Mamba, Jamba, and RWKV are real alternatives that target long-context limits.
  • Model size breaks down into parameters, training tokens, and FLOPs. The Chinchilla scaling law says training tokens should be roughly 20× the parameter count for compute-optimal training. The two visible bottlenecks are training data and electricity.
  • Post-training is SFT (behavior cloning on demonstration data) plus preference finetuning (RLHF, DPO, RLAIF) to align with human preference.
  • Sampling (temperature, top-k, top-p, logprobs, test time compute, structured outputs) is what makes AI probabilistic. It enables creativity and causes inconsistency and hallucination. Two leading hallucination hypotheses: self-delusion and mismatched internal knowledge.