Evaluation Methodology
Chapter 3: Evaluation Methodology
Introduction
The more AI gets used, the more chances there are for catastrophic failure. Without solid evaluation, the risks can outweigh the benefits. There's no shortage of cautionary tales: a chatbot encouraging suicide, lawyers submitting AI-hallucinated case citations, Air Canada being held liable when its bot misled a customer. As Greg Brockman put it, "evals are surprisingly often all you need." Yet evaluation is the part of the AI stack with the least tooling, the least investment, and the least rigor. This chapter covers the methods used to evaluate open-ended foundation models (language modeling metrics, exact evaluation, AI as a judge, comparative ranking) and where each one falls short.
Section 1: Challenges of Evaluating Foundation Models
Why is foundation-model evaluation harder than evaluating traditional ML?
- Smarter models are harder to evaluate. Most people can grade a first grader's math. Few can grade PhD-level math. Bad summaries are obvious. Spotting good summaries requires reading the source.
- Open-ended outputs break ground-truth comparison. Classification has a fixed label set. Open-ended generation has infinite valid answers.
- Most foundation models are black boxes. Architecture, training data, training process: increasingly hidden. With nothing else to look at, you only get to judge outputs.
- Public benchmarks saturate fast. GLUE (2018) became SuperGLUE (2019). NaturalInstructions (2021) became Super-NaturalInstructions (2022). MMLU (2020) became MMLU-Pro (2024).
- Scope expands for general-purpose models. Evaluation has to cover known tasks and discover capabilities you didn't know existed (some past human ability).
A lot of teams "eyeball" results or do "vibe checks". 6 of 70 a16z decision makers picked their model based on word of mouth. If you want to iterate, you need something more systematic.
Section 2: Understanding Language Modeling Metrics
Most foundation models are language model based, and LM perplexity correlates with downstream performance. Four closely related metrics show up over and over: cross entropy, perplexity, bits-per-character (BPC), and bits-per-byte (BPB).
2.1 Entropy
Entropy measures how much information a token carries on average. The flip side: how predictable the language is.
- 2-token language (upper / lower): entropy = 1 bit.
- 4-token language (UL/UR/LL/LR): entropy = 2 bits. More info per token but more bits to represent.
Lower entropy means the language is more predictable.
2.2 Cross Entropy
A model's cross entropy on a dataset measures how hard it is for the model to predict the next token in that dataset. With:
Pis the true distribution of training data.Qis the distribution the model learned.
H(P, Q) = H(P) + D_KL(P || Q)
Cross entropy is not symmetric. H(P, Q) ≠ H(Q, P). If the model perfectly learns the data, KL divergence goes to 0 and cross entropy equals data entropy.
2.3 Bits-per-Character and Bits-per-Byte
- BPC is bits per token divided by average characters per token. 6 bits/token at 2 chars/token gives BPC = 3.
- BPB is bits per token divided by average bytes per token. Standardizes across encodings (ASCII = 7 bits/char, UTF-8 = 8-32 bits/char).
If BPB = 3.43, the model can compress the original 8-bit bytes down to 3.43 bits. Less than half the size.
2.4 Perplexity
Perplexity (PPL) is the exponential of cross entropy:
PPL(P, Q) = 2^H(P, Q) (in bits)
PPL(P, Q) = e^H(P, Q) (in nats)
The intuition: PPL is the number of options the model thinks it's choosing between when predicting the next token. PPL=4 means picking among 4 equally probable options.
Interpretation
- More structured data means lower PPL. HTML is more predictable than prose.
- Bigger vocabulary means higher PPL. War and Peace > children's book.
- Longer context means lower PPL. Modern models compute PPL over 500 to 10,000+ previous tokens.
OpenAI's GPT-2 report shows larger models give consistently lower PPL across datasets. One catch: post-training (SFT, RLHF) usually raises PPL, because teaching task completion can hurt next-token prediction. Quantization can move PPL in surprising directions too.
Use Cases
- Detecting data contamination. Low PPL on a benchmark probably means it leaked into training.
- Deduplication. Only add new training data if it has high PPL.
- Detecting abnormal text or gibberish. Extreme PPL flags weirdness.
Computing Perplexity
PPL = ( ∏ 1/P(x_i | x_1, ..., x_{i-1}) )^(1/n)
You need per-token probabilities (or logprobs). Many commercial APIs don't expose them.
Section 3: Exact Evaluation
Exact evaluation gives unambiguous judgments (multiple-choice correctness). Subjective evaluation depends on the grader (essay scoring). Two exact approaches: functional correctness and similarity to reference data.
3.1 Functional Correctness
Does the system actually do the thing you wanted it to do?
For code generation, run the code against unit tests (also called execution accuracy). Used by HumanEval, MBPP, Spider, BIRD-SQL, WikiSQL. Each problem ships with test cases (assert statements). The score is pass@k: for each problem, generate k samples, mark it solved if any sample passes all tests, then compute the fraction solved. pass@1 < pass@3 < pass@10 in expectation. For game bots like Tetris, the score is the score. Optimization tasks like AI scheduling for energy savings are also measurable.
The hard part is that AI often only does part of the solution, and evaluating the intermediate steps can be harder than evaluating the final outcome.
3.2 Similarity Measurements Against Reference Data
Each example is (input, reference responses). Reference-based versus reference-free metrics. Generated responses that look more similar to references score higher. Four ways to measure similarity:
- Ask an evaluator (human or AI).
- Exact match, which is binary and works for short factual answers.
- Lexical similarity, a sliding scale of token overlap.
- Semantic similarity, a sliding scale of embedding-based meaning.
Beyond evaluation, similarity also shows up in retrieval/search, ranking, clustering, anomaly detection, and dedup.
Exact Match
Works for short, exact-answer queries:
- "What's 2 + 3?" →
5 - "Who was the first woman to win a Nobel Prize?"
Variations include "contains the reference response". Watch out: "What year was Anne Frank born?" with answer "September 12, 1929" contains 1929 but is wrong about the month and day. For translation and other open-ended tasks, exact match is mostly useless.
Lexical Similarity
How much do two texts overlap? Two flavors:
- Approximate string matching (fuzzy matching) counts edit operations: deletion, insertion, substitution, sometimes transposition. "brad" → "bad" (1 edit), "bad" → "bard" (1 edit), "bad" → "bed" (1 edit).
- N-gram similarity is overlap of token sequences. "My cats scare the mice" has bigrams:
my cats,cats scare,scare the,the mice.
Common metrics: BLEU, ROUGE, METEOR, TER, CIDEr. Used by WMT, COCO Captions, GEMv2.
The downsides:
- You need a comprehensive reference set, and correct answers can be missing (Adept's Fuyu hit this).
- References can themselves be wrong (WMT 2023 reported a lot of bad reference translations).
- Higher BLEU isn't always better. OpenAI found BLEU scores were similar for correct and incorrect HumanEval solutions.
Semantic Similarity
Lexical similarity misses meaning. "What's up?" and "How are you?" share no words but mean the same thing. "Let's eat, grandma" and "Let's eat grandma" share almost everything but mean very different things.
Convert text to an embedding (vector), then compare with cosine similarity:
cos_sim(A, B) = (A · B) / (||A|| · ||B||)
- 1 = identical, 0 = orthogonal, -1 = opposite.
- Metrics: BERTScore, MoverScore.
- You don't need an exhaustive reference set, but the quality leans hard on the embedding model. Compute can be nontrivial.
Section 4: Introduction to Embedding
An embedding is a vector that tries to capture the meaning of input data. "the cat sits on a mat" might map to [0.11, 0.02, 0.54]. Real embeddings are 100-10,000 dimensions.
4.1 Embedding Models and Sizes
| Model | Embedding size |
|---|---|
| BERT base / large | 768 / 1024 |
| OpenAI CLIP | image=512, text=512 |
| OpenAI text-embedding-3-small / large | 1536 / 3072 |
| Cohere Embed v3 (english) | 1024 / 384 (light) |
Specialized embedding models include BERT, CLIP, and Sentence Transformers.
4.2 Quality of Embeddings
A good embedding algorithm puts more-similar texts closer together (by cosine similarity). For example, "the cat sits on a mat" should be closer to "the dog plays on the grass" than to "AI research is super fun".
The benchmark to know: MTEB (Massive Text Embedding Benchmark).
4.3 Joint / Multimodal Embeddings
- CLIP does joint text-image embeddings.
- ULIP adds 3D point clouds on top of text and images.
- ImageBind covers six modalities (text, images, audio, etc.).
A multimodal embedding space lets text-based image search and cross-modality similarity work.
Section 5: AI as a Judge
Using AI to evaluate AI is AI as a judge (or LLM as a judge). The model doing the judging is the AI judge. As of 2023-24, this is one of the most common evaluation methods in production. LangChain's State of AI 2023 says 58% of evaluations on their platform are by AI judges.
5.1 Why AI as a Judge?
- Fast, cheap, easy versus paying humans.
- Reference-free, so it works in production.
- Can judge any criterion: correctness, toxicity, hallucination, role consistency, you name it.
- Strong correlation with humans. GPT-4 versus humans on MT-Bench: 85% agreement (humans agree with each other 81%). AlpacaEval correlates 0.98 with LMSYS Chatbot Arena.
- Explainable. Judges can give rationale.
5.2 How to Use AI as a Judge
Three usage patterns:
- Quality of a single response:
Given the following question and answer, evaluate how good the answer is for the question. Score 1 (very bad) to 5 (very good). Question: [QUESTION] Answer: [ANSWER] Score: - Compare to reference:
Given the following question, reference answer, and generated answer, evaluate whether the generated answer is the same as the reference answer. Output True or False. - Compare two responses (preference data, test-time compute, ranking):
Given the following question and two answers, evaluate which is better. Output A or B.
Built-in Criteria (varies by tool)
| Tool | Built-in criteria |
|---|---|
| Azure AI Studio | Groundedness, relevance, coherence, fluency, similarity |
| MLflow.metrics | Faithfulness, relevance |
| LangChain Criteria Evaluation | Conciseness, relevance, correctness, coherence, harmfulness, maliciousness, helpfulness, controversiality, misogyny, insensitivity, criminality |
| Ragas | Faithfulness, answer relevance |
Criteria don't carry across tools. Azure's "relevance" isn't MLflow's "relevance".
Prompting an AI Judge
The prompt has to be clear about:
- Task, what to evaluate.
- Criteria, with detailed instructions.
- Scoring system:
- Classification (good/bad, relevant/irrelevant/neutral) works best.
- Discrete numerical (1-5) is better than continuous, and narrower ranges work better than wide ones.
- Continuous numerical (0-1).
Including examples in the prompt helps.
An AI judge isn't just a model. It's a system: model + prompt + sampling parameters. Change any one of those and you've got a different judge.
5.3 Limitations of AI as a Judge
Inconsistency
Same input, different scores. Same prompt run twice, different scores. To mitigate, set sampling variables (temperature, seed, top-p, top-k). Including examples in the prompt pushed GPT-4 consistency from 65% to 77.5% (Zheng et al., 2023), but examples cost tokens, sometimes quadrupling GPT-4 spend. And remember that high consistency isn't accuracy. The judge can be consistently wrong.
Criteria Ambiguity
| Tool | Faithfulness scoring |
|---|---|
| MLflow | 1-5 |
| Ragas | 0/1 |
| LlamaIndex | YES/NO |
Scores aren't comparable. Worse, if the judge changes (model or prompt), historical comparisons go out the window. Don't trust an AI judge whose model and prompt you can't see.
Increased Cost and Latency
Judging adds API calls. Multiple criteria multiply costs. To soften this, use weaker judges, spot-check a subset, or run judges asynchronously. In-line judges add latency for production guardrails.
Biases
- Self-bias. Models favor their own outputs. GPT-4 gives itself +10% win rate. Claude-v1 gives itself +25%.
- First-position bias. Pairwise comparisons favor whatever option comes first. Mitigate by repeating with the order swapped.
- Recency bias (humans). Favor the last option seen. Opposite of first-position bias.
- Verbosity bias. Longer answers win even when they're wrong. Wu and Aji (2023): GPT-4 and Claude-1 prefer 100-word incorrect responses to 50-word correct ones. Saito et al. found that with large length differences, the judge almost always picks the longer one.
5.4 What Models Can Act as Judges?
Three configurations: stronger, weaker, or the same as the judged model.
Stronger judge is the natural pick. Use a cheap model to generate, and have GPT-4 evaluate a sample.
Self-evaluation / self-critique / self-ask carries a self-bias risk but works for sanity checks. You can prompt the model to revise its own answer:
- Prompt: "What's 10+3?" → First: "30" → Self-critique: "Is this correct?" → Final: "No, the correct answer is 13."
Weaker judge can work if judging is easier than generating ("anyone can judge a song"). Zheng et al. (2023) found stronger judges correlate better with human preference for general-purpose tasks.
Specialized Judges
- Reward model.
(prompt, response)to score. Used in RLHF for years. Cappy (Google) is a 360M-param model outputting 0-1. - Reference-based judge.
(generated, reference)to similarity or quality. BLEURT (Sellam et al., 2020) scores roughly -2.5 to 1.0 (a great example of how arbitrary score ranges are). Prometheus takes(prompt, response, reference, rubric)and outputs 1-5. - Preference model.
(prompt, response 1, response 2)to which is preferred. PandaLM, JudgeLM.
Section 6: Ranking Models with Comparative Evaluation
Often you don't care about absolute scores. You just want a ranking. Two approaches:
- Pointwise evaluation. Score each model independently, then rank by score.
- Comparative evaluation. Head-to-head, rank by win rate.
For subjective quality, comparative is easier. It's easier to say which song is better than to score one.
6.1 The Approach
Anthropic was first with comparative ranking in 2021. It powers LMSYS Chatbot Arena.
- For each request, two or more models respond and an evaluator picks the winner. Ties allowed.
- Each comparison is a match.
- Win rate of A over B is the fraction of A vs. B matches A wins.
- A ranking is correct if higher-ranked models tend to beat lower-ranked ones.
Not all questions should be answered by preference. Factual questions need correctness ("Is there a link between cell phone radiation and brain tumors?"). Preference voting only works when the voters know the subject.
Comparative evaluation isn't A/B testing. A/B testing shows one output at a time per user. Comparative shows multiple side by side.
Rating Algorithms
Computed from comparative signals: Elo, Bradley-Terry, TrueSkill (lifted from sports and games). LMSYS started with Elo and switched to Bradley-Terry because Elo is sensitive to evaluator and prompt order.
6.2 Challenges of Comparative Evaluation
Scalability Bottlenecks
The number of pairs grows quadratically. LMSYS in Jan 2024: 57 models, 244,000 comparisons, only around 153 per pair. The transitivity assumption (if A beats B and B beats C, then A beats C) is used to skip comparisons, but human preference is not necessarily transitive. New models have to be evaluated against existing ones, which shifts the ranking. Private models are hard to evaluate without standing up a private leaderboard. Smart matching algorithms that prioritize the matches that reduce ranking uncertainty help.
Lack of Standardization / Quality Control
LMSYS-style crowdsourcing captures broad signals but has issues:
- No fact-checking enforcement, so users may prefer fluent-but-wrong responses.
- Different users prefer different tones, which pollutes the ranking.
- Test prompts don't reflect production usage. Of 33,000 LMSYS prompts in 2023, 180 were "hello" or "hi" (0.55%), and brainteasers showed up dozens of times.
- Public leaderboards rarely model RAG or sophisticated context.
- Ways out: predefined prompts, filtered hard prompts, trained evaluators (Scale's private leaderboard), or in-product comparison (still noisy).
From Comparative to Absolute Performance
Knowing B beats A doesn't tell you whether B is good in absolute terms, how much better B is, or whether a 1% win-rate change matters for your application. For cost-benefit decisions you still need absolute metrics.
6.3 The Future of Comparative Evaluation
Even with the limits, comparative evaluation has staying power. It's easier than absolute scoring as models get past human ability. It captures human preference directly, which is the quality we actually care about. It doesn't saturate the way benchmarks do. And it's hard to game (no "training on the test set" trick).
Comparative evaluation complements benchmarks (offline) and A/B testing (online).
Summary
- Foundation models are harder to evaluate than traditional ML because of intelligence, open-endedness, opacity, fast-saturating benchmarks, and expanding scope.
- Language modeling metrics (entropy, cross entropy, BPC, BPB, perplexity) are useful proxies. Lower PPL means better next-token prediction. Useful for contamination detection, dedup, and anomaly flagging. Watch out: post-training and quantization can move PPL.
- Exact evaluation: functional correctness (pass@k, code execution) and similarity to reference (exact match, lexical/n-gram, semantic via embeddings).
- Embeddings capture meaning. Joint multimodal embedding spaces (CLIP, ULIP, ImageBind) extend to images, audio, and 3D.
- AI as a judge is fast, flexible, cheap, and well-correlated with humans, but it's inconsistent, criteria are ambiguous, it has biases (self, position, verbosity), and it adds cost and latency. Use specialized judges (reward, reference-based, preference) where you can.
- Comparative evaluation ranks models by head-to-head win rates using algorithms from sports (Elo, Bradley-Terry, TrueSkill). Powerful, but with scalability issues, no standardization, and a gap between relative and absolute performance.
- Combine multiple methods. The next chapter takes these and builds a real evaluation pipeline.
Previous chapter
Understanding Foundation ModelsNext chapter
Evaluate AI Systems