Pagefy

Pagefy

Back to AI Engineering

Evaluate AI Systems

AI Engineering by Chip HuyenBuy the book
Sign in to save bookmarks, reading progress, and highlights.

Chapter 4: Evaluate AI Systems

Introduction

A model is only useful if it works for your use case. Chapter 3 covered evaluation methods. This chapter is about putting them to work for your application. Three parts:

  1. Evaluation criteria. Domain-specific capability, generation capability, instruction-following, cost & latency, and how to measure each.
  2. Model selection. Build vs. buy (host vs. API), how to read public benchmarks and leaderboards, how to deal with contamination.
  3. Designing an evaluation pipeline. Turn vs. task evaluation, scoring rubrics, tying to business metrics, slicing data, evaluating the pipeline itself.

Section 1: Evaluation Criteria

"Which is worse, an application that has never been deployed, or one deployed but no one knows whether it's working?" Most engineers vote for the second.

Evaluation-driven development, the AI version of TDD: define how you'll evaluate before you build. The popular enterprise AI use cases are popular because they have obvious evaluation:

  • Recommender systems: purchase or engagement lift, validated with A/B testing.
  • Fraud detection: money saved.
  • Coding: functional correctness.
  • Classification: close-ended, easy to grade.

Criteria fall into four buckets: domain-specific capability, generation capability, instruction-following capability, and cost & latency.

1.1 Domain-Specific Capability

A model can't do what its training data didn't prepare it for. Evaluate with domain-specific benchmarks (public or private).

Coding

Use functional correctness (execution accuracy). For SQL, also care about efficiency. BIRD-SQL compares query runtime to ground-truth runtime. Code readability isn't easy to measure exactly, so AI judges become the fallback.

Non-Coding (Multiple Choice)

Most public benchmarks default to multiple-choice questions (MCQs) because they're easier to verify, reproduce, and compare to a random baseline. In April 2024, 75% of tasks in EleutherAI's lm-evaluation-harness were MCQs (MMLU, AGIEval, ARC-C).

Example MMLU question:

Question: One of the reasons that the government discourages and regulates monopolies is that
(A) Producer surplus is lost and consumer surplus is gained.
(B) Monopoly prices ensure productive efficiency but cost society allocative efficiency.
(C) Monopoly firms do not engage in significant research and development.
(D) Consumer surplus is lost with higher prices and lower levels of output.
Label: (D)

Metrics: accuracy, F1, precision, recall, plus point systems for multi-correct.

The drawback is that MCQ performance can shift with trivial changes like an extra space or a "Choices:" prefix. MCQs test discrimination, not generation.

1.2 Generation Capability

NLG (natural language generation) metrics from the 2010s have been adapted:

  • Fluency: grammatical correctness, naturalness. Mostly solved at this point.
  • Coherence: logical structure.
  • Faithfulness (translation), relevance (summarization).

The modern emphasis is on factual consistency and safety.

Factual Consistency

Two settings:

  • Local. Output is consistent with a given context. Used in summarization, customer support, RAG.
  • Global. Output is consistent with open-world knowledge. Chatbots, fact-checking.

Verifying without context is much harder. You have to search for and pick reliable sources, and you'll often run into the absence-of-evidence fallacy.

Models hallucinate more on (1) niche knowledge and (2) queries about things that don't exist (e.g., "What did X say about Y?" when X never said anything about Y).

The methods:

  • AI as a judge. GPT-3.5/GPT-4 outperform earlier methods. GPT-judge predicts human truthfulness with 90-96% accuracy on TruthfulQA.
  • Self-verification (SelfCheckGPT). Generate N additional responses and check consistency. Effective but expensive.
  • Knowledge-augmented verification. DeepMind's SAFE: decompose response, make statements self-contained, search Google, check consistency.
  • Textual entailment (NLI). Classify a (premise, hypothesis) pair as entailment, contradiction, or neutral. Specialized scorer: DeBERTa-v3-base-mnli-fever-anli (184M params, trained on 764K annotated pairs).

Safety

Categories of unsafe content:

  1. Inappropriate language. Profanity, explicit content.
  2. Harmful tutorials. "How to rob a bank."
  3. Hate speech. Racism, sexism, homophobia.
  4. Violence. Threats, graphic detail.
  5. Stereotypes. Always female names for nurses, for example.
  6. Political and religious bias. Feng et al. (2023) found GPT-4 leans left-libertarian and Llama leans authoritarian.

Tools: Facebook hate speech, Skolkovo toxicity classifier, Perspective API. Benchmarks: RealToxicityPrompts (100K naturally toxic-eliciting prompts), BOLD.

1.3 Instruction-Following Capability

How well does the model do what you asked? More powerful models (GPT-4 > GPT-3.5, Claude-v2 > v1) tend to be better at this.

If you tell a model to output POSITIVE/NEGATIVE/NEUTRAL and it returns HAPPY/ANGRY, it has the domain capability but fails instruction-following.

Benchmarks

IFEval (Google, Zhou et al., 2023) covers 25 types of instructions you can verify automatically:

GroupExamples
Keywordsinclude / forbidden / frequency / letter frequency
Languageresponse in {language}
LengthN paragraphs / words / sentences
Detectable contentpostscript, placeholders
Detectable formatbullets, title, sections, JSON

INFOBench (Qin et al., 2024) goes broader: format, content, linguistic, and style. Each instruction breaks down into yes/no questions:

  • "Make a questionnaire to help hotel guests write reviews" gives:
    1. Is it a questionnaire?
    2. Is it for hotel guests?
    3. Is it helpful for writing reviews?

GPT-4 turned out to be a reasonable, cost-effective evaluator (better than Mechanical Turk annotators).

Build your own instruction-following benchmark from the actual instructions your application uses.

Roleplaying

Roleplaying means asking a model to take on a persona. Two purposes:

  1. A character users interact with (gaming, AI companions).
  2. A prompt-engineering trick to improve outputs.

Roleplaying is the 8th most common instruction type in LMSYS's million-conversation analysis. Benchmarks: RoleLLM, CharacterEval. You evaluate both style and knowledge, including negative knowledge. A roleplaying Jackie Chan who doesn't speak Vietnamese should not speak Vietnamese. Game NPCs need to avoid spoilers.

1.4 Cost and Latency

Pareto optimization: trade off model quality, latency, and cost. Be explicit about the dimensions you can't compromise on.

Latency metrics:

  • TTFT is time to first token.
  • TPOT is time per output token.
  • Plus time between tokens and time per query.

The levers:

  • Prompts that ask for concise responses.
  • Stopping conditions.
  • Model choice.

When evaluating latency, separate must-have from nice-to-have. Nobody refuses lower latency, but high latency is usually an annoyance, not a deal-breaker.

For cost, model APIs charge by input + output tokens. Self-hosting gives you fixed compute costs, and cost-per-token drops as scale grows. The 7B and 65B parameter sizes exist because they max out 24GB and 80GB GPUs respectively. Re-evaluate cost as scale changes.

Example Criteria

CriteriaMetricBenchmarkHardIdeal
CostCost per output tokenX< $30 / 1M< $15 / 1M
ScaleTPM (tokens/min)X> 1M> 1M
LatencyTTFT (P90)Internal user dataset< 200ms< 100ms
LatencyTotal query (P90)Internal user dataset< 1m< 30s
Overall qualityElo scoreChatbot Arena> 1200> 1250
Code generationpass@1HumanEval> 90%> 95%
Factual consistencyInternal GPT metricInternal hallucination dataset> 0.8> 0.9

Section 2: Model Selection

2.1 Model Selection Workflow

Two kinds of attributes:

  • Hard attributes are impossible or impractical to change: license, training data, model size, your privacy and control policies.
  • Soft attributes are improvable: accuracy, toxicity, factual consistency.

Four iterative steps:

  1. Filter by hard attributes.
  2. Use public information (benchmarks, leaderboards) to shrink the candidate set.
  3. Run your own evaluation pipeline to pick the best for your app.
  4. Continually monitor in production.

2.2 Model Build vs. Buy

For most teams, "build" means hosting an open source model and "buy" means hitting a commercial API.

Open Source Terminology

  • Open source is colloquial. Weights are downloadable.
  • Open weight means weights are public, training data isn't.
  • Open model means weights and training data are public.

License questions to always ask:

  • Is commercial use allowed? Meta's first Llama wasn't.
  • Are there restrictions on commercial use? Llama 2/3 require a special license for apps with more than 700M MAU.
  • Can you use the model's outputs to train other models? This matters a lot for distillation and synthetic data. Mistral changed its license to allow it. Llama still doesn't.

Seven Decision Axes

AxisAPI prosSelf-host pros
Data privacyData stays in-house
Data lineage / IPVendor contracts can protect youInspect training data (in theory)
PerformanceBest-of-breed proprietaryClosing gap (MMLU); often suffices
FunctionalityScaling, function calling, structured outputsLogprobs, intermediate outputs, full finetuning
CostPay per token; can balloonPay engineering & compute; cheaper at scale
Control / transparencyRate limits, opaque updatesFreeze model, inspect changes
On-deviceNot possiblePossible

A few things worth flagging:

  • Samsung 2023. Employees leaked proprietary code by pasting it into ChatGPT. Got banned shortly after.
  • Zoom 2023. Backlash for changing ToS to use customer data for AI training.
  • Memorization risk. StarCoder memorized 8% of training data.
  • Open source models trail in performance because incentives keep the best models proprietary. Open developers also lack the user-feedback loop.
  • Convai finetuned open source models for 3D AI characters because commercial models kept saying "As an AI model, I don't have physical abilities".

2.3 Navigating Public Benchmarks

Google's BIG-bench has 214 benchmarks. EleutherAI's lm-evaluation-harness supports 400+. OpenAI's evals supports around 500.

Public Leaderboards

Practical limits force leaderboards to pick small subsets:

  • HELM Lite excluded MS MARCO because it's too expensive.
  • Hugging Face excluded HumanEval because it's too compute-intensive.

Hugging Face Open LLM Leaderboard, late 2023, averaged 6 benchmarks:

  1. ARC-C for grade-school science.
  2. MMLU for 57 subjects.
  3. HellaSwag for sentence and scene completion.
  4. TruthfulQA for truthful response generation.
  5. WinoGrande for pronoun resolution.
  6. GSM-8K for grade-school math.

HELM picked 10 benchmarks, only 2 of which overlap with HF. Benchmark correlations matter:

ARC-CHellaSwagMMLUTruthfulQAWinoGrandeGSM-8K
ARC-C1.0000.4810.8670.4810.8860.744
HellaSwag0.4811.0000.6110.4810.4840.355
MMLU0.8670.6111.0000.5510.9010.794
TruthfulQA0.4810.4230.5511.0000.4550.501
WinoGrande0.8860.4840.9010.4551.0000.798
GSM-8K0.7440.3550.7940.5010.7981.000

ARC-C, MMLU, and WinoGrande are highly correlated because they're all reasoning. TruthfulQA stands apart, since improving reasoning doesn't always help truthfulness.

In June 2024, HF refreshed with harder benchmarks: MATH lvl 5, MMLU-PRO, GPQA (graduate-level Q&A), MuSR, BBH, IFEval.

For aggregation, HF averages scores (which assumes scoring scales are comparable, and they aren't really), while HELM uses mean win rate: the fraction of times a model beats another, averaged across scenarios.

Why OpenAI's Models Seem to Get Worse

A Stanford/UC Berkeley study (Chen et al., 2023) showed real shifts. The same model update can degrade some apps and improve others. Voiceflow saw -10% on intent classification migrating GPT-3.5-turbo-0301 to 1106. GoDaddy improved on the same migration. Best-overall isn't best-for-you.

2.4 Data Contamination

Also known as data leakage, training on the test set, or just cheating. Rylan Schaeffer's 2023 satirical paper "Pretraining on the Test Set Is All You Need" trained a 1M-param model exclusively on benchmarks and got near-perfect scores, beating much larger models.

How it happens:

  • Unintentional. Internet scraping pulls in public benchmarks.
  • Indirect. The same source provides both training and benchmark data (e.g., math textbooks).
  • Intentional and justified. Train on benchmark data to actually improve user-facing performance.

How to detect it:

  • N-gram overlap is accurate but expensive and needs training data access.
  • Perplexity. Low PPL on a benchmark suggests it leaked. Cheaper but less accurate.

OpenAI found 13 benchmarks with at least 40% in GPT-3 training data.

Mitigations: report performance on clean vs. whole benchmark, keep part of the benchmark private, and have leaderboards spot outliers via standard deviation.


Section 3: Design Your Evaluation Pipeline

Step 1: Evaluate All Components in a System

Each component contributes to the end-to-end output. Take "extract a person's current employer from a resume PDF":

  1. PDF to text (evaluate with similarity to ground-truth text).
  2. Text to current employer (evaluate accuracy given correct text).

Without component-level evaluation, you can't tell where the system broke.

Turn-based vs. task-based:

  • Turn: quality of each output (may include multiple steps).
  • Task: did the system actually complete the user goal? In how many turns?

The twenty_questions benchmark in BIG-bench is a clean task-based example:

Bob: Is the concept an animal?
Alice: No.
Bob: Is the concept a plant?
Alice: Yes.
...
Bob: Is it an apple? [correct]

Step 2: Create an Evaluation Guideline

The hard part isn't deciding whether an output is good. It's deciding what good even means. LinkedIn learned this the hard way: a "correct" response can still be a bad one. For a Job Assessment, "You are a terrible fit" may be technically correct but unhelpful. A good response explains the gap and suggests how to close it.

Define Criteria

LangChain's State of AI 2023 says users average 2.3 different criteria per app. A customer support chatbot might use:

  1. Relevance.
  2. Factual consistency.
  3. Safety.

Create Scoring Rubrics with Examples

Pick a system: binary (0/1), 1-5, [-1, 0, 1] for entailment, whatever fits. Validate it with humans (yourself, coworkers). If a human can't follow the rubric, neither can AI. Bonus: the guideline can be reused later for finetuning data annotation.

Tie Evaluation Metrics to Business Metrics

Map metrics to outcomes:

  • 80% factual consistency means automating 30% of customer support.
  • 90% means 50%.
  • 98% means 90%.

Define a usefulness threshold (must hit at least 50% to be useful at all, for example).

Step 3: Define Methods and Data

Select Methods

Mix and match. A cheap classifier on 100% of data plus an expensive AI judge on 1% gets you manageable cost with confidence. Use logprobs when available to measure model confidence (essential for classification, perplexity, fluency). And don't dismiss human evaluation. LinkedIn manually evaluates up to 500 conversations a day.

Annotate Evaluation Data

Use real production data when you can. Use natural labels if they exist. Slice your data for finer-grained understanding:

  • Avoid bias against minority groups.
  • Debug subset failures (long inputs, specific topics).
  • Avoid Simpson's paradox, where A beats B on every subgroup but loses overall:
Group 1Group 2Overall
Model A93% (81/87)73% (192/263)78% (273/350)
Model B87% (234/270)69% (55/80)83% (289/350)

Keep multiple evaluation sets: production-distribution, frequent-failures, frequent-user-mistakes (typos!), out-of-scope.

How Much Data?

Use bootstrapping. Draw N samples with replacement from your set, evaluate, repeat. If results vary wildly, you need more data.

OpenAI's guidance for 95% confidence by score difference:

Score differenceSample size
30%~10
10%~100
3%~1,000
1%~10,000

Rule of thumb: a 3× decrease in detectable difference needs 10× more samples.

EleutherAI's median benchmark size is 1,000 examples (avg 2,159). The Inverse Scaling Prize floor is 300, with 1,000+ preferred.

Evaluate Your Evaluation Pipeline

Ask:

  • Right signals? Do better responses get higher scores? Do better metrics correlate with better business outcomes?
  • Reliability? Run twice. Same result? Set AI judge temperature to 0.
  • Metric correlation? Drop redundant metrics, and investigate fully uncorrelated ones (signal or noise?).
  • Cost & latency? Don't skip evaluation to save latency. That's a risky trade.

Iterate

Track everything: evaluation data, rubric, prompts, sampling configs. Otherwise you can't tell if a metric change reflects the application or the evaluation.


Summary

  • Evaluation is the biggest blocker to AI adoption. Evaluation-driven development means defining criteria before building.
  • Criteria fall into four buckets: domain-specific capability, generation capability (factual consistency, safety), instruction-following, and cost/latency.
  • Model selection is really just building your own private leaderboard. Filter by hard attributes (license, privacy, on-device), use public info to narrow, run your own pipeline.
  • Build vs. buy breaks down across 7 axes: data privacy, lineage, performance, functionality, cost, control, on-device. The same use case can flip over time.
  • Public benchmarks help filter bad models but won't pick best-for-you. Watch out for contamination, saturation, and opaque selection or aggregation.
  • Designing an evaluation pipeline means per-component, per-turn, and per-task evaluation, with clear rubrics and examples, tied to business metrics, sliced data, pipeline reliability checks, and careful iteration with experiment tracking.