Pagefy

Book

AI Engineering by Chip Huyen

1.Introduction to Building AI Applications with Foundation Models 2.Understanding Foundation Models 3.Evaluation Methodology 4.Evaluate AI Systems 5.Prompt Engineering 6.RAG and Agents 7.Finetuning 8.Dataset Engineering 9.Inference Optimization 10.AI Engineering Architecture and User Feedback

Inference Optimization

AI Engineering by Chip HuyenBuy the book

Chapter 9: Inference Optimization

Introduction

Models keep changing, but making them faster and cheaper never stops mattering. A next-day stock predictor that takes two days to run is useless. An expensive model has bad ROI. Inference optimization lives at three layers (model, hardware, service) and is genuinely interdisciplinary work. Model researchers, application developers, system and compiler engineers, hardware architects, and data center operators all touch it. This chapter focuses on model-level and service-level optimization, with an overview of accelerators.

Section 1: Understanding Inference Optimization

1.1 Inference Overview

The component that runs inference is the inference server. The broader inference service also receives, routes, and sometimes preprocesses requests.

OpenAI and Google APIs are inference services. Self-hosting means you own all of this.

1.2 Computational Bottlenecks

Compute-bound workloads bottleneck on computation (password decryption, for example). Memory bandwidth-bound workloads bottleneck on data transfer between memory and processors.

Terminology note. AI engineers often say "memory-bound" when they mean capacity. System engineers use it for bandwidth. The Roofline paper uses memory-bound for bandwidth-bound.

Transformer LLM Inference Profile

Prefill processes input tokens in parallel. Compute-bound.
Decode generates one token at a time and has to load large matrices. Memory bandwidth-bound.

The two phases have different profiles, which is why they're often decoupled in production. Stable Diffusion inference is compute-bound. Autoregressive LLMs are memory bandwidth-bound.

1.3 Online and Batch Inference APIs

	Online	Batch
Optimizes for	Latency	Cost
Turnaround	Seconds	Hours
Pricing	Standard	~50% discount (Gemini, OpenAI)

Use online for chatbots and code generation. Use batch for synthetic data generation, periodic reports, customer onboarding doc processing, model migration reprocessing, and knowledge base reindexing.

Streaming mode returns each token as it's generated. Better TTFT, but harder to score the response before showing it.

Foundation-model batch isn't classical ML batch. Classical ML "batch" precomputes for predictable inputs (recommender systems). Foundation models can't precompute open-ended user prompts.

1.4 Inference Performance Metrics

Latency

TTFT (Time To First Token) is the duration of prefill. Depends on input length.
TPOT (Time Per Output Token) is time per generated token. Around 120ms (6-8 tokens/s) is enough for human reading.
TBT (Time Between Tokens) / ITL (Inter-Token Latency) is the same idea.

Total latency = TTFT + TPOT × (number of output tokens)

Reducing TTFT at the cost of higher TPOT is possible by shifting compute from decoding to prefilling. UX tradeoff: instant first token plus a slower stream vs. a slight delay plus a faster stream.

Time to publish is when the user first sees a token. For CoT or agentic queries, the model-internal first token isn't the user-visible first token.

Don't average. Use percentiles (p50, p90, p95, p99). One outlier can skew everything.

Throughput

TPS (tokens/second) across all users and requests. Treat input and output throughput separately because they have different bottlenecks.
RPS / RPM (requests per second/minute).
Direct cost: $2/h with 100 tok/s = $5.56 per 1M output tokens.

Latency/throughput trade-off: batching boosts throughput but increases latency. LinkedIn: 2-3× throughput is common if you sacrifice TTFT/TPOT.

Goodput is requests/second satisfying SLOs. If you can serve 100 RPS but only 30 satisfy your SLO (TTFT ≤ 200ms, TPOT ≤ 100ms), goodput is 30 RPS.

Utilization

NVIDIA GPU utilization (nvidia-smi) reports the percentage of time the GPU is "active". Misleading. A GPU doing 1 op/s when it could do 100 still reports 100% utilization.
MFU (Model FLOP/s Utilization) is observed throughput / theoretical peak FLOP/s.
MBU (Model Bandwidth Utilization) is observed bandwidth used / theoretical peak.

Bandwidth used = parameter count × bytes/param × tokens/s
MBU = bandwidth used / theoretical bandwidth

7B FP16 model at 100 tok/s is 700 GB/s. On A100-80GB (2 TB/s) that's MBU = 35%. (Book example uses 70%.)

Compute-bound workloads have high MFU and low MBU. Bandwidth-bound workloads have low MFU and high MBU. Training MFU is higher than inference MFU. Training >50% MFU is generally good. Examples:

Model	Params	Chips	MFU
GPT-3	175B	V100	21.3%
Gopher	280B	4096 TPU v3	32.5%
Megatron-Turing NLG	530B	2240 A100	30.2%
PaLM	540B	6144 TPU v4	46.2%

The goal isn't max utilization. It's lowest cost and latency for your workload.

1.5 AI Accelerators

Why Accelerators

GPUs make AI possible. AlexNet (2012) was the first to use GPUs for NN training.

CPU: a few powerful cores, sequential workloads. Around 64 cores high-end consumer.
GPU: thousands of small cores, parallel. Matrix multiplication is more than 90% of NN FLOPs.

Major accelerators: NVIDIA GPUs, AMD GPUs, Google TPU, Intel Habana Gaudi, Graphcore IPU, Groq LPU, Cerebras WSE.

Many AI workloads are inference, not training. Inference can exceed training cost in commonly used systems and accounts for up to 90% of ML costs in deployed AI systems (Desislavov et al., 2023). Specialized inference chips: Apple Neural Engine, AWS Inferentia, MTIA, Google Edge TPU, NVIDIA Jetson Xavier.

Computational Capabilities

NVIDIA H100 SXM example:

Precision	TFLOP/s
TF32 Tensor Core	989
BFLOAT16 Tensor Core	1,979
FP16 Tensor Core	1,979
FP8 Tensor Core	3,958

Memory Hierarchy

Level	Speed	Size
CPU memory (DRAM)	25-50 GB/s	16GB-1TB+
GPU HBM	256 GB/s - 1.5 TB/s+	24-80 GB
GPU on-chip SRAM (L1/L2/L3)	10+ TB/s	<40 MB

GPU programming languages: CUDA, Triton (OpenAI), ROCm (AMD).

Power Consumption

H100 has 80 billion transistors. Running at peak for a year is around 7,000 kWh (a US household uses around 10,000 kWh). Electricity is a real bottleneck. Data center siting is now a geopolitical and power-grid problem. Specs include maximum power draw and TDP (thermal design power).

Selecting Accelerators

Three questions:

Can the hardware run your workloads?
How long does it take?
How much does it cost?

Compute-bound workloads optimize for FLOP/s. Memory-bound workloads optimize for bandwidth and memory.

Section 2: Model Optimization

Archery analogy: model-level is better arrows. Hardware-level is a stronger archer. Service-level is a better shooting process.

Three transformer characteristics make inference resource-intensive:

Model size invites compression.
Autoregressive decoding is the sequential bottleneck.
Attention mechanism has quadratic compute and a large KV cache.

2.1 Model Compression

Quantization (Chapter 7) is the most effective. Reduce bits per value.
Distillation (Chapter 8) trains a small student to mimic a large teacher.
Pruning removes nodes (architecture change) or zeros out unimportant params (sparsity). Frankle & Carbin: pruning can remove more than 90% of non-zero params with no accuracy loss. In practice it's less common because it's harder to do and not all hardware exploits sparsity.

2.2 Overcoming Autoregressive Decoding

Output tokens cost 2-4× input tokens. Anyscale: 1 output token is roughly 100 input tokens for latency.

Speculative Decoding

A faster draft model generates K tokens. The target model verifies them in parallel. Accept the longest agreed prefix. The target model adds one more.

Why it works:

Verification is parallelizable (cheap). Generation is sequential.
Many tokens are easy. A weak draft gets them right.
Decoding has spare FLOPs (memory-bandwidth-bound) for free verification.

DeepMind: 4B-param draft for Chinchilla-70B got 8× faster token generation, halved overall latency, no quality loss. Implemented in vLLM, TensorRT-LLM, llama.cpp.

Inference with Reference

Like speculative decoding, but the draft tokens are copied from the input. Useful when input and output overlap (RAG, code editing, multi-turn). 2× speedup in those cases. No extra model needed.

Parallel Decoding

Generate multiple future tokens simultaneously, then verify.

Lookahead decoding (Fu et al., 2024). The same decoder generates parallel tokens. The Jacobi method verifies and regenerates failed ones.
Medusa (Cai et al., 2024). Extra decoding heads predict tokens at fixed future offsets. NVIDIA: Medusa boosted Llama 3.1 token generation 1.9× on H200.

2.3 Attention Mechanism Optimization

KV Cache

Generating token t+1 needs K/V for all previous tokens. Cache them.

KV cache grows linearly with sequence length and batch size.
For a 500B+ model with multi-head attention, batch 512, context 2048, you're looking at a 3TB KV cache (3× model weights).

KV cache size = 2 × B × S × L × H × M

where B is batch, S is seq length, L is layers, H is model dim, M is bytes/value.

Llama 2-13B example: 40 layers × 5,120 dim × batch 32 × seq 2,048 × 2 bytes × 2 = 54 GB.

KV cache is inference-only (not training).

Three Categories of Attention Optimization

1. Redesigning attention (architecture changes, only at training/finetuning):

Local windowed attention attends to a fixed window. Window 1K vs. 10K context is 10× KV cache reduction. Often interleaved with global attention.
Cross-layer attention shares K/V across adjacent layers.
Multi-query attention shares K/V across query heads.
Grouped-query attention puts heads in groups and shares within group.

Character.AI's average conversation is 180 messages. Multi-query plus interleaved local/global plus cross-layer gives a 20× KV cache reduction.

2. KV cache management:

PagedAttention (vLLM) uses non-contiguous blocks for less fragmentation and flexible sharing.
KV cache quantization, adaptive compression, selective KV cache.

3. Attention computation kernels:

FlashAttention (Dao et al., 2022) fuses operators. FlashAttention-3 is for H100 (Shah et al., 2024).

2.4 Kernels and Compilers

A kernel is hardware-optimized code. Common AI ops (matmul, attention, convolution) have specialized kernels per chip.

Kernel-writing techniques:

Vectorization processes contiguous elements simultaneously.
Parallelization splits work into independent chunks across cores.
Loop tiling orders data accesses to match the hardware memory layout and cache (hardware-dependent).
Operator fusion combines ops into one pass for less memory I/O.

A compiler lowers model code to hardware-specific code. Compilers: Apache TVM, MLIR, torch.compile, XLA, TensorRT.

PyTorch Inference Optimization Case Study

Stack of optimizations on Llama-7B (A100 80GB):

torch.compile for faster kernels.
Quantize weights to INT8.
Quantize weights to INT4.
Add speculative decoding.

Section 3: Inference Service Optimization

Service-level optimizations don't change the model. Just how it's served.

3.1 Batching

Strategy	Behavior	Tradeoff
Static	Wait until batch is full	High latency for early requests
Dynamic	Wait until batch full OR time window	Compute waste if window expires with empty seats
Continuous (in-flight)	Add new request when one completes	Best. Short responses don't wait for long ones

3.2 Decoupling Prefill and Decode

Prefill (compute-bound) and decode (memory-bandwidth-bound) compete on the same machine. Disaggregate them onto different instances.

DistServe and the "Inference Without Interference" paper showed significant volume increases at the same SLOs. Communication overhead between prefill and decode instances is fine on modern interconnects (NVLink).

Prefill:Decode ratio depends on workload:

Long inputs and TTFT priority means 2:1 to 4:1 prefill instances.
Short inputs and TPOT priority means 1:2 to 1:1.

3.3 Prompt Caching

Cache overlapping prompt segments: system prompts, long documents, prior conversation. Big savings when the system prompt is long and many calls reuse it.

Anthropic prompt caching: up to 90% cost reduction, up to 75% latency reduction.

Use case	Latency w/o	Latency w/	Cost reduction
Chat with a book (100K tokens)	11.5s	2.4s (-79%)	-90%
Many-shot prompting (10K)	1.6s	1.1s (-31%)	-86%
Multi-turn convo (10 turns)	~10s	~2.5s (-75%)	-53%

Google Gemini caches at a 75% input discount and charges $1/1M tokens/hour for storage.

3.4 Parallelism

Replica Parallelism

Spin up multiple model copies. The most straightforward approach. It's a bin-packing problem when fitting models of different sizes onto chips of different memory capacities.

Model Parallelism

Split a single model across machines.

Tensor parallelism (intra-operator) splits tensors in an operator across devices. Most common for inference. Reduces latency and enables large models. Communication overhead.

Pipeline parallelism puts different stages on different machines. Micro-batches flow through. Adds latency per request (more communication). Common in training, less in inference.

Context parallelism splits the input sequence across machines. Sequence parallelism splits operators (attention on machine 1, FFN on machine 2).

Summary

Inference cost and latency define a model's usability. AI integration depends on cheap and fast inference.
Latency breaks into TTFT (prefill) plus TPOT × output length (decode). Use percentiles, not averages.
Throughput is directly tied to cost. There's a latency/throughput tradeoff. Goodput is SLO-satisfying RPS.
Utilization: NVIDIA GPU util is misleading. MFU and MBU are better.
Hardware: GPUs (NVIDIA, AMD), TPUs, specialized inference chips. Memory hierarchy is CPU DRAM to GPU HBM to on-chip SRAM. Power matters. H100 burns around 7,000 kWh/year.
Model-level optimization: compression (quantization, distillation, pruning), autoregressive decoding bypasses (speculative decoding, inference with reference, parallel decoding), attention optimization (redesigned attention, KV cache management, kernels like FlashAttention).
Service-level optimization: batching (static / dynamic / continuous), prefill-decode decoupling, prompt caching (massive savings for long shared prefixes), parallelism (replica / tensor / pipeline / context / sequence).
The most impactful techniques across use cases: quantization, tensor parallelism, replica parallelism, attention mechanism optimization.

Previous chapter

Dataset Engineering

Next chapter

AI Engineering Architecture and User Feedback