Pagefy

Pagefy

Back to AI Engineering

Finetuning

AI Engineering by Chip HuyenBuy the book
Sign in to save bookmarks, reading progress, and highlights.

Chapter 7: Finetuning

Introduction

Finetuning means adapting a model to a task by updating its weights. Prompt-based methods don't touch the weights. Most teams reach for finetuning when they need better instruction-following (especially output formats) or domain-specific knowledge. It's more powerful than prompting and also more expensive: data, hardware, ML expertise. This chapter covers when to finetune, why it eats memory, how PEFT and LoRA work, how to merge models, and which hyperparameters matter.


Section 1: Finetuning Overview

Finetuning is transfer learning: knowledge from a base task transfers to a related one (Bozinovski & Fulgosi, 1976). It improves sample efficiency. Training a legal QA model from scratch needs millions of examples. Finetuning a strong base might need a few hundred.

The InstructGPT view: finetuning unlocks capabilities the model already has, just hard to access via prompts.

Different forms of finetuning:

  • Continued pre-training. Self-supervised finetuning on domain text (legal docs before legal Q&A pairs).
  • Supervised finetuning (SFT). (input, output) pairs.
  • Preference finetuning. (instruction, winning, losing) triples. Uses RL.
  • Long-context finetuning. Modifies positional embeddings. Harder. Can degrade short-context performance.
  • Infilling finetuning. Fill-in-the-blank. Useful for code editing and debugging. Works even on autoregressive bases.

Feature-based transfer is different from finetuning. You extract features (embeddings) from a model and feed them to another model. Common in computer vision (ImageNet to object detection).


Section 2: When to Finetune

2.1 Reasons to Finetune

  • Improve task-specific quality. A less common SQL dialect, for example.
  • Structured outputs like JSON or YAML.
  • Bias mitigation. Garimella et al. (2022): finetuning BERT on women-authored text reduces gender bias. Finetuning on African authors reduces racial bias.
  • Distillation. Finetune a small model on a large model's outputs.
  • Smaller models are cheaper, faster, and easier to finetune.

Grammarly's example: a finetuned Flan-T5 outperformed a 60× larger GPT-3 variant on text editing using only 82,000 (instruction, output) pairs.

2.2 Reasons Not to Finetune

  • Finetuning for one task can degrade other tasks (alignment tax).
  • High up-front investment in data, ML knowledge, and training infrastructure.
  • Maintenance. Base models improve faster than you can re-finetune. When do you swap?
  • Most "prompting doesn't work" complaints turn out to be unsystematic prompt experiments.
  • Watch out for "domain-specific models always beat general":
    • BloombergGPT (50B params, $1.3-2.6M to train, March 2023) was significantly outperformed by GPT-4-0314 (zero-shot) on financial benchmarks:
ModelFiQA sentiment (F1)ConvFinQA (acc)
GPT-4-0314 (zero-shot)87.1576.48
BloombergGPT75.0743.41

Example Path

Less compelling now that prompt caching exists.

2.3 Finetuning vs. RAG

Finetuning is for form. RAG is for facts.

  • Information-based failures mean RAG. The model lacks info or has outdated info.
  • Behavior-based failures mean finetuning. Outputs are factually fine but irrelevant, malformatted, or wrong style.

Ovadia et al. (2024): for current-events Q&A, base + RAG beats finetuning. Sometimes base + RAG even beats finetuned + RAG:

Base+RAGFT-regFT-parFT-reg+RAGFT-par+RAG
Mistral-7B 0.4810.8750.5040.5880.8100.830

You can combine them. RAG plus finetuning improved MMLU 43% of the time over RAG alone. The other 57% it didn't.

Example Workflow

  1. Prompting first (with versioning).
  2. Add examples (1-50).
  3. Add RAG with simple term-based retrieval.
  4. If still failing: a. Information-based problems get advanced RAG (embedding-based). b. Behavior-based problems get finetuning.
  5. Combine RAG and finetuning.

Section 3: Memory Bottlenecks

Why finetuning is memory-intensive, and why so many techniques fight for memory efficiency.

Key Takeaways

  1. Memory is the foundation-model bottleneck (inference and training). Training >> inference.
  2. Memory drivers: # parameters, # trainable params, numerical representation.
  3. More trainable params means more memory. PEFT reduces trainable params.
  4. Quantization reduces bits-per-value. 13B params × 4 bytes (FP32) = 52GB drops to 26GB at 2 bytes.
  5. Inference often uses 16/8/4 bits.
  6. Training is more sensitive and typically uses mixed precision.

3.1 Backpropagation and Trainable Parameters

  • Forward pass. Compute output from input.
  • Backward pass. Compute loss, compute gradients per trainable param, update via optimizer.

Optimizer state per trainable param:

  • SGD: 0 values.
  • Momentum: 1 value.
  • Adam: 2 values (most common for transformers).

Each trainable param needs gradient + optimizer states. More trainable params, more memory.

3.2 Memory Math

Inference

Memory = N × M × 1.2

where N is the number of params and M is bytes per param. The 1.2× covers activations and KV cache.

  • 13B params × 2 bytes = 26GB → 31.2GB total.
  • 70B params × 2 bytes = 140GB just for weights.

Training

Training memory = weights + activations + gradients + optimizer states

For 13B params with Adam at 2 bytes:

  • Gradients + optimizer states = 13B × 3 × 2 = 78GB.
  • If only 1B trainable: 1B × 3 × 2 = 6GB.

Gradient checkpointing / activation recomputation trades compute for memory. Recompute activations instead of storing them.

3.3 Numerical Representations

FormatBitsNotes
FP6464Default in NumPy/Pandas. Rare in DL
FP3232Single precision
FP1616Half precision
BF1616Google for TPUs. More range, less precision than FP16
TF3219 (named TF32)NVIDIA for GPUs
INT8 / INT48 / 4Increasingly popular

FP16 vs BF16. Same total bits but different range/precision split. Llama 2 weights were in BF16. Many teams loaded them in FP16 and silently degraded the model.

3.4 Quantization

Strictly, quantization means converting to integer format. In practice, any precision reduction.

What and When

  • What. Weights are most quantized. Activation quantization is harder.
  • When. PTQ (post-training quantization) is most common. Run a fully trained model in lower precision.

Inference Quantization

  • 32-bit to 16-bit to 8-bit to 4-bit progression.
  • Apple ships iPhone models at around 3.5 bits-per-weight (mix of 2-bit and 4-bit).
  • NVIDIA Blackwell supports 4-bit float inference.
  • BitNet b1.58 (Microsoft, 2024): 1.58 bits per parameter, comparable to 16-bit Llama 2 at up to 3.9B params.
ModelSizeAvg benchmark
Llama 700M700M45.5
BitNet b1.58 700M700M44.3
Llama 3B3B49.7
BitNet b1.58 3B3B50.2
BitNet b1.58 3.9B3.9B51.2

Reduced precision also speeds up computation: more bits means more bit-by-bit time.

Training Quantization

  • QAT (Quantization-Aware Training). Simulates low-precision behavior during training so the model produces good outputs at inference.
  • Direct lower-precision training. Character.AI trained entirely in INT8.
  • Mixed precision. Keep weights in higher precision, gradients and activations in lower. AMP (automatic mixed precision) in many ML frameworks.
  • LLM-QAT puts weights and activations in 4 bits and embeddings in 16.

Section 4: Finetuning Techniques

4.1 Parameter-Efficient Finetuning (PEFT)

Full finetuning updates all weights. For a 7B model in FP16 + Adam:

  • Weights: 14GB.
  • Gradients + optimizer: 7B × 3 × 2 = 42GB.
  • Total at least 56GB (excluding activations). That exceeds most consumer GPUs (12-48GB).

Partial finetuning updates some layers. Saves memory but needs around 25% of params to match full finetuning on GLUE (Houlsby et al., 2019).

PEFT (Houlsby et al., 2019) inserts small adapter modules and updates only those. On GLUE, hit within 0.4% of full finetuning using only 3% of trainable params. Downside: extra layers add inference latency.

4.2 PEFT Techniques

Two buckets:

  • Adapter-based (additive). Add new params. LoRA dominates. Others: BitFit, IA3, LongLoRA.
  • Soft-prompt-based. Add trainable continuous "prompt" vectors. Not human-readable, unlike hard prompts. Variants: prefix-tuning, P-Tuning, prompt tuning, differing by where the prompts are inserted.

LoRA dominates by a wide margin.

4.3 LoRA

Low-Rank Adaptation (Hu et al., 2021). Add params that merge back into the original layers. No inference latency overhead.

For weight matrix W (n × m):

  1. Pick rank r. Build A (n × r) and B (r × m). Their product W_AB has the same shape as W.
  2. Use W' = W + (α/r) × W_AB.
  3. Finetune only A and B. Keep W frozen.

Why does it work? LLMs have low intrinsic dimension. Pre-training implicitly compresses representations. Larger pre-trained models have lower intrinsic dimension, making them easier to adapt with few params and few examples.

For GPT-3, LoRA matches or beats full finetuning with 0.0027% of trainable params (4.7M).

LoRA Configuration

LoRA is typically applied to attention matrices: query (Wq), key (Wk), value (Wv), output projection (Wo). With an 18M trainable-param budget on GPT-3-175B:

Weight typeRankWikiSQLMultiNLI
Wq870.491.0
Wv873.091.0
Wq, Wv473.791.3
Wq, Wk, Wv, Wo273.791.7

The best setup: apply to all four matrices with rank 2. If only two, query and value.

Empirically, applying LoRA to feedforward layers also helps (Databricks). Attention-LoRA gives more bang within memory limits.

Rank. r between 4 and 64 usually suffices. Higher r doesn't always help and can overfit. Some workloads benefit from r=256.

α. Ratio α:r typically 1:8 to 8:1. Smaller r means larger α. Larger r means smaller α.

Serving LoRA Adapters

Two options:

  1. Merge A,B into W before serving. No extra latency. Best for one model.
  2. Keep W, A, B separate. Extra latency per inference, but storage savings are massive for multi-LoRA serving.

Storage example (4096×4096 W, rank 8, 100 customers):

  • Option 1: 100 × 16.8M = 1.68B params.
  • Option 2: 1 × 16.8M + 100 × 65,536 = 23.3M params.

Option 2 also enables fast task switching (load only A, B).

Apple uses multi-LoRA on a 3B-param base model for different iPhone features, with quantization to keep it on-device.

Quantized LoRA

LoRA adapters are tiny. The big memory wins come from quantizing the base model:

ModelWeight memory (16-bit)LoRA params (r=2)Adapter (16-bit)
Llama 2 (13B)26GB3.28M6.55MB
GPT-3 (175B)350GB18.87M37.7MB

QLoRA (Dettmers et al., 2023) stores weights in 4-bit NF4, dequantizes to BF16 for the forward and backward pass, and adds paged optimizers that auto-swap GPU↔CPU. Result: finetune a 65B-param model on a single 48GB GPU.

ModelSizeElo (May 2023)
GPT-4-1348
Guanaco 65B41GB1022
ChatGPT-966
Guanaco 13B10GB916

Tradeoff: NF4 quantization adds time and can slow training.

Other quantized LoRA variants: QA-LoRA, ModuLoRA, IR-QLoRA.

4.4 Model Merging and Multi-Task Finetuning

Model merging combines multiple models into one. Use cases:

  • Performance, by combining strengths.
  • Memory savings, since one model serves multiple tasks.
  • Multi-task finetuning. Finetune separately per task in parallel, then merge.
  • On-device deployment.
  • Federated learning. Devices train locally, merge centrally.

Avoids catastrophic forgetting of sequential finetuning.

Ensembling vs. Merging

Ensembling combines outputs. Merging combines parameters. Ensembling has higher inference cost (multiple model calls).

Three Approaches

Summing

Linear combination:

Merge(A, B) = (W_A × A + W_B × B) / (W_A + W_B)

Model soups (Wortsman et al., 2022) average finetuned models for accuracy gains with no inference cost.

Task vectors (also called delta parameters) are finetuned model minus base model. Enables task arithmetic: add to combine, subtract to remove unwanted capabilities.

Spherical Linear Interpolation (SLERP) interpolates along the shortest path on a sphere. A factor in [0,1] controls position.

Defined for two vectors at a time. For more, sequence the merges.

Pruning redundant task-specific parameters. Most finetuning changes are minor and redundant.

TIES (Yadav et al., 2023) and DARE (Yu et al., 2023) prune task vectors before merging. Improves quality, especially when merging many models.

Layer Stacking (Frankenmerging / Passthrough)

Take different layers from different models. Often needs further finetuning.

  • Goliath-120B uses 72 of 80 layers from each of two finetuned Llama 2-70B (Xwin + Euryale).

Sparse Upcycling converts a pre-trained model into MoE by duplicating layers and adding a router.

Together AI's Mixture-of-Agents combined six weaker open source models to match GPT-4o on some benchmarks.

Model Upscaling

Depthwise scaling (Kim et al., 2023): make a copy, sum some layers, stack the rest, train.

  • SOLAR 10.7B: 32-layer model into 48 layers (32×2 - 16 summed) yields 10.7B params.

Concatenation

Stack adapters side by side. Rank of merged is sum of ranks. Generally not recommended. No memory savings.

4.5 Finetuning Tactics

Choosing a Base Model

Two paths:

Progression path:

  1. Test code with the cheapest, fastest model.
  2. Validate data with a mid-size model.
  3. Push performance with the best model.
  4. Map price/performance with all models.

Distillation path:

  1. Strongest model plus a small dataset.
  2. Use the finetuned model to generate more training data.
  3. Train a cheaper model on the new dataset.

Choosing a Method

  • LoRA before full finetuning.
  • Small dataset (hundreds of examples) means PEFT > full.
  • Multi-LoRA simplifies serving many task variants.

Frameworks

  • APIs are quick but limited (provider-allowed base models, fewer knobs).
  • Frameworks: LLaMA-Factory, unsloth, Hugging Face PEFT, Axolotl, LitGPT.
  • Distributed training: DeepSpeed, PyTorch Distributed, ColossalAI.

Key Hyperparameters

  • Learning rate. Typical range 1e-7 to 1e-3. Common pattern: pre-training final LR times a constant in [0.1, 1]. Watch the loss curve. Fluctuating = LR too high. Flat-and-slow = LR too low. Schedules vary LR over time.
  • Batch size. Too small (<8) is unstable. Larger is more stable and memory-bound. Gradient accumulation updates after N batches when memory limits batch size.
  • Number of epochs. Millions of examples means 1-2 epochs. Thousands means 4-10. Compare training and validation loss for over- and underfitting.
  • Prompt loss weight. For instruction finetuning, the fraction of loss from prompt tokens vs. response tokens. Default 10%, so the model learns mostly from responses.

Summary

  • Finetuning updates model weights. Use it for form and behavior, not for facts (which is RAG's job).
  • Start with prompting, then add examples, then add RAG, then finetune for residual behavioral problems.
  • Memory is the limiting factor, driven by # params, # trainable params, and numerical representation.
  • Quantization (PTQ, QAT, mixed precision) reduces bits-per-value. BitNet b1.58 demonstrates viable 1.58-bit LLMs.
  • PEFT reduces trainable params. LoRA decomposes weight matrices into low-rank A·B updates that merge back, with no inference latency penalty, big memory savings, and modular adapters for multi-task serving. QLoRA combines 4-bit NF4 and paged optimizers to finetune 65B models on a single 48GB GPU.
  • Model merging combines multiple models: summing (linear combination, SLERP, with optional pruning), layer stacking (frankenmerging, sparse upcycling, depthwise scaling), or concatenation. Enables multi-task finetuning, on-device deployment, and federated learning.
  • Hyperparameters matter: learning rate, batch size, number of epochs, prompt loss weight. Use the loss curves to diagnose.

Previous chapter

RAG and Agents