Pagefy

Pagefy

Back to AI Engineering

Fine Tuning Generation Models

Hands On Large Language Models by Jay Alammar & Maarten GrootendorstBuy the book
Sign in to save bookmarks, reading progress, and highlights.

Chapter 12: Fine-Tuning Generation Models

Introduction

The book closes with the most production-relevant skill: fine-tuning a generative LLM. The modern recipe has three steps. Pretraining (covered earlier) builds the base model. Supervised fine-tuning (SFT) teaches instruction-following. Preference tuning (RLHF, DPO, ORPO) aligns outputs with human preferences. We dig into PEFT methods (adapters, LoRA, QLoRA), walk through hands-on instruction tuning of TinyLlama with QLoRA, survey evaluation (perplexity, benchmarks, leaderboards, LLM-as-judge, Chatbot Arena), and finish with DPO preference tuning.


Section 1: The Three LLM Training Steps

Step 1 is pretraining (language modeling): predict the next token over massive unlabeled text. The result is a base or foundation model. It knows language but doesn't follow instructions well.

Step 2 is supervised fine-tuning (SFT). Same loss (next-token prediction), but conditioned on user instructions paired with desired outputs. The result is an instruction model or chat model.

Step 3 is preference tuning. Align the model to humans' preferred answers using reward signals. Methods include PPO, DPO, and ORPO. This step distills which answer is preferred when both are valid.

A base model when prompted with a question often continues with more questions instead of answering it:


Section 2: Supervised Fine-Tuning (SFT)

2.1 Full Fine-Tuning

Same idea as pretraining, but with smaller labeled instruction data and updates to all parameters.

Training data is (instruction, response) pairs:

The model still does next-token prediction, but now on the response portion conditioned on the instruction.

The drawbacks of full fine-tuning are slow training, high cost, big memory footprint, and difficulty sharing variants.


Section 3: Parameter-Efficient Fine-Tuning (PEFT)

PEFT methods update a tiny fraction of parameters and keep the base model frozen.

3.1 Adapters

Insert small trainable modules after the attention layer and the feedforward layer in each Transformer block, and freeze everything else.

Each "adapter" is a collection of these small modules across all blocks. You can swap adapters in and out for different tasks (medical NER, sentiment, and so on) on the same frozen base:

The 2019 paper showed adapters reach within 0.4% of full fine-tuning performance on GLUE while updating only ~3.6% of BERT's parameters.

3.2 LoRA — Low-Rank Adaptation

LoRA doesn't add new layers. It learns a low-rank delta to existing weight matrices.

A 10x10 weight matrix has 100 parameters. We approximate it with two smaller matrices, say 10x2 and 2x10:

That's 20 parameters instead of 100, 5x fewer. For a real LLM:

GPT-3 has 12,288 × 12,288 weight matrices (150M params per matrix) in each of its 96 Transformer blocks. LoRA at rank 8 uses only 12,288 × 2 + 2 × 12,288 ≈ 49K (less than 200K with both pairs combined) per matrix.

During training, only the small matrices update. After training, we add the delta back into the original weights:

Why does it work? Because LLMs have low intrinsic dimensionality. Even huge matrices are well-approximated by low-rank ones for fine-tuning purposes.

You also choose which layers to target. Common picks are the Q and V projection matrices in attention.

3.3 QLoRA — Quantized LoRA

LoRA fine-tunes small matrices, but the base model still sits in memory at full precision. QLoRA quantizes the base model to 4-bit, dropping VRAM dramatically.

Naive halving of precision (32-bit → 16-bit) loses information:

A direct mapping at 4-bit collapses many distinct weights into the same value:

QLoRA uses blockwise quantization, which applies local quantization scales per block to preserve more precision:

It also uses a distribution-aware scheme that exploits the fact that NN weights are roughly normally distributed in [-1, 1]:

The result is NF4 (4-bit normal float) with double quantization and paged optimizers, which fits a 33B model on a single 24GB GPU at training time.


Section 4: Instruction Tuning with QLoRA — Hands-On

We fine-tune TinyLlama-1.1B (a base model, not chat) on a subset of UltraChat to teach it to follow instructions.

4.1 Templating the Data

TinyLlama's chat template uses <|user|>, <|assistant|>, and </s>:

from transformers import AutoTokenizer
from datasets import load_dataset

template_tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1BChat-v1.0")

def format_prompt(example):
    chat = example["messages"]
    return {"text": template_tokenizer.apply_chat_template(chat, tokenize=False)}

dataset = (
    load_dataset("HuggingFaceH4/ultrachat_200k", split="test_sft")
    .shuffle(seed=42).select(range(3_000))
)
dataset = dataset.map(format_prompt)

4.2 4-bit Quantized Model

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

model_name = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="float16",
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name, device_map="auto", quantization_config=bnb_config,
)
model.config.use_cache = False
model.config.pretraining_tp = 1

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = "<PAD>"
tokenizer.padding_side = "left"

VRAM dropped from ~4GB to ~1GB just for loading.

4.3 LoRA Configuration

from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model

peft_config = LoraConfig(
    lora_alpha=32,    # scaling for the LoRA delta
    lora_dropout=0.1,
    r=64,             # rank — typical 4–64; bigger = more capacity, less compression
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["k_proj", "gate_proj", "v_proj", "up_proj", "q_proj", "o_proj", "down_proj"],
)

model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)

Key knobs:

ParamMeaning
rRank of compressed matrices. Bigger = more representational power, less compression
lora_alphaStrength of the LoRA delta (rule of thumb: 2 * r)
target_modulesWhich projection matrices to apply LoRA to

4.4 Training

from transformers import TrainingArguments
from trl import SFTTrainer

training_arguments = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    optim="paged_adamw_32bit",
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    num_train_epochs=1,
    logging_steps=10,
    fp16=True,
    gradient_checkpointing=True,
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    dataset_text_field="text",
    tokenizer=tokenizer,
    args=training_arguments,
    max_seq_length=512,
    peft_config=peft_config,    # remove for full fine-tuning
)

trainer.train()
trainer.model.save_pretrained("TinyLlama-1.1B-qlora")

Drop quantization_config and peft_config to switch from QLoRA to full fine-tuning.

4.5 Merge Adapter Back

from peft import AutoPeftModelForCausalLM

model = AutoPeftModelForCausalLM.from_pretrained(
    "TinyLlama-1.1B-qlora",
    low_cpu_mem_usage=True,
    device_map="auto",
)
merged_model = model.merge_and_unload()

Test it:

from transformers import pipeline

prompt = """<|user|>
Tell me something about Large Language Models.</s>
<|assistant|>
"""

pipe = pipeline(task="text-generation", model=merged_model, tokenizer=tokenizer)
print(pipe(prompt)[0]["generated_text"])
# "Large Language Models (LLMs) are artificial intelligence (AI) models..."

The base model couldn't do this. TinyLlama now follows instructions.


Section 5: Evaluating Generative Models

There's no single right metric. The chapter walks through five categories.

5.1 Word-Level Metrics

Perplexity asks, given an input, how surprised is the model by the next token? Lower perplexity means a better-confidence model.

Other word-level metrics include ROUGE (summarization), BLEU (translation), and BERTScore (semantic similarity). All have the same flaw: they don't measure correctness, fluency, or creativity.

5.2 Public Benchmarks

BenchmarkTests
MMLU57 tasks: classification, QA, sentiment
GLUEWide-range language understanding
TruthfulQATruthfulness of generated text
GSM8KGrade-school math word problems
HellaSwagCommon-sense inference (multiple choice)
HumanEvalCode generation correctness (164 problems)

Caveats: models can overfit to public benchmarks, benchmarks are broad and may miss your use case, and running them takes hours.

5.3 Leaderboards

The Open LLM Leaderboard aggregates HellaSwag, MMLU, TruthfulQA, GSM8K, and others. Useful but susceptible to leaderboard-overfitting.

5.4 LLM-as-a-Judge

Have one LLM score another's output. Pairwise comparison has Model A and Model B answer the same prompt, and a third LLM picks the better one. Good for open-ended evaluation. As LLMs improve, so does the judge.

5.5 Human Evaluation — The Gold Standard

Chatbot Arena lets the community vote on anonymous head-to-head LLM responses. Wins and losses feed an Elo rating system (the chess approach). 800K+ human votes.

Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure." Don't optimize purely for one benchmark — you'll find a model that aces it but is bad at everything else. The authors recommend: evaluate for your intended use case, and trust your own qualitative tests.


Section 6: Preference Tuning / Alignment / RLHF

After SFT, the model follows instructions but might not give high-quality answers. Preference tuning aligns it to humans' preferred answers.

A human (or model) scores the generation:

The model is updated to encourage high-scored generations and discourage low-scored ones:

6.1 Reward Models — Automating Evaluation

Asking a human for every example is expensive. Train a reward model instead:

Take a copy of the instruction-tuned LLM and replace its language modeling head with a scalar quality head:

Given prompt plus generation, it outputs a single score:

6.2 Training the Reward Model

Preference datasets are (prompt, chosen, rejected) triples. Both candidates may be reasonable, but one is preferred:

The easy way to collect them is to have the LLM generate two answers per prompt and have humans pick:

The reward model is trained so that score(chosen) > score(rejected):

6.3 Three Stages of Preference Tuning

You can stack multiple reward models. Llama 2 trained two: one for helpfulness and one for safety:

6.4 PPO

Proximal Policy Optimization is the classic RL fine-tuner, the algorithm behind the original ChatGPT. It updates the LLM to maximize reward without drifting too far from the reference (instruction-tuned) model.

PPO trains two models simultaneously (LLM plus reward model) and is complex.

6.5 DPO — No Reward Model

Direct Preference Optimization sidesteps the reward model entirely. Use the LLM itself as the reference, comparing log-probabilities of chosen versus rejected generations between a frozen reference model and the trainable model.

DPO is more stable and more accurate than PPO in practice. It's the chapter's choice.


Section 7: Preference Tuning with DPO — Hands-On

7.1 Format the Data

from datasets import load_dataset

def format_prompt(example):
    system = "<|system|>\n" + example["system"] + "</s>\n"
    prompt = "<|user|>\n" + example["input"] + "</s>\n<|assistant|>\n"
    chosen = example["chosen"] + "</s>\n"
    rejected = example["rejected"] + "</s>\n"
    return {"prompt": system + prompt, "chosen": chosen, "rejected": rejected}

dpo_dataset = load_dataset("argilla/distilabel-intel-orca-dpo-pairs", split="train")
dpo_dataset = dpo_dataset.filter(
    lambda r: r["status"] != "tie" and r["chosen_score"] >= 8 and not r["in_gsm8k_train"]
)
dpo_dataset = dpo_dataset.map(format_prompt, remove_columns=dpo_dataset.column_names)

7.2 Load + Quantize + Apply LoRA

from peft import AutoPeftModelForCausalLM, LoraConfig, prepare_model_for_kbit_training, get_peft_model
from transformers import BitsAndBytesConfig, AutoTokenizer

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="float16", bnb_4bit_use_double_quant=True,
)

# Start from the SFT-tuned model from section 4
model = AutoPeftModelForCausalLM.from_pretrained(
    "TinyLlama-1.1B-qlora", low_cpu_mem_usage=True,
    device_map="auto", quantization_config=bnb_config,
)
merged_model = model.merge_and_unload()

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = "<PAD>"
tokenizer.padding_side = "left"

peft_config = LoraConfig(
    lora_alpha=32, lora_dropout=0.1, r=64, bias="none", task_type="CAUSAL_LM",
    target_modules=["k_proj", "gate_proj", "v_proj", "up_proj", "q_proj", "o_proj", "down_proj"],
)
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)

7.3 Train with DPO

from trl import DPOConfig, DPOTrainer

training_arguments = DPOConfig(
    output_dir="./results",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    optim="paged_adamw_32bit",
    learning_rate=1e-5,
    lr_scheduler_type="cosine",
    max_steps=200,
    logging_steps=10,
    fp16=True,
    gradient_checkpointing=True,
    warmup_ratio=0.1,    # ramp LR to target during first 10% of steps
)

dpo_trainer = DPOTrainer(
    model, args=training_arguments,
    train_dataset=dpo_dataset, tokenizer=tokenizer,
    peft_config=peft_config, beta=0.1,
    max_prompt_length=512, max_length=512,
)
dpo_trainer.train()
dpo_trainer.model.save_pretrained("TinyLlama-1.1B-dpo-qlora")

7.4 Stack Adapters

from peft import PeftModel

# First merge SFT adapter into base
model = AutoPeftModelForCausalLM.from_pretrained(
    "TinyLlama-1.1B-qlora", low_cpu_mem_usage=True, device_map="auto",
)
sft_model = model.merge_and_unload()

# Then layer DPO adapter on top and merge
dpo_model = PeftModel.from_pretrained(sft_model, "TinyLlama-1.1B-dpo-qlora", device_map="auto")
dpo_model = dpo_model.merge_and_unload()

ORPO (Odds Ratio Preference Optimization) is a newer method that combines SFT + DPO into a single training loop — simpler than two-stage SFT-then-DPO and works with QLoRA.


Summary

  • The modern recipe is pretraining → SFT (instruction tuning) → preference tuning. Each stage uses next-token prediction as the underlying objective but with different data and conditioning.
  • PEFT methods cut fine-tuning cost dramatically. Adapters insert small trainable modules between Transformer sublayers. LoRA approximates weight-update matrices with low-rank factors and merges them back. QLoRA quantizes the base model to 4-bit (NF4 plus blockwise plus double quantization) so fine-tuning fits on consumer GPUs.
  • TinyLlama plus QLoRA worked example: 4-bit quantize the base model, attach LoRA with r=64, lora_alpha=32, train with SFTTrainer, merge adapter back. Loaded in ~1GB of VRAM versus ~4GB unquantized.
  • Evaluating generative models is messy. Use a mix: word-level metrics like perplexity, public benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA), leaderboards, LLM-as-a-judge, and human evaluation (Chatbot Arena's Elo system). Goodhart's Law warns against single-metric optimization.
  • Preference tuning aligns model behavior with what humans prefer. Three stages: collect (prompt, chosen, rejected) data, train a reward model, fine-tune the LLM against the reward signal.
  • PPO is the classic RL approach (ChatGPT used it) but trains two models and is unstable. DPO uses the LLM as its own reference, comparing log-prob shifts on chosen versus rejected. Simpler and more stable.
  • A full pipeline: pretrained TinyLlama → SFT with QLoRA → DPO with QLoRA → merge adapters in sequence. ORPO simplifies this into a single training loop.