Introduction
Chapter 1: Introduction
Introduction
GenAI prototypes are easy. Production GenAI is hard. The model hallucinates, gives different answers to identical prompts, and has weird gaps that come from how it was trained. The book's premise is that the recurring fixes for these problems have crystallized into 32 reusable design patterns, and the rest of the chapters are about those patterns. Chapter 1 is a vocabulary chapter, prompts, agents, logits, sampling, in-context learning, post-training, so the rest of the book has somewhere to land.
GenAI Design Patterns
The "design patterns" framing comes from architect Christopher Alexander's A Pattern Language (1977) and was carried into software by the Gang of Four book Design Patterns: Elements of Reusable Object-Oriented Software. A pattern is just a recurring problem with a known good solution and a name people agree on.
In the GenAI world the recurring problem is that nobody trains a model from scratch anymore. You start with a foundational model, GPT-4, Gemini, Claude, Llama, DeepSeek, Qwen, Mistral, that was trained on a generic dataset, and you build your application on top of it. Chip Huyen calls this AI engineering in her book of the same name, and the people who do it are AI engineers.
The kinds of problems you run into when you do this are predictable. The model's output style doesn't match what you want. It doesn't know your enterprise data. It can't do the thing you need it to do. Each pattern in the book is laid out the same way: a problem statement, a solution, an end-to-end example, a considerations section about alternatives and trade-offs, and references. Many of the patterns boil down to "decompose the task and hand pieces of it to agents", software components that drive a foundational model to plan, call tools, recover from errors, and check their own work. Applications built by stitching agents together are called agentic.
Building on Foundational Models
You almost always reach a foundational model through an API. Either the vendor's own API, or an LLM-agnostic framework that lets you point at any vendor.
Prompt and Context
A prompt goes in, a response comes out. Either side can be multimodal: text, images, video, audio.
The simplest prompt is just an instruction:
Create a pencil sketch in the style of Degas depicting a family of four playing a board game
Add context, domain background or a role, and the model's behavior changes:
You are an expert marketer who is very familiar with the book market in university
towns in Germany.
Covenant of Water is a novel that tells the story of three generations of an
Orthodox Saint Thomas Christian family in Kerala.
Write a one-paragraph blurb introducing the book to readers at a bookstore in
Göttingen, drawing local connections.
Using the Model Provider's API
Python, Go, Java, and TypeScript all have client libraries. The book sticks to Python.
import anthropic
client = anthropic.Anthropic(
# defaults to os.environ.get("ANTHROPIC_API_KEY")
api_key="YOUR_ANTHROPIC_API_KEY",
)
completion = client.messages.create(
model="claude-3-7-sonnet-latest",
system="You are an expert Python programmer.",
messages=[
{
"role": "user",
"content": [
{"type": "text",
"text": "Write code to find the median value of a list of integers."}
]
}
]
)
print(completion.content[0].text)
The thing to notice is that the prompt has two pieces. The system prompt is set by the developer and shapes overall behavior. The user prompt is dynamic and carries the actual task. This split shows up everywhere later in the book.
Using an LLM-Agnostic Framework
Frameworks like PydanticAI, LangChain, DSPy, and Hugging Face wrap the vendor APIs so you can swap providers with a string change.
from pydantic_ai import Agent
agent = Agent('anthropic:claude-3-7-sonnet-latest',
system_prompt="You are an expert Python programmer.")
result = agent.run_sync(
"Write code to find the median value of a list of integers.")
print(result.data)
Switching to OpenAI, Google, or Groq is one identifier change: openai:gpt-4o-mini, google-vertex:gemini-2.0-flash, groq:llama3-70b-8192.
Running Your Model Locally
For an open-weights model like Llama 3, Ollama is the path of least resistance.
ollama run llama3.2
Ollama exposes the model behind the OpenAI API, so you can point any OpenAI-compatible client at localhost:
from pydantic_ai.models.openai import OpenAIModel
from pydantic_ai.providers.openai import OpenAIProvider
model = OpenAIModel(
model_name='llama3.2',
provider=OpenAIProvider(base_url='http://localhost:11434/v1')
)
How Foundational Models Are Created
You're a consumer of trained models, not a producer of them. But knowing the stages helps you read the rest of the book.
DeepSeek is the example because they published more detail than anyone else. Pretraining comes first: train a base LLM on a huge corpus to predict the next token. DeepSeek used 14.8 trillion tokens. Shakespeare's complete works are about 1.2M tokens, so the dataset is "12 million Shakespeares" worth of text. Modern LLMs work on tokens (short character sequences), not whole words, which is how they handle proper names and out-of-vocabulary terms. This is also why people sometimes shorthand LLMs as "next-token predictors", but that's only the first stage.
Supervised fine-tuning (SFT) comes next. The model is trained on (prompt, ideal response) pairs written by experts. Cohere uses physicians, financial analysts, and accountants. DeepSeek-V3 came out of this stage as a mixture of experts (MoE) model: 671B total parameters but only 37B activated per token.
The third stage is reinforcement learning on human preferences. Show raters two outputs, ask which they prefer, train the model toward the preferred one. That's RLHF, also called preference optimization.
DeepSeek-R1 sits on top of V3 and went through five more stages. A cold start fine-tuning round on a few thousand data points, pure RL without any preliminary SFT (which produced DeepSeek-R1-Zero and showed that reasoning ability can be incentivized purely through RL), rejection sampling to keep the best examples from the RL run, another SFT round mixing synthetic and supervised data on writing and factual QA, and a final RL pass across diverse prompts. To make R1 cheaper to run, the team also released distilled versions on top of Qwen and Llama at 1.5B, 7B, 8B, and 14B parameters.
The Landscape of Foundational Models
Academic benchmarks saturate quickly and can be gamed. The most accepted ranking is pair-wise blind tests on LMArena.
Both axes on the leaderboard are logarithmic, Elo rating on the y-axis and cost on the x-axis. A 400-Elo gap means about 10:1 odds the higher-ranked model wins.
| Category | Examples | Key Traits |
|---|---|---|
| Frontier | GPT-5, Gemini 2.5 Pro, Claude Opus | Highest performance, costly, can't run locally; reasoning + multimodal + context up to 2M tokens |
| Distilled frontier | Gemini Flash, Claude Sonnet, GPT-4o-mini | ~20× cheaper than full frontier; great for high-volume tasks |
| Open-weight | Llama, Mistral, DeepSeek, Qwen, Falcon | Public weights → transparency, custom fine-tuning; require hosting expertise (Together.ai, hyperscaler endpoints help) |
| Locally hostable | Llama 8B, Gemma 2B (NVIDIA NIM) | Run on consumer hardware; complete privacy, no per-call cost; reduced capability |
For air-gapped enterprise deployments, frontier vendors offer on-premises options: Gemini on Google Distributed Cloud, OpenAI on Azure on-prem.
Agentic AI
In computer science an agent is a software entity that acts on a user's behalf. PydanticAI's Agent class wraps an LLM with a role and instructions, and that's enough to call the LLM your agent.
agent = Agent(
f"anthropic:{MODEL_ID}",
system_prompt="You are an inventory manager who orders just in time.",
...
)
Autonomy
The thing that distinguishes an AI agent from any other software entity is autonomy. You give the agent a goal and let the LLM, the agent's "brain", figure out the rest.
The inventory example makes this concrete:
@dataclass
class InventoryItem:
name: str
quantity_on_hand: int
weekly_quantity_sold_past_n_weeks: [int]
weeks_to_deliver: int
items = [
InventoryItem("itemA", 300, [50, 70, 80, 100], 2),
InventoryItem("itemB", 100, [70, 80, 90, 70], 2),
InventoryItem("itemC", 200, [80, 70, 90, 80], 1)
]
result = agent.run_sync(f"""
Identify which of these items need to be reordered this week.
**Items**
{items}
""")
The agent comes back with something like:
itemB
quantity_to_order=300 reason_to_reorder='Current stock (100) is insufficient
to cover projected demand over delivery time...'
You never wrote the reorder logic. That's the whole point. Conventional software needs the rules spelled out. Agents work out the rules from the goal.
Characteristics of Agents
Beyond autonomy, the rest of the agent vocabulary is about how that autonomy plays out. Goal orientation means the agent is working toward an objective ("just in time" in the inventory example), not just reacting to the prompt. Planning and reasoning is the agent breaking the work into steps on its own: weekly sales range, forecast over delivery window, required inventory, reorder count. Perception and action is how the agent reaches outside itself, pulling data from databases or web searches and acting back on the world by, say, placing an reorder via API. That last bit is Tool Calling (Pattern 21, Chapter 7). And adaptability and learning is the agent checking and correcting its own output, which shows up as Reflection (Pattern 18, Chapter 6) and Self-Check (Pattern 31, Chapter 9).
Right now, fully agentic behavior is more aspiration than reality. Nondeterminism, hallucinations, and other failure modes get in the way. A lot of the patterns in this book exist precisely because the gap between "agent in theory" and "agent in production" is wide.
Fine-Grained Control
Foundational models expose a handful of low-level knobs that change generation behavior. Knowing what they do can save you a pattern.
Logits
The model's last layer doesn't pick a token, it produces a distribution over the vocabulary. The raw scores are called logits, unnormalized real-valued numbers that say how much the model "wants" each token.
If you always picked the highest-logit token, called greedy sampling, the output would be repetitive and dull. Real models sample stochastically. Logits become probabilities through softmax:
P(token_i) = e^(logit_i) / Σ_j e^(logit_j)
Softmax accentuates peaks and dampens tails, but only when the input is already peaked. A flat distribution stays mostly flat after softmax.
Editing logits before softmax is the basis for Logits Masking (Pattern 1, Chapter 2).
Temperature
Temperature T is a scalar that scales logits before softmax:
P(token_i) = e^(logit_i / T) / Σ_j e^(logit_j / T)
T = 0 is greedy sampling. Higher T flattens the distribution and gives you more varied, more creative output. As with softmax, the effect is muted on already-flat distributions.
PydanticAI:
agent = Agent('anthropic:claude-3-7-sonnet-latest',
model_settings={"temperature": 0.5},
system_prompt="Complete the sentence.")
Direct Anthropic API:
completion = client.messages.create(
model="claude-3-7-sonnet-latest",
system="Complete the sentence.",
temperature=0.5,
messages=[...]
)
If you complete "The trade war caused" at low, medium, and high temperatures, the higher T runs visibly bring in more topics and more varied phrasing. RAG and LLM-as-Judge (Chapters 3 and 6) usually want T = 0.
Top-K Sampling
Top-K caps token selection at the k highest-probability tokens, lopping off the long tail. Useful when you crank temperature up and want to keep the model from going completely off the rails.
Continuing "The spaceship" with K=1 tracks closely to common sci-fi phrasing. K=10 is similar. K=100 starts surfacing more varied vocabulary.
Nucleus Sampling (Top-P)
Nucleus or top-P sampling picks the smallest set of tokens whose cumulative probability crosses a threshold p.
The advantage over top-K is that the cutoff adapts to the model's confidence. Confident model, few tokens considered. Uncertain model, many tokens considered. Output reads more naturally than a fixed top-K.
Beam Search
Models don't actually pick one token at a time. Beam search is a deterministic search that keeps multiple candidate sequences alive in parallel and finds the most likely overall sequence.
Around beam search there's a small zoo of penalties. Frequency penalty rises as a word repeats. Presence penalty kicks in once when a word recurs and pushes the model toward vocabulary diversity. Minimum and maximum length penalties force the output above or below a length. Length normalization penalty divides scores by some function of length so the search doesn't bias toward short sequences. Beam search width is the number of parallel candidates you keep.
OpenAI and Gemini support repetition penalties; Anthropic doesn't. Hugging Face Transformers supports length penalties and beam width, but hosted models generally don't expose them. Beam search shows up again in patterns in Chapters 2 and 6.
In-Context Learning
In traditional ML you adapt a model by retraining it. With LLMs you can adapt the model just by changing the prompt. No weight updates. This is in-context learning, and it's the property that makes a lot of the rest of the book possible.
Zero-Shot Learning
Zero-shot is the model doing a task with no examples, relying on what it learned during pretraining:
Analyze the use of light in Claude Monet's Impression, Sunrise and explain how
it exemplifies impressionist techniques.
You get a coherent essay back even though the model has never seen this exact prompt.
Few-Shot Learning
Few-shot drops a small number of examples into the prompt. The examples set the structure and the expected output format. Because the examples live in the prompt context, this is also called in-context learning, and it's the foundation of context engineering.
agent = Agent(MODEL_ID,
system_prompt="""You are an expert on art history. I will describe
a painting. You should identify it.
""")
result = agent.run_sync("""
Example:
Description: shows two small rowboats in the foreground and a red Sun. Answer: Painting: Impression, Sunrise Artist: Claude Monet Year: 1872 Significance: Gave the Impressionist movement its name; captured the fleeting effects of light and atmosphere, with loose brushstrokes.
Description: The painting shows a group of people eating at a table under an
outside tent. The men are wearing boating hats.
""")
The model returns a similarly structured answer for Luncheon of the Boating Party by Renoir.
When In-Context Learning Falls Short
In-context learning only works when the base model already has the knowledge and capability you need. A pile of examples eats context-window tokens and slows inference. And LLMs can fail to generalize complex problems from a handful of examples. When any of those bite, post-training is the next lever.
Post-Training
Post-training changes the weights of a pretrained foundational model to adapt it to a new task or domain. The post-trained model lives at a separate endpoint from the base.
Post-Training Methods
The methods overlap and you can mix them.
Continued pretraining (CPT) is just more pretraining. You keep training the base model on new text, like industry jargon. You need full weights and the architecture, it's expensive, and you'll have to redo the SFT and RL stages on top. Bloomberg did this on financial data in March 2023. Within months a regular foundational model outperformed their domain model. Almost nobody has gone this route since.
Supervised fine-tuning (SFT) trains on (prompt, response) pairs. With instruction-style prompts ("improve this management plan…") and varied data you get better instruction-following. Single-task SFT often makes the model forget prior tasks. Multi-task SFT can generalize.
Parameter-efficient fine-tuning (PeFT) is what makes fine-tuning practical at frontier-model scale. LoRA (Low-Rank Adaptation) freezes the original weights and trains small low-rank "adapter" matrices on top. The math is brutal in your favor: LoRA cuts trainable parameters by up to 10,000× and GPU memory by up to 3×, adds no inference latency, and often matches full fine-tuning. QLoRA quantizes all the weights, slower training but smaller and faster inference.
Preference tuning gives the model two outputs and a preference label. Human labels plus RL give you RLHF. DPO (Direct Preference Optimization) is more efficient than RLHF. DeepSeek introduced GRPO (Group Relative Policy Optimization), which scores multiple responses relative to the group's average reward.
What kind of post-training you can do is dictated by the data you have:
| Dataset structure | Post-training method |
|---|---|
| Pure text completions | Unsupervised CPT (new vocabulary/associations) |
| Input → ideal output pairs | SFT / instruction tuning |
| Two responses + preference label | Preference tuning (RLHF/DPO/GRPO) |
Patterns built on post-training include Content Optimization (5), Adapter Tuning (15), and Prompt Optimization (20). As of June 2025 only open-weight models support all forms; for hosted models, check the provider's docs.
Fine-Tuning a Frontier Model
OpenAI, Anthropic, AWS, and Google Cloud all have streamlined SFT on frontier models. Upload (input, output) pairs as JSON-line, kick off a job, get back an endpoint for an adapter-tuned model.
training_file = client.files.create(
file=open("training_data.jsonl", "rb"),
purpose="fine-tune"
)
job = client.fine_tuning.jobs.create(
training_file=training_file.id,
model="gpt-3.5-turbo"
)
job_status = client.fine_tuning.jobs.retrieve(job.id)
if job_status.status == 'succeeded':
print(f"Model ID: {job_status.fine_tuned_model}")
The resulting model ID looks like ft:<BASE MODEL>-0125:<ORG_NAME>::<JOB ID>, and you call it like the base model:
completion = client.chat.completions.create(
model=job_status.fine_tuned_model,
messages=messages
)
You'll typically need at least 100 pairs. A few thousand is better.
Fine-Tuning an Open-Weight Model
Unsloth.ai wraps the messy parts of fine-tuning Gemma, Llama, and similar open-weight LLMs locally or via a managed service.
from unsloth import FastLanguageModel
max_seq_length = 2048
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Meta-Llama-3.1-8B-bnb-4bit",
max_seq_length=max_seq_length,
load_in_4bit=True,
dtype=None,
)
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=["q_proj", "k_proj", "v_proj", "up_proj", "down_proj",
"o_proj", "gate_proj"],
use_gradient_checkpointing="unsloth"
)
dataset = load_dataset("...", split="train")
dataset = dataset.map(apply_template, batched=True)
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
dataset_text_field="text",
)
trainer.train()
model.save_pretrained_merged("model", tokenizer, save_method="merged_16bit")
model.push_to_hub_merged("...", tokenizer, save_method="merged_16bit")
r=16 is the LoRA rank. Adapters land on the attention (Q, K, V) and projection layers.
Considerations for Fine-Tuning
Fine-tuning gives you a model tuned to your domain at the cost of a lot of operational overhead. Before pulling the trigger, weigh the costs.
The data requirement is real. You need 100+ samples, often thousands, collected ahead of time. If you don't have them yet, run with in-context learning, gather data in production, then fine-tune later. Catastrophic forgetting is the failure mode where overfitting on new examples wipes out the model's broad world knowledge. The mitigations are small datasets, few epochs, and a learning rate around 1e-5 (where pretraining stopped). On top of that, every fine-tuned model has to be evaluated for bias and regressions before it ships, redone every time the base model updates, and tracked carefully for training and validation lineage. Cost-wise, hosted providers charge a higher per-token rate on fine-tuned models because of the hosting overhead. For open-weight models you save on inference and pay GPU costs (a few dollars to a few hundred) during fine-tuning itself.
The Organization of the Rest of the Book
The remaining nine chapters cover 32 design patterns:
| Chapter | Theme | Patterns |
|---|---|---|
| 2 | Controlling Content Style | Logits Masking (1), Grammar (2), Style Transfer (3), Reverse Neutralization (4), Content Optimization (5) |
| 3 | Adding Knowledge: Bass | Basic RAG (6), Semantic Indexing (7), Indexing at Scale (8) |
| 4 | Adding Knowledge: Syncopation | Index-Aware Retrieval (9), Node Postprocessing (10), Trustworthy Generation (11), Deep Search (12) |
| 5 | Extending Model Capabilities | Chain of Thought (13), Tree of Thoughts (14), Adapter Tuning (15), Evol-Instruct (16) |
| 6 | Improving Reliability | LLM-as-Judge (17), Reflection (18), Dependency Injection (19), Prompt Optimization (20) |
| 7 | Enabling Agents to Take Action | Tool Calling (21), Code Execution (22), Multiagent Collaboration (23) |
| 8 | Addressing Constraints | Small Language Model (24), Prompt Caching (25), Inference Optimization (26), Degradation Testing (27), Long-Term Memory (28) |
| 9 | Setting Safeguards | Template Generation (29), Assembled Reformat (30), Self-Check (31), Guardrails (32) |
| 10 | Composable Agentic Workflows | End-to-end agentic application combining the patterns |
Every pattern follows the same skeleton: Problem, Solution, Example, Considerations, References.
Summary
The throughline of the chapter is that GenAI engineering happens above the model, not inside it. You pick a foundational model, frontier, distilled, open-weight, or locally hostable, and shape its behavior through prompts, sampling knobs (logits, temperature, top-K, top-P, beam search), in-context learning, or post-training (CPT, SFT, LoRA/QLoRA, RLHF/DPO/GRPO). Agents are the architectural unit you build with: an LLM brain plus autonomy, planning, tool use, and self-correction. The rest of the book is patterns for making all of that work in production.
Next chapter
Controlling Content Style