Pagefy

Pagefy

Back to AI Engineering

Prompt Engineering

Hands On Large Language Models by Jay Alammar & Maarten GrootendorstBuy the book
Sign in to save bookmarks, reading progress, and highlights.

Chapter 6: Prompt Engineering

Introduction

Prompt engineering is how we steer a generative LLM toward useful output. This chapter covers picking and loading a generative model (Phi-3-mini), the sampling parameters that control creativity (temperature, top_p, top_k), the components of a well-engineered prompt (persona, instruction, context, format, audience, tone, data), in-context learning (zero/one/few-shot), prompt chaining, and advanced reasoning techniques (chain-of-thought, self-consistency, tree-of-thought). It ends with output verification: providing examples and using grammar-constrained sampling to force structured output (e.g., valid JSON).


Section 1: Using Text Generation Models

1.1 Choosing a Model

Foundation models come in many sizes from many vendors:

Rule of thumb: start small. Phi-3-mini (3.8B params, ≤8GB VRAM) is enough to learn the techniques. Scaling up later is easier than scaling down.

1.2 Loading Phi-3

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="cuda", torch_dtype="auto", trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer,
                return_full_text=False, max_new_tokens=500, do_sample=False)

1.3 Chat Templates Under the Hood

messages = [{"role": "user", "content": "Create a funny joke about chickens."}]
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False)
print(prompt)
<s><|user|>
Create a funny joke about chickens.<|end|>
<|assistant|>

The tokens <|user|>, <|assistant|>, and <|end|> come from the tokenizer. The model was trained to recognize this exact format, both who's speaking and where to stop.


Section 2: Controlling the Output

When generating each token, the model assigns a probability to every token in the vocab.

do_sample=False always picks the highest-probability token (deterministic / greedy). Set do_sample=True to use the parameters below.

2.1 Temperature

Controls how much weight is given to less-probable tokens.

A low temperature (e.g., 0.2) gives predictable, focused output. A high temperature (e.g., 0.8) gives diverse, creative output. Setting temp = 0 falls back to greedy decoding.

output = pipe(messages, do_sample=True, temperature=1)

2.2 top_p (Nucleus Sampling)

Restricts candidates to the smallest set of tokens whose cumulative probability is at least top_p. So top_p=0.1 only allows the most probable tokens, while top_p=1 allows all of them.

2.3 top_k

Restrict to the k most probable tokens (top_k=100 is the top 100 candidates).

2.4 Picking Values

Use casetemperaturetop_pWhy
BrainstormingHighHighMaximum diversity, surprising ideas
Email generationLowLowPredictable, conservative output
Creative writingHighLowCreative within a small coherent pool
TranslationLowHighDeterministic but lexically varied

Section 3: The Basic Ingredients of a Prompt

An LLM is fundamentally a next-token predictor.

A bare prompt completes text but doesn't follow tasks. Add an instruction plus data:

Add an output indicator to constrain format:

The model wasn't explicitly trained on Text: and Sentiment: markers but generalizes from the volume of similar instructions in its data.


Section 4: Instruction-Based Prompting

4.1 Cross-Cutting Tips

Specificity wins. "Write a description in less than two sentences with a formal tone" beats "Write a description." Guard against hallucination by saying "If you don't know, say 'I don't know'." Order also matters. LLMs forget the middle of long prompts (the lost-in-the-middle effect), so place instructions at the beginning (primacy) or end (recency).


Section 5: Advanced Prompt Engineering

5.1 Common Components of a Complex Prompt

ComponentPurpose
Persona"You are an expert in astrophysics"
InstructionThe actual task, as specific as possible
ContextWhy the task matters / background
FormatJSON? bullets? prose? — pin this down for automation
Audience"Explain like I'm 5" / domain expert / busy researcher
ToneFormal / casual / technical
DataThe input the instruction operates on

5.2 Iterative Prompt Building

persona = "You are an expert in Large Language models. You excel at breaking down complex papers into digestible summaries.\n"
instruction = "Summarize the key findings of the paper provided.\n"
context = "Your summary should extract the most crucial points...\n"
data_format = "Create a bullet-point summary that outlines the method. Follow this up with a concise paragraph...\n"
audience = "The summary is designed for busy researchers...\n"
tone = "The tone should be professional and clear.\n"
text = "MY TEXT TO SUMMARIZE"
data = f"Text to summarize: {text}"

query = persona + instruction + context + data_format + audience + tone + data

Some teams add emotional stimuli ("This is very important for my career") or role-play framing — empirically these can shift outputs. Test with your own model.

5.3 In-Context Learning (Zero / One / Few Shot)

Show the model what you want via examples instead of describing it.

one_shot_prompt = [
    {"role": "user", "content": "A 'Gigamuru' is a type of Japanese musical instrument. An example of a sentence that uses the word Gigamuru is:"},
    {"role": "assistant", "content": "I have a Gigamuru that my uncle gave me as a gift. I love to play it at home."},
    {"role": "user", "content": "To 'screeg' something is to swing a sword at it. An example of a sentence that uses the word screeg is:"},
]
outputs = pipe(one_shot_prompt)
# "During the intense duel, the knight skillfully screeged his opponent's shield, forcing him to defend himself."

The user/assistant separation matters. Without it the model can't tell its own examples from the user's request.

5.4 Chain Prompting — Break the Problem Across Calls

For complex tasks, break the work into multiple LLM calls and pipe outputs forward.

# Step 1: name + slogan
product_prompt = [{"role": "user", "content": "Create a name and slogan for a chatbot that leverages LLMs."}]
product_description = pipe(product_prompt)[0]["generated_text"]

# Step 2: sales pitch using the result of step 1
sales_prompt = [{"role": "user", "content": f"Generate a very short sales pitch for the following product: '{product_description}'"}]
sales_pitch = pipe(sales_prompt)[0]["generated_text"]

Each step can use different max_new_tokens, temperature, and so on. Other patterns are worth knowing. Response validation chains a second call to double-check the first. Parallel prompts generate alternatives in parallel, then merge in a final pass. Story writing flows outline → characters → beats → dialogue.


Section 6: Reasoning with Generative Models

LLMs resemble reasoning through pattern matching and memorization. We can prompt them to mimic System 2 thinking (slow, conscious, deliberate) instead of System 1 (fast, intuitive).

6.1 Chain-of-Thought (CoT)

Show examples that include reasoning steps before the answer, and the model learns to do the same.

cot_prompt = [
    {"role": "user", "content": "Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?"},
    {"role": "assistant", "content": "Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11."},
    {"role": "user", "content": "The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?"},
]
# → "The cafeteria started with 23 apples. They used 20 apples, so 23-20=3. Then bought 6 more, so 3+6=9. The answer is 9."

Why it works: each reasoning token itself becomes context for the next. Reasoning gives the model more compute distributed over the problem.

6.2 Zero-Shot CoT

Skip the example. Just append "Let's think step-by-step".

zeroshot_cot_prompt = [
    {"role": "user", "content": "The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have? Let's think step-by-step."}
]

Variants like "Take a deep breath and think step-by-step" also work.

6.3 Self-Consistency — Sample Multiple Reasoning Paths

Run the same CoT prompt several times with temperature > 0, then majority-vote the final answers.

The tradeoff is that this is n-times slower, but it's more robust to lucky or unlucky token sampling.

6.4 Tree-of-Thought (ToT)

For multi-step problems, branch into several candidate thoughts at each step, score them, prune the weak ones, and continue.

ToT is expensive in model calls. A simpler single-prompt approximation asks the model to roleplay a panel of experts who reason step-by-step in parallel:

zeroshot_tot_prompt = [
    {"role": "user", "content": "Imagine three different experts are answering this question. All experts will write down 1 step of their thinking, then share it with the group. Then all experts will go on to the next step, etc. If any expert realizes they're wrong at any point then they leave. The question is 'The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?' Make sure to discuss the results."}
]

Output: three "experts" debate, agree on each step, and converge on the answer (9).


Section 7: Output Verification

There are several reasons to constrain output for production. Structured output must be parseable JSON or similar. Valid output must be one of an enumerated set of values. Ethical output avoids profanity, PII, and biased or stereotyping content. Accurate output is factual and avoids hallucinations.

You have three control levers. Examples (few-shot) are the simplest. Grammar-constrained sampling restricts the next-token distribution to a grammar. Fine-tuning (Chapter 12) is the heavyweight option.

7.1 Providing Examples

Zero-shot can fail at structure. Asking Phi-3 for "a character profile in JSON format" without an example produced truncated, malformed JSON. With a one-shot template:

one_shot_template = """Create a short character profile for an RPG game. Make sure to only use this format:
{
  "description": "A SHORT DESCRIPTION",
  "name": "THE CHARACTER'S NAME",
  "armor": "ONE PIECE OF ARMOR",
  "weapon": "ONE OR MORE WEAPONS"
}
"""

Output:

{
  "description": "A cunning rogue with a mysterious past, skilled in stealth and deception.",
  "name": "Lysandra Shadowstep",
  "armor": "Leather Cloak of the Night",
  "weapon": "Dagger of Whispers, Throwing Knives"
}

Few-shot guides format but doesn't enforce it. Some models follow instructions much better than others.

7.2 Grammar — Constrained Sampling

Packages like Guidance, Guardrails, and LMQL enforce output constraints. Two flavors are common.

Validation after generation: the model proposes, and the model (or rules) validates.

Hybrid: generate only the unknown parts and insert known structure ourselves.

Grammar-constrained sampling: at each token step, restrict the candidate set to grammar-allowed tokens. For sentiment classification, only positive, neutral, or negative can be emitted.

7.3 Practical: JSON Grammar with llama-cpp-python

llama-cpp-python supports JSON grammar natively via response_format. Load Phi-3 in GGUF format (used for compressed and quantized models, see Chapter 12):

from llama_cpp.llama import Llama

llm = Llama.from_pretrained(
    repo_id="microsoft/Phi-3-mini-4k-instruct-gguf",
    filename="*fp16.gguf",
    n_gpu_layers=-1,
    n_ctx=2048,
    verbose=False,
)

output = llm.create_chat_completion(
    messages=[{"role": "user", "content": "Create a warrior for an RPG in JSON format."}],
    response_format={"type": "json_object"},
    temperature=0,
)['choices'][0]['message']["content"]

import json
print(json.dumps(json.loads(output), indent=4))

Output is guaranteed parseable JSON:

{
    "name": "Eldrin Stormbringer",
    "class": "Warrior",
    "level": 10,
    "attributes": {"strength": 18, "dexterity": 12, "constitution": 16, "intelligence": 9, "wisdom": 14, "charisma": 10},
    "skills": {...},
    "equipment": [{"name": "Ironclad Armor", ...}, {"name": "Steel Greatsword", ...}],
    "background": "Eldrin grew up in a small village..."
}

Free up VRAM before loading a second model: del model, tokenizer, pipe; gc.collect(); torch.cuda.empty_cache().


Summary

  • A generative LLM is a next-token predictor steered through three levers: prompt (what we say), sampling parameters (temperature, top_p, top_k, do_sample), and chat template (the model-specific role tokens like <|user|>).
  • High temperature and top_p produce creative output. Low values produce predictable output. Pick combinations to fit the task.
  • A good prompt is modular: persona, instruction, context, format, audience, tone, data. Compose, iterate, test.
  • Place key instructions at the start or end of the prompt. Middles get forgotten.
  • In-context learning (zero-, one-, or few-shot) shows examples instead of describing the task and often beats verbal instructions.
  • Chain prompting breaks complex tasks across multiple calls so each step can be tuned independently.
  • For reasoning: chain-of-thought ("Let's think step-by-step") makes the model show its work. Self-consistency sample-and-vote reduces noise. Tree-of-thought explores branching reasoning paths (or fakes it via expert role-play).
  • For production, control output via few-shot examples, grammar-constrained sampling (libraries like Guidance, Guardrails, and LMQL, or built-in JSON grammar in llama-cpp-python), or fine-tuning (Chapter 12).