Controlling Content Style
Chapter 2: Controlling Content Style
Introduction
Foundational models produce stochastic, model-specific output by default. Even when you fix the prompt and pin the length, two models will hand back two different voices, and the same model on two different calls will too. Ask six providers "What's a good side dish for pierogi? Answer in a single sentence" and you get six different answers:
| Model | Provider | Answer |
|---|---|---|
| GPT-4 | OpenAI | A great side dish for pierogi is sautéed onions with butter and a sprinkle of crispy bacon bits. |
| Claude Sonnet 3.5 | Anthropic | A tangy sauerkraut or caramelized onions complement pierogies perfectly by adding contrasting acidity or sweetness… |
| Gemini 2.0 Flash | Sautéed onions and mushrooms are a classic and delicious side dish for pierogi. | |
| Llama 3.2 70B | Meta | A traditional and delicious side dish for pierogi is fried onions and sour cream… |
| DeepSeek-R1 | DeepSeek | A tangy sauerkraut salad or caramelized roasted carrots with dill make excellent, flavorful sides for pierogi. |
| Mistral Small 24B | Mistral AI | A classic side dish for pierogi is coleslaw, especially when garlic and herbs are added. |
Prompt engineering by itself is brittle. The five patterns in this chapter give you progressively stronger control over style, from hard rules at sampling time (Logits Masking), to formal-syntax constraints (Grammar), to example-based imitation (Style Transfer), to a generate-then-restyle workflow when you only have desired-style examples (Reverse Neutralization), to preference tuning when you can't even articulate what makes content "good" (Content Optimization).
Pattern 1 — Logits Masking
Logits Masking enforces style rules by intercepting the sampling stage and zeroing out the probability of any continuation that breaks them.
Problem
Style rules come from a lot of places. Branding teams want Item A described with sporty, performant words and not Item B's spacious, luxurious ones. Accuracy rules ban repeating an invoice ID and amount in the body of a payment letter because only the canonical location is validated. Compliance rules ban competitor names B, C, D when discussing a case study from Customer A. Stylebook rules pin you to The Chicago Manual of Style or APA citation conventions.
The naive answer is the try-and-try-again antipattern: generate, evaluate, regenerate if it fails, repeat.
This works only when most responses pass on the first try. The latency math says that if p% succeed, the expected number of attempts is 100/p.
| Success rate | Avg attempts | 99th-percentile attempts |
|---|---|---|
| 90% | 1.1 | 2 |
| 30% | 3.3 | 13 |
The exponential backoff people add between attempts makes the tail worse. Try-and-try-again is acceptable when the success rate is already very high, and not otherwise.
Solution
Logits Masking intervenes inside the sampling loop. At each step, look at the candidate continuations, set the logits of any rule-breaking ones to -inf, and continue as long as at least one valid candidate remains. If everything gets masked or you've revisited a dead end, back up one step and retry. After a max number of retries, refuse the request.
The net effect is that nonconforming branches get pruned out of beam search.
The solid boxes are sequence selection, which is enough for simple cases. The hatched boxes are sequence regeneration, needed when masking wipes out everything.
Implementation with the Transformers library
Hugging Face Transformers exposes a LogitsProcessor hook:
class MyRulesLogitsProcessor(LogitsProcessor):
def __init__(self, tokenizer, rules):
self.tokenizer = tokenizer
self.rules = rules
Pass it to a text-generation pipeline:
from transformers import pipeline
pipe = pipeline(task="text-generation", model=MODEL_ID)
rules_processor = MyRulesLogitsProcessor(pipe.tokenizer, rules)
results = pipe(input_message,
max_new_tokens=256,
do_sample=True,
temperature=0.8,
num_beams=10,
logits_processor=[rules_processor])
The interesting work happens in __call__. Inputs are token IDs (so decode them first), and you set the logits of invalid sequences to -inf, or -10000 since torch sometimes balks at -inf:
def __call__(self, input_ids, input_logits):
output_logits = input_logits.clone()
for idx, input_id in enumerate(input_ids):
seq = self.tokenizer.decode(input_id)
if not self.apply_rules(seq, self.rules):
output_logits[idx] = -np.inf
return output_logits
The standard pipeline doesn't backtrack. To regenerate, drive pipe.model.generate() 16 tokens at a time, append accepted text to the prompt, and reinvoke:
input_ids = pipe.tokenizer(
input_prompt + '\n'.join(text_so_far),
return_tensors="pt").to("cuda")
results = pipe.model.generate(
**input_ids,
max_new_tokens=16,
num_beams=10,
output_scores=True,
)
A stop string, either specified in the prompt or coming from the example context, ends generation.
Example: Sequence Selection (Branding)
You're a marketer for nutritional supplements. Your e-commerce site bans award winning, quality, growth, and perfect, and rewards positive SEO terms like whey and whey protein. Zero-shot prompting fails because the model goes straight for the banned terms.
With Logits Masking you write an evaluator that scores by counting positive vs. negative terms, then keep only the highest-scoring continuations:
def evaluate(descr, positives, negatives):
descr = descr.lower()
num_positive = np.sum([1 for p in positives if p in descr])
num_negative = np.sum([1 for n in negatives if n in descr])
return int(num_positive - num_negative)
class BrandLogitsProcessor(LogitsProcessor):
def __call__(self, input_ids, input_logits):
output_logits = input_logits.clone()
num_matches = [evaluate(self.tokenizer.decode(seq),
self.positives, self.negatives)
for seq in input_ids]
max_matches = np.max(num_matches)
for idx in range(len(input_ids)):
if num_matches[idx] != max_matches:
output_logits[idx] = -10000
return output_logits
Result: a description that lands on whey, whey protein, nutrients, and premium (a workaround for the banned quality).
Example: Sequence Regeneration (Acrostic Poetry)
Generate a children's-book acrostic where the first letters spell out an adjective for an animal. For a tiger that might be BOLD, SWIFT, or FIERCE. One-shot Llama 3.2 fails on this. Given a rabbit example it tries to spell POWER and produces PORE.
The pattern walks through it in pieces. First, ask the model for adjectives that fit "As ___ as a {animal}" and get back something like ['wild', 'agile', 'regal', 'swift', 'fierce']. Then ask for a phrase about the animal that starts with the chosen letter, initialize the poem with that line, and call pipe.model.generate(... num_beams=10) for the next chunk. Apply Logits Masking to keep only beams whose first letters match the next character of the acrostic word, zeroing out the rest. Greedy decoding picks the highest-probability survivor. If everything gets masked, pop the last line and try again. If you exhaust the start, reinitialize with a different adjective.
Sample output for tiger:
Boldly, the brave tiger stalks its prey
Owning the forest with its might,
Lurking in the shadows, waiting to pounce,
Daring to be the king of the night
Considerations
Sequence selection is enough as long as masking always leaves at least one valid continuation. Sequence regeneration is the fallback when masking can wipe out every candidate.
The alternatives are weaker on enforcement but easier to set up. Few-shot prompting (which is essentially Style Transfer) and prompt engineering are cheaper but give you no guarantee. A bigger, more instruction-following model is slower and more expensive. Try-and-try-again is fine when p is high (99th-percentile attempts is just 2 at p = 0.9). Reflection (Pattern 18) feeds error messages back to the model and reduces retries. Grammar (Pattern 2) is the right answer when the rules can be expressed in a standard form, because then the provider can do Logits Masking server-side.
A useful extension is autocomplete. Hold the entire document in context and use Logits Masking to provide search-style autocomplete grounded in your documents. Cache the document so the input tokens don't keep accumulating cost.
The big caveat is access to logits. As of June 2025, OpenAI exposes logprobs broadly, Gemini Flash supports responseLogprobs (Pro doesn't), Llama is the most permissive but requires self-hosting, and Anthropic's Claude doesn't expose logits at all. So this pattern restricts your model choice. Each beam also requires a round-trip between client and model, which adds latency, which makes this mostly viable for locally hosted or colocated models. And if no candidate sequence meets the rules, you may have to refuse. The AI engineer's job is to provide enough context in the prompt that this is rare.
In RL the same idea is called invalid action masking (Vinyals et al. 2019, on StarCraft II; theoretical justification by Huang and Ontañón 2020). Self-Check (Pattern 31) is another pattern that reads logits.
Pattern 2 — Grammar
Grammar enforces style rules that can be expressed as a context-free metasyntax, typically when output has to fit a specific data schema or standard format.
Problem
You want LLM output in a fixed format: a comma-separated list, a JSON document, a syntactically valid SQL statement. The naive approach is to state the format in the prompt and provide examples. That's brittle, breaks across model versions, is stochastic, and forces every consumer to defend against malformed responses.
Solution
The model framework restricts the next token to ones the grammar allows by zeroing out logits of disallowed tokens. In other words, the framework does Logits Masking on your behalf.
There are three ways to set this up.
Option 1: Grammar-constrained logits processor
Provide a formal grammar in Backus–Naur form (BNF). Almost every formal format and programming language has a published BNF.
grammar_str = """
timestamp_literal ::=
{ t 'yyyy-mm-dd hh:mi:ss' } |'date_literal time_literal'
date_literal ::=
{ d'yyyy-mm-dd'}
|mm-dd-yyyy| mm/dd/yyyy| mm-dd-yy| mm/dd/yy| yyyy-mm-dd
| yyyy/mm/dd| dd-mon-yyyy| dd/mon/yyyy| dd-mon-yy| dd/mon/yy
time_literal ::=
{ t 'hh:mi:ss'}|hh:mi:ss[:mls]
"""
grammar = IncrementalGrammarConstraint(grammar_str,
"timestamp_literal",
pipe.tokenizer)
grammar_processor = GrammarConstrainedLogitsProcessor(grammar)
results = pipe(input_message, max_new_tokens=256, do_sample=False,
logits_processor=[grammar_processor])
You name the root element when constructing the constraint, so a single big spec like full SQL can be reused to extract any sub-rule.
Option 2: Standard data format
Most providers support JSON mode, which is a flag away:
response = client.chat.completions.create(
model=MODEL_ID,
messages=input_message,
response_format={"type": "json_object"}
)
The prompt has to explicitly request JSON so the model emits the right tokens. LangChain's XML parser is not an example of Grammar. It relies on instruction-following with no compliance guarantee.
Option 3: User-specified schema (structured outputs)
Specify the exact JSON shape via JSON Schema or a Python dataclass / Pydantic model:
@dataclass
class LineItem:
description: str
quantity: int
amount: float
@dataclass
class Receipt:
items: LineItem[]
total_amount: float
response = client.models.generate_content(
model='gemini-2.0-flash',
contents=[f"Parse the receipt contained in the image", image],
config={
'response_mime_type': 'application/json',
'response_schema': Receipt,
},
)
import json
data_obj = json.loads(response.text,
object_hook=lambda args: Receipt(**args))
The provider applies Logits Masking server-side by translating the schema to rules.
Examples
Arithmetic expressions
You're building elementary-school software and you want the model to output the expression (number_of_dozens × number_per_dozen = number_of_eggs), not just the answer (36). Pure prompting fails. With grammar:
root ::= (expr "=" ws term "\n")+
expr ::= term ([-+*/] term)*
term ::= ident | num | "(" ws expr ")" ws
ident ::= [a-z] [a-z0-9_]* ws
num ::= [0-9]+ ws
ws ::= [ \t\n]*
For "Bill has 3 apples and 2 oranges. Mae has 2 apples and 4 oranges. How many apples do Bill and Mae have in total?" you get back bill_apples + mae_apples = total_apples followed by 3 + 2 = 5.
Asking "do Bill and Mae have more apples than oranges?" is more interesting. The right answer needs >, which the grammar disallows. The model is forced into 3 + 2 = 5 and 2 + 4 = 6, which confirms that the constraint actually constrains.
Pipe separator extraction
Extract author, title, and year separated by |:
record ::= author separator title separator year
author ::= [a-zA-Z ]* | unk
title ::= [a-zA-Z ]* | unk
year ::= [1-2][0-9][0-9][0-9] | unk
unk ::= "NULL"
separator ::= "|"
With this grammar, "Love in the Time of Cholera (Spanish: El amor…) by Gabriel García Márquez published in 1985" yields:
Gabriel Garcia Marquez | Love in the Time of Cholera |1985
The accents get stripped because the grammar only allows ASCII letters. For a passage with conflicting dates, the model emits NULL for the year.
JSON mode
def parse_book_info(paragraph):
system_prompt = """
You will be given a short paragraph about a book.
Extract the author, title, and publication year of the book.
Return the result as JSON with the keys author, title, and year.
If any piece of information is not found, fill the spot with NULL
"""
response = client.chat.completions.create(
model=MODEL_ID,
messages=[{"role": "developer", "content": system_prompt},
{"role": "user", "content": paragraph}],
response_format={"type": "json_object"}
)
return response.choices[0].message.content
JSON mode constrains shape but not values. Accents and non-English titles may or may not appear, so guard at the parser.
Structured outputs via dataclass
class CurrencyEnum(str, Enum):
USD = 'USD'; UKP = 'UKP'; INR = 'INR'; EUR = 'EUR'
@dataclass
class Invoice:
purpose: str
amount: float
currency: CurrencyEnum = CurrencyEnum.USD
agent = Agent(model, result_type=Invoice, system_prompt=system_prompt)
response = agent.run_sync("Requesting reimbursement for taxi ride to airport. I paid $32.30.")
# Invoice(purpose='taxi ride to airport', amount=32.3,
# currency=<CurrencyEnum.USD: 'USD'>)
You're guaranteed to get back an Invoice. Don't beg an LLM for compliance. Instead of "Please answer YES or NO, in all caps", write:
from typing import Literal
agent = Agent(model, result_type=Literal["YES", "NO"])
Considerations
BNF vs. Pydantic
| Aspect | BNF | Pydantic / dataclass |
|---|---|---|
| Ease of use | Harder | Just a class |
| Latency | Client-side via Logits Masking | Server-side, fewer round-trips |
| Model support | Needs logprob access | Universal |
| Flexibility | More expressive (e.g., credit-card validation rules) | Limited beyond Enum |
Example BNF for US credit cards, which is almost impossible to express in pure Pydantic:
<credit_card_number> ::= <visa_number> | <mc_number> | <amex_number>
<visa_number> ::= "4" <digit>{12,15}
<mc_number> ::= ("51".."55" <digit>{14}) | ("2221".."2720" <digit>{12})
<amex_number> ::= "34" <digit>{13} | "37" <digit>{13}
<digit> ::= "0" | "1" | ... | "9"
JavaScript developers can use Zod in place of Pydantic with Ollama or OpenAI.
When to choose Logits Masking instead
Reach for Logits Masking when the rules are pure logic rather than representation, or when the BNF gets too complex to debug. Pure logic shows up when the rule depends on the content (mask competitor in launch content but not in in-market content), when the rules come from a database or rules engine and vary by client, when masking depends on user input (autocomplete is the classic example), or when checking the rule requires invoking an external tool.
Failure modes
Three to watch for. Endless whitespace, where the grammar's whitespace token is always allowed and the model loops on it. Increased refusals, where nested fields or long structures push the model into corners with no valid continuation. Inaccurate results from an overly restrictive grammar. The fix for that last one is to provide an escape hatch: currency_rate: float | Literal["Unknown"].
Grammar is sometimes called structured outputs or constrained decoding, but be careful: not every "structured outputs" implementation actually uses constrained decoding. LangGraph (June 2025) does an extra postprocessing LLM call instead, which is wasteful and less reliable. Anthropic's Claude (July 2025) similarly doesn't appear to use constrained decoding, and recommends prefilling the response with the start of the desired format.
Pattern 3 — Style Transfer
Style Transfer teaches a model to convert content from a readily available form into a desired style by showing it example input/output pairs.
Problem
The pattern fits when you have content you want to restyle, when the style is hard to articulate as rules but easy to recognize, and when you have at least a handful of hand-converted examples. Common situations: academic papers becoming blog posts, generic content getting brand-specific styling, LinkedIn posts getting reformatted for X or BlueSky or Instagram, technical docs being retargeted at beginner / intermediate / expert audiences, or rough notes being polished into professional emails.
A zero-shot prompt to convert notes to a professional email returns generic, overly formal text with arbitrary placeholders ([Your Name] one time, [Name] the next), which breaks any downstream substitution.
Solution
Two approaches.
Option 1: Few-shot learning
Add 1 to 10 input/output pairs to the prompt:
def generate_text(input_text):
in_context_examples = [{
"input_text": "The movie was fantastic!",
"output_text": "The cinematography was exceptional, with masterful "
"use of light and shadow to convey emotional depth.",
}, ...]
prompt = "Convert the following text into the following style:\n\n"
for ex in in_context_examples:
prompt += f"\nInput: {ex['input_text']}\nOutput: {ex['output_text']}\n"
prompt += f"\nInput: {input_text}\nOutput:\n"
Option 2: Model fine-tuning
Fine-tune on roughly 100 to 1000+ pairs. The win is higher fidelity (you can do harder restyling like paper-to-brochure with extensive vocabulary remapping) and faster, cheaper inference because the prompt drops to the bare minimum (no examples). The cost is data curation and governance, recurring training costs every time the desired style changes, training expertise to pick a learning rate that doesn't trigger catastrophic forgetting, and ops expertise to host the resulting model.
OpenAI fine-tuning:
training_file = client.files.create(
file=open("fine_tuning_dataset.jsonl", "rb"),
purpose="fine-tune"
)
job = client.fine_tuning.jobs.create(
training_file=training_file.id,
model="gpt-3.5-turbo"
)
while True:
job_status = client.fine_tuning.jobs.retrieve(job.id)
if job_status.status in ['succeeded', 'failed']:
break
completion = client.chat.completions.create(
model=job_status.fine_tuned_model,
messages=messages
)
OpenAI hosts the fine-tuned model, which takes the operational burden off you.
Example: Style Transfer in Images
You can transfer Caspar David Friedrich's Wanderer Above the Sea of Fog style to a Star Wars subject:
The trick is Stable Diffusion + ControlNet with a depth map as the control image, which preserves the spatial layout (foreground figure, distant horizon):
pipe = StableDiffusionControlNetImg2ImgPipeline.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5",
controlnet=controlnet,
torch_dtype=torch.float16,
use_safetensors=True
)
depth_estimator = pipeline("depth-estimation")
depth_image = depth_estimator(image)["depth"]
wander_depth_map = ...unsqueeze(0).half().to("cuda")
prompt = "Star Wars' Darth Vader with a red light saber"
output = pipe(
prompt,
image=wanderer_image,
control_image=wanderer_depth_map,
...
).images[0]
The output reuses the original's pose and perspective (the wanderer's cane lands as a lightsaber in the same position) while the LLM sneaks in some creative flourishes (spaceships on the horizon). Adjusting the relative weights controls how much creativity is allowed.
Considerations
Style Transfer is much simpler than Logits Masking or Grammar but offers no strict enforcement. Fine-tuning has a higher success rate but is still implicit. Use Logits Masking or Grammar when conformance is not optional.
A few things drive success rate. Bigger models generalize from few examples better, at the cost of more expense and latency. Context limits are real: too many examples eat the window, dilute the message, and start contradicting each other. And inference speed degrades as prompts grow because the attention pass takes longer, so for real-time UX a fine-tuned shorter prompt usually wins.
If you keep hitting context-window limits, consider context engineering, picking only the best examples, or Adapter Tuning (Pattern 15, Chapter 5).
Style transfer of images was introduced by Gatys, Ecker, and Bethge (2015) and reviewed by Jing et al. (2018). Text style transfer with LLMs is in Reif et al. (2021).
Pattern 4 — Reverse Neutralization
Reverse Neutralization generates content in a desired style when all you have are examples of the desired style, with no input/output pairs.
Problem
You want a complaint letter to Lufthansa in your personal style, but you've never written one before. Or in legalese specific to Tamil Nadu, but your firm has no Lufthansa-related notices yet. Zero-shot is out. Style Transfer is out because there are no input/output pairs to learn from, only desired-output examples.
Solution
Insert an intermediate neutral form that the foundational model can produce easily.
Building the fine-tuned model (training time)
Take your styled examples and ask the foundational model to rephrase them into a neutral form ("professional emails between executives" or "freshman college reading level"). Then flip the dataset: the neutral version is the input, the original styled version is the target output. Fine-tune the base model on this reversed dataset.
Inference
At inference time, the foundational model generates content in the neutral form from a generic prompt, and the fine-tuned model converts the neutral content into the desired style.
Example 1: Generating Legal Text
Step one is to neutralize an existing legal notice:
messages = [
{'role': 'system',
'content': "You are a helpful assistant who will convert the given text "
"into text that is understandable by a freshman college student."},
{'role': 'user',
'content': "Neutralize the tone and style from the following legal text "
"and express it for a nonlegal audience: 'The plaintiff hereby "
"moves for summary judgment pursuant to Rule 56(c)…'"}
]
Output:
The person suing is asking the court to decide the case without a full trial.
They say that the other person broke the contract listed in Exhibit A, and they
should get money to make up for it.
The model strips the legalese (plaintiff becomes person, material breach becomes broke the contract) while keeping the meaning. Reverse this for the fine-tuning data: input is plain English, output is legalese.
Example 2: Personal Style
Use highly stylized personal emails, complete with emoji and exclamation points.
Step one, neutralize:
Subject: Welcome to the Team
Body: Hi Emily,
I would like to extend a warm welcome to you as a new member of the Customer
Success team. We are looking forward to your contributions...
Step two, build the dataset by flipping inputs and outputs into JSONL:
{"messages": [
{"role": "system",
"content": "You are a helpful assistant converting the neutralized email into personalized email."},
{"role": "user", "content": "<neutral version>"},
{"role": "assistant", "content": "<original styled version with emojis>"}
]}
100 to 1000 examples is the typical range. The example uses 200.
Step three, fine-tune via OpenAI's managed service or any framework from Chapter 1 / Pattern 16.
Step four, run inference. The foundational model writes a neutral letter to Gretl about a presentation, and the fine-tuned model rewrites it in the personal style:
Subject: 🎉 Exciting Opportunity: Unleash Your Marketing Magic at the 2026
FIFA World Cup! ⚽
Hi Gretl!
I hope this message finds you in fantastic spirits! 🌟 I am absolutely thrilled
to invite you to present...
Considerations
The neutral form has to be repeatable. Different LLMs interpret "neutral" differently, so use cosine similarity between embeddings of the original and the neutral text as a sanity check. Watch for loss of information when style is intertwined with content, and watch for over-neutralization that strips the text of clarity entirely. Experiment with multiple neutral forms before picking one.
On the dataset side, use varied, high-quality input examples that cover the application's full scope. Apply NLP techniques like topic modeling to verify diversity at both the styled and neutral stages. Skewed distributions cause fine-tuning failures.
The neutralization step on its own is useful beyond this pattern. It supports privacy and bias removal (strip identifiable style) and content standardization (uniform tone across many authors). The pattern is analogous to back translation (Beddiar, Jahan, Oussalah 2021; Edunov et al. 2018) used in machine translation to expand datasets.
Pattern 5 — Content Optimization
Content Optimization uses preference tuning to produce optimal content even when you can't articulate what makes content "good."
Problem
Traditional A/B testing only works when you have a hypothesis about what differentiates the two styles:
Without a hypothesis, you hit three walls. Indistinguishable sets, with nothing to vary. Indeterminate tests, with no statistically significant signal. And inability to use the results: even if you spot a difference, you can't reproduce the winning style.
Solution
Reframe each wall. Instead of distinguishable sets, compare two pieces at a time and define winners pairwise. Instead of statistically significant tests, drop the per-test threshold and just collect lots of pairs. Instead of changing the prompt, change the model: tune the LLM via DPO so it produces content like the winners.
Four steps: generate pairs, pick a winner, create a preference dataset, preference-tune the LLM.
Step 1: Generate pairs from the same prompt
Three sub-techniques. Repeated generation uses the same prompt with nonzero temperature, top-K > 1, and no caching. Mistral-7B-Instruct on "Where does the term 'knee-jerk reaction' come from?" gives back two distinctly different styles, one layperson-targeted and one assuming anatomy knowledge. Changing generation settings randomizes temperature or top-P:
paired_content = []
for iter in range(2):
response = pipe(input_message,
temperature=random.uniform(0.2, 0.9))
paired_content.append(response[0]['generated_text'][-1]['content'])
Prompt rewriting asks the LLM to rephrase the prompt without changing intent (more concise, more verbose, add a follow-on question). Adding a follow-on question in particular tends to yield substantively different responses.
Step 2: Compare and pick a winner
Three variants here. Human labeling shows two pieces to an expert, takes a majority vote across a panel, or shows them as drafts and watches which users select. Using an evaluator runs an industry rubric (the 4Ps and 3Cs for marketing, for example) or runs both pieces against a programmatic check (for SQL generation, run both queries against an in-memory database and prefer the correct one, breaking ties on conciseness then speed). Or use LLM-as-judge: hand both pieces to a frontier model that already knows the framework and ask which is better.
Example: Gemini 2.0 Flash rated some Amazon marketing copy on the 4Ps + 3Cs rubric and gave a 6, with the reasoning "…could be improved by including more information about the price and availability of the devices."
The third variant uses real downstream behavior. Direct measurement splits your audience and measures clicks, signatures, or read time. Matching prompts pairs semantically similar user queries and compares resolution time.
Step 2 is the most important step of the four. The choice of variant matters a lot. The choice of reward function matters even more, and you have to verify that it agrees with intuition. Watch out for metric gaming: optimizing for engagement time can quietly reward hard-to-read content. Distinguish between the objective (the true business goal, which is often unmeasurable) and the metric (a proxy for the objective). Always interpret metrics in terms of the objective, not the other way around.
Step 3: Create the preference dataset
{
"prompt": "Where does the term \"knee-jerk reaction\" come from?",
"chosen": "The term 'knee-jerk reaction' refers to an immediate, often unreflective response to a stimulus...",
"rejected": "The term 'knee-jerk reaction' comes from the medical reflex test where the knee jerks up..."
}
Optionally split into train/eval if you're doing early stopping.
Step 4: Preference tuning with DPO
Use Direct Preference Optimization (Raifalov et al. 2023) via the TRL package. DPO is much faster than RLHF.
MODEL_ID = "Qwen/Qwen2-0.5B-Instruct"
model = AutoModelForCausalLM.from_pretrained(MODEL_ID)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
training_args = DPOConfig(output_dir="Qwen-DPO")
trainer = DPOTrainer(model=model, args=training_args,
processing_class=tokenizer,
train_dataset=train_dataset)
trainer.train()
trainer.save_model(training_args.output_dir)
Example: Classified Ads
The goal is a small Qwen2 0.5B model that writes good neighborhood-marketplace classified ads. Zero-shot fails: Qwen2 produces generic manufacturer-style copy ("…fits beginners or seasoned riders alike") when prompted to sell a 3-year-old used bike for $300.
Step one, generate two ads at random temperatures.
Step two, LLM-as-judge with Gemini 2.0 Flash, evaluating against five criteria (clarity, audience, brevity, contact info, truthfulness):
@dataclass
class AdsComparison:
ad_a_is_better_than_ad_b: bool
reasoning: str
Gemini's verdict on the bike pair: ad_b wins because it includes the price and targets adults. Both ads falsely claim a lifetime warranty, so neither is great, but the comparison still produces useful gradients.
Step three, assemble preferences:
def create_preference_example(item, price):
ad1 = create_classified_ad(item, price)
ad2 = create_classified_ad(item, price)
score = score_ad(ad1, ad2)
pe = {"prompt": SYSTEM_PROMPT + f"Write an ad to sell a {item} priced at {price}"}
if score.ad_a_is_better_than_ad_b:
pe['chosen'] = ad1; pe['rejected'] = ad2
else:
pe['chosen'] = ad2; pe['rejected'] = ad1
pe['score_reason'] = score.reasoning
return pe
Step four, load JSONL and run DPO. Roughly 3 minutes on 8 vCPU + L4 GPU for 100 examples.
Inference with the tuned model gives clean, persuasive output:
Pachinko, the classic tale of a man's obsession with gambling and his love for
a woman he meets while playing a pachinko game. A rare edition priced at $5.
For more information or to arrange pickup, please contact [Your Name] at
[Your Phone Number]. Thank you!
Considerations
Choosing variants
| Content type | Pick winner via |
|---|---|
| Open-ended user-facing | Human labeling |
| Behavior-driving content | Outcome metric (with frustration weighting) |
| Consumed by automation | Evaluator (does the code compile? run fast?) |
Use what you already have: defined metrics, rubrics, UI choice presentation, query logs.
In-distribution requirement
The LLM you preference-tune has to be capable of producing the chosen content in its original distribution. The easy way is to use the same model for content generation and tuning, like the example does with Qwen2-0.5B. The hard way is to generate content with a bigger model, then SFT-tune the smaller target before doing DPO. The Qwen2 ads weren't good, but they were ads, which was enough.
Extension to images
Use DiffusionDPO to preference-tune image diffusion models:
Generate two images per prompt, evaluate by downstream behavior (article click-through, say), and run the DPO training script:
accelerate launch --mixed_precision="fp16" train.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--dataset_name=$DATASET_NAME \
--output_dir="tmp-sd15"
Continuous improvement
DPO needs few high-quality examples. Quality beats quantity. Improve by upgrading the evaluator (this matters most), by diversifying prompts (log production prompts and feedback, add subpar-response prompts to the training set, topic-model to find outliers, route bug intake into the training set), and by iterating: once the model is trained, repeat steps 1 through 4 with the new model. Saturation eventually plateaus, but you can push to the model size's ceiling.
DeepSeek-R1 (early 2025) used iterative DPO on synthetic data for easily verifiable problems, which is the same "aha moment" idea.
The required ingredients are a fast, high-quality evaluator, systematic prompt collection, and training until saturation.
Preference tuning was originally RLHF (Christiano et al. 2017; Ouyang et al. 2022 for LLMs). DPO is from Raifalov et al. 2023.
Summary
| Pattern | Problem | Solution | When to use |
|---|---|---|---|
| Logits Masking (1) | Generated text must follow style rules for brand / accuracy / compliance | Intercept sampling and zero out non-conforming continuations | Banning brand words, dynamic rules, autocomplete grounded in docs |
| Grammar (2) | Output must conform to a format / data schema | BNF or schema/dataclass passed to the model framework, which masks logits | SQL timestamps, JSON extraction, structured outputs |
| Style Transfer (3) | Convert content to a tone/style that's hard to express as rules but easy to demonstrate | Few-shot examples or fine-tuning on input/output pairs | Brand rewrites, paper to blog, social-media reformatting, image style transfer |
| Reverse Neutralization (4) | Generate content in a style when only desired-style examples exist (no input/output pairs) | Generate neutral form via foundational model; fine-tune to convert neutral to desired style | Local legalese, personal communication style |
| Content Optimization (5) | Optimize for "best" style without knowing what factors matter | Generate pairs, evaluator picks winner, preference dataset, DPO | Ad copy, marketing content, anywhere outcome metrics exist |
The five patterns trade off enforcement against ease of use. Logits Masking gives strict dynamic enforcement but needs logit access, which often forces locally hosted models. Grammar gives a universal structured-output approach via Pydantic or dataclass, with BNF as the escape hatch when validation logic is needed. Style Transfer is the simplest to set up and the weakest at enforcement. Reverse Neutralization solves the no-input-pair problem via an intermediate neutral form. Content Optimization sidesteps the "what makes good?" question entirely by tuning the model on outcomes.
Previous chapter
IntroductionNext chapter
Adding Knowledge Bass