Pagefy

Pagefy

Back to AI Engineering

Controlling Content Style

Generative AI Design Patterns by Valliappa Lakshmanan & Hannes HapkeBuy the book
Sign in to save bookmarks, reading progress, and highlights.

Chapter 2: Controlling Content Style

Introduction

Foundational models produce stochastic, model-specific output by default. Even when you fix the prompt and pin the length, two models will hand back two different voices, and the same model on two different calls will too. Ask six providers "What's a good side dish for pierogi? Answer in a single sentence" and you get six different answers:

ModelProviderAnswer
GPT-4OpenAIA great side dish for pierogi is sautéed onions with butter and a sprinkle of crispy bacon bits.
Claude Sonnet 3.5AnthropicA tangy sauerkraut or caramelized onions complement pierogies perfectly by adding contrasting acidity or sweetness…
Gemini 2.0 FlashGoogleSautéed onions and mushrooms are a classic and delicious side dish for pierogi.
Llama 3.2 70BMetaA traditional and delicious side dish for pierogi is fried onions and sour cream…
DeepSeek-R1DeepSeekA tangy sauerkraut salad or caramelized roasted carrots with dill make excellent, flavorful sides for pierogi.
Mistral Small 24BMistral AIA classic side dish for pierogi is coleslaw, especially when garlic and herbs are added.

Prompt engineering by itself is brittle. The five patterns in this chapter give you progressively stronger control over style, from hard rules at sampling time (Logits Masking), to formal-syntax constraints (Grammar), to example-based imitation (Style Transfer), to a generate-then-restyle workflow when you only have desired-style examples (Reverse Neutralization), to preference tuning when you can't even articulate what makes content "good" (Content Optimization).


Pattern 1 — Logits Masking

Logits Masking enforces style rules by intercepting the sampling stage and zeroing out the probability of any continuation that breaks them.

Problem

Style rules come from a lot of places. Branding teams want Item A described with sporty, performant words and not Item B's spacious, luxurious ones. Accuracy rules ban repeating an invoice ID and amount in the body of a payment letter because only the canonical location is validated. Compliance rules ban competitor names B, C, D when discussing a case study from Customer A. Stylebook rules pin you to The Chicago Manual of Style or APA citation conventions.

The naive answer is the try-and-try-again antipattern: generate, evaluate, regenerate if it fails, repeat.

This works only when most responses pass on the first try. The latency math says that if p% succeed, the expected number of attempts is 100/p.

Success rateAvg attempts99th-percentile attempts
90%1.12
30%3.313

The exponential backoff people add between attempts makes the tail worse. Try-and-try-again is acceptable when the success rate is already very high, and not otherwise.

Solution

Logits Masking intervenes inside the sampling loop. At each step, look at the candidate continuations, set the logits of any rule-breaking ones to -inf, and continue as long as at least one valid candidate remains. If everything gets masked or you've revisited a dead end, back up one step and retry. After a max number of retries, refuse the request.

The net effect is that nonconforming branches get pruned out of beam search.

The solid boxes are sequence selection, which is enough for simple cases. The hatched boxes are sequence regeneration, needed when masking wipes out everything.

Implementation with the Transformers library

Hugging Face Transformers exposes a LogitsProcessor hook:

class MyRulesLogitsProcessor(LogitsProcessor):
    def __init__(self, tokenizer, rules):
        self.tokenizer = tokenizer
        self.rules = rules

Pass it to a text-generation pipeline:

from transformers import pipeline
pipe = pipeline(task="text-generation", model=MODEL_ID)

rules_processor = MyRulesLogitsProcessor(pipe.tokenizer, rules)
results = pipe(input_message,
               max_new_tokens=256,
               do_sample=True,
               temperature=0.8,
               num_beams=10,
               logits_processor=[rules_processor])

The interesting work happens in __call__. Inputs are token IDs (so decode them first), and you set the logits of invalid sequences to -inf, or -10000 since torch sometimes balks at -inf:

def __call__(self, input_ids, input_logits):
    output_logits = input_logits.clone()
    for idx, input_id in enumerate(input_ids):
        seq = self.tokenizer.decode(input_id)
        if not self.apply_rules(seq, self.rules):
            output_logits[idx] = -np.inf
    return output_logits

The standard pipeline doesn't backtrack. To regenerate, drive pipe.model.generate() 16 tokens at a time, append accepted text to the prompt, and reinvoke:

input_ids = pipe.tokenizer(
    input_prompt + '\n'.join(text_so_far),
    return_tensors="pt").to("cuda")

results = pipe.model.generate(
    **input_ids,
    max_new_tokens=16,
    num_beams=10,
    output_scores=True,
)

A stop string, either specified in the prompt or coming from the example context, ends generation.

Example: Sequence Selection (Branding)

You're a marketer for nutritional supplements. Your e-commerce site bans award winning, quality, growth, and perfect, and rewards positive SEO terms like whey and whey protein. Zero-shot prompting fails because the model goes straight for the banned terms.

With Logits Masking you write an evaluator that scores by counting positive vs. negative terms, then keep only the highest-scoring continuations:

def evaluate(descr, positives, negatives):
    descr = descr.lower()
    num_positive = np.sum([1 for p in positives if p in descr])
    num_negative = np.sum([1 for n in negatives if n in descr])
    return int(num_positive - num_negative)

class BrandLogitsProcessor(LogitsProcessor):
    def __call__(self, input_ids, input_logits):
        output_logits = input_logits.clone()
        num_matches = [evaluate(self.tokenizer.decode(seq),
                                self.positives, self.negatives)
                       for seq in input_ids]
        max_matches = np.max(num_matches)
        for idx in range(len(input_ids)):
            if num_matches[idx] != max_matches:
                output_logits[idx] = -10000
        return output_logits

Result: a description that lands on whey, whey protein, nutrients, and premium (a workaround for the banned quality).

Example: Sequence Regeneration (Acrostic Poetry)

Generate a children's-book acrostic where the first letters spell out an adjective for an animal. For a tiger that might be BOLD, SWIFT, or FIERCE. One-shot Llama 3.2 fails on this. Given a rabbit example it tries to spell POWER and produces PORE.

The pattern walks through it in pieces. First, ask the model for adjectives that fit "As ___ as a {animal}" and get back something like ['wild', 'agile', 'regal', 'swift', 'fierce']. Then ask for a phrase about the animal that starts with the chosen letter, initialize the poem with that line, and call pipe.model.generate(... num_beams=10) for the next chunk. Apply Logits Masking to keep only beams whose first letters match the next character of the acrostic word, zeroing out the rest. Greedy decoding picks the highest-probability survivor. If everything gets masked, pop the last line and try again. If you exhaust the start, reinitialize with a different adjective.

Sample output for tiger:

Boldly, the brave tiger stalks its prey
Owning the forest with its might,
Lurking in the shadows, waiting to pounce,
Daring to be the king of the night

Considerations

Sequence selection is enough as long as masking always leaves at least one valid continuation. Sequence regeneration is the fallback when masking can wipe out every candidate.

The alternatives are weaker on enforcement but easier to set up. Few-shot prompting (which is essentially Style Transfer) and prompt engineering are cheaper but give you no guarantee. A bigger, more instruction-following model is slower and more expensive. Try-and-try-again is fine when p is high (99th-percentile attempts is just 2 at p = 0.9). Reflection (Pattern 18) feeds error messages back to the model and reduces retries. Grammar (Pattern 2) is the right answer when the rules can be expressed in a standard form, because then the provider can do Logits Masking server-side.

A useful extension is autocomplete. Hold the entire document in context and use Logits Masking to provide search-style autocomplete grounded in your documents. Cache the document so the input tokens don't keep accumulating cost.

The big caveat is access to logits. As of June 2025, OpenAI exposes logprobs broadly, Gemini Flash supports responseLogprobs (Pro doesn't), Llama is the most permissive but requires self-hosting, and Anthropic's Claude doesn't expose logits at all. So this pattern restricts your model choice. Each beam also requires a round-trip between client and model, which adds latency, which makes this mostly viable for locally hosted or colocated models. And if no candidate sequence meets the rules, you may have to refuse. The AI engineer's job is to provide enough context in the prompt that this is rare.

In RL the same idea is called invalid action masking (Vinyals et al. 2019, on StarCraft II; theoretical justification by Huang and Ontañón 2020). Self-Check (Pattern 31) is another pattern that reads logits.


Pattern 2 — Grammar

Grammar enforces style rules that can be expressed as a context-free metasyntax, typically when output has to fit a specific data schema or standard format.

Problem

You want LLM output in a fixed format: a comma-separated list, a JSON document, a syntactically valid SQL statement. The naive approach is to state the format in the prompt and provide examples. That's brittle, breaks across model versions, is stochastic, and forces every consumer to defend against malformed responses.

Solution

The model framework restricts the next token to ones the grammar allows by zeroing out logits of disallowed tokens. In other words, the framework does Logits Masking on your behalf.

There are three ways to set this up.

Option 1: Grammar-constrained logits processor

Provide a formal grammar in Backus–Naur form (BNF). Almost every formal format and programming language has a published BNF.

grammar_str = """
timestamp_literal ::=
{ t 'yyyy-mm-dd hh:mi:ss' } |'date_literal time_literal'

date_literal ::=
{ d'yyyy-mm-dd'}
  |mm-dd-yyyy| mm/dd/yyyy| mm-dd-yy| mm/dd/yy| yyyy-mm-dd
  | yyyy/mm/dd| dd-mon-yyyy| dd/mon/yyyy| dd-mon-yy| dd/mon/yy

time_literal ::=
{ t 'hh:mi:ss'}|hh:mi:ss[:mls]
"""

grammar = IncrementalGrammarConstraint(grammar_str,
                                       "timestamp_literal",
                                       pipe.tokenizer)
grammar_processor = GrammarConstrainedLogitsProcessor(grammar)

results = pipe(input_message, max_new_tokens=256, do_sample=False,
               logits_processor=[grammar_processor])

You name the root element when constructing the constraint, so a single big spec like full SQL can be reused to extract any sub-rule.

Option 2: Standard data format

Most providers support JSON mode, which is a flag away:

response = client.chat.completions.create(
    model=MODEL_ID,
    messages=input_message,
    response_format={"type": "json_object"}
)

The prompt has to explicitly request JSON so the model emits the right tokens. LangChain's XML parser is not an example of Grammar. It relies on instruction-following with no compliance guarantee.

Option 3: User-specified schema (structured outputs)

Specify the exact JSON shape via JSON Schema or a Python dataclass / Pydantic model:

@dataclass
class LineItem:
    description: str
    quantity: int
    amount: float

@dataclass
class Receipt:
    items: LineItem[]
    total_amount: float

response = client.models.generate_content(
    model='gemini-2.0-flash',
    contents=[f"Parse the receipt contained in the image", image],
    config={
        'response_mime_type': 'application/json',
        'response_schema': Receipt,
    },
)

import json
data_obj = json.loads(response.text,
                     object_hook=lambda args: Receipt(**args))

The provider applies Logits Masking server-side by translating the schema to rules.

Examples

Arithmetic expressions

You're building elementary-school software and you want the model to output the expression (number_of_dozens × number_per_dozen = number_of_eggs), not just the answer (36). Pure prompting fails. With grammar:

root  ::= (expr "=" ws term "\n")+
expr  ::= term ([-+*/] term)*
term  ::= ident | num | "(" ws expr ")" ws
ident ::= [a-z] [a-z0-9_]* ws
num   ::= [0-9]+ ws
ws    ::= [ \t\n]*

For "Bill has 3 apples and 2 oranges. Mae has 2 apples and 4 oranges. How many apples do Bill and Mae have in total?" you get back bill_apples + mae_apples = total_apples followed by 3 + 2 = 5.

Asking "do Bill and Mae have more apples than oranges?" is more interesting. The right answer needs >, which the grammar disallows. The model is forced into 3 + 2 = 5 and 2 + 4 = 6, which confirms that the constraint actually constrains.

Pipe separator extraction

Extract author, title, and year separated by |:

record    ::= author separator title separator year
author    ::= [a-zA-Z ]* | unk
title     ::= [a-zA-Z ]* | unk
year      ::= [1-2][0-9][0-9][0-9] | unk
unk       ::= "NULL"
separator ::= "|"

With this grammar, "Love in the Time of Cholera (Spanish: El amor…) by Gabriel García Márquez published in 1985" yields:

Gabriel Garcia Marquez | Love in the Time of Cholera |1985

The accents get stripped because the grammar only allows ASCII letters. For a passage with conflicting dates, the model emits NULL for the year.

JSON mode

def parse_book_info(paragraph):
    system_prompt = """
    You will be given a short paragraph about a book.
    Extract the author, title, and publication year of the book.
    Return the result as JSON with the keys author, title, and year.
    If any piece of information is not found, fill the spot with NULL
    """
    response = client.chat.completions.create(
        model=MODEL_ID,
        messages=[{"role": "developer", "content": system_prompt},
                  {"role": "user", "content": paragraph}],
        response_format={"type": "json_object"}
    )
    return response.choices[0].message.content

JSON mode constrains shape but not values. Accents and non-English titles may or may not appear, so guard at the parser.

Structured outputs via dataclass

class CurrencyEnum(str, Enum):
    USD = 'USD'; UKP = 'UKP'; INR = 'INR'; EUR = 'EUR'

@dataclass
class Invoice:
    purpose: str
    amount: float
    currency: CurrencyEnum = CurrencyEnum.USD

agent = Agent(model, result_type=Invoice, system_prompt=system_prompt)
response = agent.run_sync("Requesting reimbursement for taxi ride to airport. I paid $32.30.")
# Invoice(purpose='taxi ride to airport', amount=32.3,
#         currency=<CurrencyEnum.USD: 'USD'>)

You're guaranteed to get back an Invoice. Don't beg an LLM for compliance. Instead of "Please answer YES or NO, in all caps", write:

from typing import Literal
agent = Agent(model, result_type=Literal["YES", "NO"])

Considerations

BNF vs. Pydantic

AspectBNFPydantic / dataclass
Ease of useHarderJust a class
LatencyClient-side via Logits MaskingServer-side, fewer round-trips
Model supportNeeds logprob accessUniversal
FlexibilityMore expressive (e.g., credit-card validation rules)Limited beyond Enum

Example BNF for US credit cards, which is almost impossible to express in pure Pydantic:

<credit_card_number> ::= <visa_number> | <mc_number> | <amex_number>
<visa_number> ::= "4" <digit>{12,15}
<mc_number>   ::= ("51".."55" <digit>{14}) | ("2221".."2720" <digit>{12})
<amex_number> ::= "34" <digit>{13} | "37" <digit>{13}
<digit>       ::= "0" | "1" | ... | "9"

JavaScript developers can use Zod in place of Pydantic with Ollama or OpenAI.

When to choose Logits Masking instead

Reach for Logits Masking when the rules are pure logic rather than representation, or when the BNF gets too complex to debug. Pure logic shows up when the rule depends on the content (mask competitor in launch content but not in in-market content), when the rules come from a database or rules engine and vary by client, when masking depends on user input (autocomplete is the classic example), or when checking the rule requires invoking an external tool.

Failure modes

Three to watch for. Endless whitespace, where the grammar's whitespace token is always allowed and the model loops on it. Increased refusals, where nested fields or long structures push the model into corners with no valid continuation. Inaccurate results from an overly restrictive grammar. The fix for that last one is to provide an escape hatch: currency_rate: float | Literal["Unknown"].

Grammar is sometimes called structured outputs or constrained decoding, but be careful: not every "structured outputs" implementation actually uses constrained decoding. LangGraph (June 2025) does an extra postprocessing LLM call instead, which is wasteful and less reliable. Anthropic's Claude (July 2025) similarly doesn't appear to use constrained decoding, and recommends prefilling the response with the start of the desired format.


Pattern 3 — Style Transfer

Style Transfer teaches a model to convert content from a readily available form into a desired style by showing it example input/output pairs.

Problem

The pattern fits when you have content you want to restyle, when the style is hard to articulate as rules but easy to recognize, and when you have at least a handful of hand-converted examples. Common situations: academic papers becoming blog posts, generic content getting brand-specific styling, LinkedIn posts getting reformatted for X or BlueSky or Instagram, technical docs being retargeted at beginner / intermediate / expert audiences, or rough notes being polished into professional emails.

A zero-shot prompt to convert notes to a professional email returns generic, overly formal text with arbitrary placeholders ([Your Name] one time, [Name] the next), which breaks any downstream substitution.

Solution

Two approaches.

Option 1: Few-shot learning

Add 1 to 10 input/output pairs to the prompt:

def generate_text(input_text):
    in_context_examples = [{
        "input_text": "The movie was fantastic!",
        "output_text": "The cinematography was exceptional, with masterful "
                       "use of light and shadow to convey emotional depth.",
    }, ...]

    prompt = "Convert the following text into the following style:\n\n"
    for ex in in_context_examples:
        prompt += f"\nInput: {ex['input_text']}\nOutput: {ex['output_text']}\n"
    prompt += f"\nInput: {input_text}\nOutput:\n"

Option 2: Model fine-tuning

Fine-tune on roughly 100 to 1000+ pairs. The win is higher fidelity (you can do harder restyling like paper-to-brochure with extensive vocabulary remapping) and faster, cheaper inference because the prompt drops to the bare minimum (no examples). The cost is data curation and governance, recurring training costs every time the desired style changes, training expertise to pick a learning rate that doesn't trigger catastrophic forgetting, and ops expertise to host the resulting model.

OpenAI fine-tuning:

training_file = client.files.create(
    file=open("fine_tuning_dataset.jsonl", "rb"),
    purpose="fine-tune"
)
job = client.fine_tuning.jobs.create(
    training_file=training_file.id,
    model="gpt-3.5-turbo"
)

while True:
    job_status = client.fine_tuning.jobs.retrieve(job.id)
    if job_status.status in ['succeeded', 'failed']:
        break

completion = client.chat.completions.create(
    model=job_status.fine_tuned_model,
    messages=messages
)

OpenAI hosts the fine-tuned model, which takes the operational burden off you.

Example: Style Transfer in Images

You can transfer Caspar David Friedrich's Wanderer Above the Sea of Fog style to a Star Wars subject:

The trick is Stable Diffusion + ControlNet with a depth map as the control image, which preserves the spatial layout (foreground figure, distant horizon):

pipe = StableDiffusionControlNetImg2ImgPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    controlnet=controlnet,
    torch_dtype=torch.float16,
    use_safetensors=True
)

depth_estimator = pipeline("depth-estimation")
depth_image = depth_estimator(image)["depth"]
wander_depth_map = ...unsqueeze(0).half().to("cuda")

prompt = "Star Wars' Darth Vader with a red light saber"
output = pipe(
    prompt,
    image=wanderer_image,
    control_image=wanderer_depth_map,
    ...
).images[0]

The output reuses the original's pose and perspective (the wanderer's cane lands as a lightsaber in the same position) while the LLM sneaks in some creative flourishes (spaceships on the horizon). Adjusting the relative weights controls how much creativity is allowed.

Considerations

Style Transfer is much simpler than Logits Masking or Grammar but offers no strict enforcement. Fine-tuning has a higher success rate but is still implicit. Use Logits Masking or Grammar when conformance is not optional.

A few things drive success rate. Bigger models generalize from few examples better, at the cost of more expense and latency. Context limits are real: too many examples eat the window, dilute the message, and start contradicting each other. And inference speed degrades as prompts grow because the attention pass takes longer, so for real-time UX a fine-tuned shorter prompt usually wins.

If you keep hitting context-window limits, consider context engineering, picking only the best examples, or Adapter Tuning (Pattern 15, Chapter 5).

Style transfer of images was introduced by Gatys, Ecker, and Bethge (2015) and reviewed by Jing et al. (2018). Text style transfer with LLMs is in Reif et al. (2021).


Pattern 4 — Reverse Neutralization

Reverse Neutralization generates content in a desired style when all you have are examples of the desired style, with no input/output pairs.

Problem

You want a complaint letter to Lufthansa in your personal style, but you've never written one before. Or in legalese specific to Tamil Nadu, but your firm has no Lufthansa-related notices yet. Zero-shot is out. Style Transfer is out because there are no input/output pairs to learn from, only desired-output examples.

Solution

Insert an intermediate neutral form that the foundational model can produce easily.

Building the fine-tuned model (training time)

Take your styled examples and ask the foundational model to rephrase them into a neutral form ("professional emails between executives" or "freshman college reading level"). Then flip the dataset: the neutral version is the input, the original styled version is the target output. Fine-tune the base model on this reversed dataset.

Inference

At inference time, the foundational model generates content in the neutral form from a generic prompt, and the fine-tuned model converts the neutral content into the desired style.

Step one is to neutralize an existing legal notice:

messages = [
  {'role': 'system',
   'content': "You are a helpful assistant who will convert the given text "
              "into text that is understandable by a freshman college student."},
  {'role': 'user',
   'content': "Neutralize the tone and style from the following legal text "
              "and express it for a nonlegal audience: 'The plaintiff hereby "
              "moves for summary judgment pursuant to Rule 56(c)…'"}
]

Output:

The person suing is asking the court to decide the case without a full trial.
They say that the other person broke the contract listed in Exhibit A, and they
should get money to make up for it.

The model strips the legalese (plaintiff becomes person, material breach becomes broke the contract) while keeping the meaning. Reverse this for the fine-tuning data: input is plain English, output is legalese.

Example 2: Personal Style

Use highly stylized personal emails, complete with emoji and exclamation points.

Step one, neutralize:

Subject: Welcome to the Team
Body: Hi Emily,
I would like to extend a warm welcome to you as a new member of the Customer
Success team. We are looking forward to your contributions...

Step two, build the dataset by flipping inputs and outputs into JSONL:

{"messages": [
  {"role": "system",
   "content": "You are a helpful assistant converting the neutralized email into personalized email."},
  {"role": "user", "content": "<neutral version>"},
  {"role": "assistant", "content": "<original styled version with emojis>"}
]}

100 to 1000 examples is the typical range. The example uses 200.

Step three, fine-tune via OpenAI's managed service or any framework from Chapter 1 / Pattern 16.

Step four, run inference. The foundational model writes a neutral letter to Gretl about a presentation, and the fine-tuned model rewrites it in the personal style:

Subject: 🎉 Exciting Opportunity: Unleash Your Marketing Magic at the 2026
FIFA World Cup! ⚽

Hi Gretl!
I hope this message finds you in fantastic spirits! 🌟 I am absolutely thrilled
to invite you to present...

Considerations

The neutral form has to be repeatable. Different LLMs interpret "neutral" differently, so use cosine similarity between embeddings of the original and the neutral text as a sanity check. Watch for loss of information when style is intertwined with content, and watch for over-neutralization that strips the text of clarity entirely. Experiment with multiple neutral forms before picking one.

On the dataset side, use varied, high-quality input examples that cover the application's full scope. Apply NLP techniques like topic modeling to verify diversity at both the styled and neutral stages. Skewed distributions cause fine-tuning failures.

The neutralization step on its own is useful beyond this pattern. It supports privacy and bias removal (strip identifiable style) and content standardization (uniform tone across many authors). The pattern is analogous to back translation (Beddiar, Jahan, Oussalah 2021; Edunov et al. 2018) used in machine translation to expand datasets.


Pattern 5 — Content Optimization

Content Optimization uses preference tuning to produce optimal content even when you can't articulate what makes content "good."

Problem

Traditional A/B testing only works when you have a hypothesis about what differentiates the two styles:

Without a hypothesis, you hit three walls. Indistinguishable sets, with nothing to vary. Indeterminate tests, with no statistically significant signal. And inability to use the results: even if you spot a difference, you can't reproduce the winning style.

Solution

Reframe each wall. Instead of distinguishable sets, compare two pieces at a time and define winners pairwise. Instead of statistically significant tests, drop the per-test threshold and just collect lots of pairs. Instead of changing the prompt, change the model: tune the LLM via DPO so it produces content like the winners.

Four steps: generate pairs, pick a winner, create a preference dataset, preference-tune the LLM.

Step 1: Generate pairs from the same prompt

Three sub-techniques. Repeated generation uses the same prompt with nonzero temperature, top-K > 1, and no caching. Mistral-7B-Instruct on "Where does the term 'knee-jerk reaction' come from?" gives back two distinctly different styles, one layperson-targeted and one assuming anatomy knowledge. Changing generation settings randomizes temperature or top-P:

paired_content = []
for iter in range(2):
    response = pipe(input_message,
                    temperature=random.uniform(0.2, 0.9))
    paired_content.append(response[0]['generated_text'][-1]['content'])

Prompt rewriting asks the LLM to rephrase the prompt without changing intent (more concise, more verbose, add a follow-on question). Adding a follow-on question in particular tends to yield substantively different responses.

Step 2: Compare and pick a winner

Three variants here. Human labeling shows two pieces to an expert, takes a majority vote across a panel, or shows them as drafts and watches which users select. Using an evaluator runs an industry rubric (the 4Ps and 3Cs for marketing, for example) or runs both pieces against a programmatic check (for SQL generation, run both queries against an in-memory database and prefer the correct one, breaking ties on conciseness then speed). Or use LLM-as-judge: hand both pieces to a frontier model that already knows the framework and ask which is better.

Example: Gemini 2.0 Flash rated some Amazon marketing copy on the 4Ps + 3Cs rubric and gave a 6, with the reasoning "…could be improved by including more information about the price and availability of the devices."

The third variant uses real downstream behavior. Direct measurement splits your audience and measures clicks, signatures, or read time. Matching prompts pairs semantically similar user queries and compares resolution time.

Step 2 is the most important step of the four. The choice of variant matters a lot. The choice of reward function matters even more, and you have to verify that it agrees with intuition. Watch out for metric gaming: optimizing for engagement time can quietly reward hard-to-read content. Distinguish between the objective (the true business goal, which is often unmeasurable) and the metric (a proxy for the objective). Always interpret metrics in terms of the objective, not the other way around.

Step 3: Create the preference dataset

{
  "prompt": "Where does the term \"knee-jerk reaction\" come from?",
  "chosen": "The term 'knee-jerk reaction' refers to an immediate, often unreflective response to a stimulus...",
  "rejected": "The term 'knee-jerk reaction' comes from the medical reflex test where the knee jerks up..."
}

Optionally split into train/eval if you're doing early stopping.

Step 4: Preference tuning with DPO

Use Direct Preference Optimization (Raifalov et al. 2023) via the TRL package. DPO is much faster than RLHF.

MODEL_ID = "Qwen/Qwen2-0.5B-Instruct"
model = AutoModelForCausalLM.from_pretrained(MODEL_ID)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

training_args = DPOConfig(output_dir="Qwen-DPO")
trainer = DPOTrainer(model=model, args=training_args,
                     processing_class=tokenizer,
                     train_dataset=train_dataset)
trainer.train()
trainer.save_model(training_args.output_dir)

Example: Classified Ads

The goal is a small Qwen2 0.5B model that writes good neighborhood-marketplace classified ads. Zero-shot fails: Qwen2 produces generic manufacturer-style copy ("…fits beginners or seasoned riders alike") when prompted to sell a 3-year-old used bike for $300.

Step one, generate two ads at random temperatures.

Step two, LLM-as-judge with Gemini 2.0 Flash, evaluating against five criteria (clarity, audience, brevity, contact info, truthfulness):

@dataclass
class AdsComparison:
    ad_a_is_better_than_ad_b: bool
    reasoning: str

Gemini's verdict on the bike pair: ad_b wins because it includes the price and targets adults. Both ads falsely claim a lifetime warranty, so neither is great, but the comparison still produces useful gradients.

Step three, assemble preferences:

def create_preference_example(item, price):
    ad1 = create_classified_ad(item, price)
    ad2 = create_classified_ad(item, price)
    score = score_ad(ad1, ad2)
    pe = {"prompt": SYSTEM_PROMPT + f"Write an ad to sell a {item} priced at {price}"}
    if score.ad_a_is_better_than_ad_b:
        pe['chosen'] = ad1; pe['rejected'] = ad2
    else:
        pe['chosen'] = ad2; pe['rejected'] = ad1
    pe['score_reason'] = score.reasoning
    return pe

Step four, load JSONL and run DPO. Roughly 3 minutes on 8 vCPU + L4 GPU for 100 examples.

Inference with the tuned model gives clean, persuasive output:

Pachinko, the classic tale of a man's obsession with gambling and his love for
a woman he meets while playing a pachinko game. A rare edition priced at $5.
For more information or to arrange pickup, please contact [Your Name] at
[Your Phone Number]. Thank you!

Considerations

Choosing variants

Content typePick winner via
Open-ended user-facingHuman labeling
Behavior-driving contentOutcome metric (with frustration weighting)
Consumed by automationEvaluator (does the code compile? run fast?)

Use what you already have: defined metrics, rubrics, UI choice presentation, query logs.

In-distribution requirement

The LLM you preference-tune has to be capable of producing the chosen content in its original distribution. The easy way is to use the same model for content generation and tuning, like the example does with Qwen2-0.5B. The hard way is to generate content with a bigger model, then SFT-tune the smaller target before doing DPO. The Qwen2 ads weren't good, but they were ads, which was enough.

Extension to images

Use DiffusionDPO to preference-tune image diffusion models:

Generate two images per prompt, evaluate by downstream behavior (article click-through, say), and run the DPO training script:

accelerate launch --mixed_precision="fp16" train.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --dataset_name=$DATASET_NAME \
  --output_dir="tmp-sd15"

Continuous improvement

DPO needs few high-quality examples. Quality beats quantity. Improve by upgrading the evaluator (this matters most), by diversifying prompts (log production prompts and feedback, add subpar-response prompts to the training set, topic-model to find outliers, route bug intake into the training set), and by iterating: once the model is trained, repeat steps 1 through 4 with the new model. Saturation eventually plateaus, but you can push to the model size's ceiling.

DeepSeek-R1 (early 2025) used iterative DPO on synthetic data for easily verifiable problems, which is the same "aha moment" idea.

The required ingredients are a fast, high-quality evaluator, systematic prompt collection, and training until saturation.

Preference tuning was originally RLHF (Christiano et al. 2017; Ouyang et al. 2022 for LLMs). DPO is from Raifalov et al. 2023.


Summary

PatternProblemSolutionWhen to use
Logits Masking (1)Generated text must follow style rules for brand / accuracy / complianceIntercept sampling and zero out non-conforming continuationsBanning brand words, dynamic rules, autocomplete grounded in docs
Grammar (2)Output must conform to a format / data schemaBNF or schema/dataclass passed to the model framework, which masks logitsSQL timestamps, JSON extraction, structured outputs
Style Transfer (3)Convert content to a tone/style that's hard to express as rules but easy to demonstrateFew-shot examples or fine-tuning on input/output pairsBrand rewrites, paper to blog, social-media reformatting, image style transfer
Reverse Neutralization (4)Generate content in a style when only desired-style examples exist (no input/output pairs)Generate neutral form via foundational model; fine-tune to convert neutral to desired styleLocal legalese, personal communication style
Content Optimization (5)Optimize for "best" style without knowing what factors matterGenerate pairs, evaluator picks winner, preference dataset, DPOAd copy, marketing content, anywhere outcome metrics exist

The five patterns trade off enforcement against ease of use. Logits Masking gives strict dynamic enforcement but needs logit access, which often forces locally hosted models. Grammar gives a universal structured-output approach via Pydantic or dataclass, with BNF as the escape hatch when validation logic is needed. Style Transfer is the simplest to set up and the weakest at enforcement. Reverse Neutralization solves the no-input-pair problem via an intermediate neutral form. Content Optimization sidesteps the "what makes good?" question entirely by tuning the model on outcomes.

Previous chapter

Introduction