Pagefy

Pagefy

Back to AI Engineering

Improving Reliability

Generative AI Design Patterns by Valliappa Lakshmanan & Hannes HapkeBuy the book
Sign in to save bookmarks, reading progress, and highlights.

Chapter 6: Improving Reliability

Introduction

Foundational models are stochastic. Same input gives different outputs, and the same prompt across versions gives different behavior. This chapter's four patterns go after that unreliability from different angles. LLM-as-Judge (Pattern 17) gives you a scalable way to evaluate generations against custom rubrics. Reflection (Pattern 18) feeds critiques back into the model to fix earlier outputs. Dependency Injection (Pattern 19) borrows the classic software pattern so you can develop and test each step of a chain independently. Prompt Optimization (Pattern 20) systematically generates and selects prompts so that updates to the underlying model don't force a manual prompt-engineering rebuild.


Pattern 17 — LLM-as-Judge

LLM-as-Judge uses an LLM to evaluate the quality of GenAI outputs against a scoring rubric. It's the middle ground between fully automated metrics and human evaluation.

Problem

A lot of patterns in this book need an evaluation step (Content Optimization, Node Postprocessing, Tree of Thoughts, Evol-Instruct). The traditional answers all fail in different ways. Outcome measurement is the gold standard but hard to attribute (sales depend on too many factors), so people fall back to proxies (engagement, clicks) and multi-armed bandits. Human evaluation is the next best, but doesn't scale: expensive, time-consuming, biased, and limited by availability. Automated metrics like BLEU and ROUGE measure n-grams. They don't capture semantics, nuance, or factual correctness, and you can't customize a BLEU score for marketing-content quality.

What we want is something fast, scalable, customizable, and a good proxy for outcomes, all without needing a deploy.

Solution

Three options.

Option 1: Prompting

Build a custom rubric and ask an LLM to apply it. Set temperature=0 and cache aggressively for repeatability:

Given an article and a summary, provide a score in the range of 1-5 for each:
- Factual accuracy
- Completeness of key points
- Conciseness
- Clarity

For each score, provide a brief justification.

Calibrate by spelling out the rubric:

- Factual accuracy
  * 1 if any information in the summary misrepresents the article
  * 5 if all statements in the summary are grounded in the article
- Completeness of key points
  * 1 if multiple high-impact points are missing
  * 3 if all the major points are present
  * 5 if all the major points are present and the more important points receive more coverage
...

Three design questions to settle up front. What are you evaluating: a single piece, a binary comparison, a ranking with a reference? How will scores be used: binary, numeric, ranking, categorical? Decisions usually need binary. How are humans involved: do you need explanations, conversational cross-examination?

Option 2: ML

Combine LLM-as-Judge scores into a single outcome-predicting score using a trained ML classifier. Create a scoring rubric for prompting, collect historical content paired with real outcomes (from CRM, did the customer purchase within 30 days?), score the historical content with LLM-as-Judge, and train a classifier on (rubric scores → outcome). The ML model discounts criteria that don't matter or are inconsistent, as long as you don't overfit.

Option 3: Fine-tuning

When rubrics are hard to articulate ("persuasive marketing"), have human experts score content with the same rubric you intend, then Adapter-tune (Pattern 15) an LLM to mimic their scores. Useful when you need the LLM to apply a domain standard like medical diagnostic checklists.

Example: Voter Pamphlet Arguments

Evaluate persuasiveness of arguments for and against Washington State's I-1491 (gun safety initiative). Outcome measurement is confounded by partisanship, human evaluation is biased on gun topics, and BLEU/ROUGE doesn't measure persuasion. Best fit: prompting.

Rubric drawn from voter-education research:

- Centers the voter: easy to understand impact at different socioeconomic / education levels
- Pyramid information: most essential info first, details last
- Understandable: plain language, simple sentences, minimizes jargon
- Clarity: clear call to action
- Caters to undecided: endorsements from neutrals, comparisons to alternatives

Embed in the prompt and run on each argument with GPT-4o-mini.

Considerations

Inconsistency

Even at temperature=0, you want similar inputs to receive similar scores. Three things help. Coarse scores: 1 to 5 is OK, 1 to 100 is much harder, and binary yes/no is best of all. Multiple criteria instead of one aggregate score, which is a form of CoT, with few-shot examples to calibrate. And multiple evaluations: multiple LLMs as different stakeholders ("LLM-as-jury"), combined with binary scoring, which gives you "polling".

Leniency

LLMs grade like a generous professor. Mitigations include direct comparison instead of separate scoring (as in Pattern 5, Content Optimization), group rewards (DeepSeek's GRPO normalizes a response's score by its group's average), and lower expectations: use scores only to identify clear problems, where context loss looks like 0 vs. 0.95.

Bias

Three biases to watch for. Self-bias: LLMs favor their own outputs, so use a different LLM to evaluate. Length bias: favor longer reviews even with the same content. Positional bias: favor info at the start or end and miss the middle.

LLMs prefer well-written (low-perplexity) text, even if it's inaccurate, over badly written but accurate text. Consider fine-tuned small models like PandaLM or task-specific evaluators like PatronusAI.

Caveats

Asking for a score and an explanation may degrade evaluation performance, since self-explanation introduces bias. If you don't need explanations, take the probability-weighted mean score of the score distribution instead of argmax.

When evaluating, true outcomes beat KPIs beat LLM-as-Judge. Pick rubric criteria that proxy real outcomes or business KPIs.

References: Bai et al. 2023 (Language-Model-as-Examiner); Shankar et al. 2024; Balog/Metzler/Qin 2025; Gu et al. 2025; Hamel Husain 2024; Eugene Yan 2025.


Pattern 18 — Reflection

Reflection is an agentic pattern: invoke the LLM, evaluate the response, modify the prompt with the critique, regenerate. The "self" in self-reflection is misleading. The evaluator can be a different LLM, an external tool, or a human.

Problem

In a chat UI, you give follow-up feedback and the model corrects. In a stateless API, every call is independent. How do you generate the critique automatically and feed it back?

Solution

The pipeline is straightforward. Call LLM with the user prompt. Send the response to an evaluator (LLM-as-Judge, external tool, or human) that returns a critique, not just a score. Use the critique to construct a modified prompt. Regenerate, and loop until quality threshold.

Two variants worth knowing. Maximum attempts caps the loop at typically 1 retry, which avoids needing a "good enough" threshold. Conversational state treats the evaluator as a participant and appends messages to history. The Autogen example interleaves Reviewer / Coder / User messages.

Example: Logo Design

Evaluator

Build with LLM-as-Judge using Claude Sonnet 3.7:

Analyze the following proposed logo for {company}.
{company_description}
Score the logo 1-5 on each of the following criteria:
- Clear what the company name is
- Logo and image appropriate for what the company does
- Doesn't conflict with well-known brands or competitors
- Streamlined and clean design
- Stands out and is easy to recognize
Explain your scores.

For Pydantic's logo, the critique notes: "The geometric pyramid/triangle symbol… suggests structure, validation, and frameworks - all relevant to Pydantic's core business… However, it doesn't specifically reference their AI agent framework or Logfire observability platform, which is why I didn't give a perfect score."

Logo designer

Use Gemini 2.0 (multimodal) to generate an initial logo from a text prompt:

Aggregating five criteria scores:

def score(self) -> float:
    return (10 * self.clarity +
            10 * self.appropriateness +
            30 * self.no_conflicts +
            30 * self.clean_design +
            20 * self.easy_to_recognize) / 500.0

Pydantic logo gets 0.9, sushi logo also gets 0.9. LLMs are too lenient. The workaround is to do exactly one round of critique, no threshold needed.

Apply criticism

Use the following feedback to generate detailed extra instructions to send back
to the designer of the logo.

{critique}

Claude returns refinements like "Consider adding a subtle, unique twist to the nigiri illustration to differentiate it from other sushi restaurants…".

Regenerate

def design_logo(company, company_description, output_filename,
                previous_logo, changes_to_make):
    prompt = f"""
        Here's a proposed logo image for {company}. {company_description}
        Please edit the image and make the following changes.
        Return only the final image after all edits.
        {changes_to_make}
    """
    previous_image = Image.open(previous_logo)
    contents = [prompt, previous_image]

    response = client.models.generate_content(
        model=GEMINI,
        contents=contents,
        config=types.GenerateContentConfig(
            response_modalities=['TEXT', 'IMAGE']
        )
    )

The right-hand result incorporates Claude's suggestions for distinctiveness.

Considerations

Reflection is one of Andrew Ng's four agentic patterns. It matters for three reasons. Quality and robustness: you catch errors before they reach the user, and logs reveal edge cases. It mitigates reasoning limitations through iterative refinement, since novel tasks are easier to fix than to plan. And it gives transparency through explicit reasoning traces.

Cost vs. quality

Multiple LLM calls means more latency and cost. Provider availability problems compound: you may already need retries to make the first call land, and Reflection multiplies tail latency.

For code generation, reflection (with compilers or sandboxes) often pays off because broken builds cost more than extra calls. For real-time chatbots or game engines, reflection may add unacceptable latency. Tune reflection depth based on time, problem characteristics, and business impact.

You can also do beam-style reflection: generate multiple drafts, critique all of them, prune poor ones at each step. More expensive, often better quality.

Getting evaluation right

The evaluator is everything. Use a different LLM for evaluation than for generation (we used Gemini for images, Claude for critique) to avoid self-bias. Apply LLM-as-Judge's leniency mitigations.

References: Flavell 1979 (cognitive monitoring); Rabinowitz et al. 2018 (theory of mind RL); Reflexion (Shinn et al. 2023); self-refinement (Madaan et al. 2023); Self-RAG (Asai et al. 2023); Dou et al. 2024 (code bug reduction). Amazon uses Reflection on product listings.


Pattern 19 — Dependency Injection

Dependency Injection makes each step in an LLM chain mockable so you can develop and test components independently.

Problem

Three difficulties of testing GenAI apps. Nondeterminism: same prompt, different outputs. Models change quickly: pinning to a model version forfeits future improvements. And LLM-agnostic code: frameworks like PydanticAI and LangChain make code portable, but prompts aren't portable, so you have to test on multiple LLMs. Chains compound the problem because each step's output is the next step's input.

Example: Improve Marketing Description

Two-step chain (CoT + Reflection): generate critique, then apply one improvement.

Step 1: critique

@dataclass
class Critique:
    target_audience: List[str]
    improvements: List[str]

def critique(in_text):
    prompt = """
    You are an expert marketer for technology books.
    You will be given the marketing description for a book.
    Identify the target audience by roles (eg: Data Analyst, Data Engineer)
    Suggest exactly 5 ways that the *marketing description* can be improved...
    Do not suggest improvements to the book itself.
    **Marketing Description**: ...
    """
    agent = Agent(GEMINI, result_type=Critique)
    return agent.run_sync([prompt, in_text]).data

Run on Machine Learning Design Patterns and you get a list of audience roles and 5 improvement suggestions.

Test step 1

def assert_critique(critique):
    assert len(critique.improvements) > 3, "Should have 4+ improvements"
    assert len(critique.target_audience) > 0, "Should have 1+ role"

You can also use LLM-as-Judge for nuanced checks. The suggestion "Include a section on troubleshooting" sounds like a book change, not a description change, and an LLM evaluator can catch that.

Step 2: implement a suggestion

@dataclass
class Improvement:
    change: str
    reason: str
    modified_marketing_description: str

def improve(marketing_text, c):
    prompt = """
    You are a helpful marketing assistant.
    You will be given the marketing description for a book, its target audience,
    and a list of suggested changes.

    Pick one change from the list that best meets these criteria:
    - It does not require changing the book itself, only the marketing description.
    - It will make the book much more appealing to the target audience.
    - It requires only 1-5 lines to be changed in the text...
    """

Test step 2

def assert_improvement(improvement, orig_text, c):
    assert improvement.change in c.improvements, "Chosen change not in original list"
    nlines_changed = ...  # difflib
    assert 0 < nlines_changed <= 5, f"{nlines_changed} lines changed, not 1-5"

But to test step 2 you need a Critique object. The naive way calls step 1's LLM:

def improvement_chain(a_text):
    a_critique = critique(a_text)
    improved = improve(a_text, a_critique)
    assert_improvement(improved, a_text, a_critique)

Now step 2 can't be tested independently of step 1.

Solution

Pass each step as a parameter with a default:

def improvement_chain(
    in_text: str,
    critique_fn: Callable[[str], Critique] = critique,
    improve_fn: Callable[[str, Critique], Improvement] = improve
) -> Improvement:
    c = critique_fn(in_text)
    assert_critique(c)
    improved = improve_fn(in_text, c)
    assert_improvement(improved, in_text, c)
    return improved

Python's assert is stripped under -O, so you can keep assertions on in dev/test and off in production with no code change. Pytest decorates them with stack info on failure.

Mock step 1

def mock_critique(in_text: str) -> Critique:
    return Critique(
        target_audience=["AI Engineers", "Machine Learning Engineers", "Software Engineers"],
        improvements=[
            "Use more precise language to define the problems the book solves.",
            "Add specific examples of how the design patterns have been used to solve real-world problems.",
            "Highlight the benefits of using design patterns...",
            "Emphasize the book's practical approach...",
            "Include testimonials from data scientists...",
        ]
    )

improved = improvement_chain(mldp_text, critique_fn=mock_critique)

Considerations

You can mock objects (abstract classes, inheritance), not just functions. Use idiomatic patterns for your language. Hardcoded mocks get harder as inter-step coupling grows.

Mock external functions too. Network latency and service availability shouldn't slow dev/test.

References: Fowler 2024 on DI in software engineering. As of writing, no GenAI framework natively supports DI. PydanticAI comes closest with prompt and tool injection (but not agent injection).


Pattern 20 — Prompt Optimization

Prompt Optimization uses a framework to generate prompt variations, evaluate them on a dataset, and select the best, so updating dependencies (model version, toolchain) doesn't force manual prompt rewriting.

Problem

Building a GenAI app is trial-and-error prompt engineering. When the model upgrades, all your manual experimentation has to be redone. Long detailed prompts are especially brittle. We need a way to update prompts automatically.

Solution

Four components: a pipeline of steps (the chain of LLM calls and tools), a dataset of input/output pairs (supervised) or just inputs (with a fitness evaluator) ranging from one example to thousands, an evaluator that automatically scores pipeline outputs (vs. references, or via LLM-as-Judge), and an optimizer that generates prompt variations, runs them on the dataset, and returns the optimized pipeline.

The fundamental theorem of software engineering applies: "All problems in computer science can be solved by another level of indirection."

Example: O'Reilly Book Blurb Improvement (DSPy)

Pipeline

import dspy
lm = dspy.LM("claude-3-7-sonnet-latest",
             api_key=os.environ['ANTHROPIC_API_KEY'])
dspy.configure(lm=lm)

class BlurbExtraction(dspy.Signature):
    text: str = dspy.InputField(desc="Text from backcover")
    blurb: Blurb = dspy.OutputField(desc="Extracted information")

class Blurb:
    about_topic: str = Field(description="Why the topic of book is worth learning")
    about_book: str = Field(description="What book contains")
    target_audience: List[str] = Field(description="Roles such as Data Engineer, Data Analyst")
    learning_objectives: List[str] = Field(description="4-6 'You will learn how to ___' objectives")

class BlurbImprovement(dspy.Signature):
    current_cover: Blurb = dspy.InputField(desc="Current information on book")
    about_topic: str = dspy.OutputField(desc="More catchy statement why topic is worth learning")
    about_book: str = dspy.OutputField(desc="More appealing description of book contents")
    target_audience: List[str] = dspy.OutputField(desc="Aspirational roles list. Restrict to top 3.")
    learning_objectives: List[str] = dspy.OutputField(desc="Rephrased to be more appealing. Exactly 6.")

class BlurbPipeline(dspy.Module):
    def __init__(self):
        self.extract_info = dspy.ChainOfThought(BlurbExtraction)
        self.improve_blurb = dspy.ChainOfThought(BlurbImprovement)

    def forward(self, in_text):
        cover_info = self.extract_info(text=in_text)
        improved_cover = self.improve_blurb(current_cover=cover_info.blurb)
        return cover_info.blurb, make_blurb(improved_cover.toDict())

Evaluator (compare improved vs. original)

LLM-as-Judge with side-by-side comparison, which is more reliable than independent scores:

class BlurbScore(dspy.Signature):
    reference_blurb: Blurb = dspy.InputField()
    blurb_to_evaluate: Blurb = dspy.InputField()
    topic_score: float = dspy.OutputField(desc="-1 to 1: how much more appealing the topic description is...")
    contents_score: float = dspy.OutputField(desc="-1 to 1: how much more appealing the book contents...")
    objectives_score: List[float] = dspy.OutputField(desc="-1 to 1: appealing-ness of each objective...")

def calc_aggregate_score(blurb, p):
    result = (p.topic_score * 10 +
              p.contents_score * 10 +
              sum(p.objectives_score)) / (20 + len(p.objectives_score))
    num_lines = len(blurb.toMarketingCopy().splitlines())
    if num_lines > TARGET_MAX_LINES:
        result -= 0.1 * (num_lines - TARGET_MAX_LINES)
    return max(result, 0)

Optimization on a single blurb

def score_reward(args, pred):
    orig_blurb, improved_blurb = pred
    scorer = ScorerPipeline()
    return scorer(orig_blurb, improved_blurb)

optimized_pipeline = dspy.BestOfN(
    module=BlurbPipeline(),
    N=10,
    reward_fn=score_reward,
    threshold=0.95
)

10 trials took the score from 0.63 (baseline) to 0.74. The downside is 10 inference runs per blurb.

Few-shot bootstrapping (one inference at runtime)

Pick 3 of 10 examples as few-shot demonstrations, evaluate the prompt on the other 7, and iterate to find the best demo set:

blurbs = [dspy.Example(in_text=b).with_inputs("in_text") for b in blurbs]

optimizer = BootstrapFewShot(metric=evaluate_blurb)
optimized_pipeline = optimizer.compile(BlurbPipeline(), trainset=blurbs)
optimized_pipeline.save("optimized_pipeline", save_program=True)

orig_blurb, optimized_blurb = optimized_pipeline(in_text=mldp_text)

The result is a single prompt that works well across many books. Re-run optimization when the LLM version changes. Your code itself contains no prompts.

Considerations

A simple prompt library (config-file prompts) helps with version control but doesn't solve the core problem. You still re-experiment manually when dependencies change.

Beyond best-of-N and few-shot, DSPy supports fine-tuning the LLM on your pipeline if your dataset is large.

Other Prompt-Optimization frameworks worth knowing: AdalFlow, PromptWizard. PydanticAI is considering it.

Once you have prompt-optimization infrastructure, you also have a path to building an LLM-as-Judge dataset from logged prompts and human feedback, and to fine-tuning a task-tuned LLM from logged prompts.

References: DSPy (Khattab et al. 2023, "DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines").


Summary

PatternProblemSolutionWhen to use
LLM-as-Judge (17)Open-ended evaluation hard with outcome metrics, humans, BLEU/ROUGECustom rubric, LLM scores; or ML on rubric outputs; or fine-tuned scoring LLMAny pattern that needs evaluation: Content Optimization, ToT, Reflection, Evol-Instruct
Reflection (18)Stateless API, can't iterate on suboptimal responsesLoop: critique, modified prompt, regenerateCode generation, image generation, complex tasks where iterative refinement helps
Dependency Injection (19)Can't develop or test chain steps independentlyPass step implementations as parameters; mock during dev/testAny chained LLM application or anything that uses external tools
Prompt Optimization (20)Updating prompts each time the LLM changes is brittleFramework auto-generates and selects prompts using a dataset and evaluatorProduction apps with regular dependency upgrades

The patterns combine. LLM-as-Judge becomes the evaluator inside Reflection and Prompt Optimization. Dependency Injection makes it possible to swap each step's implementation. Prompt Optimization can use Reflection's critique loop within the optimizer. A reliable production GenAI application typically has a clear evaluator (LLM-as-Judge), Reflection where the cost makes sense, mockable dependencies, and an automated prompt update mechanism, so model upgrades become routine rather than painful.