Pagefy

Pagefy

Back to AI Engineering

Setting Safeguards

Generative AI Design Patterns by Valliappa Lakshmanan & Hannes HapkeBuy the book
Sign in to save bookmarks, reading progress, and highlights.

Chapter 9: Setting Safeguards

Introduction

GenAI applications are inherently risky. Foundational models are nondeterministic and may hallucinate, generate toxic content, or drift from your application's purpose. The four patterns in this chapter wrap protective layers around LLMs. Template Generation (29) avoids dynamic generation by pregenerating reviewable templates. Assembled Reformat (30) keeps facts intact by splitting content creation into a low-risk data-assembly step and a low-risk reformatting step. Self-Check (31) detects hallucinations by inspecting token probabilities. Guardrails (32) wrap LLM calls with input/output filtering for security, privacy, content moderation, hallucination, and alignment.


Pattern 29 — Template Generation

Template Generation pregenerates content templates that humans review offline. At inference time the application performs only deterministic string replacement, so the output is safe to send without per-message human review.

Problem

You're a tour operator sending personalized thank-you notes after each booking. Volume is thousands per day, way too much for per-note human review. But fully dynamic LLM-generated notes risk including upsells, off-brand language, or worse.

Solution

Use the LLM to generate a small number of templates (one per tour × package × language). Humans review and edit those templates once. At inference time, fill placeholders with deterministic values from the booking record.

Example

Pregenerate 3 destinations × 4 package types × 2 languages = 24 templates:

DESTINATIONS = ["Toledo, Spain", "Avila & Segovia", "Escorial Monastery"]
PACKAGE_TYPES = ["Family", "Individual", "Group", "Singles"]
LANGUAGES = ["English", "Polish"]

for dest in DESTINATIONS:
    for package_type in PACKAGE_TYPES:
        for lang in LANGUAGES:
            template = create_template(dest, package_type, lang)
            db.insert(dest, package_type, lang, template)

The template-generation prompt asks the LLM to use placeholders for runtime substitution:

def create_template(tour_destination, package_type, language):
    prompt = f"""
    You are a tour guide working on behalf of Tours GenAI S.L. Write a
    personalized letter in {language} to a customer who has purchased a
    {package_type} tour package to visit {tour_destination}. ...
    Use [CUSTOMER_NAME] to indicate the place to be replaced by their name and
    [TOUR_GUIDE] to indicate the place to be replaced by your name.
    """
    template = zero_shot(GEMINI, prompt)
    template = human_edit_confirm(template)
    return template

A reviewed English template might be:

Dear [CUSTOMER_NAME],

I'm absolutely thrilled to welcome you to Toledo! I'm [TOUR_GUIDE], and I'll be your guide for your family tour. ...

See you soon, [TOUR_GUIDE]

A native-Polish reviewer can fix subtle issues. Polish has grammatical gender so the verb after [TOUR_GUIDE] may need a masculine vs. feminine form, and Polish audiences expect fewer exclamation points. Easier to fix once than to review thousands of generated letters.

At inference:

template = db.retrieve(booked_tour.destination, booked_tour.package_type, booked_tour.language)
email_body = (template
    .replace("[CUSTOMER_NAME]", booked_tour.customer_name)
    .replace("[TOUR_GUIDE]", booked_tour.tour_guide.name))

Considerations

This pattern works when the number of templates is tractable. If combinations explode, consider Assembled Reformat (Pattern 30) or Guardrails (Pattern 32). Pair with ML for personalization: pregenerate landing pages and use a propensity model or recommendation engine to select which to show.

References: Mail merge (1980s WordStar). LLM-generated templates: Lakshmanan (2024).


Pattern 30 — Assembled Reformat

Assembled Reformat splits content creation into two low-risk steps. First, assemble raw facts using non-LLM or low-risk methods (database, OCR, RAG, Tool Calling, Template Generation). Second, reformat that assembled content with an LLM. Rephrasing or summarizing introduces far fewer inaccuracies than generating from scratch.

Problem

A product catalog page that hallucinates "alkaline" instead of "lithium" for a camera battery is a real liability. Lithium batteries can't go in checked airline baggage. With hundreds of thousands of products, dynamic generation is too risky and per-page human review is infeasible.

Solution

Identify the risk-bearing attributes (battery type, dimensions, warranty period, materials, price). Assemble them via deterministic or low-risk methods. Then ask the LLM to reformat, never to generate the facts themselves.

Example: Product Catalog

@dataclass
class CatalogContent:
    part_name: str = Field("Common name of part")
    part_id: str = Field("unique part id in catalog")
    part_description: str = Field("One paragraph description...")
    failure_modes: list[str] = Field("list of common reasons why customer might need to replace this part.")
    warranty_period: int = Field("number of years that the part is under warranty")
    price: str = Field("price of part")

part_name, part_id, warranty_period, and price come from a database. part_description and failure_modes come from extraction over the equipment manual using a low temperature (0.1 or 0).

CatalogContent
part_name='wet_end'
part_id='X34521PL'
part_description='The wet end of a paper machine is the section where the paper web is formed...'
failure_modes=['Web breaks', 'Uneven sheet formation', 'Poor drainage']
warranty_period=3
price='$23295'

Reformat with the LLM:

Write content in Markdown that will go in the Replacement Parts part of the
manufacturer's website. Include a placeholder for an image and include a
description of the image. Optimize the content for SEO. Also make it appealing
to potential buyers.

**Part Information:**
{item}

The output is appealing prose grounded in the three "acceptable" failure modes. It won't invent new failure modes the manufacturer doesn't want publicized.

Considerations

Validate by extracting facts twice (different methods) and comparing. Use Self-Check (Pattern 31) on the reformatted output to catch hallucinations. Use LLM-as-Judge (Pattern 17) to verify that reformatting preserved the source facts. Try Template Generation (Pattern 29) first, since it allows whole-template review and is even safer. Use Assembled Reformat when item count is too large to template. The pattern works for relatively static content (product catalogs). For per-user personalization (marketing landing pages), use Template Generation.

References: Lakshmanan (2024).


Pattern 31 — Self-Check

Self-Check uses token probabilities (logprobs) to detect potential hallucinations.

Problem

Hallucinations are dropping over time:

Vectara's text-summarization measurements show the best LLM dropping from 1.3% (Dec 2024) to 0.7% (Apr 2025), with the 25th-best dropping from 4.1% to 2.4%. But constrained or complex tasks still hallucinate. Image-to-text field extraction sits at 90 to 97% accuracy, which means 3 to 10% of extracted invoice numbers are hallucinated. Errors compound through downstream LLM calls.

You could compare three LLMs trained independently, but frontier models share training data, and tripling the calls triples cost. Can a single LLM's output reveal its uncertainty?

Solution

LLMs return logprobs (log probabilities) for each generated token along with several alternative candidates. Probability is e^logit. Confident tokens have probabilities near 1, uncertain ones don't.

Requesting logprobs from OpenAI

message = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[...],
    logprobs=True,
    top_logprobs=5
)

response_text = message.choices[0].message.content
logprobs = message.choices[0].logprobs

for token_info in logprobs.content:
    token = token_info.token
    probability = math.e ** token_info.logprob
    if token_info.top_logprobs:
        for alt_token in token_info.top_logprobs:
            if alt_token.token != token:
                alt_probability = math.e ** alt_token.logprob

How logprobs behave

Asking "What year was Ataturk born?", the model says 1881 (tokens: 188 then 1).

The year tokens have probabilities near 1.0, and alternative years (1980s, 1830s, 1930s) are near zero.

But the starting token At has only 58% probability:

The reason is that "Mustafa Kemal Atatürk was born in 1881" is also a valid response. The probability of at (lowercase) at position 2 reflects the umlaut variant Atatürk. Low logprob doesn't always mean hallucination. It can mean valid alternatives.

Detecting actual hallucinations

Asking GPT-3.5 "Who is John Cole Howard?" (a fabricated person):

John Cole Howard is a fictional character from the TV show The Office, portrayed by actor Ed Helms.

The tokens The and Ed have probabilities below 50%. The model is guessing. Ed Helms's character on The Office is actually Andy Bernard.

Approaches to limit false positives

Identify tokens of interest: only check logprobs at known positions (in structured output). Sample generated sequences: generate multiple completions and check whether they agree on the answer (compare via embedding similarity). Different leading tokens but the same answer is fine. Normalize across token length with perplexity: perplexity = e^(-(1/N) Σ log p_i). Lower perplexity means more confident. Build an ML model: train on token probabilities, embedding distances, perplexity, and contextual features. That last approach is the most robust.

Example: Receipt Extraction

Receipts have four fields: billed_amount, tax, tip, paid_amount. The total is a checksum, but if a field is smudged you must impute it. Ask the LLM:

You are a helpful AI assistant that helps parse restaurant receipts.
I will give you a set of parsed values containing the following on each line:
billed_amount, tax, tip, paid_amount

If tax is missing, calculate it as 9.21% of the billed_amount.
If the tip is missing, calculate it as (paid_amount - billed_amount - tax).
If the paid_amount is missing, calculate it as (billed_amount + tax + tip).
Do not add any headers or explanations.

Track each token's logprob and aggregate per row:

line_no = 0
confidence_of_line = 1.0
last_col_no = len(result_df.iloc[0]) - 1
for token_info in logprobs.content:
    probability = (2.718281828459045 ** token_info.logprob)
    confidence_of_line = min(confidence_of_line, probability)
    result_df.iloc[line_no, last_col_no] = confidence_of_line
    if '\n' in token_info.token:
        line_no += 1
        confidence_of_line = 1.0

Result table:

billed_amounttaxtippaid_amountConfidence
312.3228.7660.0401.080.962668
312.3228.7660.0400.000.551552
312.3228.7660.0400.080.562172
312.2128.8450.0391.050.172516
312.4328.8060.0401.230.170295
300.0027.6360.0387.630.999290

Confidence is high (>0.9) only for rows with no imputation, around 0.55 for rows with one imputation, and around 0.17 for two imputations. Confidence neatly tracks where the model had to guess.

Considerations

A simpler alternative is to explicitly give the model an out: ask it to respond "I don't know" when uncertain, or use Grammar:

currency_rate: float | Literal["Unknown"]

Self-Check is great for detecting inconsistent RAG retrievals. If two retrieved chunks contradict each other, the generated tokens will have low logprobs.

Not all models expose logprobs. Anthropic doesn't (as of writing). OpenAI does broadly. Gemini Flash supports responseLogprobs. Llama is the most permissive but requires self-hosting.

References: SelfCheckGPT (Manakul, Liusie, Gales 2023); Quevedo et al. (2024); Valentin et al. (2024).


Pattern 32 — Guardrails

Guardrails wrap LLM calls with preprocessing and postprocessing layers that enforce ethical, legal, and functional constraints across inputs, outputs, retrieved knowledge, and tool parameters.

Problem

GenAI apps need to defend against several categories of attack and failure. Security: prompt injection (direct or hidden in consumed data) and jailbreaks. CMU researchers (2023) found that random-character suffixes can manipulate LLM behavior. Data privacy: exposing PII, trade secrets, or confidential content. Content moderation: filtering hate speech, violence, sexual content. Hallucination: accuracy in critical fields like health, law, and finance. Alignment: adhering to brand voice and avoiding competitors, politics, and religion.

You don't want to sprinkle these checks across every code path. The error-handling surface area gets out of hand.

Solution

Guardrails insert uniform processing layers before the LLM (input sanitization), after retrieval, before tool calls, and after generation.

Prebuilt guardrails

Some APIs have these built in. Gemini lets you block hate speech:

response = client.models.generate_content(
    model="gemini-2.0-flash",
    contents=[prompt, media, ...],
    config=types.GenerateContentConfig(
        safety_settings=[
            types.SafetySetting(
                category=types.HarmCategory.HARM_CATEGORY_HATE_SPEECH,
                threshold=types.HarmBlockThreshold.BLOCK_LOW_AND_ABOVE,
            ),
        ]
    )
)

Frameworks: NVIDIA NeMo, Guardrails AI, LLM Guard.

LLM Guard scanning a prompt for toxicity or prompt injection (it uses post-trained SLMs internally: unitary/unbiased-toxic-roberta, ProtectAI/deberta-v3-base-prompt-injection-v2):

from llm_guard.input_scanners import Toxicity, PromptInjection, Regex
scanner = Toxicity(threshold=0.5, match_type=MatchType.SENTENCE)
sanitized_prompt, is_valid, _ = scanner.scan(prompt)

scanner = PromptInjection(threshold=0.5, match_type=MatchType.FULL)
sanitized_prompt, is_valid, _ = scanner.scan(prompt)

scanner = Regex(
    patterns=[r"Bearer [A-Za-z0-9-._~+/]+"],
    is_blocked=True,
    match_type=MatchType.SEARCH,
    redact=True,
)
sanitized_prompt, is_valid, risk_score = scanner.scan(prompt)

Custom guardrails

LLM-as-Judge as a guardrail, rejecting banned topics:

banned_topics = ["religion", "politics", "sexual innuendo"]
system_prompt = f"""
I will give you a piece of text. Check whether the text touches on any of these
topics.

    {banned_topics}

Return True or False, with no preamble or special markers.
Text:
"""
response = llm.complete(prompt).text.strip()
is_valid = (response == "False")

Applying a chain of guardrails

def apply_guardrails(guardrails, prompt):
    sanitized_prompt = prompt
    for scanner in guardrails:
        sanitized_prompt, is_valid, _ = scanner(sanitized_prompt)
        if not is_valid:
            raise Exception("...")
    return sanitized_prompt

Example: Guarded Jane Austen RAG

Custom PII redactor (replace personal names with generic identifiers):

def guardrail_replace_names(to_scan: str):
    system_prompt = """
I will give you a piece of text. In that piece of text, replace any personal
names with a generic identifier.

Example:
  Input: I met Sally in the store.
  Output: I met a woman in the store.

Return only the modified text, with no preamble or special markers.
    """
    sanitized_output = llm.complete(system_prompt + "\n" + to_scan).text.strip()
    no_change = (sanitized_output == to_scan)
    return {
        "guardrail_type": "PII Removal",
        "activated": not no_change,
        "should_stop": False,
        "sanitized_output": sanitized_output,
    }

Custom banned-topic filter (LLM-as-Judge):

def guardrail_banned_topics(to_scan: str):
    banned_topics = ["religion", "politics", "sexual innuendo"]
    system_prompt = f"""
I will give you a piece of text. Check whether the text touches on any of these
topics.

    {banned_topics}

Return True or False, with no preamble or special markers.
Text:
    """
    response = llm.complete(system_prompt + "\n" + to_scan).text.strip()
    is_banned = (response == "True")
    return {
        "guardrail_type": "Banned Topic",
        "activated": is_banned,
        "should_stop": is_banned,
        "sanitized_output": to_scan,
    }

All guardrails follow the same signature so they compose. Wrap the query engine:

class GuardedQueryEngine(RetrieverQueryEngine):
    def __init__(self, query_engine):
        self._query_engine = query_engine

    def query(self, query):
        gd = apply_guardrails(query, [guardrail_replace_names, guardrail_banned_topics])
        if not gd["should_stop"]:
            print(f"Modified Query: {gd['sanitized']}")
            query_response = self._query_engine.query(gd["sanitized"])
            gd = apply_guardrails(str(query_response), [guardrail_banned_topics])
            if not gd["should_stop"]:
                return Response(gd["sanitized"], source_nodes=query_response.source_nodes)
        return Response(str(gd))

Behavior. "Are parish priests expected to be role models?" gets blocked because religion is banned. "Would Mr. Darcy be an appealing match if he were not wealthy?" gets rewritten to "Would a man be an appealing match if he were not wealthy?" before reaching the LLM.

Considerations

Guardrails add engineering complexity and latency. Use prebuilt SLMs where possible.

When the only risk is using poisoned output, you can run guardrails in parallel with the main LLM call:

try:
    input_guardrail_results, turn_result = await asyncio.gather(
        apply_guardrails(...),
        llm.complete(...),
    )
except InputGuardrailTriggered:
    ...

The second call still executes (the LLM isn't protected) but you don't use its results downstream.

A few tradeoffs to be honest about. Stricter guardrails reduce model capability and add latency. Attackers eventually bypass: guardrails are an arms race. Don't build heavily customized guardrails. Build ones you can swap to a new framework or model. Maintain an evaluation dataset of attack scenarios plus max acceptable latency, and periodically test commercial guardrail systems.

References: Dong et al. (2024); OWASP prompt injection classifications. QED42 built prompt-based guardrails for a legal search; Acrolinx uses LLM-as-Judge for brand-voice consistency.


Summary

PatternProblemSolutionWhen to use
Template Generation (29)Per-message review can't scale; full dynamic generation too riskyPregenerate small set of templates; humans review once; deterministic substitution at inferencePersonalized B2C communications
Assembled Reformat (30)Need appealing presentation but dynamic content too riskyLow-risk assembly (DB, RAG, OCR), low-risk reformatting (LLM rephrasing)Product catalogs, fact-driven content
Self-Check (31)Need to detect hallucinations cost-effectivelyThreshold logprobs / perplexity at key tokens; sample sequences; train ML detectorFactual extraction, structured outputs, RAG conflict detection
Guardrails (32)Need security, privacy, content moderation, alignment around LLMPre/post-processing layers around inputs, retrieval, tools, outputsPublic-facing apps, adversarial environments

A production GenAI app with adversarial users typically does Template Generation or Assembled Reformat for the content path, layered with Guardrails for the defense path, and Self-Check on critical extraction tasks. Reach for Template Generation first whenever you can: it gives complete review coverage. Move to Assembled Reformat when you have hundreds of thousands of items but the risk is in specific facts. Use Self-Check on extracted fields, RAG outputs, and structured generation. It's cheap once you have logprobs. Treat Guardrails as a swappable wrapper. Don't over-customize: reuse commercial systems and re-evaluate periodically as attacks evolve.