Setting Safeguards
Chapter 9: Setting Safeguards
Introduction
GenAI applications are inherently risky. Foundational models are nondeterministic and may hallucinate, generate toxic content, or drift from your application's purpose. The four patterns in this chapter wrap protective layers around LLMs. Template Generation (29) avoids dynamic generation by pregenerating reviewable templates. Assembled Reformat (30) keeps facts intact by splitting content creation into a low-risk data-assembly step and a low-risk reformatting step. Self-Check (31) detects hallucinations by inspecting token probabilities. Guardrails (32) wrap LLM calls with input/output filtering for security, privacy, content moderation, hallucination, and alignment.
Pattern 29 — Template Generation
Template Generation pregenerates content templates that humans review offline. At inference time the application performs only deterministic string replacement, so the output is safe to send without per-message human review.
Problem
You're a tour operator sending personalized thank-you notes after each booking. Volume is thousands per day, way too much for per-note human review. But fully dynamic LLM-generated notes risk including upsells, off-brand language, or worse.
Solution
Use the LLM to generate a small number of templates (one per tour × package × language). Humans review and edit those templates once. At inference time, fill placeholders with deterministic values from the booking record.
Example
Pregenerate 3 destinations × 4 package types × 2 languages = 24 templates:
DESTINATIONS = ["Toledo, Spain", "Avila & Segovia", "Escorial Monastery"]
PACKAGE_TYPES = ["Family", "Individual", "Group", "Singles"]
LANGUAGES = ["English", "Polish"]
for dest in DESTINATIONS:
for package_type in PACKAGE_TYPES:
for lang in LANGUAGES:
template = create_template(dest, package_type, lang)
db.insert(dest, package_type, lang, template)
The template-generation prompt asks the LLM to use placeholders for runtime substitution:
def create_template(tour_destination, package_type, language):
prompt = f"""
You are a tour guide working on behalf of Tours GenAI S.L. Write a
personalized letter in {language} to a customer who has purchased a
{package_type} tour package to visit {tour_destination}. ...
Use [CUSTOMER_NAME] to indicate the place to be replaced by their name and
[TOUR_GUIDE] to indicate the place to be replaced by your name.
"""
template = zero_shot(GEMINI, prompt)
template = human_edit_confirm(template)
return template
A reviewed English template might be:
Dear [CUSTOMER_NAME],
I'm absolutely thrilled to welcome you to Toledo! I'm [TOUR_GUIDE], and I'll be your guide for your family tour. ...
See you soon, [TOUR_GUIDE]
A native-Polish reviewer can fix subtle issues. Polish has grammatical gender so the verb after [TOUR_GUIDE] may need a masculine vs. feminine form, and Polish audiences expect fewer exclamation points. Easier to fix once than to review thousands of generated letters.
At inference:
template = db.retrieve(booked_tour.destination, booked_tour.package_type, booked_tour.language)
email_body = (template
.replace("[CUSTOMER_NAME]", booked_tour.customer_name)
.replace("[TOUR_GUIDE]", booked_tour.tour_guide.name))
Considerations
This pattern works when the number of templates is tractable. If combinations explode, consider Assembled Reformat (Pattern 30) or Guardrails (Pattern 32). Pair with ML for personalization: pregenerate landing pages and use a propensity model or recommendation engine to select which to show.
References: Mail merge (1980s WordStar). LLM-generated templates: Lakshmanan (2024).
Pattern 30 — Assembled Reformat
Assembled Reformat splits content creation into two low-risk steps. First, assemble raw facts using non-LLM or low-risk methods (database, OCR, RAG, Tool Calling, Template Generation). Second, reformat that assembled content with an LLM. Rephrasing or summarizing introduces far fewer inaccuracies than generating from scratch.
Problem
A product catalog page that hallucinates "alkaline" instead of "lithium" for a camera battery is a real liability. Lithium batteries can't go in checked airline baggage. With hundreds of thousands of products, dynamic generation is too risky and per-page human review is infeasible.
Solution
Identify the risk-bearing attributes (battery type, dimensions, warranty period, materials, price). Assemble them via deterministic or low-risk methods. Then ask the LLM to reformat, never to generate the facts themselves.
Example: Product Catalog
@dataclass
class CatalogContent:
part_name: str = Field("Common name of part")
part_id: str = Field("unique part id in catalog")
part_description: str = Field("One paragraph description...")
failure_modes: list[str] = Field("list of common reasons why customer might need to replace this part.")
warranty_period: int = Field("number of years that the part is under warranty")
price: str = Field("price of part")
part_name, part_id, warranty_period, and price come from a database. part_description and failure_modes come from extraction over the equipment manual using a low temperature (0.1 or 0).
CatalogContent
part_name='wet_end'
part_id='X34521PL'
part_description='The wet end of a paper machine is the section where the paper web is formed...'
failure_modes=['Web breaks', 'Uneven sheet formation', 'Poor drainage']
warranty_period=3
price='$23295'
Reformat with the LLM:
Write content in Markdown that will go in the Replacement Parts part of the
manufacturer's website. Include a placeholder for an image and include a
description of the image. Optimize the content for SEO. Also make it appealing
to potential buyers.
**Part Information:**
{item}
The output is appealing prose grounded in the three "acceptable" failure modes. It won't invent new failure modes the manufacturer doesn't want publicized.
Considerations
Validate by extracting facts twice (different methods) and comparing. Use Self-Check (Pattern 31) on the reformatted output to catch hallucinations. Use LLM-as-Judge (Pattern 17) to verify that reformatting preserved the source facts. Try Template Generation (Pattern 29) first, since it allows whole-template review and is even safer. Use Assembled Reformat when item count is too large to template. The pattern works for relatively static content (product catalogs). For per-user personalization (marketing landing pages), use Template Generation.
References: Lakshmanan (2024).
Pattern 31 — Self-Check
Self-Check uses token probabilities (logprobs) to detect potential hallucinations.
Problem
Hallucinations are dropping over time:
Vectara's text-summarization measurements show the best LLM dropping from 1.3% (Dec 2024) to 0.7% (Apr 2025), with the 25th-best dropping from 4.1% to 2.4%. But constrained or complex tasks still hallucinate. Image-to-text field extraction sits at 90 to 97% accuracy, which means 3 to 10% of extracted invoice numbers are hallucinated. Errors compound through downstream LLM calls.
You could compare three LLMs trained independently, but frontier models share training data, and tripling the calls triples cost. Can a single LLM's output reveal its uncertainty?
Solution
LLMs return logprobs (log probabilities) for each generated token along with several alternative candidates. Probability is e^logit. Confident tokens have probabilities near 1, uncertain ones don't.
Requesting logprobs from OpenAI
message = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[...],
logprobs=True,
top_logprobs=5
)
response_text = message.choices[0].message.content
logprobs = message.choices[0].logprobs
for token_info in logprobs.content:
token = token_info.token
probability = math.e ** token_info.logprob
if token_info.top_logprobs:
for alt_token in token_info.top_logprobs:
if alt_token.token != token:
alt_probability = math.e ** alt_token.logprob
How logprobs behave
Asking "What year was Ataturk born?", the model says 1881 (tokens: 188 then 1).
The year tokens have probabilities near 1.0, and alternative years (1980s, 1830s, 1930s) are near zero.
But the starting token At has only 58% probability:
The reason is that "Mustafa Kemal Atatürk was born in 1881" is also a valid response. The probability of at (lowercase) at position 2 reflects the umlaut variant Atatürk. Low logprob doesn't always mean hallucination. It can mean valid alternatives.
Detecting actual hallucinations
Asking GPT-3.5 "Who is John Cole Howard?" (a fabricated person):
John Cole Howard is a fictional character from the TV show The Office, portrayed by actor Ed Helms.
The tokens The and Ed have probabilities below 50%. The model is guessing. Ed Helms's character on The Office is actually Andy Bernard.
Approaches to limit false positives
Identify tokens of interest: only check logprobs at known positions (in structured output). Sample generated sequences: generate multiple completions and check whether they agree on the answer (compare via embedding similarity). Different leading tokens but the same answer is fine. Normalize across token length with perplexity: perplexity = e^(-(1/N) Σ log p_i). Lower perplexity means more confident. Build an ML model: train on token probabilities, embedding distances, perplexity, and contextual features. That last approach is the most robust.
Example: Receipt Extraction
Receipts have four fields: billed_amount, tax, tip, paid_amount. The total is a checksum, but if a field is smudged you must impute it. Ask the LLM:
You are a helpful AI assistant that helps parse restaurant receipts.
I will give you a set of parsed values containing the following on each line:
billed_amount, tax, tip, paid_amount
If tax is missing, calculate it as 9.21% of the billed_amount.
If the tip is missing, calculate it as (paid_amount - billed_amount - tax).
If the paid_amount is missing, calculate it as (billed_amount + tax + tip).
Do not add any headers or explanations.
Track each token's logprob and aggregate per row:
line_no = 0
confidence_of_line = 1.0
last_col_no = len(result_df.iloc[0]) - 1
for token_info in logprobs.content:
probability = (2.718281828459045 ** token_info.logprob)
confidence_of_line = min(confidence_of_line, probability)
result_df.iloc[line_no, last_col_no] = confidence_of_line
if '\n' in token_info.token:
line_no += 1
confidence_of_line = 1.0
Result table:
| billed_amount | tax | tip | paid_amount | Confidence |
|---|---|---|---|---|
| 312.32 | 28.76 | 60.0 | 401.08 | 0.962668 |
| 312.32 | 28.76 | 60.0 | 400.00 | 0.551552 |
| 312.32 | 28.76 | 60.0 | 400.08 | 0.562172 |
| 312.21 | 28.84 | 50.0 | 391.05 | 0.172516 |
| 312.43 | 28.80 | 60.0 | 401.23 | 0.170295 |
| 300.00 | 27.63 | 60.0 | 387.63 | 0.999290 |
Confidence is high (>0.9) only for rows with no imputation, around 0.55 for rows with one imputation, and around 0.17 for two imputations. Confidence neatly tracks where the model had to guess.
Considerations
A simpler alternative is to explicitly give the model an out: ask it to respond "I don't know" when uncertain, or use Grammar:
currency_rate: float | Literal["Unknown"]
Self-Check is great for detecting inconsistent RAG retrievals. If two retrieved chunks contradict each other, the generated tokens will have low logprobs.
Not all models expose logprobs. Anthropic doesn't (as of writing). OpenAI does broadly. Gemini Flash supports responseLogprobs. Llama is the most permissive but requires self-hosting.
References: SelfCheckGPT (Manakul, Liusie, Gales 2023); Quevedo et al. (2024); Valentin et al. (2024).
Pattern 32 — Guardrails
Guardrails wrap LLM calls with preprocessing and postprocessing layers that enforce ethical, legal, and functional constraints across inputs, outputs, retrieved knowledge, and tool parameters.
Problem
GenAI apps need to defend against several categories of attack and failure. Security: prompt injection (direct or hidden in consumed data) and jailbreaks. CMU researchers (2023) found that random-character suffixes can manipulate LLM behavior. Data privacy: exposing PII, trade secrets, or confidential content. Content moderation: filtering hate speech, violence, sexual content. Hallucination: accuracy in critical fields like health, law, and finance. Alignment: adhering to brand voice and avoiding competitors, politics, and religion.
You don't want to sprinkle these checks across every code path. The error-handling surface area gets out of hand.
Solution
Guardrails insert uniform processing layers before the LLM (input sanitization), after retrieval, before tool calls, and after generation.
Prebuilt guardrails
Some APIs have these built in. Gemini lets you block hate speech:
response = client.models.generate_content(
model="gemini-2.0-flash",
contents=[prompt, media, ...],
config=types.GenerateContentConfig(
safety_settings=[
types.SafetySetting(
category=types.HarmCategory.HARM_CATEGORY_HATE_SPEECH,
threshold=types.HarmBlockThreshold.BLOCK_LOW_AND_ABOVE,
),
]
)
)
Frameworks: NVIDIA NeMo, Guardrails AI, LLM Guard.
LLM Guard scanning a prompt for toxicity or prompt injection (it uses post-trained SLMs internally: unitary/unbiased-toxic-roberta, ProtectAI/deberta-v3-base-prompt-injection-v2):
from llm_guard.input_scanners import Toxicity, PromptInjection, Regex
scanner = Toxicity(threshold=0.5, match_type=MatchType.SENTENCE)
sanitized_prompt, is_valid, _ = scanner.scan(prompt)
scanner = PromptInjection(threshold=0.5, match_type=MatchType.FULL)
sanitized_prompt, is_valid, _ = scanner.scan(prompt)
scanner = Regex(
patterns=[r"Bearer [A-Za-z0-9-._~+/]+"],
is_blocked=True,
match_type=MatchType.SEARCH,
redact=True,
)
sanitized_prompt, is_valid, risk_score = scanner.scan(prompt)
Custom guardrails
LLM-as-Judge as a guardrail, rejecting banned topics:
banned_topics = ["religion", "politics", "sexual innuendo"]
system_prompt = f"""
I will give you a piece of text. Check whether the text touches on any of these
topics.
{banned_topics}
Return True or False, with no preamble or special markers.
Text:
"""
response = llm.complete(prompt).text.strip()
is_valid = (response == "False")
Applying a chain of guardrails
def apply_guardrails(guardrails, prompt):
sanitized_prompt = prompt
for scanner in guardrails:
sanitized_prompt, is_valid, _ = scanner(sanitized_prompt)
if not is_valid:
raise Exception("...")
return sanitized_prompt
Example: Guarded Jane Austen RAG
Custom PII redactor (replace personal names with generic identifiers):
def guardrail_replace_names(to_scan: str):
system_prompt = """
I will give you a piece of text. In that piece of text, replace any personal
names with a generic identifier.
Example:
Input: I met Sally in the store.
Output: I met a woman in the store.
Return only the modified text, with no preamble or special markers.
"""
sanitized_output = llm.complete(system_prompt + "\n" + to_scan).text.strip()
no_change = (sanitized_output == to_scan)
return {
"guardrail_type": "PII Removal",
"activated": not no_change,
"should_stop": False,
"sanitized_output": sanitized_output,
}
Custom banned-topic filter (LLM-as-Judge):
def guardrail_banned_topics(to_scan: str):
banned_topics = ["religion", "politics", "sexual innuendo"]
system_prompt = f"""
I will give you a piece of text. Check whether the text touches on any of these
topics.
{banned_topics}
Return True or False, with no preamble or special markers.
Text:
"""
response = llm.complete(system_prompt + "\n" + to_scan).text.strip()
is_banned = (response == "True")
return {
"guardrail_type": "Banned Topic",
"activated": is_banned,
"should_stop": is_banned,
"sanitized_output": to_scan,
}
All guardrails follow the same signature so they compose. Wrap the query engine:
class GuardedQueryEngine(RetrieverQueryEngine):
def __init__(self, query_engine):
self._query_engine = query_engine
def query(self, query):
gd = apply_guardrails(query, [guardrail_replace_names, guardrail_banned_topics])
if not gd["should_stop"]:
print(f"Modified Query: {gd['sanitized']}")
query_response = self._query_engine.query(gd["sanitized"])
gd = apply_guardrails(str(query_response), [guardrail_banned_topics])
if not gd["should_stop"]:
return Response(gd["sanitized"], source_nodes=query_response.source_nodes)
return Response(str(gd))
Behavior. "Are parish priests expected to be role models?" gets blocked because religion is banned. "Would Mr. Darcy be an appealing match if he were not wealthy?" gets rewritten to "Would a man be an appealing match if he were not wealthy?" before reaching the LLM.
Considerations
Guardrails add engineering complexity and latency. Use prebuilt SLMs where possible.
When the only risk is using poisoned output, you can run guardrails in parallel with the main LLM call:
try:
input_guardrail_results, turn_result = await asyncio.gather(
apply_guardrails(...),
llm.complete(...),
)
except InputGuardrailTriggered:
...
The second call still executes (the LLM isn't protected) but you don't use its results downstream.
A few tradeoffs to be honest about. Stricter guardrails reduce model capability and add latency. Attackers eventually bypass: guardrails are an arms race. Don't build heavily customized guardrails. Build ones you can swap to a new framework or model. Maintain an evaluation dataset of attack scenarios plus max acceptable latency, and periodically test commercial guardrail systems.
References: Dong et al. (2024); OWASP prompt injection classifications. QED42 built prompt-based guardrails for a legal search; Acrolinx uses LLM-as-Judge for brand-voice consistency.
Summary
| Pattern | Problem | Solution | When to use |
|---|---|---|---|
| Template Generation (29) | Per-message review can't scale; full dynamic generation too risky | Pregenerate small set of templates; humans review once; deterministic substitution at inference | Personalized B2C communications |
| Assembled Reformat (30) | Need appealing presentation but dynamic content too risky | Low-risk assembly (DB, RAG, OCR), low-risk reformatting (LLM rephrasing) | Product catalogs, fact-driven content |
| Self-Check (31) | Need to detect hallucinations cost-effectively | Threshold logprobs / perplexity at key tokens; sample sequences; train ML detector | Factual extraction, structured outputs, RAG conflict detection |
| Guardrails (32) | Need security, privacy, content moderation, alignment around LLM | Pre/post-processing layers around inputs, retrieval, tools, outputs | Public-facing apps, adversarial environments |
A production GenAI app with adversarial users typically does Template Generation or Assembled Reformat for the content path, layered with Guardrails for the defense path, and Self-Check on critical extraction tasks. Reach for Template Generation first whenever you can: it gives complete review coverage. Move to Assembled Reformat when you have hundreds of thousands of items but the risk is in specific facts. Use Self-Check on extracted fields, RAG outputs, and structured generation. It's cheap once you have logprobs. Treat Guardrails as a swappable wrapper. Don't over-customize: reuse commercial systems and re-evaluate periodically as attacks evolve.
Previous chapter
Addressing ConstraintsNext chapter
Composable Agentic Workflows