Pagefy

Book

Generative AI Design Patterns by Valliappa Lakshmanan & Hannes Hapke

1.Introduction 2.Controlling Content Style 3.Adding Knowledge Bass 4.Adding Knowledge Syncopation 5.Extending Model Capabilities 6.Improving Reliability 7.Enabling Agents to Take Action 8.Addressing Constraints 9.Setting Safeguards 10.Composable Agentic Workflows

Adding Knowledge Syncopation

Generative AI Design Patterns by Valliappa Lakshmanan & Hannes HapkeBuy the book

Chapter 4: Adding Knowledge — Syncopation

Introduction

This chapter continues the RAG ladder from Chapter 3. After the foundation (Basic RAG, Semantic Indexing, Indexing at Scale), the four patterns here go after harder problems: questions whose words don't appear in the knowledge base (Index-Aware Retrieval), retrieved chunks that aren't actually relevant (Node Postprocessing), users' trust in the system's answers (Trustworthy Generation), and complex queries that need multistep reasoning across multiple knowledge sources (Deep Search). Each pattern composes with the earlier ones, so pick the components that fit your use case rather than turning everything on.

Pattern 9 — Index-Aware Retrieval

Index-Aware Retrieval uses your knowledge of the indexed text, its vocabulary and structure, to push retrieval past Basic RAG and Semantic Indexing.

Problem

The "find chunks similar to the query" assumption breaks in four common cases. The question may not be present in the knowledge base: ask "What's a historical attraction within a 2-hour train ride from Madrid?" against a knowledge base that contains "Toledo is primarily located on the right (north) bank of the Tagus…" and "Work began on a high-speed link to Madrid, which entered service on November 15, 2005", and neither chunk shares keywords or meaning with the question. The knowledge base may use technical language that the query doesn't match: a user asks about "Muslim palaces" but the chunk calls Alhambra a "Nasrid fortress". The answer may be a fine detail buried in a long chunk, where a whole-chunk embedding hides a sub-paragraph fact (a passing mention of muqarnas in a long architectural description). And the answer may need a holistic interpretation across chunks: "What caused the collapse of Alhambra?" chains the Nueva Planta decrees to the centralized Spanish state to the expulsion of Nasrid rulers, and you can't retrieve that chain without already knowing it.

Solution

Four composable components.

Component 1: Hypothetical Answers (HyDE)

Instead of matching to the query, ask the model to answer the query (without grounding) and match the hypothetical document against the knowledge base. The hypothetical answer pulls vocabulary closer to the indexed chunks.

def create_hypothetical_answer(question):
    messages = [
        ChatMessage(role="system",
                    content="Answer the following question in 2-3 sentences. "
                            "If you don't know the answer, make an educated guess."),
        ChatMessage(role="user", content=question)
    ]
    return str(llm.chat(messages))

def hyde_rag(question):
    answer = create_hypothetical_answer(question)
    return semantic_rag(answer)

When the knowledge base contains differing viewpoints (mask use, fluoridation, abortion), generate hypothetical answers for each perspective and run RAG limited to documents matching that perspective.

HyDE was introduced by Gao et al. (2022). It works even when the hypothetical answer is wrong, because the wrong answer is still in the right vocabulary.

Component 2: Query Expansion

Expand the query with context, synonyms, and disambiguation before searching. This bridges users' nontechnical vocabulary to your knowledge base's specialized terms.

def add_context_to_query(question):
    messages = [
        ChatMessage(role="system", content="""
The following question is about topics discussed in a second-century book about
Alexander the Great. Clarify the question posed in the following ways:
* Expand to include second-century names. For example, a question about Iranians
  should include answers about Parthians, Persians, Medes, Bactrians, etc.
* Provide context on terms. For example, explain that Ammonites came from Jordan
  or that Philip was the father of Alexander.

Provide only the clarified question without any preamble or instructions.
""".strip()),
        ChatMessage(role="user", content=question)
    ]
    return str(llm.chat(messages))

def qryexp_rag(question):
    expanded_question = add_context_to_query(question)
    return semantic_rag(expanded_question)

Combinable with HyDE.

Component 3: Hybrid Search

Index chunks both on keywords (BM25) and embeddings, then score as a weighted average. In LlamaIndex:

query_engine = index.as_query_engine(
    vector_store_query_mode="hybrid",
    similarity_top_k=2,
    alpha=0.25
)

alpha=0.0 is pure BM25, alpha=1.0 is pure vector. Postgres, Pinecone, and Weaviate support hybrid natively. If your store doesn't, run two retrievers and combine results (see Pattern 10's reranking discussion).

Component 4: GraphRAG

Store chunks in a graph database like Neo4j, with relationships. Retrieval can pull related chunks once a partial-answer chunk is found, build trees where parent embeddings summarize children, repeat nodes in different contexts, and pre-generate query-focused summaries.

If you don't have a structured knowledge graph, ask an LLM to extract one:

llm_transformer = LLMGraphTransformer(llm=llm)
graph_documents = llm_transformer.convert_to_graph_documents(text)

graph_store = Neo4jGraphStore(...)
graph_store.write_graph(graph_documents)

graph_rag_retriever = KnowledgeGraphRAGRetriever(...)
query_engine = RetrieverQueryEngine.from_args(graph_rag_retriever)

GraphRAG goes beyond independent-chunk retrieval by leveraging explicit entity relationships.

Example: Anabasis of Alexander

Using a 2nd-century history of Alexander the Great as the knowledge base, semantic-only RAG works for some questions. "How did Alexander treat the people of the places he conquered?" returns chunks about Hyparna and the Thebans. "Where did Alexander die?" correctly returns "The provided text does not contain information about where Alexander died", which is the desired behavior for enterprise grounding.

It fails for fine details and holistic questions. "Describe the relationship between Alexander and Diogenes" succeeds at chunk size 100 but fails at chunk size 1024 (LlamaIndex default). "What was Alexander's strategy against Darius III?" returns tactics from a single battle, not strategy. "How did the Persian king fight the Greeks?" is limited because the book uses Parthians/Macedonians, not Persians/Greeks.

With HyDE, the hypothetical answer "Alexander's strategy centered on forcing decisive battles to cripple the Persian army…" uses vocabulary much closer to the source. Retrieval brings in chunks across the entire book, and the answer becomes a holistic description of army formation at Gaugamela. HyDE also surfaces the Diogenes anecdote at chunk size 1024.

With query expansion, "How did the Persian king fight the Greeks?" becomes "How did the Achaemenid Persian king Darius III, as described in Arrian's Anabasis Alexandri… engage in military conflict with the Macedonians…", which matches many more chunks. The grounded answer covers Bactrian cavalry, scythe-bearing chariots, and Macedonian counter-tactics.

Considerations

HyDE and query expansion both lean on the foundational model's prior knowledge of the domain. If the model has no knowledge, expansions can be hallucinated, obsolete, or irrelevant, and the RAG's retrieval becomes grounded in those errors. Ask "What patterns is Alexander best known for?" and the model might generate Christopher Alexander's architectural design patterns instead of phalanxes. Wrong domain leads to wrong retrieval. Query expansion can also drift from user intent (expanding to alliances when the user only cares about the king). And GraphRAG introduces errors from poorly constructed relationships, with earlier versions of documents bleeding into the result.

References: HyDE (Gao et al. 2022); query optimization survey (Azad and Deepak 2017); LLM-era query optimization (Song and Zheng 2024); GraphRAG survey (Peng et al. 2024).

Pattern 10 — Node Postprocessing

Node Postprocessing inserts a step between retrieval and generation that increases relevance, reduces ambiguity, and personalizes responses.

Problem

Even good retrieval has issues. Similarity isn't relevance: a chunk full of geological terms (a table of contents) can rank top for "Describe the geology of the Grand Canyon" without containing the answer. Within a chunk, only some sentences help and the rest dilute the prompt. Ambiguous entities collide (Grand Canyon of the Colorado vs. Grand Canyon of the Yellowstone). Conflicting or obsolete content (multiple software versions, regional laws) ends up retrieved together. And by default the same answer goes to every user, with no personalization.

Solution

The core operation is reranking, plus a set of LLM-driven improvements that fold naturally into it.

Reranking

After retrieval, ask an LLM to score each chunk's relevance to the query:

You will be given a query and some text. Assign a relevance score between 0 and 1,
where 1 means that the text contains the answer to the question.

**Query**: {query}
**Full Text**: {node.text}

This is LLM-as-Judge (Pattern 17). Rerankers are far more accurate than embeddings, because embeddings compress everything into one vector and a reranker can examine the whole chunk.

Use a fine-tuned reranker like BGE to keep cost down:

reranked_nodes = pc.inference.rerank(
    model="bge-reranker-v2-m3",
    query=query,
    documents=nodes,
    top_n=3,
    return_documents=True,
)

Reranking adds latency and cost (one LLM call per retrieved node).

Hybrid Search via Reranking

If you're going to rerank anyway, pull from multiple retrievers (BM25 + semantic), combine their lists, and rerank. That sidesteps the score-comparison problem of merging incompatible scoring scales.

Query Expansion and Decomposition

Send different versions of the query to different retrievers (a synonym-expanded version to BM25, for example). Or split a complex query into subparts, retrieve each, and combine.

Filtering for Obsolete Information

Use metadata to keep only the latest version:

latest_year = max([chunk.publication_year for chunk in chunks])
chunks = [c for c in chunks if c.publication_year == latest_year]

Detecting conflicts between chunks would need pairwise comparison: N×(N-1) LLM calls, often cost-prohibitive.

Contextual Compression

When you call an LLM to score relevance, also have it strip the chunk down to just the relevant sentences:

@dataclass
class Chunk:
    full_text: str
    relevant_text: str
    relevance_score: float

def process_node(query, node):
    system_prompt = """
    You will be given a query and some text.
    1. Remove information from the text that is not relevant to answering the question.
    2. Assign a relevance score between 0 and 1, where 1 means that the text answers the question
    """
    agent = Agent(model, result_type=Chunk, system_prompt=system_prompt)
    return agent.run_sync(f"**Query**: {query}\n **Full Text**: {node.text}").data

Folding compression into reranking limits the LLM-call count.

Disambiguation

@dataclass
class DisambiguationResult:
    is_ambiguous: bool
    ambiguous_term: str
    possibility_1: str
    possibility_2: str

def disambiguate(query, node1, node2):
    system_prompt = """
You will be given a query and two retrieved passages. Respond by saying whether
the two passages are referring to two different entities with the same term...
    """
    agent = Agent(model, result_type=DisambiguationResult, system_prompt=system_prompt)
    return agent.run_sync(f"**Query**: {query}\n**Passage 1**: {node1.text}\n**Passage 2**: {node2.text}").data

Only N–1 calls (compare the first chunk to each subsequent one). Like compression, this can be folded into the reranking call.

Personalization and Conversation History

The postprocessing slot is a natural place to inject user context. A travel chatbot adds the user's planned travel dates so the writeup matches the season. Conversation history can be summarized and threaded in:

joke = agent.run_sync('Tell me a joke.')
joke2 = agent.run_sync('Make the joke longer and add a punchline.',
                       message_history=joke.new_messages())

You can also pull dynamic context based on what was retrieved, not just what was asked. If the retrieved nodes are about luxury watches even though the query isn't, pull in the user's relevant search history.

Example: Geology of the Grand Canyon

A semantic RAG with top_k=2 on geology textbooks returns the table of contents as the top chunk, then a chunk about the plateau north of the Grand Canyon. The generated answer claims the Grand Canyon has stairs "a score (20) miles in breadth", which is wrong (the Grand Canyon averages 10 miles). With top_k=4 the wrong answer persists.

With Node Postprocessing, reranking elevates a chunk that begins "Running water has gulched the walls, and weathering has everywhere attacked and driven them back…". Combined with the plateau chunk, the synthesized answer correctly attributes formation to river and wind erosion.

For the query "Name the characteristics of coal-bearing strata in Newcastle", the disambiguator catches that retrieved chunks reference both Newcastle, Pennsylvania and Newcastle, England. Production should ask the user which one.

Considerations

Rerankers are slow. They shift compute from index time to query time. Folding reranking, compression, and disambiguation into a single structured-output call (Pattern 2, Grammar) amortizes the cost. That requires a foundational model that can do all three, which specialized rerankers like BGE can't.

References: neural ranking survey (Guo et al. 2019); fine-tuned LLaMA as retriever and reranker (Ma et al. 2023); contextual compression (Verma 2024); ambiguity benchmark (Chen et al. 2021).

Pattern 11 — Trustworthy Generation

Trustworthy Generation is a set of techniques that increase users' trust in RAG answers, since errors can never be fully eliminated.

Problem

RAG fails in ways that erode trust. Retrieval failures surface irrelevant chunks or miss critical info. Context reliability issues arise from outdated, biased, or incorrect retrieved info. Reasoning errors come from misinterpretation across chunks. Hallucination risks remain, especially on complex topics where the model fabricates or blends information. For a medical-RAG question like "What are the best treatment options for Type 1 diabetics?", the system has to portray correctness and warn about outdated or non-peer-reviewed sources.

Solution

A toolbox of techniques.

Out-of-Domain Detection

Decline (or route elsewhere) when the query falls outside your knowledge base. Three approaches: embedding distance (similarity scores drop steeply on OOD requests, so track over time and tune the threshold), zero-shot classification (categorize the query against ["Medical", "Not Medical"] and require some minimum confidence), and required keywords (must contain a term from a medical dictionary, for example). Often a hybrid of these works best. On detection, short-circuit the pipeline and either reply "I can't answer that" or route somewhere else (Google Maps for directions, say).

Citations

Three approaches.

Source-level tracking generates citations from retrieval lineage using chunk metadata.

Simple, but tends to over-cite.

Classification-based uses a classifier to distinguish common knowledge from citation-worthy facts.

More precise, more complex (you have to provide or fine-tune the classifier).

Token-level attribution tracks metadata through the LLM's attention mechanism so every token can be attributed to its sources.

This handles paraphrasing and mixed-source attribution. As of writing, it's an active research area with no production-ready open-source implementation.

Guardrails

Insert checks at every stage of the pipeline.

Before retrieval, filter harmful queries via OOD detection, sanitize input to prevent prompt-injection, and restrict the document store to high-trust sources. After retrieval and before generation, track chunk metadata, prioritize by source authority and a relevance threshold, fact-check via reflective RAG, run a privacy-compliance check, do a freshness check (exclude chunks older than 6 months, say), enforce source diversity, and run a harmful-content scan. After generation, enforce citations, fact-check the final response against trusted sources, and run privacy and harmful-content scans. If anything fails, rewrite and regenerate.

For tooling, Guardrails AI covers PII, OOD, jailbreak, and profanity. DeepEval does RAG metrics and red-teaming. Ragas covers correctness, context recall, and the rest. These also give you observability.

Observability

Track context relevance, response relevance, faithfulness, and context recall and precision. Tools include Arize Phoenix, Comet Opik, Langfuse, and Langtrace. Transparent input/output evaluation directly increases stakeholder trust.

Human Feedback

Online: human review post-retrieval (up/down vote chunks), human review of low-confidence responses before sending. Offline: rank retrieved chunks (the basis for a fine-tuned domain-specific embedding model), review responses for correctness (the basis for fine-tuning the generation model). Feedback comes in three flavors: explicit (thumbs-up/down on chunks or responses), implicit (engagement metrics like response usage frequency), and validation/annotation (subject-matter-expert review).

Corrective RAG (CRAG)

Adds an evaluator that scores retrieved chunks before generation. If chunks are irrelevant or ambiguous, augment with web search or other dynamic sources, or decompose-then-recompose to filter out irrelevant content from chunks. CRAG composes with traditional RAG (Yan et al. 2024; LangGraph CRAG tutorial).

Self-RAG

Three components: self-evaluation (critique retrieved docs for relevance and quality), adaptive retrieval (decide whether to retrieve more), and controlled generation (modulate between retrieved and parametric knowledge). All three use prompting (LLM-as-Judge, Pattern 17). The pipeline is more complex with more failure points, but it reduces hallucinations and improves explainability (Asai et al. 2023).

UI Design

Visual indicators of trust matter: citation links, source previews, confidence meters, inline citations with one-click access to sources. Progressive disclosure (simple results first, "explore deeper" options). Feedback mechanisms (thumbs up/down, correction). Filters by date or source authority. Refinement of queries based on initial responses.

Example: Adding Citations Classifier-Style

The flow is: generate the initial response using normal RAG, chunk the response (by sentence), classify each chunk as needing a citation, look up sources in the document store, insert citation markers into the response while deduping sources, and warn or regenerate when no citation source is found.

def needs_citation(content):
    llm = ChatOpenAI(model_name=LLM_MODEL)
    prompt = PromptTemplate.from_template("""
    Check if the content requires citations. The return should be true or false
    in this JSON format: {{"requires_citations": true}}

    Content: {content}
    """)
    response = llm.invoke(prompt.format(content=content))
    return json.loads(response.content)["requires_citations"]

def check_sources(sentence):
    vectorstore = load_vector_store()
    similar_chunks = vectorstore.similarity_search(sentence, k=5)
    return similar_chunks

response_with_citations = ""
for review_sentence in review_sentences:
    response_with_citations += review_sentence["sentence"]
    if review_sentence["review"] and len(review_sentence["chunks"]) == 0:
        response_with_citations += " [Citation needed] "
    elif review_sentence["review"] and len(review_sentence["chunks"]) > 0:
        file_references = set([x.metadata["source"]
                               for x in review_sentence["chunks"]])
        citation = format_citation(file_references, file_to_citation)
        response_with_citations += citation
    response_with_citations += " "

When no citation source is found, mark as [Citation needed], color-code the sentence, drop it, or force regeneration.

Sample output:

The Brandenburg Concertos (BWV 1046–1051) are a collection of six instrumental works by Johann Sebastian Bach, presented to Christian Ludwig, Margrave of Brandenburg-Schwedt, in 1721. [1] These concertos are highly regarded as some of the greatest orchestral compositions of the Baroque era. [1] … Recent research has indicated that some of the material for the concertos may have been based on earlier music composed by Bach for other purposes. [1, 2]

References: [1] raw_texts/bach_brandenburg-concertos.txt [2] raw_texts/bach_mass-in-b-minor-bach.txt

Considerations

More tooling means more failure points and more latency, so evaluate the tradeoffs. Threshold-based retrieval and filtering is highly domain-specific and needs constant tuning. Strict guardrails can drop too much information and hurt UX. Verification and human-in-the-loop don't scale indefinitely, since growing knowledge bases produce more conflicts.

The alternatives when this pattern alone isn't enough are combining multiple knowledge sources (parametric, nonparametric, knowledge graphs) and focusing on explainability rather than auto-fixing: surface confidence scores, reasoning steps, and side-by-side citations and let users decide.

References: Guardrails AI docs; OpenAI Cookbook on guardrails; ML6 on guardrail intervention levels; Qi et al. 2024 and Phukan et al. 2024 on token-level attribution; Huang et al. 2024 on citation generation; Yan et al. 2024 (CRAG); Asai et al. 2023 (Self-RAG); Google NotebookLM for inline citations.

Pattern 12 — Deep Search

Deep Search uses an iterative loop of retrieval, thinking, and generation to answer complex queries that single-shot RAG can't handle.

Problem

Simple RAG breaks on a few specific shapes of question. Context window constraints: "Compare the economic impacts of climate change mitigation across developing and developed nations…" needs more than fits in one context. Query ambiguity: Pattern 10 detects ambiguity but only suggests follow-up questions and doesn't resolve when one meaning is much more likely. Information staleness: preindexed info can be outdated, with no verification mechanism. Reasoning depth: "What would be the implications of using transformer models with linear attention mechanisms for real-time video processing on edge devices?" requires connecting concepts across documents. And multihop queries: "What programming languages would be most suitable for implementing the algorithms described in the latest quantum machine learning papers from MIT?" needs to find the papers, then identify algorithms, then choose languages.

Solution

Three additions over traditional RAG: a thinking step between retrieval and generation that asks what's missing, iteration across multiple retrieval-and-generation rounds, and external tools (search engines, enterprise APIs) rather than just one knowledge base. State is maintained across iterations, updated queries fetch new info that gets appended to context, and the entire context drives the final synthesis.

This addresses each problem differently. Context window constraints get solved by accumulating knowledge across iterations and discarding less relevant info. Query ambiguity gets handled by decomposing or refining the query, with reasoning to disambiguate. Information staleness and verification get fact-checked across multiple sources, including real-time data. Reasoning depth becomes multistep inference across iterations. And multihop just falls out of the iteration model: early iterations inform later searches.

Deep Search and Deep Research differ only in output. Both are iterative with thinking. Deep Search produces a concise answer, deep research produces a long-form report. The book treats deep research as a Deep Search variant.

Implementing with foundational models

The blue boxes (parsing the query, ranking results, extracting relevant text, finding gaps, synthesizing) all become LLM calls. Identifying gaps:

def get_next_queries(original_query, sub_queries, synthesis):
    prompt = f"""Determine whether there is a logical or information gap in the
answer based on the original query, previous sub queries, and the response. If
the current response answers the question without any logical or information
gaps, return an empty list. If there is a gap, provide a list of up to 3 search
queries to fill in the gap.

    **Original Query**: {original_query}
    **Previous Sub Queries**: {sub_queries}
    **Current answer**: {synthesis}
    """
    agent = Agent(llm, result_type=str[])
    return agent.run_sync(prompt).data

Start simple and let the answer drive the next step. "Compare the economic impacts…" becomes "What is the economic impact of climate change mitigation strategies?" first, then progressively fills in country-by-country and cost-vs-benefit details.

Evaluation metrics

This is the most important part of the pattern. The quality of responses depends directly on the quality of evaluation. Use a framework like Ragas and create a weighted average of relevance, comprehensiveness, accuracy and factual correctness, coherence and logical flow, citation quality, and efficiency. Add tool-specific metrics where needed (SQL correctness for SQL queries, for example). Curate a reference dataset of (question, answer) pairs and verify your scores match intuition. Patronus AI offers evaluator-trained models.

Information integration

The synthesis step has to do cross-document entity resolution, detect contradictions and resolve them, cluster sources by perspective when contradictions are perspective-driven, rank sources by credibility (especially web sources), and apply temporal reasoning.

Example: Deep Search on Wikipedia

Retrieval

import wikipedia
def search_wikipedia(query):
    wikipedia.set_lang("en")
    results = wikipedia.search(query)
    pages = []
    for title in results:
        page = wikipedia.page(title)
        pages.append(WikipediaPage(title=page.title, url=page.url))
    return pages

For "What were the causes of the Liberian civil war?", this returns First Liberian Civil War, Liberians United for Reconciliation and Democracy, Americo-Liberian people, and similar.

Rank pages with an LLM:

def rank_pages(query, pages):
    agent = Agent(model=MODEL_ID, result_type=list[WikipediaPage])
    prompt = f"Rank these Wikipedia pages by relevance to the query: \"{query}\".\nPages: {pages}"
    response = agent.run_sync(prompt)
    return response.data

Then extract relevant text from each top-ranked page using another LLM call.

Generation

Answer the following query based on the given information.

Query: {query}
Relevant information: {[page.relevant_text for page in pages]}

Thinking — find follow-ups

For Liberia, the model produces:

How did the socio-economic disparities specifically manifest…?
In what specific ways did regional instability contribute…?
Can you elaborate on Doe-regime human rights abuses that fueled the rebellion?

Each follow-up feeds back into retrieval. Many chatbots already use this pattern visually, with follow-up questions appearing as clickable chips.

Orchestration

@dataclass
class Section:
    query: str
    answer: str
    sections: list['Section']

def create_section(query):
    pages = search_wikipedia(query)
    ranked_pages = rank_pages(query, pages)[:3]
    for page in ranked_pages:
        add_relevant_text(query, page)
    answer = synthesize_answer(query, ranked_pages)
    return Section(query=query, answer=answer, sections=list())

def add_subsections(parent):
    follow_ups = identify_gaps_and_followups(parent.query, parent.answer)
    for follow_up in follow_ups:
        section = create_section(follow_up)
        parent.sections.append(section)

def deep_search(query, depth, report=None):
    if report is None:
        report = create_section(query)
    add_subsections(report)
    if depth > 1:
        for section in report.sections:
            deep_search(section.query, depth-1, section)
    return report

report = deep_search(
    query="What were some of the famous victories of Napoleon Bonaparte?",
    depth=1)

The output is a hierarchical report:

What were some of the famous victories of Napoleon Bonaparte?
  - Siege of Toulon (1793), 13 Vendémiaire (1795), … Battle of Austerlitz (1805) …

  ## What was the significance of the Battle of Austerlitz, and why is it considered one of Napoleon's greatest victories?
  - Effectively destroyed the Third Coalition…

  ## What were the key strategies or tactics that Napoleon employed in these battles to secure victory?
  - Overwhelming enemy forces, rapid marching, concentrated firepower…

  ## What were the consequences of these victories in terms of Napoleon's power and the political landscape of Europe?
  - First Consul for life in 1802, Emperor in 1804…

Considerations

Deep Search is slow. Multiple iterations, each with retrieval plus LLM calls. Parallelize subqueries where you can and cut iteration short when the answer is good enough. Combine with Content Optimization (Pattern 5) to tune response style based on user preferences.

References: STORM (Stanford, 2024) builds Wikipedia from scratch; DeepSeek-R1 (early 2025) thinking mode; node-DeepResearch (open source). All frontier models (OpenAI, Gemini, Claude) now offer "deep research" modes.

Summary

Pattern	Problem	Solution	Usage
Basic RAG (6)	Knowledge cutoff, confidential data, hallucinations	Retrieve + ground responses	Customer service, internal search, analyst tools, legal research, competitive intelligence
Semantic Indexing (7)	Keyword indexing fails on complex / multi-modal docs	Embeddings of text, images, video, tables	Anything natural-language; multimodal corpora
Indexing at Scale (8)	Outdated / contradictory info in production	Metadata, query filtering, reranking	Production knowledge bases that evolve over time
Index-Aware Retrieval (9)	Question doesn't match KB vocabulary; fine details / holistic answers	Hypothetical answers (HyDE), query expansion, hybrid search, GraphRAG	Mismatched user/expert vocabulary; multi-fact questions
Node Postprocessing (10)	Irrelevant content, ambiguous entities, generic answers	Reranking + compression + disambiguation + personalization	Trust-critical retrieval; ambiguous-entity domains
Trustworthy Generation (11)	Errors can't be eliminated, preserve user trust	OOD detection, citations, guardrails, human feedback, CRAG, self-RAG, UI	High-stakes domains: medical, legal, financial
Deep Search (12)	Complex queries: context limits, multihop, deep reasoning	Iterative retrieve + think + generate; external tools	Research, multi-source synthesis, deep-research workflows

The patterns compose. A production RAG might use Semantic Indexing plus Indexing at Scale (metadata) plus Index-Aware Retrieval (HyDE) plus Node Postprocessing (reranking + compression) plus Trustworthy Generation (citations + guardrails), with Deep Search reserved for the complex queries. Pick the components that match your data and your users. Over-engineering hurts UX and cost.

Previous chapter

Adding Knowledge Bass

Next chapter

Extending Model Capabilities

Pagefy

Generative AI Design Patterns by Valliappa Lakshmanan & Hannes Hapke

Adding Knowledge Syncopation

Chapter 4: Adding Knowledge — Syncopation

Introduction

Pattern 9 — Index-Aware Retrieval

Problem

Solution

Component 1: Hypothetical Answers (HyDE)

Component 2: Query Expansion

Component 3: Hybrid Search

Component 4: GraphRAG

Example: Anabasis of Alexander

Considerations

Pattern 10 — Node Postprocessing

Problem

Solution

Reranking

Hybrid Search via Reranking

Query Expansion and Decomposition

Filtering for Obsolete Information

Contextual Compression

Disambiguation

Personalization and Conversation History

Example: Geology of the Grand Canyon

Considerations

Pattern 11 — Trustworthy Generation

Problem

Solution

Out-of-Domain Detection

Citations

Guardrails

Observability

Human Feedback

Corrective RAG (CRAG)

Self-RAG

UI Design

Example: Adding Citations Classifier-Style

Considerations

Pattern 12 — Deep Search

Problem

Solution

Implementing with foundational models

Iterative refinement

Evaluation metrics

Information integration

Example: Deep Search on Wikipedia

Retrieval

Generation

Thinking — find follow-ups

Orchestration

Considerations

Summary