Adding Knowledge Syncopation
Chapter 4: Adding Knowledge — Syncopation
Introduction
This chapter continues the RAG ladder from Chapter 3. After the foundation (Basic RAG, Semantic Indexing, Indexing at Scale), the four patterns here go after harder problems: questions whose words don't appear in the knowledge base (Index-Aware Retrieval), retrieved chunks that aren't actually relevant (Node Postprocessing), users' trust in the system's answers (Trustworthy Generation), and complex queries that need multistep reasoning across multiple knowledge sources (Deep Search). Each pattern composes with the earlier ones, so pick the components that fit your use case rather than turning everything on.
Pattern 9 — Index-Aware Retrieval
Index-Aware Retrieval uses your knowledge of the indexed text, its vocabulary and structure, to push retrieval past Basic RAG and Semantic Indexing.
Problem
The "find chunks similar to the query" assumption breaks in four common cases. The question may not be present in the knowledge base: ask "What's a historical attraction within a 2-hour train ride from Madrid?" against a knowledge base that contains "Toledo is primarily located on the right (north) bank of the Tagus…" and "Work began on a high-speed link to Madrid, which entered service on November 15, 2005", and neither chunk shares keywords or meaning with the question. The knowledge base may use technical language that the query doesn't match: a user asks about "Muslim palaces" but the chunk calls Alhambra a "Nasrid fortress". The answer may be a fine detail buried in a long chunk, where a whole-chunk embedding hides a sub-paragraph fact (a passing mention of muqarnas in a long architectural description). And the answer may need a holistic interpretation across chunks: "What caused the collapse of Alhambra?" chains the Nueva Planta decrees to the centralized Spanish state to the expulsion of Nasrid rulers, and you can't retrieve that chain without already knowing it.
Solution
Four composable components.
Component 1: Hypothetical Answers (HyDE)
Instead of matching to the query, ask the model to answer the query (without grounding) and match the hypothetical document against the knowledge base. The hypothetical answer pulls vocabulary closer to the indexed chunks.
def create_hypothetical_answer(question):
messages = [
ChatMessage(role="system",
content="Answer the following question in 2-3 sentences. "
"If you don't know the answer, make an educated guess."),
ChatMessage(role="user", content=question)
]
return str(llm.chat(messages))
def hyde_rag(question):
answer = create_hypothetical_answer(question)
return semantic_rag(answer)
When the knowledge base contains differing viewpoints (mask use, fluoridation, abortion), generate hypothetical answers for each perspective and run RAG limited to documents matching that perspective.
HyDE was introduced by Gao et al. (2022). It works even when the hypothetical answer is wrong, because the wrong answer is still in the right vocabulary.
Component 2: Query Expansion
Expand the query with context, synonyms, and disambiguation before searching. This bridges users' nontechnical vocabulary to your knowledge base's specialized terms.
def add_context_to_query(question):
messages = [
ChatMessage(role="system", content="""
The following question is about topics discussed in a second-century book about
Alexander the Great. Clarify the question posed in the following ways:
* Expand to include second-century names. For example, a question about Iranians
should include answers about Parthians, Persians, Medes, Bactrians, etc.
* Provide context on terms. For example, explain that Ammonites came from Jordan
or that Philip was the father of Alexander.
Provide only the clarified question without any preamble or instructions.
""".strip()),
ChatMessage(role="user", content=question)
]
return str(llm.chat(messages))
def qryexp_rag(question):
expanded_question = add_context_to_query(question)
return semantic_rag(expanded_question)
Combinable with HyDE.
Component 3: Hybrid Search
Index chunks both on keywords (BM25) and embeddings, then score as a weighted average. In LlamaIndex:
query_engine = index.as_query_engine(
vector_store_query_mode="hybrid",
similarity_top_k=2,
alpha=0.25
)
alpha=0.0 is pure BM25, alpha=1.0 is pure vector. Postgres, Pinecone, and Weaviate support hybrid natively. If your store doesn't, run two retrievers and combine results (see Pattern 10's reranking discussion).
Component 4: GraphRAG
Store chunks in a graph database like Neo4j, with relationships. Retrieval can pull related chunks once a partial-answer chunk is found, build trees where parent embeddings summarize children, repeat nodes in different contexts, and pre-generate query-focused summaries.
If you don't have a structured knowledge graph, ask an LLM to extract one:
llm_transformer = LLMGraphTransformer(llm=llm)
graph_documents = llm_transformer.convert_to_graph_documents(text)
graph_store = Neo4jGraphStore(...)
graph_store.write_graph(graph_documents)
graph_rag_retriever = KnowledgeGraphRAGRetriever(...)
query_engine = RetrieverQueryEngine.from_args(graph_rag_retriever)
GraphRAG goes beyond independent-chunk retrieval by leveraging explicit entity relationships.
Example: Anabasis of Alexander
Using a 2nd-century history of Alexander the Great as the knowledge base, semantic-only RAG works for some questions. "How did Alexander treat the people of the places he conquered?" returns chunks about Hyparna and the Thebans. "Where did Alexander die?" correctly returns "The provided text does not contain information about where Alexander died", which is the desired behavior for enterprise grounding.
It fails for fine details and holistic questions. "Describe the relationship between Alexander and Diogenes" succeeds at chunk size 100 but fails at chunk size 1024 (LlamaIndex default). "What was Alexander's strategy against Darius III?" returns tactics from a single battle, not strategy. "How did the Persian king fight the Greeks?" is limited because the book uses Parthians/Macedonians, not Persians/Greeks.
With HyDE, the hypothetical answer "Alexander's strategy centered on forcing decisive battles to cripple the Persian army…" uses vocabulary much closer to the source. Retrieval brings in chunks across the entire book, and the answer becomes a holistic description of army formation at Gaugamela. HyDE also surfaces the Diogenes anecdote at chunk size 1024.
With query expansion, "How did the Persian king fight the Greeks?" becomes "How did the Achaemenid Persian king Darius III, as described in Arrian's Anabasis Alexandri… engage in military conflict with the Macedonians…", which matches many more chunks. The grounded answer covers Bactrian cavalry, scythe-bearing chariots, and Macedonian counter-tactics.
Considerations
HyDE and query expansion both lean on the foundational model's prior knowledge of the domain. If the model has no knowledge, expansions can be hallucinated, obsolete, or irrelevant, and the RAG's retrieval becomes grounded in those errors. Ask "What patterns is Alexander best known for?" and the model might generate Christopher Alexander's architectural design patterns instead of phalanxes. Wrong domain leads to wrong retrieval. Query expansion can also drift from user intent (expanding to alliances when the user only cares about the king). And GraphRAG introduces errors from poorly constructed relationships, with earlier versions of documents bleeding into the result.
References: HyDE (Gao et al. 2022); query optimization survey (Azad and Deepak 2017); LLM-era query optimization (Song and Zheng 2024); GraphRAG survey (Peng et al. 2024).
Pattern 10 — Node Postprocessing
Node Postprocessing inserts a step between retrieval and generation that increases relevance, reduces ambiguity, and personalizes responses.
Problem
Even good retrieval has issues. Similarity isn't relevance: a chunk full of geological terms (a table of contents) can rank top for "Describe the geology of the Grand Canyon" without containing the answer. Within a chunk, only some sentences help and the rest dilute the prompt. Ambiguous entities collide (Grand Canyon of the Colorado vs. Grand Canyon of the Yellowstone). Conflicting or obsolete content (multiple software versions, regional laws) ends up retrieved together. And by default the same answer goes to every user, with no personalization.
Solution
The core operation is reranking, plus a set of LLM-driven improvements that fold naturally into it.
Reranking
After retrieval, ask an LLM to score each chunk's relevance to the query:
You will be given a query and some text. Assign a relevance score between 0 and 1,
where 1 means that the text contains the answer to the question.
**Query**: {query}
**Full Text**: {node.text}
This is LLM-as-Judge (Pattern 17). Rerankers are far more accurate than embeddings, because embeddings compress everything into one vector and a reranker can examine the whole chunk.
Use a fine-tuned reranker like BGE to keep cost down:
reranked_nodes = pc.inference.rerank(
model="bge-reranker-v2-m3",
query=query,
documents=nodes,
top_n=3,
return_documents=True,
)
Reranking adds latency and cost (one LLM call per retrieved node).
Hybrid Search via Reranking
If you're going to rerank anyway, pull from multiple retrievers (BM25 + semantic), combine their lists, and rerank. That sidesteps the score-comparison problem of merging incompatible scoring scales.
Query Expansion and Decomposition
Send different versions of the query to different retrievers (a synonym-expanded version to BM25, for example). Or split a complex query into subparts, retrieve each, and combine.
Filtering for Obsolete Information
Use metadata to keep only the latest version:
latest_year = max([chunk.publication_year for chunk in chunks])
chunks = [c for c in chunks if c.publication_year == latest_year]
Detecting conflicts between chunks would need pairwise comparison: N×(N-1) LLM calls, often cost-prohibitive.
Contextual Compression
When you call an LLM to score relevance, also have it strip the chunk down to just the relevant sentences:
@dataclass
class Chunk:
full_text: str
relevant_text: str
relevance_score: float
def process_node(query, node):
system_prompt = """
You will be given a query and some text.
1. Remove information from the text that is not relevant to answering the question.
2. Assign a relevance score between 0 and 1, where 1 means that the text answers the question
"""
agent = Agent(model, result_type=Chunk, system_prompt=system_prompt)
return agent.run_sync(f"**Query**: {query}\n **Full Text**: {node.text}").data
Folding compression into reranking limits the LLM-call count.
Disambiguation
@dataclass
class DisambiguationResult:
is_ambiguous: bool
ambiguous_term: str
possibility_1: str
possibility_2: str
def disambiguate(query, node1, node2):
system_prompt = """
You will be given a query and two retrieved passages. Respond by saying whether
the two passages are referring to two different entities with the same term...
"""
agent = Agent(model, result_type=DisambiguationResult, system_prompt=system_prompt)
return agent.run_sync(f"**Query**: {query}\n**Passage 1**: {node1.text}\n**Passage 2**: {node2.text}").data
Only N–1 calls (compare the first chunk to each subsequent one). Like compression, this can be folded into the reranking call.
Personalization and Conversation History
The postprocessing slot is a natural place to inject user context. A travel chatbot adds the user's planned travel dates so the writeup matches the season. Conversation history can be summarized and threaded in:
joke = agent.run_sync('Tell me a joke.')
joke2 = agent.run_sync('Make the joke longer and add a punchline.',
message_history=joke.new_messages())
You can also pull dynamic context based on what was retrieved, not just what was asked. If the retrieved nodes are about luxury watches even though the query isn't, pull in the user's relevant search history.
Example: Geology of the Grand Canyon
A semantic RAG with top_k=2 on geology textbooks returns the table of contents as the top chunk, then a chunk about the plateau north of the Grand Canyon. The generated answer claims the Grand Canyon has stairs "a score (20) miles in breadth", which is wrong (the Grand Canyon averages 10 miles). With top_k=4 the wrong answer persists.
With Node Postprocessing, reranking elevates a chunk that begins "Running water has gulched the walls, and weathering has everywhere attacked and driven them back…". Combined with the plateau chunk, the synthesized answer correctly attributes formation to river and wind erosion.
For the query "Name the characteristics of coal-bearing strata in Newcastle", the disambiguator catches that retrieved chunks reference both Newcastle, Pennsylvania and Newcastle, England. Production should ask the user which one.
Considerations
Rerankers are slow. They shift compute from index time to query time. Folding reranking, compression, and disambiguation into a single structured-output call (Pattern 2, Grammar) amortizes the cost. That requires a foundational model that can do all three, which specialized rerankers like BGE can't.
References: neural ranking survey (Guo et al. 2019); fine-tuned LLaMA as retriever and reranker (Ma et al. 2023); contextual compression (Verma 2024); ambiguity benchmark (Chen et al. 2021).
Pattern 11 — Trustworthy Generation
Trustworthy Generation is a set of techniques that increase users' trust in RAG answers, since errors can never be fully eliminated.
Problem
RAG fails in ways that erode trust. Retrieval failures surface irrelevant chunks or miss critical info. Context reliability issues arise from outdated, biased, or incorrect retrieved info. Reasoning errors come from misinterpretation across chunks. Hallucination risks remain, especially on complex topics where the model fabricates or blends information. For a medical-RAG question like "What are the best treatment options for Type 1 diabetics?", the system has to portray correctness and warn about outdated or non-peer-reviewed sources.
Solution
A toolbox of techniques.
Out-of-Domain Detection
Decline (or route elsewhere) when the query falls outside your knowledge base. Three approaches: embedding distance (similarity scores drop steeply on OOD requests, so track over time and tune the threshold), zero-shot classification (categorize the query against ["Medical", "Not Medical"] and require some minimum confidence), and required keywords (must contain a term from a medical dictionary, for example). Often a hybrid of these works best. On detection, short-circuit the pipeline and either reply "I can't answer that" or route somewhere else (Google Maps for directions, say).
Citations
Three approaches.
Source-level tracking generates citations from retrieval lineage using chunk metadata.
Simple, but tends to over-cite.
Classification-based uses a classifier to distinguish common knowledge from citation-worthy facts.
More precise, more complex (you have to provide or fine-tune the classifier).
Token-level attribution tracks metadata through the LLM's attention mechanism so every token can be attributed to its sources.
This handles paraphrasing and mixed-source attribution. As of writing, it's an active research area with no production-ready open-source implementation.
Guardrails
Insert checks at every stage of the pipeline.
Before retrieval, filter harmful queries via OOD detection, sanitize input to prevent prompt-injection, and restrict the document store to high-trust sources. After retrieval and before generation, track chunk metadata, prioritize by source authority and a relevance threshold, fact-check via reflective RAG, run a privacy-compliance check, do a freshness check (exclude chunks older than 6 months, say), enforce source diversity, and run a harmful-content scan. After generation, enforce citations, fact-check the final response against trusted sources, and run privacy and harmful-content scans. If anything fails, rewrite and regenerate.
For tooling, Guardrails AI covers PII, OOD, jailbreak, and profanity. DeepEval does RAG metrics and red-teaming. Ragas covers correctness, context recall, and the rest. These also give you observability.
Observability
Track context relevance, response relevance, faithfulness, and context recall and precision. Tools include Arize Phoenix, Comet Opik, Langfuse, and Langtrace. Transparent input/output evaluation directly increases stakeholder trust.
Human Feedback
Online: human review post-retrieval (up/down vote chunks), human review of low-confidence responses before sending. Offline: rank retrieved chunks (the basis for a fine-tuned domain-specific embedding model), review responses for correctness (the basis for fine-tuning the generation model). Feedback comes in three flavors: explicit (thumbs-up/down on chunks or responses), implicit (engagement metrics like response usage frequency), and validation/annotation (subject-matter-expert review).
Corrective RAG (CRAG)
Adds an evaluator that scores retrieved chunks before generation. If chunks are irrelevant or ambiguous, augment with web search or other dynamic sources, or decompose-then-recompose to filter out irrelevant content from chunks. CRAG composes with traditional RAG (Yan et al. 2024; LangGraph CRAG tutorial).
Self-RAG
Three components: self-evaluation (critique retrieved docs for relevance and quality), adaptive retrieval (decide whether to retrieve more), and controlled generation (modulate between retrieved and parametric knowledge). All three use prompting (LLM-as-Judge, Pattern 17). The pipeline is more complex with more failure points, but it reduces hallucinations and improves explainability (Asai et al. 2023).
UI Design
Visual indicators of trust matter: citation links, source previews, confidence meters, inline citations with one-click access to sources. Progressive disclosure (simple results first, "explore deeper" options). Feedback mechanisms (thumbs up/down, correction). Filters by date or source authority. Refinement of queries based on initial responses.
Example: Adding Citations Classifier-Style
The flow is: generate the initial response using normal RAG, chunk the response (by sentence), classify each chunk as needing a citation, look up sources in the document store, insert citation markers into the response while deduping sources, and warn or regenerate when no citation source is found.
def needs_citation(content):
llm = ChatOpenAI(model_name=LLM_MODEL)
prompt = PromptTemplate.from_template("""
Check if the content requires citations. The return should be true or false
in this JSON format: {{"requires_citations": true}}
Content: {content}
""")
response = llm.invoke(prompt.format(content=content))
return json.loads(response.content)["requires_citations"]
def check_sources(sentence):
vectorstore = load_vector_store()
similar_chunks = vectorstore.similarity_search(sentence, k=5)
return similar_chunks
response_with_citations = ""
for review_sentence in review_sentences:
response_with_citations += review_sentence["sentence"]
if review_sentence["review"] and len(review_sentence["chunks"]) == 0:
response_with_citations += " [Citation needed] "
elif review_sentence["review"] and len(review_sentence["chunks"]) > 0:
file_references = set([x.metadata["source"]
for x in review_sentence["chunks"]])
citation = format_citation(file_references, file_to_citation)
response_with_citations += citation
response_with_citations += " "
When no citation source is found, mark as [Citation needed], color-code the sentence, drop it, or force regeneration.
Sample output:
The Brandenburg Concertos (BWV 1046–1051) are a collection of six instrumental works by Johann Sebastian Bach, presented to Christian Ludwig, Margrave of Brandenburg-Schwedt, in 1721. [1] These concertos are highly regarded as some of the greatest orchestral compositions of the Baroque era. [1] … Recent research has indicated that some of the material for the concertos may have been based on earlier music composed by Bach for other purposes. [1, 2]
References: [1] raw_texts/bach_brandenburg-concertos.txt [2] raw_texts/bach_mass-in-b-minor-bach.txt
Considerations
More tooling means more failure points and more latency, so evaluate the tradeoffs. Threshold-based retrieval and filtering is highly domain-specific and needs constant tuning. Strict guardrails can drop too much information and hurt UX. Verification and human-in-the-loop don't scale indefinitely, since growing knowledge bases produce more conflicts.
The alternatives when this pattern alone isn't enough are combining multiple knowledge sources (parametric, nonparametric, knowledge graphs) and focusing on explainability rather than auto-fixing: surface confidence scores, reasoning steps, and side-by-side citations and let users decide.
References: Guardrails AI docs; OpenAI Cookbook on guardrails; ML6 on guardrail intervention levels; Qi et al. 2024 and Phukan et al. 2024 on token-level attribution; Huang et al. 2024 on citation generation; Yan et al. 2024 (CRAG); Asai et al. 2023 (Self-RAG); Google NotebookLM for inline citations.
Pattern 12 — Deep Search
Deep Search uses an iterative loop of retrieval, thinking, and generation to answer complex queries that single-shot RAG can't handle.
Problem
Simple RAG breaks on a few specific shapes of question. Context window constraints: "Compare the economic impacts of climate change mitigation across developing and developed nations…" needs more than fits in one context. Query ambiguity: Pattern 10 detects ambiguity but only suggests follow-up questions and doesn't resolve when one meaning is much more likely. Information staleness: preindexed info can be outdated, with no verification mechanism. Reasoning depth: "What would be the implications of using transformer models with linear attention mechanisms for real-time video processing on edge devices?" requires connecting concepts across documents. And multihop queries: "What programming languages would be most suitable for implementing the algorithms described in the latest quantum machine learning papers from MIT?" needs to find the papers, then identify algorithms, then choose languages.
Solution
Three additions over traditional RAG: a thinking step between retrieval and generation that asks what's missing, iteration across multiple retrieval-and-generation rounds, and external tools (search engines, enterprise APIs) rather than just one knowledge base. State is maintained across iterations, updated queries fetch new info that gets appended to context, and the entire context drives the final synthesis.
This addresses each problem differently. Context window constraints get solved by accumulating knowledge across iterations and discarding less relevant info. Query ambiguity gets handled by decomposing or refining the query, with reasoning to disambiguate. Information staleness and verification get fact-checked across multiple sources, including real-time data. Reasoning depth becomes multistep inference across iterations. And multihop just falls out of the iteration model: early iterations inform later searches.
Deep Search and Deep Research differ only in output. Both are iterative with thinking. Deep Search produces a concise answer, deep research produces a long-form report. The book treats deep research as a Deep Search variant.
Implementing with foundational models
The blue boxes (parsing the query, ranking results, extracting relevant text, finding gaps, synthesizing) all become LLM calls. Identifying gaps:
def get_next_queries(original_query, sub_queries, synthesis):
prompt = f"""Determine whether there is a logical or information gap in the
answer based on the original query, previous sub queries, and the response. If
the current response answers the question without any logical or information
gaps, return an empty list. If there is a gap, provide a list of up to 3 search
queries to fill in the gap.
**Original Query**: {original_query}
**Previous Sub Queries**: {sub_queries}
**Current answer**: {synthesis}
"""
agent = Agent(llm, result_type=str[])
return agent.run_sync(prompt).data
Iterative refinement
Start simple and let the answer drive the next step. "Compare the economic impacts…" becomes "What is the economic impact of climate change mitigation strategies?" first, then progressively fills in country-by-country and cost-vs-benefit details.
Evaluation metrics
This is the most important part of the pattern. The quality of responses depends directly on the quality of evaluation. Use a framework like Ragas and create a weighted average of relevance, comprehensiveness, accuracy and factual correctness, coherence and logical flow, citation quality, and efficiency. Add tool-specific metrics where needed (SQL correctness for SQL queries, for example). Curate a reference dataset of (question, answer) pairs and verify your scores match intuition. Patronus AI offers evaluator-trained models.
Information integration
The synthesis step has to do cross-document entity resolution, detect contradictions and resolve them, cluster sources by perspective when contradictions are perspective-driven, rank sources by credibility (especially web sources), and apply temporal reasoning.
Example: Deep Search on Wikipedia
Retrieval
import wikipedia
def search_wikipedia(query):
wikipedia.set_lang("en")
results = wikipedia.search(query)
pages = []
for title in results:
page = wikipedia.page(title)
pages.append(WikipediaPage(title=page.title, url=page.url))
return pages
For "What were the causes of the Liberian civil war?", this returns First Liberian Civil War, Liberians United for Reconciliation and Democracy, Americo-Liberian people, and similar.
Rank pages with an LLM:
def rank_pages(query, pages):
agent = Agent(model=MODEL_ID, result_type=list[WikipediaPage])
prompt = f"Rank these Wikipedia pages by relevance to the query: \"{query}\".\nPages: {pages}"
response = agent.run_sync(prompt)
return response.data
Then extract relevant text from each top-ranked page using another LLM call.
Generation
Answer the following query based on the given information.
Query: {query}
Relevant information: {[page.relevant_text for page in pages]}
Thinking — find follow-ups
For Liberia, the model produces:
- How did the socio-economic disparities specifically manifest…?
- In what specific ways did regional instability contribute…?
- Can you elaborate on Doe-regime human rights abuses that fueled the rebellion?
Each follow-up feeds back into retrieval. Many chatbots already use this pattern visually, with follow-up questions appearing as clickable chips.
Orchestration
@dataclass
class Section:
query: str
answer: str
sections: list['Section']
def create_section(query):
pages = search_wikipedia(query)
ranked_pages = rank_pages(query, pages)[:3]
for page in ranked_pages:
add_relevant_text(query, page)
answer = synthesize_answer(query, ranked_pages)
return Section(query=query, answer=answer, sections=list())
def add_subsections(parent):
follow_ups = identify_gaps_and_followups(parent.query, parent.answer)
for follow_up in follow_ups:
section = create_section(follow_up)
parent.sections.append(section)
def deep_search(query, depth, report=None):
if report is None:
report = create_section(query)
add_subsections(report)
if depth > 1:
for section in report.sections:
deep_search(section.query, depth-1, section)
return report
report = deep_search(
query="What were some of the famous victories of Napoleon Bonaparte?",
depth=1)
The output is a hierarchical report:
What were some of the famous victories of Napoleon Bonaparte?
- Siege of Toulon (1793), 13 Vendémiaire (1795), … Battle of Austerlitz (1805) …
## What was the significance of the Battle of Austerlitz, and why is it considered one of Napoleon's greatest victories?
- Effectively destroyed the Third Coalition…
## What were the key strategies or tactics that Napoleon employed in these battles to secure victory?
- Overwhelming enemy forces, rapid marching, concentrated firepower…
## What were the consequences of these victories in terms of Napoleon's power and the political landscape of Europe?
- First Consul for life in 1802, Emperor in 1804…
Considerations
Deep Search is slow. Multiple iterations, each with retrieval plus LLM calls. Parallelize subqueries where you can and cut iteration short when the answer is good enough. Combine with Content Optimization (Pattern 5) to tune response style based on user preferences.
References: STORM (Stanford, 2024) builds Wikipedia from scratch; DeepSeek-R1 (early 2025) thinking mode; node-DeepResearch (open source). All frontier models (OpenAI, Gemini, Claude) now offer "deep research" modes.
Summary
| Pattern | Problem | Solution | Usage |
|---|---|---|---|
| Basic RAG (6) | Knowledge cutoff, confidential data, hallucinations | Retrieve + ground responses | Customer service, internal search, analyst tools, legal research, competitive intelligence |
| Semantic Indexing (7) | Keyword indexing fails on complex / multi-modal docs | Embeddings of text, images, video, tables | Anything natural-language; multimodal corpora |
| Indexing at Scale (8) | Outdated / contradictory info in production | Metadata, query filtering, reranking | Production knowledge bases that evolve over time |
| Index-Aware Retrieval (9) | Question doesn't match KB vocabulary; fine details / holistic answers | Hypothetical answers (HyDE), query expansion, hybrid search, GraphRAG | Mismatched user/expert vocabulary; multi-fact questions |
| Node Postprocessing (10) | Irrelevant content, ambiguous entities, generic answers | Reranking + compression + disambiguation + personalization | Trust-critical retrieval; ambiguous-entity domains |
| Trustworthy Generation (11) | Errors can't be eliminated, preserve user trust | OOD detection, citations, guardrails, human feedback, CRAG, self-RAG, UI | High-stakes domains: medical, legal, financial |
| Deep Search (12) | Complex queries: context limits, multihop, deep reasoning | Iterative retrieve + think + generate; external tools | Research, multi-source synthesis, deep-research workflows |
The patterns compose. A production RAG might use Semantic Indexing plus Indexing at Scale (metadata) plus Index-Aware Retrieval (HyDE) plus Node Postprocessing (reranking + compression) plus Trustworthy Generation (citations + guardrails), with Deep Search reserved for the complex queries. Pick the components that match your data and your users. Over-engineering hurts UX and cost.
Previous chapter
Adding Knowledge BassNext chapter
Extending Model Capabilities