RAG and Agents
Chapter 6: RAG and Agents
Introduction
Models need both instructions and the right information to do a task. The last chapter covered instructions. This one is about context construction. The two dominant patterns:
- RAG (Retrieval-Augmented Generation) pulls relevant info from external memory.
- Agents use tools to act on the environment (search the web, send emails, run SQL).
RAG is mostly about constructing context. Agents do that and let models actually interact with the world. The chapter walks through both and finishes with memory, which is what keeps both alive.
Section 1: RAG
RAG augments generation by retrieving relevant information from an external memory source. That source can be an internal database, prior chat sessions, or the open internet.
The retrieve-then-generate pattern was introduced by Chen et al. (2017) in Reading Wikipedia to Answer Open-Domain Questions and got its name from Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (Lewis et al., 2020).
Context construction is to foundation models what feature engineering was to classical ML.
Will Long Context Make RAG Obsolete?
No.
- The data you have grows faster than context windows.
- Even models with long context don't use long context well, and every token costs latency and money.
- Anthropic's guidance: knowledge bases under 200K tokens (~500 pages) are fine to just stuff into the prompt.
1.1 RAG Architecture
A RAG system has two pieces: a retriever and a generator.
The retriever does two things:
- Indexing processes data so it can be retrieved quickly.
- Querying finds data relevant to the user's question.
Documents have to be split into manageable chunks. The book uses document and chunk interchangeably.
1.2 Retrieval Algorithms
Two main mechanisms: term-based (keyword/lexical) and embedding-based (semantic). Industry uses sparse vs. dense to describe the same split, but term-based vs. embedding-based is more accurate. SPLADE produces sparse embeddings but behaves like dense retrieval.
Term-Based Retrieval
Find documents containing the query keywords. Two challenges:
- Many docs may contain a term. Use TF (term frequency): more occurrences means more relevance.
- Some terms matter more than others.
for,at<vietnamese,recipes. Use IDF (inverse document frequency):
IDF(t) = log(N / C(t))
TF-IDF Score(D, Q) = Σᵢ IDF(tᵢ) × f(tᵢ, D)
Common implementations:
- Elasticsearch, built on Lucene, uses an inverted index mapping terms to documents:
| Term | Doc count | (doc index, term frequency) |
|---|---|---|
| banana | 2 | (10, 3), (5, 2) |
| machine | 4 | (1, 5), (10, 1), (38, 9), (42, 5) |
| learning | 3 | (1, 5), (38, 7), (42, 5) |
- BM25 modifies TF-IDF by normalizing TF by document length. Variants: BM25+, BM25F. Still a tough baseline.
Tokenization matters. Convert to lowercase, drop punctuation and stopwords, treat common n-grams as single terms (so "hot dog" doesn't split into "hot" + "dog").
Embedding-Based Retrieval
Rank by semantic similarity. Querying is two steps:
- Embedding model converts the query to an embedding.
- Retriever fetches the top-k nearest data chunks (by cosine similarity).
The DB that stores embeddings is a vector database. The hard part isn't storage. It's vector search.
Vector Search Algorithms
- k-NN (naive) is O(N) scan. Precise but slow.
- ANN (approximate nearest neighbor) is what's actually used in practice:
- LSH (Locality-Sensitive Hashing): hash similar vectors to the same buckets.
- HNSW (Hierarchical Navigable Small World): multi-layer graph traversal.
- Product Quantization: decompose vectors into lower-dim subvectors. Backbone of FAISS.
- IVF (Inverted File Index): k-means cluster the vectors and search closest centroids first.
- Annoy (Spotify): random binary tree forest.
Libraries: FAISS, ScaNN (Google), Annoy (Spotify), Hnswlib, SPTAG (Microsoft), FLANN.
Comparing the Two
| Term-based | Embedding-based | |
|---|---|---|
| Speed | Much faster | Embedding gen + vector search slower |
| Performance | Strong out of the box, hard to improve | Can outperform after finetuning; handles natural queries |
| Cost | Cheaper | Embedding + vector storage/search expensive (1/5 to 1/2 of API spend not uncommon) |
Retriever Quality Metrics
- Context precision: fraction of retrieved docs that are relevant.
- Context recall: fraction of relevant docs that were retrieved. Hard to compute in production because it requires labeling all docs.
- NDCG, MAP, MRR when ranking matters.
- For embeddings: MTEB (Muennighoff et al., 2023).
- ANN-Benchmarks scores ANN algorithms on recall, QPS (queries/sec), build time, and index size.
- BEIR (Benchmarking IR) covers 14 retrieval benchmarks.
Evaluate retrieval and the end-to-end RAG output and the embeddings.
Combining Retrieval Algorithms (Hybrid Search)
- Sequential. Cheap retriever fetches candidates, expensive reranker reorders. Term-based finds anything containing "transformer", then vector search filters to the neural-architecture meaning.
- Ensemble parallel. Multiple retrievers in parallel, combined with Reciprocal Rank Fusion (RRF):
Score(D) = Σᵢ 1 / (k + rᵢ(D))
where k≈60 controls the influence of lower-ranked docs.
1.3 Retrieval Optimization
Chunking Strategies
- Equal-length by character, word, sentence, or paragraph.
- Recursive splitting: section → paragraph → sentence as needed.
- Specialized splitters per programming language. Q&A pair as chunk. Different rules per language.
- Overlap. Stops key info from being cut off (chunk size 2,048 chars + 20-char overlap).
- Chunk size shouldn't exceed model context (or the embedding model's context).
- Token-based chunking uses the generative model's tokenizer. Switching models means reindexing.
Smaller chunks mean more diversity per query but information can be lost across boundaries. They also double indexing and storage.
Reranking
Rerank for accuracy or to fit a context budget:
- Cheap retriever, then expensive reranker.
- Time-based reranking for time-sensitive apps (news, emails, stocks).
- For context, exact rank matters less than search ranking, but models prefer info at the start and end.
Query Rewriting
Also called query reformulation, normalization, or expansion.
User: When was the last time John Doe bought something from us?
AI: John last bought a Fruity Fedora hat from us two weeks ago, on January 3, 2030.
User: How about Emily Doe?
The last query is ambiguous in isolation. Rewriting it (often via another model) gives "When was the last time Emily Doe bought something from us?"
Identity resolution complicates this. "How about his wife?" requires looking up the wife.
Contextual Retrieval
Augment chunks with extra context to improve retrieval:
- Tags, keywords, descriptions, reviews.
- Auto-extracted entities (error codes like
EADDRNOTAVAIL). - Augment with questions the chunk can answer (great for customer support).
- Augment with the original document title or summary so chunks know their parent context.
- Anthropic uses an LLM to generate a 50-100 token context summary per chunk:
<document>
{{WHOLE_DOCUMENT}}
</document>
Here is the chunk we want to situate within the whole document:
<chunk>
{{CHUNK_CONTENT}}
</chunk>
Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else.
Evaluating Retrieval Solutions
When picking a retriever or vector DB, check:
- Retrieval mechanisms (hybrid?).
- Embedding models / vector search algorithms supported.
- Scalability (storage and query).
- Indexing throughput / bulk processing.
- Query latency per algorithm.
- Pricing (volume? queries?).
1.4 RAG Beyond Texts
Multimodal RAG
If the generator handles multiple modalities, retrieve images, audio, and video too.
Workflow:
- Generate CLIP embeddings for all texts and images.
- Embed the query.
- Retrieve closest items across modalities.
RAG with Tabular Data
For structured data, the workflow becomes:
- Text-to-SQL (semantic parsing).
- SQL execution.
- Generation based on results.
If you have many tables and the schemas exceed context, add an intermediate "predict relevant tables" step.
Section 2: Agents
"The study and design of rational agents." — Russell & Norvig
An agent is anything that perceives its environment and acts on it. It's characterized by:
- Environment. Minecraft, the internet, a kitchen, a road system.
- Tools / actions it has access to.
ChatGPT is an agent (web search, code execution, image gen). RAG systems are agents (text/image retrievers, SQL executors are tools).
Agents need stronger models for two reasons:
- Compound mistakes. 95% per-step accuracy turns into 60% over 10 steps and 0.6% over 100.
- Higher stakes. Actions in the world have real consequences.
2.1 Tools
Three categories:
Knowledge Augmentation (Context Construction)
- Text/image retrievers, SQL executors.
- Internal knowledge: people search, inventory APIs, Slack retrieval, email reader.
- Web browsing keeps models from going stale. Covers search, news, GitHub, X, Reddit, LinkedIn APIs. Internet APIs need careful selection.
Capability Extension
These address inherent model weaknesses:
- Calculator fixes arithmetic.
- Calendar, timezone converter, unit converter, translator.
- Code interpreter runs Python, etc. Security risk: code injection. Must be sandboxed.
- Modality conversion. DALL-E for image gen, OCR, transcription, image captioning, LaTeX rendering, browser for HTML.
Chameleon (Lu et al., 2023): GPT-4 with 13 tools beat GPT-4 alone:
- ScienceQA: +11.37% over the best published few-shot.
- TabMWP (tabular math): +17%.
Write Actions
Tools that change the environment: SQL DELETE/UPDATE, send email, initiate bank transfers. Critical for full automation but they need trust and security.
The right tools enable models the way Excel and cranes enabled humans. Most providers now support function calling.
2.2 Planning
A task is a goal plus constraints (a 2-week SF→India trip with a $5,000 budget, for example).
Planning is hard. Don't do it inline with execution. A 1,000-step bogus plan can burn time and money. Decouple planning from execution:
Three components: plan generator, plan validator, executor. With multiple agents, this becomes a multi-agent system.
Validate plans via:
- Heuristics. Invalid actions, too many steps.
- AI judges. Does the plan look reasonable?
Generate multiple plans in parallel and pick the best.
An intent classifier (another agent) helps planning. For out-of-scope intents, classify as IRRELEVANT and politely refuse. Saves FLOPs.
Humans can plug in at any stage: provide a high-level plan, validate, execute risky actions. Define automation level per action.
Process Summary
- Plan generation (task decomposition).
- Reflection / error correction on the plan.
- Execution (function calls).
- Reflection / error correction on outcomes.
Foundation Models as Planners
LeCun and Kambhampati argue autoregressive LLMs can't plan. The counter-argument:
- A model that detects a bad path can revise toward a different action, effectively backtracking.
- "Reasoning with Language Model is Planning with World Model" (Hao et al., 2023): LLMs encode enough world knowledge to predict outcomes, which lets them produce coherent plans.
- Search + state-tracking augmentation also helps.
FM Agents vs. RL Agents. RL trains the planner. FM uses the model directly (prompted or finetuned). Less time and resources for FM. They're likely to converge.
Plan Generation via Prompting
SYSTEM PROMPT
Propose a plan to solve the task. You have access to 5 actions:
get_today_date()
fetch_top_products(start_date, end_date, num_products)
fetch_product_info(product_name)
generate_query(task_history, tool_output)
generate_response(query)
The plan must be a sequence of valid actions.
Examples
Task: "Tell me about Fruity Fedora"
Plan: [fetch_product_info, generate_query, generate_response]
Task: "What was the best selling product last week?"
Plan: [fetch_top_products, generate_query, generate_response]
Task: {USER INPUT}
Plan:
Parameters often have to be inferred from earlier tool outputs. When the info isn't enough, models guess. Guesses can be wrong.
To improve planning:
- Better system prompt and more examples.
- Better tool descriptions.
- Refactor complex functions into simpler ones.
- Stronger model.
- Finetune for plan generation.
Function Calling
Workflow:
- Create a tool inventory by declaring each function: name, params, doc.
- Specify allowed tools per query:
required: must use a tool.none: no tools.auto: model decides.
Example response:
ModelResponse(
finish_reason='tool_calls',
message=chat.Message(content=None, role='assistant',
tool_calls=[
ToolCall(function=Function(arguments='{"lbs":40}', name='lbs_to_kg'),
type='function')
])
)
Always inspect the parameter values used in each function call.
Planning Granularity
A more detailed plan is easier to execute but harder to generate. A higher-level plan is easier to generate but harder to execute. Hierarchical planning combines both. Using exact function names couples the planner to the tool inventory. Renaming get_time() to get_current_time() means re-finetuning. Natural-language plans are more robust. Add a translator (Chameleon's program generator) for execution. Translation is much simpler than planning.
Complex Plans (Control Flows)
- Sequential. B after A.
- Parallel. A and B simultaneously (fetch top 100 products + fetch each price).
- If. Branch on outcome.
- For loop. Repeat until condition met.
Parallel control flows reduce perceived latency. Check what your agent framework supports.
Reflection and Error Correction
Reflection points:
- After receiving a query: is it feasible?
- After plan generation: does it make sense?
- After each step: on track?
- After completion: task done?
ReAct (Yao et al., 2022) interleaves reasoning and action:
Thought 1: ...
Act 1: ...
Observation 1: ...
...
Thought N: ...
Act N: Finish [Response to query]
Reflexion (Shinn et al., 2023) adds a separate evaluator and self-reflection module. The agent proposes a new "trajectory" each step. Coding example: code fails 1/3 of test cases, agent reflects on the missing edge case (all-negative arrays), generates new code.
The cost is in latency and tokens. Thoughts and observations consume tokens. Examples in prompts increase input tokens significantly.
2.3 Tool Selection
Tool choice depends on environment, task, and model. The literature spans a wide range. Toolformer uses 5 tools, Chameleon 13, Gorilla 1,645 APIs.
More tools means more capability and more difficulty using them. Selection guidance:
- Compare agent performance with different tool sets.
- Ablation. Remove tools that don't hurt performance.
- Spot tools the agent frequently misuses. Fix the prompt or finetune, or replace.
- Plot the tool-call distribution.
Lu et al. (2023):
- Different tasks need different tools (ScienceQA leans on knowledge retrieval, TabMWP on math).
- Different models prefer different tools (GPT-4 picks broader, ChatGPT favors image captioning).
Framework comparisons: AutoGPT focuses on social media APIs. Composio focuses on enterprise APIs.
Tool Transition
After tool X, how often is tool Y called next? Frequently-paired tools can be combined into composite tools.
Voyager (Wang et al., 2023) uses a skill manager that tracks new skills (coding programs). Successful new skills go into a skill library for reuse.
2.4 Agent Failure Modes and Evaluation
Benchmarks: Berkeley Function Calling Leaderboard, AgentOps eval harness, TravelPlanner.
Planning Failures
Tool-use failures:
- Invalid tool. Calls
bing_searchwhich isn't in the inventory. - Valid tool, invalid params. Wrong number of params.
- Valid tool, incorrect param values.
lbs=100when it should be120.
Goal failures:
- Plan doesn't solve the task.
- Plan ignores constraints (over budget, wrong city).
- Time is the often-overlooked constraint. A late grant proposal is useless.
Reflection errors: model declares completion when it isn't done (assigns 40 of 50 people to rooms but says it's finished).
Evaluate via a (task, tool inventory) dataset. Generate K plans per task. Track:
- Percentage of valid plans.
- Average plans needed to get a valid one.
- Percentage of valid tool calls.
- Frequency of invalid tool, wrong params, wrong values.
Tool Failures
- Wrong outputs from a correct tool (bad caption, bad SQL).
- Translation errors (NL plan to executable command).
- Missing tool. Task can't be solved with the current inventory (e.g., needs internet).
Test each tool independently. Print every tool call and output for inspection.
Efficiency
- Average steps per task.
- Average cost per task.
- Per-action time and cost. Watch for expensive outliers.
Human and AI efficiency aren't directly comparable. Visiting 100 web pages is slow for a human and trivial for AI in parallel.
Section 3: Memory
Memory is the mechanism for retaining and using information. Three types in AI:
| Type | Description | Capacity / persistence |
|---|---|---|
| Internal knowledge | What the model learned during training | Doesn't change without retraining; available for all queries |
| Short-term memory | The model's context | Fast, limited, single-session |
| Long-term memory | External data via retrieval | Persistent across tasks, deletable without retraining |
Human analogy: knowing how to breathe is internal knowledge. The name of someone you just met is short-term. Books and notes are long-term.
3.1 Why You Need a Memory System
- Manage information overflow within a session.
- Persist information between sessions. An AI coach has to remember your situation.
- Boost consistency. Model can reference its prior answers.
- Maintain data structural integrity. Excel sheet for sales leads, queue for an action sequence.
3.2 Memory Management
Two functions: management (add and delete) and retrieval (similar to RAG retrieval). Long-term memory is cheap and extensible. Short-term memory is bounded by context length.
Allocate a context budget. Reserve 30% for retrieved info and at most 70% for short-term memory. Overflow goes to long-term.
Strategies
- FIFO. Drop the oldest. Used by OpenAI / LangChain (last N messages or tokens). Risk: critical early messages (purpose statements) get dropped.
- Redundancy removal. Summarize the conversation, track named entities.
- Bae et al. (2022): summary plus a classifier per sentence (memory only / summary only / both / neither) gives the new memory.
- Liu et al. (2023): reflection-based. After each action, decide whether to insert, merge, or replace memory (especially when new info contradicts old).
How to handle contradictions depends on the use case. Keep newer? Use AI to judge? Keep both for diverse perspectives?
Summary
- RAG retrieves relevant info from external memory and uses it to generate responses. Retriever quality is decisive.
- Term-based retrieval (Elasticsearch, BM25) is fast and strong out of the box. Embedding-based retrieval can outperform after finetuning but is more expensive.
- Vector search algorithms (LSH, HNSW, Product Quantization, IVF, Annoy) make embedding retrieval scalable.
- Hybrid search combines retrievers sequentially (cheap + reranker) or in parallel (RRF).
- Optimizations: chunking strategy, reranking, query rewriting, contextual retrieval.
- RAG goes beyond text. Multimodal RAG (images, audio, video) and tabular RAG (text-to-SQL).
- Agents are AI plus tools plus environment. Tools augment knowledge, extend capabilities, and enable write actions.
- Planning is a search problem. Decouple planning from execution. Use intent classification, hierarchical planning, natural-language plans plus a translator. Handle complex control flows (sequential, parallel, if, for).
- Reflection (ReAct, Reflexion) significantly improves agent quality.
- Tool selection is empirical. Different models and tasks prefer different tool sets. Watch tool transitions and build composite skills (Voyager).
- Failure modes: planning (tool/goal/reflection errors), tool (wrong outputs, translation errors, missing tools), efficiency (steps, cost, time).
- Memory comes in three flavors: internal, short-term (context), and long-term (external retrieval), with FIFO, summarization, or reflection-based management strategies.
Previous chapter
Prompt EngineeringNext chapter
Finetuning