Pagefy

Pagefy

Back to AI Engineering

Adding Knowledge Bass

Generative AI Design Patterns by Valliappa Lakshmanan & Hannes HapkeBuy the book
Sign in to save bookmarks, reading progress, and highlights.

Chapter 3: Adding Knowledge — Bass

Introduction

Foundational models are closed systems whose knowledge ends at their training cutoff. Retraining is wildly expensive (the 2025 frontier models cost tens of millions to train), and even continued pretraining on a million pages of new content for a 13B model costs tens of thousands. The practical answer is Retrieval-Augmented Generation (RAG), introduced by Lewis et al. in 2020, which adds knowledge at runtime by fetching relevant content and dropping it into the prompt context.

The patterns in Chapters 3 and 4 stack. Each pattern fixes a limitation of the one before it, climbing a sophistication ladder, and they're meant to be read in order.

This chapter is the bass line: Basic RAG (Pattern 6), Semantic Indexing (Pattern 7), and Indexing at Scale (Pattern 8). Chapter 4 layers in advanced retrieval, postprocessing, trustworthy generation, and deep search.


Pattern 6 — Basic RAG

A basic RAG system has three stages: indexing, retrieval, and generation. Almost nobody runs Basic RAG as-is in production. Its limitations are what motivate the rest of the chapter.

Problem

Foundational models pretrain on Common Crawl, Wikipedia, arXiv, GitHub, Reddit, EDGAR, Project Gutenberg, and similar sources. Enterprise needs run past what's in there. The model has a static knowledge cutoff, so it can't know events after its training date. Even huge models are lossy compressions of their training set, which means capacity limits start to bite. And private data such as confidential reports, subscriber-only research, and customer order history is absent by definition.

When asked about something outside its training, the model still picks token continuations. That's useful for creative writing ("a poem in the style of Rumi about a lover in a different time zone") and bad for facts. You get hallucinations, which are plausible-sounding but factually wrong, and you get no way to cite sources, because the generated text isn't tied to any.

Solution: Grounding

LLMs preferentially use information present in the prompt when generating. Adding relevant text to the context grounds the response.

Priming

Asking "Suggest three small cities to visit in Europe…" gives wildly different answers each time. Prepend "The best food in France is found in Lyon" before the same question and you reliably get foodie cities. That's the priming effect, and the same trick can override existing knowledge:

The Seahawks traded two offensive stars over the weekend, with receiver DK
Metcalf going to the Steelers and quarterback Geno Smith headed to the Raiders.

Who does Geno Smith play for?

The model says "Raiders" even though it was trained when he was a Seahawk. The same approach injects confidential or personal information, like a customer's recent orders for a support ticket.

Always tell the model what to do when no match is found. Add "Say none of them if the message does not match any of the above orders", and if you're using Grammar (Pattern 2), make sure "None of them" is one of the allowed responses.

Relevance and runtime compute

Adding all knowledge to the prompt isn't feasible, so you can only add chunks that are relevant to the query. Identifying relevance needs the query, which means it has to happen at runtime. That makes RAG a form of runtime compute.

So a RAG system has two jobs: index chunks for fast retrieval, and retrieve the most relevant ones at query time.

The Two Pipelines

The indexing pipeline runs offline (batch or event-triggered). It converts source documents into indexed chunks and stores them in a document store. The question-answering pipeline runs at query time, retrieves the relevant chunks, and generates the answer using them.

Indexing

Chunks should be information-dense because prompt tokens are limited. Strip excess whitespace and attach metadata so you can cite later. With LlamaIndex:

content = text[start_pos:end_pos].strip()
content = re.sub(r'\n{3,}', '\n\n', content)

document = Document(
    text=content,
    metadata={
        "source": url,
        "filename": filename,
        "date_loaded": time.strftime("%Y-%m-%d %H:%M:%S")
    }
)

node_parser = SentenceSplitter(chunk_size=200, chunk_overlap=20)
nodes = node_parser.get_nodes_from_documents([document])

docstore = SimpleDocumentStore()
docstore.add_documents(nodes)

LlamaIndex calls chunks nodes, anticipating knowledge-graph organization. The 20-character overlap preserves information across chunk boundaries. The default SimpleDocumentStore is in-memory; LlamaIndex also supports MongoDB, Postgres, Redis, and Firestore:

docstore = FirestoreDocumentStore.from_database(project="project-id", database="(default)")

Retrieval — TF-IDF and BM25

To find chunks relevant to "Describe the relationship between Alexander and Diogenes", use term frequency × inverse document frequency:

tfidf(chunk, term) = count(term, chunk)/Σ_term count(term, chunk)
                   × log(count(chunks)/count(chunks containing term))

Stop words like "the" are dropped. Sum tfidf across query terms for total chunk relevance.

There's a problem with raw TF-IDF. Alexander appears 1,311 times and Diogenes only 6 times in the source, so Alexander's tfidf dominates and retrieved chunks may not actually mention Diogenes. The fix is to saturate the numerator: count / (count + k) for some k > 0. BM25 does exactly that, plus an information-theoretic correction in the denominator. Build a BM25 retriever:

retriever = BM25Retriever.from_defaults(
    docstore=index.get_docstore(),
    similarity_top_k=5)

retrieved_nodes = retriever.retrieve(query)

Sample retrieved node:

Node ID: ee1ef41e-3e31-4e07-9949-5e585a50651c
Similarity: 4.2463765144348145
Text: But Diogenes said that he wanted nothing else, except that he and his
attendants would stand out of the sunlight. Alexander is said to have expressed
his admiration of Diogenes's conduct.

Generation

Drop relevant chunks into the prompt as system messages, then put the query last:

messages = [
    ChatMessage(role="system",
                content="Use the following text to answer the given question.")
]
messages += [ChatMessage(role="system", content=node.text) for node in retrieved_nodes]
messages += [ChatMessage(role="user", content=query)]

llm = Anthropic(
    model="claude-3-7-sonnet-latest",
    api_key=os.environ['ANTHROPIC_API_KEY'],
    temperature=0.2
)
response = llm.chat(messages)

The response is grounded in the retrieved chunks. Send the chunks back to the caller too, so they can build a citations list.

Example: Equipment Manual

For "What should I do if the diaphragm is ruptured?", BM25 retrieves the right paragraphs from the manual:

Node ID: 6afc9709-…
Text: Inspect to see if diaphragm is intact. If diaphragm is ruptured,
replace the safety head with an unbroken head.
Score: 4.869

The generated answer correctly says to replace the safety head and explains the surrounding hand-unscrew / yoke-block caveats.

Ingesting PDFs

PDFs mix text, tables, images, headers and footers, and nonlinear layouts. There are four common approaches. Direct text extraction with libraries like unstructured, docling, papermage, and marker, each with strengths (papermage for academic papers, marker for Markdown output). AI-enabled parsing like LlamaParse for complex documents (it extracts text, tables, images, equations). Multimodal LLMs like GPT-4o, Gemini-2.5, Gemma3, or Llama4 fed page screenshots. And managed services like Vertex AI RAG Engine and Glean that accept PDFs directly.

Considerations

Why none of the Chapter 2 patterns add knowledge

Logits Masking can't add knowledge because you can only mask tokens the model would actually generate. Tokens about Pope Leo XIV after a 2025 cutoff are too unlikely to appear at all. Few-shot, fine-tuning, and instruction tuning all need the model to generate the new content first, and without the data point at inference time, it can't.

RAG vs. large context window

If the document is small (200K to 2M tokens, depending on the model), skip retrieval and load the whole thing. For querying a tax return:

def answer_question(prompt, cached_tax_return):
    response = client.models.generate_content(
        model=GEMINI,
        contents=prompt,
        config=types.GenerateContentConfig(cached_content=cached_tax_return)
    )
    return response.text

answer_question("How much did Obama claim in business expenses?", "cachedContents/wc0yof...")

Use server-side Prompt Caching (Pattern 25) so you don't keep resending the document. The model precomputes embeddings once and reuses them.

Limitations of Basic RAG

Two big ones drive the rest of the chapter. First, BM25 needs an exact match, so asking "What should I do if the diaphragm is broken?" returns a different (worse) answer than "…ruptured?" because keyword overlap is lower. Second, chunk size is a tradeoff: 100-character chunks may not include the follow-on step, and bigger chunks cost more and slow generation.

A RAG with only embedding-based retrieval and no BM25 is often inadequate. Keyword matching is still essential for product codes, SKUs, and exact strings.


Pattern 7 — Semantic Indexing

Semantic Indexing uses meaning-based embeddings rather than keywords, which fixes BM25's limitations on natural language, images, video, and tables.

Problem

Keyword indexing fails in a handful of recurring ways. Synonyms and pronouns: "AI" in the query won't match "Artificial Intelligence" in the chunk, and "The President" won't match the president's name. Overall meaning: "How do AI systems handle medical terminology ambiguity?" won't match a chunk about misinterpreting "CHF" because that chunk doesn't contain the word "ambiguity". Cross-language: a Spanish query won't match an English chunk. Multimodal documents: images and video are entirely opaque to BM25. Layout context: tables and figure captions live next to images, not in the keyword soup. And exact matches give false positives, since "CHF" could mean congestive heart failure, critical heat flux, or Swiss Francs.

Solution

Use an ML model to encode chunks into fixed-dimensional vectors that capture meaning.

The chunks are stored in a vector store, indexed by their embedding.

Embeddings

Embeddings place semantically similar content close together in vector space. With:

chunks = [
    "I really enjoyed the film we watched last night",
    "The movie was excellent",
    "I didn't like the documentary",
    "The cinematic experience was remarkable"
]

# Keyword-based
vectorizer = CountVectorizer()
keyword_vectors = vectorizer.fit_transform(chunks)

Keyword similarity is low across these chunks (movie ≠ film, excellent ≠ great):

Embedding similarity is much higher because all four texts discuss films:

model = SentenceTransformer('all-MiniLM-L6-v2')
embedding_vectors = model.encode(chunks)
query = "The film was great"
query_embedding = model.encode([query])[0]

Higher-dimensional embeddings hold more information but slow similarity search. The curse of dimensionality is worst-case O(N²) in dimensionality times number of embeddings. Optimal design uses minimum dimensionality, dimensionality reduction, and approximate nearest neighbor search where possible.

You can't embed entire books in one vector. Nuance gets lost, retrieval suffers, and costs explode. Use chunking.

Semantic Chunking

StrategyHowPros / Cons
Length-based with overlapFixed-length chunks, overlappingSimplest; can still split context
Sentence-basedGroup sentences until size thresholdPreserves semantic blocks; struggles with topic transitions
Paragraph-basedUse paragraph breaksGood for structured docs
Document-structureMarkdown/Markup headingsWorks when structure is rich
Semantic-shiftTopic modeling (LDA, NMF) or embedding-based shift detectionCoherent topics per chunk

LangChain's TextSplitter is fine for prototyping but not recommended for production.

Handling Other Modalities

Images

Two paths. Use OCR or a vision LLM (like Llama-3.2-9B) to describe the image, replace the image with the description, and chunk the text. Or embed the image directly with a multimodal model that shares vector space with text.

Video

Transcribe the audio (text becomes chunks) and extract keyframes at intervals or scene changes (image embeddings).

Tables

Handle missing values and ensure consistent formatting first. Then choose a chunking strategy:

StrategyBest for
Table-based (whole table as one embedding)Small tables, coarse retrieval
Sliding-window over the table (with headers attached to each chunk)Large tables
Row-basedSemantically disconnected rows (e.g., bank transactions)
Column-basedLong temporal series; columns share semantics

Always preserve the table's name and column-header descriptions in the surrounding text chunk.

Handling Industry Jargon

A query for "heart attack" won't match docs that say acute myocardial infarction or cardiac infarction. A query for "discovery" in legal docs should also match disclosure and deposition.

Synonym expansion in the query:

What was the timeline for discovery in federal court?
→ What was the timeline for discovery|disclosure|deposition in federal court?

You can also expand the original documents, which is more thorough but inflates the index. There are three ways to build a synonym dictionary: a manual jargon glossary (straightforward, labor-intensive), statistical co-occurrence analysis (automated, noisier), and LLMs (watch out for hallucinations).

Track directionality. Exchange-traded fund IS-A index fund, but not the other way around. Don't auto-expand index fund → ETF or you'll over-retrieve.

Contextual Retrieval

Small chunks lose surrounding context. Contextual retrieval prepends a chunk-specific summary before embedding.

Original chunk:

The company's losses decreased by 10% YoY

Contextualized:

This chunk is Walmart's financial report, released in Q4/2025. The previous quarter's earnings increased by 2%. The company's losses decreased by 10% year on year.

To generate the context, send Anthropic's recommended prompt to the LLM:

<document>
{{WHOLE_DOCUMENT}}
</document>

Here is the chunk we want to situate within the whole document
<chunk>
{{CHUNK_CONTENT}}
</chunk>

Please give a short succinct context to situate this chunk within the overall
document for the purposes of improving search retrieval of the chunk. Answer
only with the succinct context and nothing else.

Anthropic reports this reduces incorrect retrieval rates by 67% consistently across domains. Cache the document portion (Prompt Caching, Pattern 25) so the repeated processing is cheap.

Hierarchical Chunking (RAPTOR)

Retrieving short chunks loses holistic context. Big chunks lose nuance during embedding. Hierarchical chunking builds a tree to get both.

The process is chunk, embed, cluster, summarize, embed those, cluster, summarize, and so on until one root summary remains. Each chunk and summary is embedded.

Inference walks the tree top-down: match against first-level nodes, descend into the matched node's children, and so on. Retrieval returns concepts at varying granularity, from high-level summaries down to specific chunks.

This is RAPTOR (Recursive Abstractive Processing for Tree-Organized Retrieval), a simplified GraphRAG (which appears in Pattern 9). Docugami uses hierarchical chunking in knowledge-graph RAG.

Example: Indexing a Product Catalog

Embedding a structured catalog row by combining text fields, normalized numerics, and booleans:

def encode_text_embeddings(model, text_data):
    embeddings = []
    for _, row in text_data.iterrows():
        text = (f"Product: {row['name']}. "
                f"Description: {row['description']}. "
                f"Category: {row['category']}")
        embeddings.append(model.encode(text))
    return embeddings

def encode_numeric_data(numeric_data):
    scaler = StandardScaler()
    normalized_numeric = scaler.fit_transform(numeric_data)
    return normalized_numeric

def create_hybrid_embeddings(text_embeddings, numeric_data, boolean_data):
    hybrid_embeddings = []
    for i in range(len(text_embeddings)):
        combined = np.concatenate([text_embeddings[i],
                                  numeric_data[i],
                                  boolean_data[i]])
        hybrid_embeddings.append(combined)
    return np.array(hybrid_embeddings)

Save the scaler. You need the same transformation for new data at query time.

Index in ChromaDB:

collection = client.get_or_create_collection(
    name="product-catalog",
    metadata={"hnsw:space": "cosine"}
)
collection.add(ids=ids, embeddings=create_hybrid_embeddings(...))

At query time you don't know the user's preferred numeric values, so use the median vector across products as a placeholder:

text_embedding = model.encode([query])[0]
median_numeric = calculate_median_values(df)
boolean_embedding = np.array([1])  # in-stock products

query_embedding = np.concatenate([text_embedding, median_numeric, boolean_embedding])
results = collection.query(query_embeddings=[query_embedding.tolist()], n_results=1)

For "Top Notebook for Gaming and Work", the closest result is the Gaming Laptop Pro.

Considerations

Domain-specific embedding models (medical, legal, financial) sharply improve retrieval precision and latency. They cluster jargon (so MI and myocardial infarction land in the same neighborhood) and can be smaller than general-purpose models.

The pattern still has limitations to plan around. The fixed-dimensional bottleneck means complex chunks get compressed into the same vector size as simple ones. Chunking can separate related concepts. Vector DB scalability becomes a problem at millions of vectors, where ANN methods (ScaNN, Faiss) trade accuracy for speed. Embeddings don't natively distinguish recent from outdated information. They find related content but don't reason or deduce. Multimodal embeddings can misalign, so "apple computers" might surface fruit images. And tabular fields get under-weighted, since concatenating two normalized numbers onto a 512-dim text embedding leaves them overshadowed unless you carefully design the joint structure.

In spite of those, Semantic Indexing is the core of most production RAG.

References: Bengio et al. (2000) introduced embeddings; Olah (2014) has a famous visual explanation; Schwaber-Cohen and Patel (2025) of Pinecone describe chunking's role.


Pattern 8 — Indexing at Scale

Indexing at Scale is the operational discipline you need when a RAG goes from POC to production. It addresses the problems that emerge over time as the knowledge base grows.

Problem

Disambiguation

"Fluid" means liquid in everyday English but liquids and gases in physics. To know whether a chunk about fluids matches a question on oxygen, you need the chunk's context: audience, domain.

Data freshness

Information goes stale. The CDC's COVID isolation guidelines went from 10 days (early 2020) to 5 days (Dec 2021) to dropped entirely (Feb 2024). Each update adds conditions, so naive deletion loses context.

Contradictory information

Hypertension is the cleanest example. Pre-2017 the bar was 140/90 mm Hg, AHA/ACC dropped it to 130/80 in November 2017, and AAFP contradicted that in 2022 by recommending 140/90 for adverse-effect reasons. As the index grows, similarity-only retrieval starts surfacing inconsistent answers, sometimes on the same query in the same session.

Model lifecycle

Proprietary embedding APIs deprecate (typically 12+ months notice). When the API disappears you have to reindex everything because embeddings between model versions are incompatible.

Solution: Metadata

Metadata gives the system context for disambiguation, freshness filtering, and contradiction resolution.

Types of metadata to attach

There are four useful tiers. Document-level metadata covers source URL or ID, creation and modification timestamps, author, topic tags, reading level, and length. Chunk-level metadata covers position in the document (chapter / section / paragraph), entities mentioned (people, orgs, places), semantic role (definition, example, conclusion), and language or locale. Domain or enterprise-specific metadata covers things like API versions and language for code, methodology and findings for papers, SKU and price for products, jurisdiction and statutes for legal, or DMA requirements. And AuthN/AuthZ/confidentiality metadata covers roles, authentication methods, consent requirements, and encryption/redaction needs.

Some vector DBs only support binary filters, others support continuous values (less performant). Consider storing metadata separately if it hurts vector DB performance.

Detecting contradictory content

Metadata enables conflict resolution. Timestamping flags different timestamps with conflicting content as a potential contradiction (or evolution). Source provenance distinguishes authority levels. Subject categorization groups related info to surface conflicts within a domain. Version tracking identifies outdated facts.

Two chunks for "treatment for X":

Chunk 1: "Medication A is recommended..."
  Source: National Health Guidelines | 2023-03 | Journal of Medical Practice Vol 45

Chunk 2: "Medication A is no longer recommended... use B instead."
  Source: Medical Research Institute | 2025-01 | Recent Clinical Findings Vol 12

Prefer the more recent or more reputable source.

Detecting outdated content

Three handling options:

OptionWhatTrade-off
Retrieval filteringOnly return chunks newer than a thresholdIndex stays large; retrievals slower
Document store pruningDrop chunks older than N daysFaster retrieval but historical info lost
Result rerankingBoost recent / authoritative chunks after retrievalMost flexible; needs ranking logic

Managing model lifecycle

Before committing to an embedding model, prefer ones with long support cycles, or open-weight models you can host yourself. Use the MTEB (Massive Text Embedding Benchmark) leaderboard to choose. As of writing: Gemini #1, Qwen2 (Alibaba, open) #3 (only ~10% behind), OpenAI text-embedding-3-large #13. Open-weight models give full lifecycle control with little performance cost.

You'll still want to switch models when a new one holds the same info content at 25% of the original dimensionality, or when a newer cutoff date is needed for recent events and new vocabulary.

The cost of switching can be ugly. A US patent RAG ingests roughly 1,000 patents per day, which is 350K patents per year, with millions of pages and many millions of chunks. Re-embedding all of that is not cheap.

Example: Metadata Filtering

Annotate chunks at index time:

documents = [
    {'id': 1, 'text': '...', 'source': 'New York Times', 'created_at': '2025-01-01'},
    ...
]

metadata = []
for j in range(len(documents)):
    metadata.append({'source': documents['source'][j],
                     'created_at': documents['created_at'][j]})

collection.add(ids=ids, embeddings=vectors, metadatas=metadata)

Filter at query time (ChromaDB syntax):

where_conditions = []
for key, value in filters.items():
    where_conditions.append({key: value})

if len(where_conditions) > 1:
    where = {"$and": where_conditions}
elif len(where_conditions) == 1:
    where = where_conditions[0]

results = collection.query(
    query_embeddings=[query_embedding.tolist()],
    where=where
)

Without a filter, "Top Gaming Laptop" returns both 2024 and 2025 versions. Adding filters={"created_at": "2025-01-01"} excludes the 2024 result.

Considerations

Metadata only helps to the extent it's complete and consistent. Binary-only filters in some vector DBs limit nuance. Date alone isn't sufficient for "relevance", since a 2020 analysis can still be the right answer and recent technical docs can already be obsolete. And the right metadata is domain-specific: academic_institution matters for papers, not for customer service.

If metadata can't carry the load on its own, the alternatives are domain-specific indexes (split by domain and route queries, faster than filtering one giant index), incremental indexing (only modify changed or new docs), and maintaining semantic relationships across versions (keep older versions rather than deleting them, useful for tracking evolution).

References: Chen, Zhang, and Choi (2022) on calibrating retrieval-conflict diagnosis; Wang et al. (2025) for ambiguity / misinformation / noise datasets.


Summary

Chapter 3 lays down the bass line for RAG: how to ground a model in retrieved content (Basic RAG), how to retrieve by meaning rather than keyword (Semantic Indexing), and how to keep the system honest as the index grows (Indexing at Scale). The core warnings are that you shouldn't replace BM25 entirely with embeddings (keyword matching is still vital for product codes and exact strings), embedding models change and you need to design for replaceability, and metadata directionality matters (ETF implies index fund, not the other way). Chapter 4 layers on advanced retrieval (hypothetical answers, query expansion, GraphRAG), node postprocessing (reranking, contextual compression), trustworthy generation (handling errors gracefully), and deep search (iterative multi-hop information retrieval).