Pagefy

Book

Hands On Large Language Models by Jay Alammar & Maarten Grootendorst

1.An Introduction to Large Language Models 2.Tokens and Embeddings 3.Looking Inside Large Language Models 4.Text Classification 5.Text Clustering and Topic Modeling 6.Prompt Engineering 7.Advanced Text Generation Techniques and Tools 8.Semantic Search and Retrieval Augmented Generation 9.Multimodal Large Language Models 10.Creating Text Embedding Models 11.Fine Tuning Representation Models for Classification 12.Fine Tuning Generation Models

Semantic Search and Retrieval Augmented Generation

Hands On Large Language Models by Jay Alammar & Maarten GrootendorstBuy the book

Chapter 8: Semantic Search and Retrieval-Augmented Generation

Introduction

Search was the first place LLMs hit broad industry adoption. Google announced BERT in 2019 as "one of the biggest leaps forward in the history of Search," and Bing rolled out similar improvements weeks later. The capability behind those changes is semantic search: matching by meaning, not keywords. On a separate track, generative LLMs hallucinate confidently. Retrieval-augmented generation (RAG) combines search and generation: retrieve relevant documents first, then ground the LLM's answer on them. This chapter covers three model families behind these systems (dense retrieval, rerankers, and RAG) plus chunking, evaluation metrics (MAP, nDCG), and advanced RAG patterns (multi-query, multi-hop, agentic).

Section 1: Three Categories of Search-with-LLMs

1.1 Dense Retrieval

Embed both query and documents into the same vector space, then return nearest neighbors.

1.2 Reranking

Take a shortlist from an existing retriever and reorder by relevance using a more expensive model.

1.3 Retrieval-Augmented Generation (RAG)

Retrieve, then generate an answer grounded in the retrieved sources, ideally with citations.

Section 2: Dense Retrieval

The intuition is that similar texts have nearby embeddings.

A query embedding lands near its relevant documents in that same space.

Two questions are worth flagging. Should distant results be excluded? That's a designer's call, often a max-distance threshold. Are queries and answers always semantically close? Not naturally. Embedding models need fine-tuning on (query, answer) pairs (see §2.4).

The standard pipeline: chunk the documents, embed each chunk, store in a vector index.

2.1 Worked Example — Cohere on the Interstellar Wikipedia Article

import cohere, numpy as np, pandas as pd
co = cohere.Client(api_key='...')

text = """Interstellar is a 2014 epic science fiction film... [full Wikipedia intro]"""
texts = [t.strip(' \n') for t in text.split('.') if t.strip()]

# Embed
response = co.embed(texts=texts, input_type="search_document").embeddings
embeds = np.array(response)
embeds.shape   # (15, 4096)

Build a FAISS index:

import faiss
index = faiss.IndexFlatL2(embeds.shape[1])
index.add(np.float32(embeds))

Search function:

def search(query, number_of_results=3):
    query_embed = co.embed(texts=[query], input_type="search_query").embeddings[0]
    distances, ids = index.search(np.float32([query_embed]), number_of_results)
    return pd.DataFrame({
        'texts': np.array(texts)[ids[0]],
        'distance': distances[0],
    })

search("how precise was the science")

	texts	distance
0	"praise from many astronomers for its scientific accuracy and portrayal of theoretical astrophysics"	10757.4
1	"Caltech theoretical physicist and 2017 Nobel laureate Kip Thorne was an executive producer..."	11566.1
2	"uses extensive practical and miniature effects..."	11922.8

The top result perfectly answers the query, and it doesn't share any keywords with it. This is what makes semantic search distinct.

2.2 Comparison with Keyword Search (BM25)

from rank_bm25 import BM25Okapi

# Tokenize and build BM25 index, then for the same query:
keyword_search("how precise was the science")

1.789  Interstellar is a 2014 epic science fiction film co-written, directed, and produced by Christopher Nolan
1.373  Caltech theoretical physicist Kip Thorne was an executive producer...
0.000  It stars Matthew McConaughey, Anne Hathaway, ...

BM25 ranks the result that matches the word "science" highest, but that result doesn't actually answer the question.

2.3 Caveats of Dense Retrieval

A few things to watch for. Out-of-corpus queries always return something, so use distance thresholds and click-tracking. For exact-phrase matches, keyword search wins, and hybrid search (dense + BM25) is often the right answer. Domain shift matters: a model trained on web or Wikipedia text won't perform on legal corpora without fine-tuning. Chunking strategy matters when answers span multiple sentences.

2.4 Chunking Long Texts

Transformers have a context limit. Two patterns dominate:

One vector per document is quick to implement but compresses away most of the content. You either embed only the title or intro (loses information) or average chunk embeddings (loses specificity).

Multiple vectors per document is the better default. Chunk the text, embed each piece, and index the chunks for richer retrieval.

A handful of chunking strategies show up in practice. One sentence per chunk is too granular and gives weak context. One paragraph per chunk works well if paragraphs are short, otherwise group around 3 to 8 sentences. Add the document title to each chunk for topical context. Use overlapping chunks: duplicate a few sentences across adjacent chunks so context isn't sliced cleanly.

2.5 Nearest Neighbor Search vs. Vector Databases

For small corpora (thousands to tens of thousands), NumPy distance calculation is fine. For millions of items or more, use approximate nearest neighbor libraries like Annoy or FAISS for millisecond retrieval, GPU-friendly and cluster-scale. Vector databases like Weaviate, Pinecone, and Chroma add CRUD on the index without rebuilds, plus filtering and metadata.

2.6 Fine-tuning Embedding Models

Default embedding models cluster topically but don't necessarily put a query close to its answer. Fine-tuning teaches them to.

For the sentence "Interstellar premiered on October 26, 2014, in Los Angeles":

Positive query 1: "Interstellar release date"
Positive query 2: "When did Interstellar premier"
Negative query: "Interstellar cast"

Fine-tuning pulls positive pairs together and pushes negative pairs apart.

(See Chapter 10 for the algorithmic details, contrastive learning.)

Section 3: Reranking

Most organizations already have search systems. The cheapest LLM upgrade is to drop a reranker at the end.

3.1 Cohere Rerank Example

results = co.rerank(query="how precise was the science",
                    documents=texts, top_n=3, return_documents=True)
for idx, r in enumerate(results.results):
    print(idx, r.relevance_score, r.document.text)

0  0.170    "praise from many astronomers for its scientific accuracy..."
1  0.070    "worldwide gross over $677 million..."
2  0.004    "Kip Thorne was an executive producer..."

Notice the big gap between top-1 and the rest. The reranker is highly confident.

3.2 First-Stage Shortlist + Reranker (Hybrid Pipeline)

In practice, you wouldn't pass all documents to the reranker (too expensive). The first stage shortlists 10 to 1000 candidates with cheap retrieval (BM25 or dense), and the reranker reorders.

def keyword_and_reranking_search(query, top_k=3, num_candidates=10):
    # BM25 → top 10
    bm25_scores = bm25.get_scores(bm25_tokenizer(query))
    top_n = np.argpartition(bm25_scores, -num_candidates)[-num_candidates:]
    bm25_hits = sorted(top_n, key=lambda i: -bm25_scores[i])

    # Rerank those 10 → top 3
    docs = [texts[i] for i in bm25_hits]
    return co.rerank(query=query, documents=docs, top_n=top_k, return_documents=True)

On the multilingual MIRACL benchmark, adding a reranker boosts nDCG@10 from 36.5 to 62.8.

3.3 How Rerankers Work — Cross-Encoder

Unlike dense retrieval, which embeds query and document independently, a cross-encoder feeds the model both at once and outputs a relevance score.

This is more accurate (the model can attend across both texts) but slower, which is why rerankers run on a shortlist, not the full corpus. The classic paper here is "Multi-stage document ranking with BERT" (monoBERT). It's essentially classification: input is query plus document, output is a relevance score in [0, 1].

Section 4: Retrieval Evaluation Metrics

Evaluating retrieval needs three things: a text archive, a set of queries, and relevance judgments for each query (which docs are relevant?).

4.1 Mean Average Precision (MAP)

Compare two systems on the same query:

Counting "how many relevant in top 3" is too coarse. What about position?

System 1 returns the relevant doc at position 1, which is clearly better. We need a metric that rewards position.

Precision @ k = (relevant results in top k) / k.

Average precision is the average of "precision at position p" computed at each position where a relevant document appears.

For a query with one relevant document at position 1: precision@1 = 1.0, so AP = 1.0.

If that single relevant doc lands at position 3 instead, AP is penalized:

Multiple relevant docs:

Mean Average Precision (MAP) is the mean of AP across all queries in the test suite.

Why "mean" and "average"? "MAP" sounds nicer than "average average precision." That's it.

4.2 nDCG

Normalized Discounted Cumulative Gain generalizes MAP for graded relevance (not just binary): one document can be more relevant than another. Position is also discounted, with top results counting more.

Section 5: Retrieval-Augmented Generation (RAG)

A RAG pipeline retrieves relevant chunks and then prompts an LLM to answer using only those chunks.

This is grounded generation: the retrieved context grounds the LLM in the right domain, reducing hallucinations and enabling "chat with my data."

5.1 Cohere RAG with Citations

results = search("income generated")
docs_dict = [{'text': t} for t in results['texts']]
response = co.chat(message="income generated", documents=docs_dict)
print(response.text)
# "The film generated a worldwide gross of over $677 million,
#  or $773 million with subsequent re-releases."

response.citations includes spans like (start=21, end=36, text='worldwide gross', document_ids=['doc_0']). The model tells you which source backs each claim.

5.2 Local RAG with Phi-3 + LangChain + FAISS

from langchain import LlamaCpp, PromptTemplate
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA

llm = LlamaCpp(model_path="Phi-3-mini-4k-instruct-fp16.gguf",
               n_gpu_layers=-1, max_tokens=500, n_ctx=2048, seed=42)

embedding_model = HuggingFaceEmbeddings(model_name='thenlper/gte-small')
db = FAISS.from_texts(texts, embedding_model)

template = """<|user|>
Relevant information:
{context}

Provide a concise answer the following question using the relevant information provided above:
{question}<|end|>
<|assistant|>"""
prompt = PromptTemplate(template=template, input_variables=["context", "question"])

rag = RetrievalQA.from_chain_type(
    llm=llm, chain_type='stuff',
    retriever=db.as_retriever(),
    chain_type_kwargs={"prompt": prompt},
)
rag.invoke('Income generated')

chain_type='stuff' stuffs all retrieved chunks into the same prompt. Other chain types (map_reduce, refine, map_rerank) handle many chunks beyond the context window.

Section 6: Advanced RAG Techniques

6.1 Query Rewriting

In a chat setting, the user message is rarely a clean search query:

"We have an essay due tomorrow. We have to write about some animal. I love penguins. I could write about them. But I could also write about dolphins. Are they animals? Maybe. Let's do dolphins. Where do they live for example?"

An LLM rewrites this as "Where do dolphins live". Clean, focused, much better retrieval.

6.2 Multi-Query RAG

Some questions need several searches:

"Compare the financial results of Nvidia in 2020 vs. 2023"

The query rewriter splits this into two parallel queries: "Nvidia 2020 financial results" and "Nvidia 2023 financial results". Pass results from both into a single grounded-generation prompt.

A bonus pattern: let the rewriter decide that no search is needed when the model can answer confidently from priors.

6.3 Multi-Hop RAG

Sequential queries where step 2 depends on step 1's results:

"Who are the largest car manufacturers in 2023? Do they each make EVs or not?"

Step 1: "largest car manufacturers 2023"  → Toyota, VW, Hyundai
Step 2.1: "Toyota Motor Corporation electric vehicles"
Step 2.2: "Volkswagen AG electric vehicles"
Step 2.3: "Hyundai Motor Company electric vehicles"

6.4 Query Routing

Route to different data sources based on the question. HR questions go to Notion. Customer questions go to Salesforce. Engineering questions go to Confluence.

6.5 Agentic RAG

Combining all the above gives the LLM agent-like behavior: choosing tools, sequencing actions, and deciding when to retrieve. Data sources become tools (read and write, e.g., posting to Notion). This requires capable models. Cohere Command R+ is a good open-weights option, otherwise the largest managed models.

Section 7: Evaluating RAG Systems

Beyond retrieval metrics, RAG generation quality is evaluated along multiple axes ("Evaluating verifiability in generative search engines", 2023):

Axis	Definition
Fluency	Is the text readable and cohesive?
Perceived utility	Is the answer helpful and informative?
Citation recall	Are all factual claims supported by the cited sources?
Citation precision	Do the cited sources actually back their associated statements?

Human evaluation is the gold standard. LLM-as-a-judge automates this via libraries like Ragas, which adds two more axes. Faithfulness asks whether the answer matches the retrieved context. Answer relevance asks whether the answer addresses the question.

Summary

LLMs upgrade search via three families: dense retrieval (semantic similarity), rerankers (cross-encoders that rescore shortlists), and RAG (search plus grounded generation).
Dense retrieval needs chunking (sentence or paragraph windows, often overlapping), an index (NumPy → FAISS or Annoy → vector DB at scale), and ideally fine-tuning (Chapter 10) to align query and document embeddings.
Hybrid search (dense + BM25) handles both semantic queries and exact-phrase queries.
Rerankers (cross-encoders) sit at the end of a pipeline and dramatically boost relevance, with MIRACL nDCG@10 going from 36.5 to 62.8 in the chapter's example.
Mean Average Precision (MAP) rewards systems for placing relevant docs at high positions. nDCG generalizes to graded relevance.
RAG = retrieval + grounded generation. Cohere's co.chat exposes span-level citations. LangChain's RetrievalQA builds the same locally with Phi-3 + FAISS.
Advanced patterns: query rewriting, multi-query (parallel searches), multi-hop (sequential dependent queries), routing (per-source dispatch), agentic (tool use over data sources).
Evaluating RAG goes beyond retrieval metrics. Combine fluency, utility, citation recall and precision, and faithfulness with human review and LLM-as-a-judge tooling like Ragas.

Previous chapter

Advanced Text Generation Techniques and Tools

Next chapter

Multimodal Large Language Models