Pagefy

Book

Hands On Large Language Models by Jay Alammar & Maarten Grootendorst

1.An Introduction to Large Language Models 2.Tokens and Embeddings 3.Looking Inside Large Language Models 4.Text Classification 5.Text Clustering and Topic Modeling 6.Prompt Engineering 7.Advanced Text Generation Techniques and Tools 8.Semantic Search and Retrieval Augmented Generation 9.Multimodal Large Language Models 10.Creating Text Embedding Models 11.Fine Tuning Representation Models for Classification 12.Fine Tuning Generation Models

Advanced Text Generation Techniques and Tools

Hands On Large Language Models by Jay Alammar & Maarten GrootendorstBuy the book

Chapter 7: Advanced Text Generation Techniques and Tools

Introduction

Prompt engineering only takes you so far. To build systems on top of LLMs you need ways to load and orchestrate models, give them memory across turns, chain multiple prompts together, and let them reach out to external tools. This chapter introduces these patterns through LangChain. We cover Model I/O (loading quantized GGUF models), Chains (linking prompts and modules), Memory (buffer, windowed buffer, summary), and Agents with the ReAct framework. The same ideas show up in newer frameworks like DSPy and Haystack.

Section 1: Model I/O — Loading Quantized Models

1.1 Quantization

A neural net's weights are floating-point numbers. Quantization reduces the bits used to store each weight (e.g., 16-bit to 8-bit). The model gets smaller and faster at the cost of a slight accuracy hit.

The analogy is saying "14:16" instead of "14:16 and 12 seconds". Same useful info, less precision. Chapter 12 covers the algorithm. For now we just use a quantized model.

Rule of thumb: stick to ≥4-bit quantization. Below that (3-bit, 2-bit) the quality drop is noticeable — better to pick a smaller model at higher precision.

We use the FP16 (16-bit) variant of Phi-3 in GGUF format, which is the format llama-cpp-python expects.

1.2 Loading Phi-3 GGUF in LangChain

wget https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/resolve/main/Phi-3-mini-4k-instruct-fp16.gguf

from langchain import LlamaCpp

llm = LlamaCpp(
    model_path="Phi-3-mini-4k-instruct-fp16.gguf",
    n_gpu_layers=-1,    # all layers on GPU
    max_tokens=500,
    n_ctx=2048,
    seed=42,
    verbose=False,
)

llm.invoke("Hi! My name is Maarten. What is 1 + 1?")
# ''   ← empty! Phi-3 needs its chat template

Without the proper template, Phi-3 returns nothing. With transformers.pipeline the template is applied automatically. With LlamaCpp we have to wire it in via a chain.

The chapter's examples work with any LLM. For closed-source: from langchain.chat_models import ChatOpenAI; chat_model = ChatOpenAI(openai_api_key="MY_KEY").

Section 2: Chains

A chain wires modular components (prompt template, memory, tools, even other chains) onto an LLM.

2.1 Single Chain: Prompt Template + LLM

Phi-3's chat template uses four special tokens: <s>, <|user|>, <|assistant|>, and <|end|>.

from langchain import PromptTemplate

template = """<s><|user|>
{input_prompt}<|end|>
<|assistant|>"""
prompt = PromptTemplate(template=template, input_variables=["input_prompt"])

basic_chain = prompt | llm    # `|` is LangChain's pipe operator

basic_chain.invoke({"input_prompt": "Hi! My name is Maarten. What is 1 + 1?"})
# "The answer to 1 + 1 is 2..."

The same pattern works for templated parametric prompts:

template = "Create a funny name for a business that sells {product}."
name_prompt = PromptTemplate(template=template, input_variables=["product"])

2.2 Multi-Step Chains

For complex tasks, split the work across multiple sequential prompts so each handles a subtask.

A story-generation example with three subtasks: title → character description → story.

from langchain import LLMChain

# 1. Title chain
template = """<s><|user|>
Create a title for a story about {summary}. Only return the title.<|end|>
<|assistant|>"""
title = LLMChain(llm=llm,
                 prompt=PromptTemplate(template=template, input_variables=["summary"]),
                 output_key="title")

# 2. Character chain — uses summary AND title
template = """<s><|user|>
Describe the main character of a story about {summary} with the title {title}. Use only two sentences.<|end|>
<|assistant|>"""
character = LLMChain(llm=llm,
                    prompt=PromptTemplate(template=template, input_variables=["summary", "title"]),
                    output_key="character")

# 3. Story chain — uses summary, title, AND character
template = """<s><|user|>
Create a story about {summary} with the title {title}. The main character is: {character}. Only return the story and it cannot be longer than one paragraph.<|end|>
<|assistant|>"""
story = LLMChain(llm=llm,
                prompt=PromptTemplate(template=template, input_variables=["summary", "title", "character"]),
                output_key="story")

llm_chain = title | character | story
llm_chain.invoke("a girl that lost her mother")
# {'summary': '...', 'title': '...', 'character': '...', 'story': '...'}

This pattern beats a single mega-prompt for two reasons. Each subtask gets its own focus, and we can access intermediate outputs (e.g., the title alone) for downstream use.

Section 3: Memory — Helping LLMs Remember

LLMs are stateless between calls. By default they forget your name as soon as the next prompt arrives.

3.1 ConversationBufferMemory

Append the entire conversation history to each prompt.

from langchain.memory import ConversationBufferMemory

template = """<s><|user|>Current conversation:{chat_history}
{input_prompt}<|end|>
<|assistant|>"""
prompt = PromptTemplate(template=template, input_variables=["input_prompt", "chat_history"])

memory = ConversationBufferMemory(memory_key="chat_history")
llm_chain = LLMChain(prompt=prompt, llm=llm, memory=memory)

llm_chain.invoke({"input_prompt": "Hi! My name is Maarten. What is 1 + 1?"})
llm_chain.invoke({"input_prompt": "What is my name?"})
# 'Your name is Maarten.'

Now the model has memory:

The drawback is that tokens grow without bound, eventually exceeding the context length.

3.2 ConversationBufferWindowMemory

Keep only the last k exchanges:

from langchain.memory import ConversationBufferWindowMemory

memory = ConversationBufferWindowMemory(k=2, memory_key="chat_history")

After three turns where the user mentioned their age in turn 1, asking "What is my age?" returns "I'm unable to determine your age" because turn 1 has aged out of the window.

The tradeoff is clear: this fixes context bloat but loses old information entirely.

3.3 ConversationSummaryMemory

Use a second LLM call to summarize the conversation into a running synopsis.

from langchain.memory import ConversationSummaryMemory

summary_prompt_template = """<s><|user|>Summarize the conversations and update with the new lines.

Current summary:
{summary}

new lines of conversation:
{new_lines}

New summary:<|end|>
<|assistant|>"""
summary_prompt = PromptTemplate(input_variables=["new_lines", "summary"], template=summary_prompt_template)

memory = ConversationSummaryMemory(llm=llm, memory_key="chat_history", prompt=summary_prompt)
llm_chain = LLMChain(prompt=prompt, llm=llm, memory=memory)

Each user turn produces two LLM calls, one for the summarization and one for the response. You can use a smaller and faster LLM for the summarizer.

Summary memory is concise but can lose specific details. The chapter's example had to infer the original question rather than recall it verbatim.

3.4 Memory Type Tradeoffs

Memory	Pros	Cons
Buffer	Simplest; nothing lost in-window	Token-heavy; only OK with large-context LLMs; later retrieval gets murky
Windowed Buffer	Constant-size; small-context-friendly	Forgets past the window; no compression within window
Summary	Captures full history compactly; supports very long conversations	Extra LLM call per turn (slower); fidelity tied to summarizer's quality

Section 4: Agents — Letting LLMs Decide

So far chains follow paths we defined. Agents let the LLM decide what to do next, including calling external tools.

LLMs are notoriously bad at math, but with a calculator tool they suddenly aren't. Add a search engine, weather API, or anything else, and capabilities multiply.

There are two new components. Tools are functions the agent can invoke. The agent type is the strategy for choosing actions, and we use ReAct.

4.1 ReAct: Reasoning + Acting

ReAct interleaves three steps in a loop. Thought is when the LLM reasons about what to do. Action invokes a tool with specific input. Observation receives the tool's output.

Example: "What's a MacBook Pro in EUR?" The agent searches the web for the USD price, then uses the calculator to convert, then returns the final answer.

4.2 Building a ReAct Agent in LangChain

The Phi-3-mini we've used isn't powerful enough to reliably follow ReAct's complex format. Switch to GPT-3.5:

import os
from langchain_openai import ChatOpenAI

os.environ["OPENAI_API_KEY"] = "MY_KEY"
openai_llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

The ReAct prompt template:

react_template = """Answer the following questions as best you can. You have access to the following tools:

{tools}

Use the following format:

Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [{tool_names}]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question

Begin!

Question: {input}
Thought:{agent_scratchpad}"""

prompt = PromptTemplate(template=react_template,
                        input_variables=["tools", "tool_names", "input", "agent_scratchpad"])

The {agent_scratchpad} is where past Thought/Action/Observation entries get accumulated.

Define the tools:

from langchain.agents import load_tools, Tool
from langchain.tools import DuckDuckGoSearchResults

search = DuckDuckGoSearchResults()
search_tool = Tool(
    name="duckduck",
    description="A web search engine. Use this to as a search engine for general queries.",
    func=search.run,
)

tools = load_tools(["llm-math"], llm=openai_llm)  # math via LLM-backed calculator
tools.append(search_tool)

Build and run the agent:

from langchain.agents import AgentExecutor, create_react_agent

agent = create_react_agent(openai_llm, tools, prompt)
agent_executor = AgentExecutor(agent=agent, tools=tools,
                               verbose=True, handle_parsing_errors=True)

agent_executor.invoke({
    "input": "What is the current price of a MacBook Pro in USD? "
             "How much would it cost in EUR if the exchange rate is 0.85 EUR for 1 USD."
})

Output:

The current price of a MacBook Pro in USD is $2,249.00.
It would cost approximately 1911.65 EUR with an exchange rate of 0.85 EUR for 1 USD.

4.3 Cautions

Autonomous agents are powerful but error-prone. There's no human in the loop at intermediate steps, so bad reasoning compounds. Tool output may be wrong (the search engine might surface stale or wrong prices). Reliability requires design: return source URLs, ask for confirmation between steps, and log every Thought/Action/Observation for auditing.

Summary

Quantization shrinks model size with minimal quality loss. Aim for ≥4-bit. GGUF is the file format llama-cpp-python uses. LangChain's LlamaCpp loads GGUF models.
A chain wires a prompt template (and other modules) to an LLM with the | operator. Use chains to apply chat templates automatically and to compose multi-step pipelines like title → character → story.
LLMs are stateless. Memory modules give them context across turns:
- ConversationBufferMemory keeps full history (token-heavy)
- ConversationBufferWindowMemory keeps only the last k turns (forgets older context)
- ConversationSummaryMemory keeps a running summary via a second LLM call (compact but adds latency and may drop specifics)
Agents let the LLM choose what to do. The ReAct framework loops Thought → Action → Observation, where actions are tool calls (calculators, search engines, APIs).
ReAct dramatically expands what an LLM can do alone, but autonomous loops need careful guardrails: smaller LLMs may not reliably follow the format, tool outputs can be wrong, and intermediate steps need observability.
Combining chains, memory, and agents is where LLM-based systems start to shine. The rest of the book builds on these primitives.

Previous chapter

Prompt Engineering

Next chapter

Semantic Search and Retrieval Augmented Generation