Advanced Text Generation Techniques and Tools
Chapter 7: Advanced Text Generation Techniques and Tools
Introduction
Prompt engineering only takes you so far. To build systems on top of LLMs you need ways to load and orchestrate models, give them memory across turns, chain multiple prompts together, and let them reach out to external tools. This chapter introduces these patterns through LangChain. We cover Model I/O (loading quantized GGUF models), Chains (linking prompts and modules), Memory (buffer, windowed buffer, summary), and Agents with the ReAct framework. The same ideas show up in newer frameworks like DSPy and Haystack.
Section 1: Model I/O — Loading Quantized Models
1.1 Quantization
A neural net's weights are floating-point numbers. Quantization reduces the bits used to store each weight (e.g., 16-bit to 8-bit). The model gets smaller and faster at the cost of a slight accuracy hit.
The analogy is saying "14:16" instead of "14:16 and 12 seconds". Same useful info, less precision. Chapter 12 covers the algorithm. For now we just use a quantized model.
Rule of thumb: stick to ≥4-bit quantization. Below that (3-bit, 2-bit) the quality drop is noticeable — better to pick a smaller model at higher precision.
We use the FP16 (16-bit) variant of Phi-3 in GGUF format, which is the format llama-cpp-python expects.
1.2 Loading Phi-3 GGUF in LangChain
wget https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/resolve/main/Phi-3-mini-4k-instruct-fp16.gguf
from langchain import LlamaCpp
llm = LlamaCpp(
model_path="Phi-3-mini-4k-instruct-fp16.gguf",
n_gpu_layers=-1, # all layers on GPU
max_tokens=500,
n_ctx=2048,
seed=42,
verbose=False,
)
llm.invoke("Hi! My name is Maarten. What is 1 + 1?")
# '' ← empty! Phi-3 needs its chat template
Without the proper template, Phi-3 returns nothing. With transformers.pipeline the template is applied automatically. With LlamaCpp we have to wire it in via a chain.
The chapter's examples work with any LLM. For closed-source:
from langchain.chat_models import ChatOpenAI; chat_model = ChatOpenAI(openai_api_key="MY_KEY").
Section 2: Chains
A chain wires modular components (prompt template, memory, tools, even other chains) onto an LLM.
2.1 Single Chain: Prompt Template + LLM
Phi-3's chat template uses four special tokens: <s>, <|user|>, <|assistant|>, and <|end|>.
from langchain import PromptTemplate
template = """<s><|user|>
{input_prompt}<|end|>
<|assistant|>"""
prompt = PromptTemplate(template=template, input_variables=["input_prompt"])
basic_chain = prompt | llm # `|` is LangChain's pipe operator
basic_chain.invoke({"input_prompt": "Hi! My name is Maarten. What is 1 + 1?"})
# "The answer to 1 + 1 is 2..."
The same pattern works for templated parametric prompts:
template = "Create a funny name for a business that sells {product}."
name_prompt = PromptTemplate(template=template, input_variables=["product"])
2.2 Multi-Step Chains
For complex tasks, split the work across multiple sequential prompts so each handles a subtask.
A story-generation example with three subtasks: title → character description → story.
from langchain import LLMChain
# 1. Title chain
template = """<s><|user|>
Create a title for a story about {summary}. Only return the title.<|end|>
<|assistant|>"""
title = LLMChain(llm=llm,
prompt=PromptTemplate(template=template, input_variables=["summary"]),
output_key="title")
# 2. Character chain — uses summary AND title
template = """<s><|user|>
Describe the main character of a story about {summary} with the title {title}. Use only two sentences.<|end|>
<|assistant|>"""
character = LLMChain(llm=llm,
prompt=PromptTemplate(template=template, input_variables=["summary", "title"]),
output_key="character")
# 3. Story chain — uses summary, title, AND character
template = """<s><|user|>
Create a story about {summary} with the title {title}. The main character is: {character}. Only return the story and it cannot be longer than one paragraph.<|end|>
<|assistant|>"""
story = LLMChain(llm=llm,
prompt=PromptTemplate(template=template, input_variables=["summary", "title", "character"]),
output_key="story")
llm_chain = title | character | story
llm_chain.invoke("a girl that lost her mother")
# {'summary': '...', 'title': '...', 'character': '...', 'story': '...'}
This pattern beats a single mega-prompt for two reasons. Each subtask gets its own focus, and we can access intermediate outputs (e.g., the title alone) for downstream use.
Section 3: Memory — Helping LLMs Remember
LLMs are stateless between calls. By default they forget your name as soon as the next prompt arrives.
3.1 ConversationBufferMemory
Append the entire conversation history to each prompt.
from langchain.memory import ConversationBufferMemory
template = """<s><|user|>Current conversation:{chat_history}
{input_prompt}<|end|>
<|assistant|>"""
prompt = PromptTemplate(template=template, input_variables=["input_prompt", "chat_history"])
memory = ConversationBufferMemory(memory_key="chat_history")
llm_chain = LLMChain(prompt=prompt, llm=llm, memory=memory)
llm_chain.invoke({"input_prompt": "Hi! My name is Maarten. What is 1 + 1?"})
llm_chain.invoke({"input_prompt": "What is my name?"})
# 'Your name is Maarten.'
Now the model has memory:
The drawback is that tokens grow without bound, eventually exceeding the context length.
3.2 ConversationBufferWindowMemory
Keep only the last k exchanges:
from langchain.memory import ConversationBufferWindowMemory
memory = ConversationBufferWindowMemory(k=2, memory_key="chat_history")
After three turns where the user mentioned their age in turn 1, asking "What is my age?" returns "I'm unable to determine your age" because turn 1 has aged out of the window.
The tradeoff is clear: this fixes context bloat but loses old information entirely.
3.3 ConversationSummaryMemory
Use a second LLM call to summarize the conversation into a running synopsis.
from langchain.memory import ConversationSummaryMemory
summary_prompt_template = """<s><|user|>Summarize the conversations and update with the new lines.
Current summary:
{summary}
new lines of conversation:
{new_lines}
New summary:<|end|>
<|assistant|>"""
summary_prompt = PromptTemplate(input_variables=["new_lines", "summary"], template=summary_prompt_template)
memory = ConversationSummaryMemory(llm=llm, memory_key="chat_history", prompt=summary_prompt)
llm_chain = LLMChain(prompt=prompt, llm=llm, memory=memory)
Each user turn produces two LLM calls, one for the summarization and one for the response. You can use a smaller and faster LLM for the summarizer.
Summary memory is concise but can lose specific details. The chapter's example had to infer the original question rather than recall it verbatim.
3.4 Memory Type Tradeoffs
| Memory | Pros | Cons |
|---|---|---|
| Buffer | Simplest; nothing lost in-window | Token-heavy; only OK with large-context LLMs; later retrieval gets murky |
| Windowed Buffer | Constant-size; small-context-friendly | Forgets past the window; no compression within window |
| Summary | Captures full history compactly; supports very long conversations | Extra LLM call per turn (slower); fidelity tied to summarizer's quality |
Section 4: Agents — Letting LLMs Decide
So far chains follow paths we defined. Agents let the LLM decide what to do next, including calling external tools.
LLMs are notoriously bad at math, but with a calculator tool they suddenly aren't. Add a search engine, weather API, or anything else, and capabilities multiply.
There are two new components. Tools are functions the agent can invoke. The agent type is the strategy for choosing actions, and we use ReAct.
4.1 ReAct: Reasoning + Acting
ReAct interleaves three steps in a loop. Thought is when the LLM reasons about what to do. Action invokes a tool with specific input. Observation receives the tool's output.
Example: "What's a MacBook Pro in EUR?" The agent searches the web for the USD price, then uses the calculator to convert, then returns the final answer.
4.2 Building a ReAct Agent in LangChain
The Phi-3-mini we've used isn't powerful enough to reliably follow ReAct's complex format. Switch to GPT-3.5:
import os
from langchain_openai import ChatOpenAI
os.environ["OPENAI_API_KEY"] = "MY_KEY"
openai_llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
The ReAct prompt template:
react_template = """Answer the following questions as best you can. You have access to the following tools:
{tools}
Use the following format:
Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [{tool_names}]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question
Begin!
Question: {input}
Thought:{agent_scratchpad}"""
prompt = PromptTemplate(template=react_template,
input_variables=["tools", "tool_names", "input", "agent_scratchpad"])
The {agent_scratchpad} is where past Thought/Action/Observation entries get accumulated.
Define the tools:
from langchain.agents import load_tools, Tool
from langchain.tools import DuckDuckGoSearchResults
search = DuckDuckGoSearchResults()
search_tool = Tool(
name="duckduck",
description="A web search engine. Use this to as a search engine for general queries.",
func=search.run,
)
tools = load_tools(["llm-math"], llm=openai_llm) # math via LLM-backed calculator
tools.append(search_tool)
Build and run the agent:
from langchain.agents import AgentExecutor, create_react_agent
agent = create_react_agent(openai_llm, tools, prompt)
agent_executor = AgentExecutor(agent=agent, tools=tools,
verbose=True, handle_parsing_errors=True)
agent_executor.invoke({
"input": "What is the current price of a MacBook Pro in USD? "
"How much would it cost in EUR if the exchange rate is 0.85 EUR for 1 USD."
})
Output:
The current price of a MacBook Pro in USD is $2,249.00.
It would cost approximately 1911.65 EUR with an exchange rate of 0.85 EUR for 1 USD.
4.3 Cautions
Autonomous agents are powerful but error-prone. There's no human in the loop at intermediate steps, so bad reasoning compounds. Tool output may be wrong (the search engine might surface stale or wrong prices). Reliability requires design: return source URLs, ask for confirmation between steps, and log every Thought/Action/Observation for auditing.
Summary
- Quantization shrinks model size with minimal quality loss. Aim for ≥4-bit. GGUF is the file format
llama-cpp-pythonuses. LangChain'sLlamaCpploads GGUF models. - A chain wires a prompt template (and other modules) to an LLM with the
|operator. Use chains to apply chat templates automatically and to compose multi-step pipelines like title → character → story. - LLMs are stateless. Memory modules give them context across turns:
ConversationBufferMemorykeeps full history (token-heavy)ConversationBufferWindowMemorykeeps only the last k turns (forgets older context)ConversationSummaryMemorykeeps a running summary via a second LLM call (compact but adds latency and may drop specifics)
- Agents let the LLM choose what to do. The ReAct framework loops Thought → Action → Observation, where actions are tool calls (calculators, search engines, APIs).
- ReAct dramatically expands what an LLM can do alone, but autonomous loops need careful guardrails: smaller LLMs may not reliably follow the format, tool outputs can be wrong, and intermediate steps need observability.
- Combining chains, memory, and agents is where LLM-based systems start to shine. The rest of the book builds on these primitives.
Previous chapter
Prompt Engineering