Enabling Agents to Take Action
Chapter 7: Enabling Agents to Take Action
Introduction
Earlier chapters were about generating content. The three patterns here cross the line into agentic behavior: letting LLMs interact with the world. Tool Calling (21) lets the model invoke external functions. Code Execution (22) lets it run code in a sandbox. Multiagent Collaboration (23) orchestrates multiple specialized agents through hierarchical, peer-to-peer, or market-based architectures. Together with Reflection (Pattern 18), these patterns are usually considered the threshold beyond which an LLM application becomes an agent.
Pattern 21 — Tool Calling
Tool Calling lets an LLM invoke an external function by emitting special tokens with the function name and arguments. A client-side postprocessor calls the function and feeds the result back to the LLM.
Problem
Multimodal LLMs generate content. They can write "Book one seat on TK 161…" but can't actually book a flight, issue a refund, or move money. RAG (Chapters 3-4) injects new information but doesn't act.
Solution
The LLM is trained to emit a structured tool-call token when it determines a function should be called:
[CALL_TOOL: book_flight, TK 161, 2025-06-12, Economy]
A client program parses the call, invokes book_flight(...), and substitutes the result back:
Thanks for booking a flight with us!
Here's your flight confirmation:
{fd.flight_number}
...
I have billed your {fd.payment_method} for {fd.invoiced_amount}.
What Tool Calling unlocks
Up-to-date knowledge (current news, weather, stock prices). Personalization through email, calendar, and similar connectors. Enterprise APIs against internal databases and search engines. Calculations through calculators, GIS, and optimization solvers. And ReAct (Reasoning + Acting), which interleaves CoT with tool calls so the model adapts its plan based on tool outputs. ReAct is just CoT plus Tool Calling, so it isn't a separate pattern in this book.
A note on naming. Toolformer was the inventors' term. Andrew Ng and Anthropic call it Tool Use. OpenAI and Gemini call it Function Calling. The book uses "Tool Calling" because tool is more general than function (covers APIs and remote proxies) and calling emphasizes that the LLM only emits the name. The client invokes.
OpenAI Function Calling (Low-Level)
Step 1: Implement the function
@dataclass
class BookingData:
...
class CabinClass(Enum):
...
def book_flight(flight_code: str,
departure_date: datetime,
cabin_class: CabinClass,
passenger_details: List[PassengerInfo]) -> BookingData:
response = requests.post("https://api.turkishairlines.com/...", json={...})
booking_data = response.json()
return BookingData(**booking_data)
Step 2: Pass tool definitions to the LLM
tools = [{
"type": "function",
"name": "book_flight",
"description": "Books a flight using the airline API",
"parameters": {
"type": "object",
"properties": {
"flight_code": {"type": "string",
"description": "IATA flight code like AA 123"},
"departure_date": {"type": "string",
"description": "YYYY-MM-DD format..."},
"cabin_class": {"type": "string",
"enum": ["economy", "premium_economy", "business", "first"],
"description": "Class of travel"},
...
}
}
}]
response = client.responses.create(
model="gpt-4.1",
input=[{"role": "user",
"content": "Book me an economy class ticket from Mauritius to Istanbul on June 12..."}],
tools=tools,
)
Tool functions need self-descriptive names and clear descriptions. The model uses them to decide when and how to call.
Provider variations are real. Anthropic uses input_schema (not parameters); Llama nests under a function key. Use an LLM-agnostic framework (PydanticAI, LangChain, LangGraph, LiteLLM). The model provider's client API will lock you in.
Step 3: Process and invoke client-side
The model does not call the function (security). It returns:
{
"type": "function_call",
"id": "fc_12345xyz",
"call_id": "call_12345xyz",
"name": "book_flight",
"arguments": "{\"flight_code\":\"TK 161\",..."
}
You parse and call:
tool_call = response.output[0]
if tool_call.name == "book_flight":
args = json.loads(tool_call.arguments)
result = book_flight(args["flight_code"], ...)
Step 4: Send the result back
input_messages.append(tool_call)
input_messages.append({
"type": "function_call_output",
"call_id": tool_call.call_id,
"output": json.dumps(result)
})
response_2 = client.responses.create(model="gpt-4.1", input=input_messages, tools=tools)
Step 5: Final response
The model integrates the result:
Great news! I've successfully booked your flight from Mauritius (MRU) to Istanbul (IST). Booking details: ...
LangGraph + MCP (High-Level)
The Model Context Protocol (MCP) standardizes how a client passes tool definitions to an LLM. LangGraph wraps the client steps.
MCP server (decorator-driven)
@mcp.tool()
async def book_flight(flight_code: str,
departure_date: datetime,
cabin_class: CabinClass,
passenger_details: List[PassengerInfo]) -> BookingData:
"""
Books a flight using the airline API
Args:
flight_code: IATA airline flight code such as AA 123
departure_date: Date of departure
cabin_class: Class of travel (economy, premium_economy, business, first)
passenger_details: List of passenger information including names and passport details
Returns:
Booking confirmation details including booking reference, flight numbers, and total price
"""
...
if __name__ == "__main__":
mcp.run(transport="stdio") # for in-process Python
# or mcp.run(transport="streamable-http") # for cross-language / remote
The function name, parameter names, and docstrings carry all the info the LLM needs.
MCP client
from langchain_mcp_adapters.client import MultiServerMCPClient
async with MultiServerMCPClient({
"flight_booking": {
"command": "python",
"args": ["/path/to/flight_booking.py"],
"transport": "stdio",
},
"flight_options": {
"url": "http://localhost:8000/mcp",
"transport": "streamable_http",
}
}) as client:
agent = langgraph.prebuilt.create_react_agent(
"anthropic:claude-3-7-sonnet-latest",
client.get_tools()
)
booking_details = await agent.ainvoke(
{"messages": [{"role": "user",
"content": "Book me an economy class ticket from Mauritius to Istanbul on June 12..."}]}
)
The ReAct agent reasons about when to call tools and how to incorporate their responses.
Example: Weather Question
For "Will it rain in Chicago on Tuesday?" you need Chicago's lat/lon and the weather forecast at that lat/lon.
@mcp.tool()
async def get_weather_from_nws(latitude: float, longitude: float) -> str:
"""Fetches weather data from the National Weather Service API for a specific geographic location."""
base_url = "https://api.weather.gov/points/"
points_url = f"{base_url}{latitude},{longitude}"
response = requests.get(points_url, headers=headers)
metadata = response.json()
forecast_url = metadata.get("properties", {}).get("forecast")
response = requests.get(forecast_url, headers=headers)
weather_data = response.json()
return weather_data.get("properties", {}).get("periods")
@mcp.tool()
async def latlon_geocoder(location: str) -> (float, float):
"""Converts a place name such as "Kalamazoo, Michigan" to latitude and longitude coordinates"""
geocode_result = gmaps.geocode(location)
return (round(geocode_result[0]['geometry']['location']['lat'], 4),
round(geocode_result[0]['geometry']['location']['lng'], 4))
mcp = FastMCP("weather")
if __name__ == '__main__':
mcp.run(transport="streamable-http")
The ReAct agent automatically chains them: geocode, then weather. A few-shot system prompt with worked steps improves reliability.
Considerations
Improving reliability
A few things move the needle. Clear and detailed function names plus parameter descriptions, with documented policies (when to search vs. book, how long search results stay valid). Use enum types via Grammar to constrain inputs. Few tools is better: as of June 2025, 3 to 10 tools is the sweet spot, and more reduces accuracy. Don't make the model fill in info you already have deterministically. Return descriptive error messages and use Reflection (Pattern 18) to retry.
MCP limitations (May 2025)
Three things to know. Security: MCP doesn't enforce auth, and Cloudflare's Workers OAuth Provider Library fills the gap. Collaboration: MCP is mostly client-to-server, and for agent-to-agent see Google's A2A and IBM's ACP. Streaming: standard MCP calls timeout in 30 to 60 seconds, so use streamable HTTP for long ops.
Prompt injection
Tool Calling expands the attack surface. Adversaries inject malicious text into tool inputs to steer downstream calls. Six defense patterns from Beurer-Kellner et al. 2025:
| Pattern | How |
|---|---|
| Action-Selector | Predefined action set; no feedback to agent |
| Plan-Then-Execute | Agent commits to a fixed plan; tool feedback can't deviate from it |
| Map-Reduce | Isolated subagents process untrusted prompts; reduce step uses constrained Action-Selector |
| Dual-LLM | Privileged LLM uses tools; sandboxed LLM processes untrusted data without tools |
| Code-Then-Execute | LLM writes a program that calls tools and spawns unprivileged LLMs for untrusted text |
| Context-Minimization | Strip the user's original prompt from context in subsequent steps |
References: ReAct (Yao et al. 2022); Toolformer (Schick et al. 2023); Beurer-Kellner et al. 2025. GitHub, Sentry, and Zapier all expose MCP servers.
Pattern 22 — Code Execution
Code Execution has the LLM generate code in a programming language or DSL, and an external system (typically a sandbox) executes it.
Problem
Tool Calling fits when a function takes a small list of parameters. It breaks down when the function expects a long DSL string: graph-drawing (Matplotlib, Mermaid), image annotation (ImageMagick), or database queries (SQL). LLMs are also bad at directly producing graphs and annotated images, but they're good at writing the code that produces them.
Solution
LLM emits the DSL, postprocessor sends it to the executor, result returned. For database updates, have the LLM produce SQL and send it as a single transaction. Don't ask the LLM to maintain integrity.
Combine with ReAct so some interleaved steps are code execution.
Example: Basketball Tournament Bracket
LLM emits Graphviz DOT for the matchups:
**Input**:
Saturday, March 29, 2025 (Elite Eight)
(1) Florida 84, (3) Texas Tech 79
(1) Duke 85, (2) Alabama 65
...
**Output**:
"Florida" -> "Texas Tech" [label="84-79"]
"Duke" -> "Alabama" [label="85-65"]
"Auburn" -> "Michigan State" [label="70-64"]
"Houston" -> "Tennessee" [label="69-50"]
subgraph cluster_elite_eight {
label = "Elite Eight"
{rank = same; "Texas Tech"; "Alabama"; "Michigan State"; "Tennessee"; }
}
Save and execute:
dot -Grankdir=LR -Tpng tournament.dot -o tournament.png
Considerations
Sandboxing is the first concern. Containers (Docker), VMs, or specialized runtimes; constrain CPU, memory, network, time; monitor for infinite loops. Validate before execution: syntax check, static analysis, formal correctness checking. Even sandboxes have escape vectors, so keep base images patched. Reflect on failures: compiler or runtime errors flow back to the LLM as feedback for a retry. Code Execution works best when the generated code is a narrow DSL with a parser (Graphviz DOT). Narrower target means more reliable LLM output.
References: CodeT5 (Wang et al. 2021); AlphaCode, Codex, StarCoder; Hyunh and Lin 2025 (LLM code-gen survey); HumanEval (Chen et al. 2021). Claude uses Mermaid for diagrams; Gemini generates Pandas code for financial analysis.
Pattern 23 — Multiagent Collaboration
Multiagent Collaboration orchestrates specialized single-purpose agents in human-organization-like structures (hierarchies, peer-to-peer networks, markets) to solve complex tasks beyond a single LLM call.
Problem
Single agents have four weaknesses. Cognitive bottlenecks, since context is finite and integration across domains is hard. Decreasing parameter efficiency, since bigger models give diminishing returns at huge cost. Limited reasoning depth, since sequential transformers can't pursue parallel reasoning paths. And domain adaptation issues, since fine-tuning a single model risks catastrophic forgetting.
Solution
Multiagent systems give you a handful of benefits. Task decomposition splits complex problems among specialized agents. Parallel processing lets multiple agents work simultaneously. Hierarchical problem-solving lets high-level agents coordinate specialists. Domain-specific expertise lets you fine-tune small agents per domain. Functional specialization lets agents serve as interfaces to different role-specific systems. Scalability: scale horizontally, replace agents independently, allocate compute dynamically. Robustness: replicate critical capabilities, agents verify each other. And emergent capabilities: interactions produce behaviors not explicitly trained.
Architectures
Hierarchical structures are tree-like, executive-worker, multilevel hierarchies. The simplest is prompt chaining (sequential workflow) where the first agent's output becomes the second agent's input. LangChain example:
paragraph_prompt = PromptTemplate(input_variables=["topic"],
template="Write a concise and entertaining paragraph on {topic}.")
paragraph_chain = LLMChain(llm=llm, prompt=paragraph_prompt, output_key="paragraph")
title_prompt = PromptTemplate(input_variables=["paragraph"],
template="Write a catchy title for ... {paragraph}")
title_chain = LLMChain(llm=llm, prompt=title_prompt, output_key="title")
keywords_prompt = PromptTemplate(input_variables=["paragraph", "title"],
template="Extract up to 5 keywords ... {title} {paragraph}")
keywords_chain = LLMChain(llm=llm, prompt=keywords_prompt, output_key="keywords")
overall_chain = SequentialChain(
chains=[paragraph_chain, title_chain, keywords_chain],
input_variables=["topic"],
output_variables=["paragraph", "title", "keywords"],
)
Peer-to-peer networks distribute authority equally; agents collaborate as peers via voting or consensus. CrewAI example with consensus on accept/reject/revise:
voting_and_consensus_task = Task(
description="Review the preliminary recommendations from all editors... "
"Engage in up to 3 rounds of discussion to reach consensus on "
"ACCEPT, REJECT, or REVISE.",
expected_output="The final decision (ACCEPT, REJECT, or REVISE)...",
agent=[senior_editor, content_editor, research_editor],
context=[senior_editor_review_task, content_editor_review_task, research_editor_review_task],
callback=lambda output: print(f"## Final Decision: {output.raw_output}")
)
Market-based systems use auctions and utility maximization. Sealed-bid:
def run_auction(agents, car_description):
bids = {}
for agent in agents:
prompt = f"Here is the car for auction:\n {car_description}\n\nWhat is your maximum bid?"
bid_response = agent.run(prompt)
bids[agent.name] = int(bid_response.output)
highest_bid = 0
winner = None
for agent_name, bid_amount in bids.items():
if bid_amount > highest_bid:
highest_bid = bid_amount
winner = agent_name
return winner, highest_bid
English auction (open ascending) is similar but iterates with a price increment until everyone passes. Auctions generalize to any task assignment where each agent can independently estimate fitness for the task.
Human-in-the-loop gives one agent a human-proxy role to resolve conflicts or inject preferences.
Use cases
The most common, useful, and least complex use of multiagent systems is breadth-first or parallel execution where the fastest wins. Beyond that: complex reasoning with agents specialized in math, law, and science domains. Multistep problem solving (planning, execution, monitoring, adaptation). Collaborative content creation (research, outline, draft, edit, fact-check). Adversarial verification (red team vs. blue team). Specialized-domain integration across modalities (text, image, audio, video) or channels (web, voice, text). And self-improving systems where evaluator agents critique others.
Example: Educational Content Pipeline (AG2)
Hybrid: hierarchical router (Task Assigner) plus a peer-to-peer review panel plus a secretary summary plus a writer rewrite.
Step 1: Set up agents
llm_config = LLMConfig(
api_type="google",
model="gemini-2.0-flash",
api_key=os.environ.get("GEMINI_API_KEY"),
temperature=0.2,
)
with llm_config:
history_writer = ConversableAgent(name="history_writer",
system_message="You are a historian ...")
math_writer = ConversableAgent(name="math_writer",
system_message="You are a math teacher ...")
human = ConversableAgent(name="human", human_input_mode="ALWAYS")
class TaskAssignmentResponse(BaseModel):
writer: Literal['HISTORIAN', 'MATH WRITER']
llm_task_config = LLMConfig(..., temperature=0.0, response_format=TaskAssignmentResponse)
with llm_task_config:
task_assigner = ConversableAgent(name="task_assigner", system_message=task_assigner_prompt)
The Task Assigner uses Grammar (Pattern 2) to constrain output to one of two writer roles. This pattern (one classifier fronting a worker pool) is called a router.
Step 2: Route to a writer
task_response = human.run(recipient=task_assigner, message=question, max_turns=1)
task_response.process()
writer = json.loads(task_response.messages[-1]['content'])['writer']
writer = history_writer if writer == 'HISTORIAN' else math_writer
Step 3: Initial draft
content_response = task_assigner.run(recipient=writer, message=question, max_turns=1)
initial_draft = content_response.messages[-1]['content']
For "Why was the Battle of Plassey so pivotal?" the historian agent produces a draft about the British East India Company's expansion. For "x² + 50 = 150" the math agent walks through the algebraic steps including the ± sign caveat.
Step 4-5: Peer-to-peer review panel
reviewers = []
with llm_config:
reviewers.append(ConversableAgent(name="district_admin", ...))
reviewers.append(ConversableAgent(name="school_admin", ...))
reviewers.append(ConversableAgent(name="secretary", ...))
reviewers.append(ConversableAgent(name="conservative_parent", ...))
reviewers.append(ConversableAgent(name="liberal_parent", ...))
pattern = RoundRobinPattern(
initial_agent=reviewers[0],
agents=reviewers,
user_agent=None,
group_manager_args={"llm_config": llm_config},
)
reviews, context, last_agent = initiate_group_chat(
pattern=pattern,
max_rounds=len(reviewers)+1, # everyone speaks once
messages=f"You are part of a review panel ...\n{question}\n...\n{answer}"
)
Each reviewer gives feedback from their perspective. The district admin asks for plain language. The conservative parent wants neutral framing. The liberal parent wants it not to whitewash.
Step 6-8: Secretary summarizes, writer rewrites
The secretary (last agent in the round-robin) consolidates feedback into actionable directives. The writing agent receives:
rewrite_response = last_agent.run(recipient=history_writer, max_turns=2,
message=f"""Please incorporate the feedback from a review panel...
You were asked to write the answer for this question:
{question}
You wrote the following:
{answer}
The reviewer panel has provided the following feedback:
{reviews.chat_history[-1]['content']}
Incorporate the feedback to rewrite the content.""")
The final response is more balanced. It keeps the historical narrative but neutralizes loaded vocabulary.
Considerations
Costs and complexity
Coordination cost is nonlinear in the number of agents. Anthropic recommends "simple, composable patterns rather than complex frameworks". Prefer peer or parallel agents to reduce wall-clock time. Communication consistency is hard with agents at different speeds and latencies. Specialize agents for tasks to trade consistency complexity for communication complexity.
A2A (Agent-to-Agent) protocol
The A2A protocol decreases multiagent communication complexity. A Python PydanticAI agent can be exposed as A2A:
from pydantic_ai import Agent
agent = Agent('openai:gpt-4.1', ...)
app = agent.to_a2a()
uvicorn agent_to_a2a:app --host 0.0.0.0 --port 8093
A TypeScript Mastra client invokes it:
import { A2A } from "@mastra/client-js";
const a2a = new A2A({ serverUrl: "https://...server.com:8093" });
const task = await a2a.sendTask({
id: randomUUID(),
message: { role: 'user', parts: [...] }
});
const stream = a2a.streamTaskUpdates(task.id, (update) => {
console.log("Task update:", update);
});
Failure modes
A 2025 analysis found that 40 to 80 percent of multiagent tasks fail. Fourteen failure modes group into three categories. Specification issues (bad prompts, poorly defined roles, LLM limitations) show up as agents not following tasks or roles, repeating steps, losing context, or failing to recognize completion. Interagent misalignment (conversations reset, agents don't ask for clarification, task derailment, withholding info, reasoning-action mismatches) is hard to diagnose because different root causes look alike. Task verification problems are inadequate verification and premature termination.
If a single-agent system plus better UX plus human-in-the-loop suffices, prefer it over multiagent.
References: Andrew Ng's four agentic patterns (May 2024); OpenAI adds LLM-as-Judge / Parallelization / Router / Guardrails; Anthropic distinguishes pre-specified workflows from autonomous agents; Google's 2025 multiagent whitepaper; Cemri et al. 2025 on failure modes; Anthropic's 2024 blog on composable patterns. Devin spawns subagents (planning, coding, debugging, web search). Chapter 10 builds an end-to-end multiagent system.
Summary
| Pattern | Problem | Solution | When to use |
|---|---|---|---|
| Tool Calling (21) | LLM can't act on the world | LLM emits structured tool call, client invokes, result fed back | Real-time data, enterprise APIs, calculations, optimization, ReAct workflows |
| Code Execution (22) | Functions take a DSL, not parameters | LLM generates code; sandbox runs it | Graphs (Matplotlib, Mermaid), SQL, ImageMagick, database updates |
| Multiagent Collaboration (23) | Multistep tasks, parallel reasoning, specialized expertise | Specialized agents in hierarchical / peer / market / hybrid architectures | Parallel processing, complex reasoning, content creation, adversarial verification, self-improvement |
A real agentic application typically chains these: a top-level multiagent system whose specialists each do Tool Calling and Code Execution, with Reflection (Pattern 18) for retries and LLM-as-Judge (Pattern 17) for evaluation. Add Grammar (Pattern 2) at every typed boundary. Add the six prompt-injection defenses where untrusted data flows through tools.
The practical guidance is straightforward. Start with a single agent plus Tool Calling plus Reflection, and add multiagent only when single agent plus better UX clearly fails. Prefer few well-described tools over many: 3 to 10 is the practical limit at the time of writing. Use MCP for tool servers, A2A for agent-to-agent, and an LLM-agnostic framework throughout. Sandbox all generated code, validate before running, and reflect on failures. Multiagent failure rates run 40 to 80 percent, so tune ruthlessly and prefer simple composable patterns over heavyweight frameworks.
Previous chapter
Improving ReliabilityNext chapter
Addressing Constraints