Pagefy

Pagefy

Back to AI Engineering

Composable Agentic Workflows

Generative AI Design Patterns by Valliappa Lakshmanan & Hannes HapkeBuy the book
Sign in to save bookmarks, reading progress, and highlights.

Chapter 10: Composable Agentic Workflows

Introduction

The closing chapter integrates the 32 patterns into a single production-ready application that shows how the patterns interact. Rather than build a complete product, the chapter walks through a vertical slice: an end-to-end workflow for generating educational content. The application runs in two modes. Copilot is the AI assistant where humans accept or edit each step. Agent is fully autonomous. Agent mode is just copilot mode with no human edits, so understanding copilot behavior tells you how the autonomous version runs.

Following Anthropic's "Building Effective Agents" guidance ("the most successful implementations use simple, composable patterns rather than complex frameworks"), the workflow is built without a multiagent framework, using PydanticAI, LlamaIndex, Streamlit, and standard Python.


Agentic Workflow

The workflow has more writers than the Chapter 7 version (including a GenAI writer that retrieves from this book), and two stages of review instead of one.

A note on terminology. Copilot is AI-assisted (user-driven workflow with AI recommendations). Agent is autonomous AI. Agentic is anywhere on that spectrum.

Running the Application

Setup

python -m venv agentic_ai/
source agentic_ai/bin/activate
python -m pip install -r requirements.txt
# Then edit keys.env and add your Gemini API key

Use a virtual environment so library versions don't conflict with other projects.

You can swap LLM providers in utils/llms.py:

BEST_MODEL = "gemini-2.5-pro"
DEFAULT_MODEL = "gemini-2.5-flash"
SMALL_MODEL = "gemini-2.5-flash-lite-preview-06-17"

The three settings let the application trade off quality, cost, and speed per agent. Logging is configured in logging.json: INFO to console, DEBUG to prompts.log, guards.log, feedback.log.

Copilot mode

python -m streamlit run streamlit_app.py

Agent mode

python cmdline_app.py

What the Application Does

The user enters a topic ("Battle of the Bulge"), and the Task Assigner agent picks a writer.

The user can override the recommendation:

This logs human feedback automatically:

Every piece of user-generated content should follow this AI-recommendation plus implicit-feedback pattern. UX design matters here: feedback collection has to be unobtrusive and comprehensive.

Editable artifacts

Anthropic introduced the term artifact for editable AI outputs. Three UX options for the draft:

Text boxes for direct editing (the default). Display as regular text with an Edit button. A chat interface for natural-language commands ("add more keywords related to the location of the battle"). Chat-style instructions need Long-Term Memory (Pattern 28) so the system remembers each user's prior edits and applies them on future runs.

If only the current step's context matters, embed the chat in the page. If the entire workflow's history matters, place the chat in a side panel.


System Architecture

Five interacting components: agents for each step, multi-agent orchestration to advance through the workflow, governance/monitoring/security, a learning pipeline for continuous improvement, and a data creation/collection/curation program.

Agent Patterns

Each step's agent should be implementable independently. The patterns each agent typically uses: Chain of Thought (13) for planning, Basic RAG (6) plus Index-Aware Retrieval (9) for data retrieval, Tool Calling (21) to invoke external systems, Reflection (18) plus Self-Check (31) for error recovery, and Template Generation (29) plus Assembled Reformat (30) for risk and creativity tradeoffs.

Choose the right framework per agent. RAG-heavy agents may use LlamaIndex. Agents that just generate from prompts may use PydanticAI. Don't force one framework on every agent.

Example: Panel Secretary agent (PydanticAI)

from pydantic_ai import Agent

class PanelSecretary:
    def __init__(self):
        system_prompt = PromptService.render_prompt("secretary_system_prompt")
        self.agent = Agent(llms.DEFAULT_MODEL,
                           output_type=str,
                           retries=2,
                           system_prompt=system_prompt)

The system prompt is loaded from a Jinja2-templated config so different installations can have different prompts. Default model is used because summarization isn't quality- or speed-critical. PydanticAI keeps the code LLM- and cloud-agnostic.

retries=2 is the try-and-try-again antipattern from Pattern 1, but acceptable here because LLM calls succeed over 90% of the time, keeping refusal rate under 1%.

Example: GenAI writer agent (LlamaIndex for RAG)

from llama_index.core import StorageContext, load_index_from_storage

def __init__(self):
    storage_context = StorageContext.from_defaults(persist_dir="data")
    index = load_index_from_storage(storage_context)
    self.retriever = index.as_retriever(similarity_top_k=3)

async def write_response(self, topic: str, prompt: str) -> Article:
    nodes = self.retriever.retrieve(topic)
    ...

Example: Consolidating reviews

async def consolidate(self, topic, article, reviews_so_far):
    reviews_text = []
    for reviewer, review in reviews_so_far:
        reviews_text.append(f"BEGIN review by {reviewer.name}:\n{review}\nEND review\n")

    prompt = PromptService.render_prompt("Secretary_consolidate_reviews",
                                         topic=topic, article=article, reviews=reviews_text)
    result = await self.agent.run(prompt)
    return result.output

async/await enables concurrency. The whole workflow state (topic, article, reviews) flows through the prompt context.

Long-term memory for user instructions

When the user says "Write history articles in bullet points" in the chat box, the application stores it for future runs:

import composable_app.utils.long_term_memory as ltm
ltm.add_to_memory(modify_instruction, metadata={
    "topic": topic,
    "writer": writer.name()
})

When a writer creates a future draft, it pulls relevant memories:

prompt_vars = {
    "prompt_name": f"GenericWriter_write_about",
    "content_type": get_content_type(self.writer),
    "additional_instructions": ltm.search_relevant_memories(
        f"{self.writer.name}, write about {topic}"),
    "topic": topic
}
prompt = PromptService.render_prompt(**prompt_vars)

Multi-Agent Architecture

Agents are orchestrated directly in code, not through a multiagent framework. Agent mode walks through the workflow sequentially:

async def write_about(self, topic):
    writer = WriterFactory.create_writer(await self.find_writer(topic))

    logger.info(f"Assigning {topic} to {writer.name()}")
    draft = await writer.write_about(topic)

    logger.info("Sending article to review panel")
    panel_review = await reviewer_panel.get_panel_review_of_article(topic, draft)

    article = await writer.revise_article(topic, draft, panel_review)
    return article

In copilot mode, each Streamlit page invokes its own agent:

@st.cache_resource
def write_about(writer_name, topic):
    writer = st.session_state.writer
    assert writer.name() == writer_name  # so caching works
    article = asyncio.run(writer.write_about(topic))
    return article

# on every redraw
ai_generated_draft = write_about(writer.name(), topic)

Two patterns at work: Grammar (2) via the Article structured output, and Prompt Caching (25) via @st.cache_resource to prevent redundant LLM calls on page redraws.

The "Next" button advances to the next agent's page:

if st.button("Next"):
    st.switch_page("pages/3_PanelReview1.py")

When the user modifies the draft text, the writer agent rewrites the article in response:

def modify_draft():
    modify_instruction = st.session_state.modify_instruction
    draft = asyncio.run(writer.revise_article(topic,
                                              st.session_state.draft,
                                              modify_instruction))
    st.session_state.draft = draft

with st.form("Modification form", clear_on_submit=True):
    st.text_input(label="Modification instructions", value="", key="modify_instruction")
    st.form_submit_button(label="Modify", on_click=modify_draft)

The main advantage of skipping multiagent frameworks is direct control flow. You write the orchestration logic where it matters.

Governance, Monitoring, Security

Input guardrails (Pattern 32) wrap every input. The implementation uses LLM-as-Judge (Pattern 17):

class InputGuardrail:
    def __init__(self, name, condition, should_reject=True):
        self.system_prompt = PromptService.render_prompt(
            "InputGuardrail_prompt",
            condition=condition)
        self.agent = Agent(llms.SMALL_MODEL,
                           output_type=bool,
                           retries=2,
                           system_prompt=self.system_prompt)

Prompt:

You are an AI agent that acts as a guardrail to prevent prompt injection and other
adversarial attacks.
Is the following condition met by the input?
** CONDITION **
{{ condition }}
async def is_acceptable(self, prompt, raise_exception=False):
    result = await self.agent.run(prompt)
    if not result.output:
        raise InputGuardrailException(f"{self.id} failed on {prompt}")
    return True

Parallel guardrail / call

Run the guardrail and the main LLM call in parallel to avoid added latency:

_, result = await asyncio.gather(
    self.topic_guardrail.is_acceptable(topic),
    self.agent.run(prompt)
)
return result.output

If the guardrail throws, asyncio.gather cancels the second call before its result is consumed.

All guardrail invocations log to guards.log for monitoring (spotting unusual attacks, fine-tuning detection on the actual distribution). Pair with Degradation Testing (Pattern 27) to find GPU and latency bottlenecks, then apply Chapter 8 patterns to fix them. Add access controls, policy management, audit logging, and human-in-the-loop checkpoints.

Learning Pipeline

In copilot mode, before each Next button advances, the application checks whether the user edited the AI output. If so, it logs the diff:

if st.button("Next"):
    if st.session_state.draft != st.session_state.ai_generated_draft:
        record_human_feedback("initial_draft",
                              ai_input=topic,
                              ai_response=st.session_state.ai_generated_draft,
                              human_choice=st.session_state.draft)
        logger.info(f"User has changed the draft to {st.session_state.draft}")

    st.switch_page("pages/3_PanelReview1.py")

In addition, all prompts, inputs, and outputs are logged to prompts.log and evals.log for offline evaluation:

from composable_app.utils import save_for_eval as evals
evals.record_ai_response("initial draft",
                         ai_input=prompt_vars,
                         ai_response=initial_draft)

See evals/evaluate_keywords.py for an offline evaluation example.

Outcome metrics

You won't know the real outcome until much later. For an educational-content app, possible metrics include appeal (number of teachers who include the topic in lesson plans), engagement (number of students who read through to the end), and functional performance (fraction of students who answer national exam questions on the topic correctly).

Feed feedback and outcomes into Content Optimization (Pattern 5) for preference tuning, Adapter Tuning (Pattern 15) for narrow-task fine-tuning, and Prompt Optimization (Pattern 20) for automatic prompt updates.

For high-traffic consumer apps, sample prompts rather than logging every one to avoid degrading throughput.

Data Program

Organic feedback often isn't enough. The data size tends to be too small (too few human corrections to learn from). The data complexity is wrong (high-value tasks are rare, common tasks are simple). Detailed feedback is hard to attribute (experts edit the final output, obscuring which agent caused the error). Automation fatigue kicks in (as AI improves, humans skim outputs, the Automation Paradox). And labels are sometimes incorrect (humans aren't perfect either, experts have personal styles).

Pair organic feedback with systematic data creation. Hire people to walk through the workflow (expensive but reliable). Use Evol-Instruct (Pattern 16) to create complex variations. Use Self-Check (Pattern 31) to flag where AI content or human feedback may be problematic, which addresses automation fatigue.


Deployment

The architecture's composability buys you a handful of things. Modularity and reusability: components reuse across applications, and Dependency Injection (Pattern 19) enables independent dev/test. Technical flexibility: pick the best tool per agent. Standard protocols, tools, and packages: PydanticAI (LLM-agnostic), LlamaIndex (RAG), Mem0 (long-term memory), MCP (tools), A2A (agent-to-agent). Independent scaling per component. Failure isolation, where one component failing doesn't bring down the system. Accelerated development by combining smaller services rather than building from scratch. And security and compliance through existing controls.

The book's example is pure open-source Python. For production, common stacks split frontend in TypeScript and backend in Python.


Summary

The chapter shows how the 32 patterns compose into a real workflow. A few practical takeaways. Build copilot first, agent second: agent mode is copilot with no edits, the same code runs both ways, and copilot is also where you collect organic feedback. Skip the multiagent frameworks when you can: direct orchestration in code is simpler, easier to debug, and gives full control. Choose framework per agent: PydanticAI for LLM-agnostic generation, LlamaIndex for RAG, Streamlit for UI, Mem0 for memory. Configure prompts and models externally so Jinja2 templates plus per-environment config let you swap settings without code changes. Run guardrails in parallel with the main call to keep latency low. Log everything: prompts, guardrails, human feedback, evaluations. Pair organic data with synthetic data, using Evol-Instruct (16) to create complex training examples and Self-Check (31) to flag automation-fatigue cases. Tie metrics to outcomes, not just task-level metrics: appeal, engagement, and functional performance matter more than per-step quality scores. And use three LLM tiers (BEST_MODEL, DEFAULT_MODEL, SMALL_MODEL) so each agent can trade quality for speed and cost, with the small model going to guardrails (low risk, high frequency).

The 32 patterns aren't a menu where you pick one. They're an interlocking toolkit. Real production agentic workflows blend half a dozen at minimum: a Task Assigner uses Grammar plus LLM-as-Judge for guardrails, writers use RAG plus CoT plus Reflection, reviewers use Multiagent Collaboration plus LLM-as-Judge, the secretary uses Reflection plus Long-Term Memory, and the entire pipeline uses Prompt Caching plus Degradation Testing for serving, with Content Optimization plus Prompt Optimization plus Adapter Tuning for continuous improvement.

The patterns make GenAI practical for real-world use cases. Agents that get better over time depend on architecture that lets you observe, evaluate, and improve each piece independently. That's what composable agentic workflows ultimately deliver.

Previous chapter

Setting Safeguards