Pagefy

Book

AI Engineering by Chip Huyen

1.Introduction to Building AI Applications with Foundation Models 2.Understanding Foundation Models 3.Evaluation Methodology 4.Evaluate AI Systems 5.Prompt Engineering 6.RAG and Agents 7.Finetuning 8.Dataset Engineering 9.Inference Optimization 10.AI Engineering Architecture and User Feedback

AI Engineering Architecture and User Feedback

AI Engineering by Chip HuyenBuy the book

Chapter 10: AI Engineering Architecture and User Feedback

Introduction

Earlier chapters covered individual techniques for adapting foundation models. This chapter pulls them together. Two halves:

AI engineering architecture. Start from the simplest application and add components (context, guardrails, routing/gateway, caching, agent patterns) until you've got a production-grade system. Then add monitoring/observability and orchestration.
User feedback. For AI applications, feedback isn't just a UX signal. It's the proprietary data that powers the data flywheel. Conversational interfaces unlock new feedback types but make extracting clean signals harder.

Section 1: AI Engineering Architecture

1.1 Simplest Architecture

No augmentation, no guardrails, no optimization. Add components only when you actually need them:

Enhance context with retrieval and tools.
Guardrails for safety.
Router and gateway for multi-model and security.
Caches for cost and latency.
Agent patterns for complex flows and write actions.

Plus monitoring/observability and orchestration.

1.2 Step 1: Enhance Context

Context construction is feature engineering for foundation models. The mechanisms (Chapter 6):

Text/image/tabular retrieval.
Tools: web search, news, weather, events.

Most providers (OpenAI, Claude, Gemini) support uploads and tools, but limits and configurations differ.

1.3 Step 2: Guardrails

Input Guardrails

Two main risks:

Leaking private info to external APIs (Samsung pasting proprietary code into ChatGPT).
Bad prompts (jailbreak/injection, Chapter 5).

PII detection options:

Block the entire query.
Mask sensitive parts, replace with placeholders, then unmask in the response via a reverse PII map.

Common sensitive classes: personal info, faces, IP/privileged keywords.

Output Guardrails

Catch failures and define how to handle them. Failure types:

Quality: malformatted (invalid JSON), factually inconsistent, generally bad.
Security: toxic, leaks PII, triggers remote tool execution, brand-risk.

Track both violation rate AND false refusal rate (Chapter 5). Over-secure systems are useless.

Mitigation tactics:

Retry on empty or malformatted output. Costs latency and extra calls.
Parallel calls. Send query twice, pick the better one. More cost, less latency.
Human fallback. Transfer based on phrases, anger detection, conversation length.

Implementation

Guardrails versus latency is a tradeoff. Some teams skip them entirely. Risky.
Streaming mode breaks output guardrails because the response is shown before evaluation.
Self-host vs. API: APIs already include guardrails. Self-hosting means you have to add them, but it cuts external risk.

Solutions: Meta Purple Llama, NVIDIA NeMo Guardrails, Azure PyRIT, Azure AI content filters, Perspective API, OpenAI moderation API.

1.4 Step 3: Model Router and Gateway

Router

Different queries go to different models or handlers:

Specialization (technical troubleshooting vs. billing).
Cost savings (cheap model for simple queries).
Out-of-scope decline ("As a chatbot, I don't have the ability to vote.").
Disambiguation. Ask for clarification.
Next-action prediction for agents.
Memory routing. Pick which memory tier to query.

Build with smaller models (GPT-2, BERT, Llama 7B) or trained-from-scratch classifiers. Routers must be fast and cheap.

When routing across models with different context limits, you may need to truncate or re-route to a larger-context model.

Gateway

The gateway centralizes:

Unified interface to all models.
Access control (no leaked org tokens).
Cost and rate-limit management.
Fallback policies for API failures.
Optional: load balancing, logging, analytics, caching, guardrails.

def model_gateway():
    data = request.get_json()
    model_type = data.get("model_type")
    model_name = data.get("model_name")
    input_data = data.get("input_data")
    max_tokens = data.get("max_tokens")
    if model_type == "openai":
        result = openai_model(input_data, model_name, max_tokens)
    elif model_type == "gemini":
        result = gemini_model(input_data, model_name, max_tokens)
    return jsonify(result)

Off-the-shelf gateways: Portkey AI Gateway, MLflow AI Gateway, Wealthsimple LLM Gateway, TrueFoundry, Kong, Cloudflare.

1.5 Step 4: Reduce Latency with Caches

Two system-level cache types (separate from KV cache and prompt cache from Chapter 9):

Exact Caching

Fetch only on exact match. Useful for:

Multi-step queries (CoT).
Time-consuming actions (retrieval, SQL, web search).
Embedding-based retrieval (skip vector search if the query is already cached).

Implementations: in-memory, Redis, PostgreSQL, tiered storage. Eviction: LRU, LFU, FIFO.

Don't cache user-specific queries ("status of my recent order"), time-sensitive ones ("how's the weather"), or permission-sensitive responses. Caching can leak data: user X's policy answer accidentally returned to user Y.

Semantic Caching

Reuse a cached response if a new query is semantically similar:

Embed the query.
Vector search the cache.
If similarity is above the threshold, return cached. Otherwise, process and cache.

Risks: bad embeddings, threshold tuning, vector search latency and cost. Worth it only if the cache hit rate is high.

1.6 Step 5: Agent Patterns

Add loops, parallel execution, and conditional branches (Chapter 6). After generation, the system can decide to retrieve more or invoke a tool again.

Write actions: compose email, place order, transfer money. Massive capability gain, massive risk gain. Apply with utmost care.

1.7 Monitoring and Observability

Observability should be designed in, not bolted on.

DevOps metrics:

MTTD is mean time to detection.
MTTR is mean time to response.
CFR is change failure rate.

Monitoring vs. observability. Monitoring tracks external outputs. Observability is a stronger assumption: internal state can be inferred from outputs (logs/metrics), so you don't need to ship new code to debug.

Metrics

Design around failure modes. Categories:

Format failures (invalid JSON, fixable vs. unfixable).
Open-ended generation quality. Factual consistency, conciseness, creativity, positivity (often AI-judge-computed).
Safety. Toxicity, PII leaks, guardrail trigger rate, refusal rate, abnormal queries.
User behavior signals: early termination, turns/conversation, tokens/input, tokens/output, output diversity.
Component-specific. RAG retrieval has context relevance and context precision. Vector DB has storage and query latency.
Latency. TTFT, TPOT, total (Chapter 9).
Cost. TPS, RPS, query volume, rate-limit usage.

Track per user, per release, per prompt version, per type, per time slice. Combine spot checks with exhaustive checks. Correlate metrics with business north stars (DAU, session duration, subscriptions).

Logs and Traces

Log everything. Configurations (model, sampling settings, prompt template), user query, final prompt, output, intermediate outputs, tool calls, tool outputs, component start/end, crashes. Tag with IDs for traceability.

For fast debugging, logs need to be available and accessible quickly. 15-minute delays kill incident response.

Traces stitch related events into a complete request timeline.

A trace shows: user query, actions, retrieved docs, final prompt, response, with time and cost per step.

Drift Detection

Three drift sources:

System prompt changes. Template updates, typo fixes. Your prompt drifts without you knowing.
User behavior changes. Users adapt to AI, just like they did with Google Search. Length drops over time without obvious cause.
Underlying model changes. Same API, different weights. Voiceflow saw a 10% performance drop migrating GPT-3.5-turbo-0301 to 1106.

1.8 AI Pipeline Orchestration

An orchestrator does:

Components definition. Declare available models, retrievers, tools, evals.
Chaining. Compose into a pipeline.

Example pipeline:

Process raw query.
Retrieve relevant data.
Build prompt.
Generate response.
Evaluate.
Return or escalate to human.

Tools: LangChain, LlamaIndex, Flowise, Langflow, Haystack.

AI pipeline orchestrator isn't the same as workflow orchestrators (Airflow, Metaflow).

Don't reach for an orchestrator on day one. They abstract away details and add complexity. Build first, adopt later.

When evaluating orchestrators:

Integration and extensibility. Does it support your models and components? Easy to add new ones?
Complex pipelines. Branching, parallelism, error handling.
Ease of use, performance, scalability. Intuitive APIs, good docs, no hidden API calls or latency overhead.

Section 2: User Feedback

User feedback is proprietary data. That's your competitive moat. It powers the data flywheel (Chapter 8). Open source apps lose this. Users self-deploy and you never see the feedback.

Feedback is user data. Respect privacy. Tell users how their data is used.

2.1 Extracting Conversational Feedback

Explicit feedback: thumbs up/down, star rating, "did we solve your problem?".
Implicit feedback is inferred from actions. Highly application-dependent.

The conversational interface makes giving feedback natural, but extracting clean signals is harder.

Natural Language Feedback

Early Termination

If users stop a generation halfway, exit, tell the bot to stop, or just leave the agent hanging, the conversation probably isn't going well.

Error Correction

"No, ...", "I meant, ..." means the response missed the mark.
Rephrasing can be detected by heuristics or ML.
Action-correcting feedback is common in agentic tasks: "You should also check XYZ GitHub page".
Confirmation requests like "Are you sure?", "Check again", "Show me the sources" may signal lack of trust.
Direct edits to model output are a strong negative signal AND a perfect preference pair (original = losing, edited = winning).

Complaints

Eight types from the FITS dataset (Yuan et al., 2023):

Type	Description	%
1	Clarify demand again	26.54%
2	Doesn't answer / irrelevant / asks user to find out	16.20%
3	Point out specific search results	16.17%
4	Suggest using search results	15.27%
5	Factually incorrect / not grounded	11.27%
6	Not specific/accurate/complete/detailed	9.39%
7	Bot is not confident ("I am not sure" / "I don't know")	4.17%
8	Repetition / rudeness	0.99%

Sentiment

Frustration, disappointment, ridicule. Track sentiment across the conversation arc. The call center pattern of "starts angry, ends happy" indicates resolution.

Refusal Rate

Frequent "Sorry, I don't know" or "As a language model..." equals unhappy users.

Other Conversational Feedback

Regeneration

Choosing to regenerate often means the first response wasn't good. But it could also be exploration ("show me options"). Stronger signal under usage-based billing than under subscription.

Conversation Organization

Delete is a strong negative.
Rename means the response was good but the auto-title was bad.
Share is ambiguous. Could be "this is great" or "look at this disaster".
Bookmark is positive.

Conversation Length

Application-dependent:

AI companion: long conversation means engagement (good).
Customer support: long conversation means unable to resolve (bad).

Dialogue Diversity

Long and repetitive means stuck in a loop.

2.2 Feedback Design

When to Collect

In the Beginning

Calibrate (face ID, voice wake-word, language-learning skill assessment). Make optional unless required. Face ID needs it. A conversational app should default to neutral and learn over time.

When Something Bad Happens

Downvote, regenerate (same model or another).
Conversational corrections: "Too cliché", "I want something shorter".
Let users still complete tasks. Fix categories, transfer to a human agent.
Inpainting for image generation lets users select a region and describe a fix.

When Model Has Low Confidence

Show two response options for the user to choose. That gives you preference data.

Tradeoff: full responses give informed feedback. Partial responses cut reading load. Both are used in production.

Positive Feedback?

Apple's HIG warns against asking for both positive and negative feedback. Good results are the baseline, and asking signals they're exceptions. Many PMs argue positive feedback reveals high-impact features. The compromise is to limit frequency (1% sample).

How to Collect

Seamless. Integrate into the workflow. Easy to ignore.
Incentivize. Explain what feedback is for: personalization, analytics, training.

Examples:

Midjourney: a 4-image grid with options to upscale (strongest positive), generate variations (weaker positive), or regenerate (negative).

GitHub Copilot: ghost-text suggestions. Tab to accept, keep typing to reject.

Standalone apps (ChatGPT, Claude) struggle with this. They don't know if a generated email actually got sent.

For deeper analysis you need conversation context (last 5-10 turns). User consent or a data donation flow may be required.

Don't ask users to do the impossible.

Don't make labels confusing.

Decide private vs. public feedback. X made likes private in 2024. Private signals are more candid. Public signals enable discoverability and explainability.

2.3 Feedback Limitations

Biases

Leniency bias. Users skew positive to avoid follow-up questions, or because giving negative feedback feels rude. Uber's average driver rating is 4.8. Sub-4.6 risks deactivation. Rewriting low ratings without strong negative connotations:
- "Great ride. Great driver."
- "Pretty good."
- "Nothing to complain about but nothing stellar either."
- "Could've been better."
- "Don't match me with this driver again."
Randomness. Users don't read both responses. They click randomly.
Position bias. First option gets clicks regardless of quality. Mitigate via random shuffling.
Preference biases. Verbose-but-wrong over short-but-right. Recency bias on the last option seen.

Degenerate Feedback Loop

Predictions influence feedback, which influences the next iteration, which amplifies biases.

Video recommender: A ranks higher, gets more clicks, gets boosted further, leaves B behind. "Filter bubbles."
Cat photos: a few users like cat photos, the system generates more cats, attracts cat lovers, endless cats.
Sycophancy. Sharma et al. (2023): models trained on user feedback tend to give responses matching the user's view, even when wrong.

User feedback improves UX. Used indiscriminately, it can perpetuate biases and destroy your product.

Summary

Architecture is built incrementally: simple model API, context construction, input/output guardrails, router and gateway, caches (exact, semantic), agent patterns and write actions.
Observability has to be designed in. Track failure-mode-specific metrics, business-correlated metrics, and latency/cost. Log everything, build traces, and detect drift in prompts, users, and underlying models.
Orchestrators (LangChain, LlamaIndex, Flowise, etc.) help compose pipelines but add complexity. Adopt later. Evaluate for integrations, complex flows, and ease of use.
User feedback is proprietary data, which is your competitive moat. The conversational interface unlocks rich implicit signals: early termination, error correction, complaints, sentiment, regeneration, conversation org/length, diversity.
Feedback design: collect throughout the journey, but unobtrusively. Calibrate at start, capture on failure, ask for choice when uncertain. Be cautious about asking for positive feedback. Use seamless workflow-integrated patterns (Midjourney, GitHub Copilot).
Limits: biases (leniency, randomness, position, preference) and degenerate feedback loops can perpetuate harm. Audit feedback before using it to train.
AI engineering is moving closer to product because data flywheel plus product experience are the most durable competitive advantages.

Previous chapter

Inference Optimization