Pagefy

Book

AI Engineering by Chip Huyen

1.Introduction to Building AI Applications with Foundation Models 2.Understanding Foundation Models 3.Evaluation Methodology 4.Evaluate AI Systems 5.Prompt Engineering 6.RAG and Agents 7.Finetuning 8.Dataset Engineering 9.Inference Optimization 10.AI Engineering Architecture and User Feedback

Introduction to Building AI Applications with Foundation Models

AI Engineering by Chip HuyenBuy the book

Chapter 1: Introduction to Building AI Applications with Foundation Models

Introduction

The story of AI after 2020 is really a story about scale. Language models got bigger, picked up new modalities, and turned into foundation models that anyone can call over an API. That shift created a new job: AI engineering, which is just building applications on top of models someone else trained. This chapter walks through how we got from old-school language models to today's foundation models, what people are actually building with them, how to plan an AI product without burning cash, and what the AI engineering stack looks like compared to traditional ML.

Section 1: The Rise of AI Engineering

Foundation models are descendants of large language models, which themselves trace back to plain language models in the 1950s. Three things pushed the evolution forward: self-supervision unlocked scale, multimodality made models more general, and model-as-a-service dropped the barrier to entry to almost zero.

1.1 From Language Models to Large Language Models

A language model stores statistical information about a language. Give it the context "My favorite color is __" and an English language model should pick "blue" more often than "car". Simple idea, but everything else builds on top of it.

Tokens and Tokenization

The unit a language model actually works with is a token. A token can be a character, a word, or part of a word like -tion.

Tokenization is the process of chopping text into tokens. For GPT-4 a token is roughly three-quarters of a word, so 100 tokens lands around 75 words. The vocabulary is the full set of tokens a model can produce. Mixtral 8x7B has 32,000. GPT-4 has 100,256.

Why tokens instead of words or characters? A few reasons:

Tokens break words into pieces that carry meaning (e.g., cooking → cook + ing).
The token vocabulary is way smaller than a word vocabulary, which makes the model more efficient.
Tokens let the model handle words it has never seen before (e.g., chatgpting → chatgpt + ing).

Two Types of Language Models

A masked language model fills in the blank. It predicts a missing token using both the tokens before it and the tokens after it. BERT is the classic example. These are good for non-generative work like sentiment analysis, classification, and code debugging.

An autoregressive language model predicts the next token using only what came before. That's the format you need for text generation, and it's what almost everyone uses today.

A model whose outputs are open-ended is generative, which is where the name "generative AI" comes from. The simplest mental model is a completion machine:

Prompt: "To be or not to be"
Completion: ", that is the question."

Translation, summarization, coding, classification. A surprising number of tasks can be reframed as completion.

Self-Supervision

Supervision means training on labeled data, and labeled data is slow and expensive to produce. Labeling 1M images at 5¢ each costs $50,000. That's the price tag for ImageNet.

Self-supervision sidesteps the problem by inferring labels from the input itself. For language modeling, every input sequence is also its own label set, because the next token is the label. The sentence "I love street food." gives you several training examples for free:

Input (context)	Output (next token)
`<BOS>`	I
`<BOS>, I`	love
`<BOS>, I, love`	street
`<BOS>, I, love, street`	food
`<BOS>, I, love, street, food`	.
`<BOS>, I, love, street, food, .`	`<EOS>`

Self-supervision is not the same thing as unsupervised learning. Self-supervised learning still uses labels, the labels are just derived from the input. Unsupervised learning doesn't use labels at all.

Model Size and Scaling

A parameter is a variable the training process updates. More parameters generally means more capacity to learn.

GPT (June 2018): 117M parameters, considered large at the time.
GPT-2 (Feb 2019): 1.5B parameters.
Today: 100B+ is what people mean by large.

Bigger models also need more training data. There's more capacity to fill, so you need more data to actually use it.

1.2 From Large Language Models to Foundation Models

LLMs only handle text. Foundation models push past that, handling vision, audio, video, 3D, even protein structures.

A multimodal model works with more than one modality. A large multimodal model (LMM) is the generative version.

CLIP and Natural Language Supervision

OpenAI's CLIP pulled off a clever trick called natural language supervision: instead of paying people to label images, they trained on (image, text) pairs that already co-occurred on the internet. They ended up with 400M pairs, around 400× the size of ImageNet, with zero manual labeling. CLIP isn't generative, it's an embedding model that produces joint embeddings for text and images. Those embeddings are the backbone of generative multimodal models like Flamingo, LLaVA, and Gemini.

Task-Specific to General-Purpose

Because of their scale and training, foundation models are general-purpose. The same LLM can do sentiment analysis and translation. To bend a general-purpose model toward a specific task, three techniques show up over and over:

Prompt engineering, where you write detailed instructions and examples.
Retrieval-augmented generation (RAG), where you wire the model up to an external database.
Finetuning, where you keep training the model on domain-specific data.

1.3 From Foundation Models to AI Engineering

AI engineering means building applications on top of foundation models. Traditional ML engineering meant developing the models themselves. The conditions that turned this into a real discipline almost overnight:

General-purpose AI capabilities. Models can suddenly do things that used to require years of research. The user base exploded.
A wall of money. Goldman Sachs estimated $100B of AI investment in the US, $200B globally by 2025. One in three S&P 500 companies mentioned AI in Q2 2023 earnings calls, 3× the year before.
Low barrier to entry. APIs hide the infrastructure. You can build with English instead of code.

In two years, four open source AI tools (AutoGPT, Stable Diffusion Web UI, LangChain, Ollama) racked up more GitHub stars than Bitcoin.

Section 2: Foundation Model Use Cases

The list of possible AI applications is basically endless. Different shops slice the space differently:

AWS: customer experience, employee productivity, process optimization.
O'Reilly 2024 survey: programming, data analysis, customer support, marketing copy, other copy, research, web design, art.
Deloitte: cost reduction, process efficiency, growth, accelerating innovation.

Eloundou et al. (2023) ranked occupations by AI exposure. The most exposed: interpreters and translators, tax preparers, web designers, writers, mathematicians, financial quantitative analysts. The least exposed: cooks, stonemasons, athletes.

2.1 Common Use Case Categories

Category	Consumer examples	Enterprise examples
Coding	Coding	Coding
Image and video production	Photo and video editing, design, presentation	Ad generation
Writing	Email, social media, blog posts	Copywriting, SEO, reports, design docs
Education	Tutoring, essay grading	Employee onboarding, upskill training
Conversational bots	Chatbot, AI companion	Customer support, product copilots
Information aggregation	Summarization, talk-to-your-docs	Summarization, market research
Data organization	Image search, memex	Knowledge management, document processing
Workflow automation	Travel/event planning	Data extraction, lead generation

Enterprises tend to roll out the lower-risk, internal-facing stuff first (think internal knowledge management) before letting AI talk to actual customers.

2.2 Coding

The single most popular use case. GitHub Copilot hit $100M ARR in two years.

McKinsey measured developer productivity gains: 2× on documentation, 25-50% on code generation and refactoring, almost nothing on highly complex tasks. The tooling around AI coding covers a lot of ground:

Pulling structured data out of web pages and PDFs (AgentGPT)
English to code (DB-GPT, SQL Chat, PandasAI)
Screenshot to code (screenshot-to-code, draw-a-ui)
Translating between languages (GPT-Migrate, AI Code Translator)
Documentation (Autodoc), tests (PentestGPT), commit messages (AI Commits)

2.3 Image and Video Production

The probabilistic nature of these models is a feature for creative work. Standout startups: Midjourney (image generation, $200M ARR in a year and a half), Adobe Firefly (photo editing), and Runway / Pika Labs / Sora for video generation.

What people actually use it for:

AI-generated profile photos. Facebook banned them in 2019. Now they're standard.
Marketing and ads. Generate promo images and videos, A/B test variations, swap seasons or locations cheaply.

2.4 Writing

LLMs are good at writing because writing is what they're trained for. An MIT study (Noy & Zhang, 2023) gave 453 professionals access to ChatGPT and found 40% time savings and 18% higher output quality. The gains skew toward less-skilled writers, which closes the skill gap.

For consumers it's rewriting emails in a different tone, turning bullet points into paragraphs, and outright drafting essays and books. For enterprises it's sales and marketing emails, ad copy, performance reports, and SEO. The downside is content farms and AI-generated junk travel guides flooding the internet.

2.5 Education

AI is showing up across the whole stack of teaching:

Summarizing textbooks and generating personalized lesson plans.
Adapting material to learning styles (auditory, visual, code-based).
Generating quizzes, grading answers, debate practice.
Roleplay practice for language learning.

Khan Academy ships AI teaching assistants. Chegg's stock went from $28 to $2 as students moved their homework help to AI.

2.6 Conversational Bots

For consumers: companions, therapists, personality emulation, digital partners. Researchers even use bots to simulate small societies. For enterprises: customer support is the top hit, plus product copilots that walk you through filing claims or doing taxes. Beyond text, voice assistants (Siri, Alexa) and 3D bots in games (smart NPCs in Inworld and Convai) extend the same idea.

2.7 Information Aggregation

74% of generative AI users use it to distill complex stuff into something readable. The patterns:

Talk-to-your-docs: chew through contracts, disclosures, papers.
"Fast Breakdown" templates (Instacart): summarize meeting notes, emails, and Slack into facts, open questions, and action items.
Surface the customer or competitor info that actually matters.

2.8 Data Organization

We keep producing more unstructured and semi-structured data: photos, videos, logs, PDFs. AI helps wrangle it.

It can describe images and videos in text, then match a text query against the visuals (Google Photos, Google Image Search). It can write the analysis code itself, build visualizations, flag outliers, and predict. On the enterprise side, the killer use is pulling structured information out of contracts, receipts, and IDs. The intelligent document processing (IDP) industry is projected at $12.81B by 2030.

2.9 Workflow Automation

End users want their AI to book restaurants, file refunds, plan trips, fill out forms. Enterprises want it to handle leads, invoices, reimbursements, customer requests, and data entry.

The phrase that covers all of this is AI agents: AIs that can plan and use tools like search engines, calendars, and phones. Chapter 6 goes deep on this.

Section 3: Planning AI Applications

Building a cool demo is easy. Shipping something profitable is hard. So plan first.

3.1 Use Case Evaluation

Start with why. The motivations sort themselves into a rough risk ranking:

Existential threat. Competitors with AI can put you out of business. Financial analysis, insurance, advertising are good examples.
Profit or productivity opportunity. This is most companies. AI can lower acquisition cost, improve retention, and help sales and marketing pull harder.
FOMO and R&D. You don't want to be Kodak, Blockbuster, or BlackBerry.

If AI is an existential threat, build in-house. If it's just a productivity boost, buying off the shelf usually saves time and money.

The Role of AI and Humans

Three Apple-inspired axes for thinking about AI's role in your product:

The first axis is critical or complementary. Can the app work without AI? Face ID is critical. Smart Compose is complementary. The more critical, the higher the accuracy and reliability bar.

The second is reactive or proactive. Reactive features respond to user actions, like a chatbot. Proactive features surface output without being asked, like Google Maps traffic alerts. Proactive features need a higher quality bar because the user didn't ask for them.

The third is dynamic or static. Dynamic features keep updating with user feedback (Face ID, ChatGPT memory). Static features get refreshed on a schedule.

For human involvement, Microsoft's Crawl-Walk-Run framework is a clean way to think about it:

Crawl means humans are required.
Walk means AI talks to internal employees directly.
Run means more automation, including direct external interactions.

AI Product Defensibility

If it's easy for you to build, it's easy for someone else too. The three competitive advantages:

Technology. For most companies, this looks similar to what everyone else has.
Data. Startups can build a moat by getting to market first and accumulating usage data. The "data flywheel."
Distribution. Usually owned by the big players.

The biggest single risk is your product becoming a feature inside the underlying model.

3.2 Setting Expectations

How will you measure success? For a customer support chatbot, you'd track percentage of messages automated, message throughput, response speed, and human labor saved. Customer satisfaction is its own thing and has to be measured separately.

The usefulness threshold is how good the system has to be before anyone wants to use it. The metrics typically break down as:

Quality metrics (response quality)
Latency metrics: TTFT (time to first token), TPOT (time per output token), and total latency.
Cost metrics: cost per inference request.
Other things like interpretability and fairness.

3.3 Milestone Planning

The last mile is where AI projects go to die. A demo is easy, productizing is brutal:

The UltraChat team put it well: "the journey from 0 to 60 is easy, whereas progressing from 60 to 100 becomes exceedingly challenging."
LinkedIn (2024) hit 80% in a month, then needed 4 more months to clear 95%. Most of that time went to hallucinations and product polish.

3.4 Maintenance

AI moves fast, which creates its own headaches:

Today's best option becomes tomorrow's worst. Model providers cut prices in half right after you finished building in-house.
Models converge on similar APIs but each one has quirks. You need versioning and an evaluation pipeline.
Regulation risk. GDPR cost businesses $9B. Compute export controls and IP issues are live.

Section 4: The AI Engineering Stack

4.1 Three Layers of the AI Stack

Three layers. Most people start at the top and only go deeper as they have to.

Application development. Feeding the model good prompts and the right context. Needs heavy evaluation. Most of the activity in the last two years has been here.
Model development. Modeling, training, finetuning, inference optimization, dataset engineering. Also needs heavy evaluation.
Infrastructure. Model serving, data and compute management, monitoring.

In 2023 the application and application-development repos saw the most growth. Infrastructure grew slower, because the core needs (resource management, serving, monitoring) hadn't really changed.

4.2 AI Engineering Versus ML Engineering

Three high-level differences:

You don't have to train your own model. AI engineering is about model adaptation, not training.
The models are bigger and more compute-hungry. Inference optimization matters more, and you'll work with big GPU clusters.
Outputs are open-ended. Evaluation gets a lot harder.

Two flavors of model adaptation show up:

Prompt-based approaches don't touch the weights. They're easy to start with and need little data. Sometimes they're not enough for complex or strict tasks.
Finetuning updates weights. More complex, more data, but you get better quality, latency, and cost. Required for tasks the model wasn't trained on.

4.3 Model Development Layer

Modeling and Training

Tools: TensorFlow, Hugging Face Transformers, PyTorch.

The ML knowledge that helps: clustering, logistic regression, decision trees, neural network architectures (feedforward, recurrent, convolutional, transformer), gradient descent, loss functions, regularization. For AI engineering, this knowledge is nice-to-have, not required.

Training Terminology

Term	Meaning
Training	Any process that changes model weights
Pre-training	Training from scratch with random initialization. For LLMs this is usually text completion. By far the most resource-intensive (98% of InstructGPT's compute)
Finetuning	Continuing training on a model that's already been trained. Much cheaper
Post-training	Training that happens after pre-training. Conceptually the same as finetuning. Post-training is what model developers do; finetuning is what application developers do

A couple of footnotes: Quantization changes weight values but isn't training. Prompt engineering is also not training, even though people sometimes call it that.

Dataset Engineering

Curating, generating, and annotating data. With foundation models a few things shift:

Annotating open-ended queries is harder than close-ended ones.
You're working with unstructured data, not tables.
The work moves toward deduplication, tokenization, context retrieval, and quality control (stripping out sensitive or toxic data).

Inference Optimization

Foundation models are autoregressive, meaning tokens come out one at a time. If a model takes 10ms per token, 100 tokens take a full second. Getting that down to the 100ms latency the web expects is one of the hardest problems in the stack.

Category	Traditional ML	Foundation Models
Modeling and training	ML knowledge required	ML knowledge nice-to-have
Dataset engineering	Feature engineering, tabular	Deduplication, tokenization, context retrieval, quality control
Inference optimization	Important	Even more important

4.4 Application Development Layer

Three jobs at this layer: evaluation, prompt engineering, and AI interface.

Evaluation

When everyone uses the same foundation model, the model itself stops being the differentiator. The application development process is what sets products apart. You need evaluation to:

Pick a model and benchmark progress.
Decide when something is ready to ship.
Catch issues and find improvements once it's live.

A couple of things make evaluation hard:

Open-ended outputs. There's no exhaustive ground-truth answer key.
Adaptation techniques. They can swing apparent performance dramatically.

A real example: Gemini Ultra's MMLU score went from 83.7% to 90.04% just by switching prompt engineering from 5-shot to CoT@32.

Prompt Engineering and Context Construction

Getting models to do what you want just from input, without touching weights. That includes the prompt itself, plus the context, tools, and (for long-running tasks) some kind of memory management.

AI Interface

The shape an AI app can take is wide: standalone web/desktop/mobile apps, browser extensions, chatbots inside Slack, Discord, WeChat, or WhatsApp, plug-ins for VSCode, Shopify, and Microsoft 365, voice assistants, embodied AR/VR.

Tools for putting AI apps together: Streamlit, Gradio, Plotly Dash.

Category	Traditional ML	Foundation Models
AI interface	Less important	Important
Prompt engineering	Not applicable	Important
Evaluation	Important	More important

4.5 AI Engineering Versus Full-Stack Engineering

Because the application development layer matters so much, AI engineering keeps drifting toward full-stack. Python is still common but JavaScript APIs are catching up fast (LangChain.js, Transformers.js, OpenAI Node, Vercel AI SDK), and a lot of AI engineers come out of web and full-stack backgrounds.

The workflow flips on its head:

Traditional ML: gather data, train the model, build the product.
AI engineering: build the product, then invest in data and models if it actually goes anywhere.

Summary

Self-supervision killed the labeling bottleneck and let language models scale into LLMs.
LLMs picked up multiple modalities and turned into foundation models: general-purpose, broadly capable.
The availability of foundation models created AI engineering as a discipline. Three drivers: general-purpose capabilities, money pouring in, and a low barrier to entry.
Common use cases cover coding, image/video, writing, education, conversational bots, information aggregation, data organization, and workflow automation.
Planning an AI application means thinking about why you're building it, the role of AI versus humans, defensibility, success metrics, milestones, and maintenance.
The AI engineering stack has three layers: application development, model development, and infrastructure.
The big shifts from ML engineering: less training, more model adaptation and evaluation, and more weight on inference optimization, prompt engineering, and user interfaces.

Next chapter

Understanding Foundation Models