Addressing Constraints
Chapter 8: Addressing Constraints
Introduction
Production LLM deployment is a different game from prototyping. It adds hard constraints around hardware capacity, latency budgets, and cost ceilings. The five patterns in this chapter go after those constraints. Small Language Model (24) shrinks the model via distillation, quantization, or speculative decoding. Prompt Caching (25) eliminates redundant work for repeated requests. Inference Optimization (26) maximizes throughput on self-hosted LLMs through continuous batching, speculative decoding, and prompt compression. Degradation Testing (27) defines the metrics that pinpoint where quality starts to slip. Long-Term Memory (28) lets agents remember across sessions without overflowing the context window.
Pattern 24 — Small Language Model
The Small Language Model (SLM) pattern uses distillation, quantization, or speculative decoding to fit within cost and latency constraints without unduly compromising quality.
Problem
Frontier model serving needs state-of-the-art GPUs and memory. Llama 4 Scout requires 4×H100 80GB. At June 2025 pricing, that's $10/hour per instance. Cloud capacity is also routinely scarce.
You can't just swap to a smaller model on hard tasks. Asking Gemma 3 27B to document Python code yields proper docstrings with Args / Attributes sections; Gemma 3 1B returns plain English summaries. Smaller means lower quality by default.
But smaller models are faster:
| Model | Tokens / second |
|---|---|
| Gemma 3 27B | 3.26 |
| Gemma 3 1B | 8.82 |
Solution
Three options.
Option 1: Distillation
Transfer knowledge from a large teacher model to a small student model by training the student to mimic the teacher's outputs (Hinton, Vinyals, Dean 2015). Most enterprise apps need only narrow knowledge, so let the student "forget" everything else and focus its parameters on your task.
The training loop, for each batch:
# 1. Get teacher logits (no gradients needed)
with torch.no_grad():
teacher_outputs = self.teacher_model(**inputs)
teacher_logits = teacher_outputs.logits
# 2. Get student logits
student_outputs = model(**inputs)
student_logits = student_outputs.logits
# 3. Standard task loss
task_loss = student_outputs.loss
# 4. Temperature scaling on both before computing distillation loss
student_logits = student_logits / self.temperature
teacher_logits = teacher_logits / self.temperature
# 5. KL divergence between distributions
distillation_loss = torch.nn.functional.kl_div(
torch.log_softmax(student_logits, dim=-1),
torch.softmax(teacher_logits, dim=-1),
reduction='batchmean'
) * (self.temperature ** 2)
# 6. Combine
loss = (1 - self.alpha) * task_loss + self.alpha * distillation_loss
KL divergence forces the student to mimic the teacher's distribution. Temperature scaling softens the distribution: instead of one token at p≈1 and the rest near 0, you get p=0.6, 0.2, 0.1, etc., which preserves the teacher's "dark knowledge" about alternative tokens. alpha balances task loss vs. distillation loss.
There are two extensions worth knowing. A meta distillation loop iterates distillation across multiple rounds, gradually shrinking the model. Ensemble distillation (Allen-Zhu and Li 2020) distills multiple specialized teachers into one student.
Option 2: Quantization
LLMs typically store weights as FP32 (4 bytes). 70B params × 4 bytes is roughly 280GB just for weights. Quantize to INT8 (1 byte) or INT4 (½ byte) and the memory drops 4 to 8 times. Accuracy degrades slightly, but it's usually a small BLEU drop, not a catastrophe.
When to quantize:
| Stage | Methods |
|---|---|
| Pretraining | Train at low precision; quantization-aware training (QAT) injects fake quantized ops in forward pass so the model learns to be robust |
| During training | Mixed-precision (DeepSpeed, Megatron-LM); dynamic quantization adjusting based on activation statistics (PyTorch) |
| Post-training | Weight-only (GPTQ, AWQ); full-model (static with calibration set or dynamic at inference); QLoRA, SPQR, BitNet (1-bit) |
Practical post-training with BitsAndBytes:
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4", # NF4 - statistically optimized for LLMs
bnb_4bit_use_double_quant=True, # quantize the quantization constants
)
quantized_model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=quantization_config,
device_map="auto",
torch_dtype=torch.float16,
token=hf_token
)
load_in_4bit cuts memory roughly 8x. nf4 preserves the statistical properties of language-model weights better than uniform quantization.
Option 3: Speculative decoding
Two models work together. A draft (student) model rapidly proposes a sequence of tokens, and a target (teacher) model verifies them in parallel. If the teacher agrees, the sequence is accepted. If not, fall back to the teacher's normal token-by-token generation.
It works because not all tokens require a big model. Common phrasing: the student handles it. Rare or context-specific tokens: the teacher.
Step 1: Student: "The [talented] [chef]" Teacher: ✓ accepts
Step 2: Student: "cooked [a] [delicious]" Teacher: ✓ accepts
Step 3: Student: "[soup]" Teacher: ✗ rejects
Teacher generates: "bouillabaisse"
Step 4: Student: "[for] [dinner]" Teacher: ✓ accepts
The speed comes from cheap draft generation plus batch verification of multiple tokens at once.
vLLM supports it natively:
from vllm import LLM, SamplingParams
llm = LLM(
model="google/gemma-2-9b-it",
tensor_parallel_size=1,
speculative_model="google/gemma-2-2b-it",
num_speculative_tokens=5,
)
outputs = llm.generate(prompts, sampling_params)
Example: Python Documentation SLM
Two stages: distill Gemma 3 12B into Gemma 3 1B for code documentation, then quantize the 1B model to 4-bit.
Generate training data with Claude (1,000 Python code examples). For real projects, log production prompts.
Distill (10 epochs on A100 80GB, around 1 hour). The untuned 1B model produced a flat plain-English summary; the distilled 1B model produces docstrings comparable to the 12B teacher:
def add_task(self, title, description, tags=None):
"""
Adds a new task to the task list.
Args:
title: The title of the task.
description: A description of the task.
tags: Optional[List[str]] of tags for the task. If None, no tags are included.
"""
Quantize to INT4 and inference drops from several minutes to 19 seconds with no visible quality loss.
Considerations
Distillation has the student lose generality and inherit teacher biases. Quantization is an accuracy/efficiency tradeoff: not all architectures or hardware support all formats well, and transformer attention is sensitive to quantization in specific layers.
The alternatives. Model sharding (latency-only) splits the model across GPUs. Parallelization processes multiple requests simultaneously. Continuous batching (Pattern 26) is even better than fixed parallelization. Prompt Caching (Pattern 25) is instant for repeated requests. QAT models like Gemma 3 QAT give substantially better efficiency. Adapter Tuning (Pattern 15) does domain specialization.
References: Hinton et al. 2015; GPTQ; AWQ; QLoRA; BitNet (1-bit transformers); speculative decoding (Leviathan, Kalman, Matias 2022); Xia et al. 2024 inference survey.
Pattern 25 — Prompt Caching
Prompt Caching reuses previously generated responses (client-side) or model internal states (server-side) for repeated or similar prompts.
Problem
Production usage skews to a few popular questions. 31% of cable callers report outages. 30% of bank calls are about login. 40% of physical-store calls ask for hours. Recomputing the same response wastes hardware utilization, makes users wait, and inflates costs.
Solution
Two main families: client-side (you operate the cache) and server-side (the model provider does).
Client-side caching (memoization)
LangChain's built-in cache:
from langchain_core.caches import InMemoryCache
from langchain_core.globals import set_llm_cache
set_llm_cache(InMemoryCache()) # also Redis, Cassandra
OpenAI client cache via env var:
os.environ["OPENAI_CACHE_DIR"] = "./oai_cache"
Semantic caching
Exact-match keys are brittle. Three approaches to fuzzy matching:
A canonical form stems, normalizes, or replaces synonyms before keying. Multiple keys for the same response generates semantic variants of the request and stores the response under all of them. Embedding-based similarity uses a vector store and a similarity threshold (GPTCache).
The risk with all three is that users can get the same response to subtly different queries, and nuance gets lost.
Server-side prompt caching (prefix caching)
The provider stores internal model states for common prompt prefixes (long system prompts, examples). Subsequent prompts with the same prefix load the cached state and skip redundant computation. This doesn't affect creativity because it only reuses state, not output tokens. The big win is on TTFT: time to first token drops dramatically, which is great for streaming chats.
OpenAI, Anthropic, and Google all do this implicitly above ~1,024 tokens. vLLM and SGLang support open-source automatic prefix caching.
Context caching (Gemini, for example) caches multimedia content for reuse across queries.
Example: Hash-based Client Cache
Hash the prompt to get a file name, store the JSON response, and add a semantic-variant generator:
class PromptCache:
def __init__(self, cache_dir=".prompt_cache"):
self.cache_dir = Path(cache_dir)
self.cache_dir.mkdir(exist_ok=True)
self.client = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
def _get_cache_key(self, prompt):
return hashlib.md5(prompt.encode()).hexdigest()
def get_cached_response(self, prompt):
cache_path = self.cache_dir / f"{self._get_cache_key(prompt)}.json"
if cache_path.exists():
with open(cache_path) as f:
return json.load(f)
return None
For "Explain prompt caching in 100 words", generate 10 semantic variants ("What is prompt caching?", "Briefly explain prompt caching technology", …) and cache the response under all keys.
Considerations
Multitenancy: cache keys must include user identity (or use federated learning, Gill et al. 2024) or you'll leak data. Invalidation: TTL (provider default 5 min), and always invalidate on model version change. Client-side caching shortcuts the entire response (max latency win). Server-side caching only improves TTFT but is automatic.
References: GPTCache; Gill et al. 2024 (federated semantic cache); Jha and Wang 2023 (auto prefix caching, vLLM). Anthropic reports 90% cost / 85% latency savings for long prompts. OpenAI reports 80% latency / 50% cost savings on prompts over 1024 tokens. Notion uses Claude prompt caching.
Pattern 26 — Inference Optimization
Inference Optimization improves model-serving efficiency for self-hosted LLMs via continuous batching, speculative decoding, and prompt compression.
Problem
Self-hosting is necessary for sensitive data (healthcare, finance, legal). But GPUs are scarce and expensive, and users still expect ChatGPT-level latency on real-time apps. You need maximum efficiency.
Solution
Option 1: Continuous batching
Traditional batching pads all prompts to the same length. That's wasteful for varying-length LLM prompts because the whole batch waits for the longest one. Short prompts hold up the GPU and users with short prompts wait for long ones.
Continuous batching pulls requests from a queue and slots them into GPU cores as cores free up. Request granularity is the forward pass, not the batch. Each iteration: every active sequence advances, finished sequences are removed, new sequences fill freed slots. Kernels handle dynamic resizing of attention matrices and KV cache.
vLLM and SGLang do this by default. Just submit batches of requests, not individual ones:
# DON'T:
for prompt in prompts:
_ = model.generate(prompt, sampling_params)
# DO:
_ = model.generate(prompts, sampling_params)
Option 2: Speculative decoding
See Pattern 24. Same mechanism: draft model proposes tokens, target model verifies in parallel.
Option 3: Prompt compression
Long prompts blow up the KV cache and consume GPU memory. Agent histories and document workflows are common offenders. Two flavors (Li et al. 2024).
Hard compression is human-readable shrinkage:
Use regexes or have an LLM compress the prompt. Verify by asking another LLM to reconstruct the original and check for information loss. Most LLMs respond to compressed prompts the same as to originals.
Soft compression encodes the prompt as continuous vectors:
<bach_1> <bach_2> <bach_3> ... <bach_n>
Question: Analyze Bach's compositional techniques in his keyboard works...
Each <bach_n> is a dense vector encoding a paragraph. The 500xCompressor pushes this further by providing compressed prompts as KV values directly. Soft compressions are model-specific, so don't share between Llama and GPT.
Example
A continuous-batching benchmark on the same hardware showed a 23x speedup over individual requests:
Number of samples: 100
Individual processing time: 106.11 seconds (0.94 samples/sec)
Batch processing time: 4.60 seconds (21.74 samples/sec)
Speedup factor: 23.07x
Speculative decoding gave roughly 14.2% latency reduction. Tune num_speculative_tokens carefully: too aggressive and the target model rejects too often, making it slower than no speculation.
References: continuous batching (Yu et al. 2022); Anyscale 2023 explanation; Li et al. 2024 prompt compression survey; 500xCompressor; Leviathan et al. 2022 (speculative decoding); Xia et al. 2024.
Pattern 27 — Degradation Testing
Degradation Testing identifies bottlenecks and the points where service quality starts to slip, not just where it fails. Standard load testing (look for 4xx/5xx) isn't fine-grained enough for LLM serving.
Problem
LLM apps look like web servers but require deeper performance metrics. Saying "95% of requests serve under 0.3 seconds" misses the point. You need to know the constraint that triggers the 5%-and-falling regime: request size, concurrent count, GPU memory, model context window, so you can plan capacity and tune.
Solution: Core Metrics
Four key metrics.
Time to first token (TTFT)
The lag between submission and the first response token. Critical for streaming UIs because once tokens start streaming, the user is reading and can tolerate slower full-response latency.
Reducing TTFT: shorten the prompt (Prompt Compression, Pattern 26), cache the prefix (Prompt Caching, Pattern 25) and put predictable text first with dynamic (RAG) text last so the cache works, give it more GPU memory for the KV cache, reduce the served model's max context window so the smaller cache is faster, and show progress in multi-step apps to reduce perceived TTFT.
End-to-end request latency (EERL)
Total time including queuing, network, KV cache creation, and full response generation. Track P50, P95, P99 across realistic query distributions.
Reducing EERL: upgrade hardware (L4 to A100, or specialized ASIC like Groq), reduce output tokens (ask for only differences from a cached reference, use few-shot examples that demonstrate concise answers), parallelize subtasks, and use speculative execution at the workflow level (start Step 2 with a guessed result for Step 1, verify Step 1 separately, let Step 2 run if the guess matched, else re-launch).
Tokens per second (TPS)
TPS = Total Tokens Generated / (T_end − T_start)
System throughput. The saturation point is where adding more requests stops increasing TPS. Verify against vendor-published TPS for hosted models.
If you can't hit your TPS, you can throttle, cache more, use a smaller model, or differentiate (peak vs. off-peak models, paid vs. free).
Requests per second (RPS)
RPS = Number of completed Requests / (T_end − T_start)
Closely related to TPS but ignores response length. Track success rate, error rate, and concurrent users alongside it.
Scalability, stress, and load
Scalability is how throughput evolves as load gradually increases. Find inflection points before efficiency declines. Track throughput vs. load, response-time degradation, resource utilization, scaling efficiency, breaking point.
Stress analysis pushes beyond normal to find breakpoints and observe whether failure is graceful, partial, or catastrophic. Track max load capacity, failure threshold, recovery time, error rate under stress, system availability, resource exhaustion point.
Load testing simulates realistic peak conditions to validate that the system holds up under expected high demand. Track peak load performance, response time under load, error rate at peak, queue length, peak resource utilization.
A well-performing setup shows no failures, low TTFT, and high consistent TPS even at 100 users × 25 requests. A poorly performing one shows most requests failing, average TTFT 52s, and degrading TPS:
Mitigations for under-resourced setups: bigger GPU; multi-GPU parallelism (data parallelism is full model per GPU, model parallelism partitions across GPUs); apply this chapter's other patterns.
Example: Benchmark Tool
$ python llm_benchmark_openai.py \
--requests-per-user 25 \
--num-users 100
========================================================
BENCHMARK SUMMARY
========================================================
Endpoint: https://api.openai.com/v1/chat/completions
Model: gpt-4o-mini
Users: 100
Requests per user: 25
Total requests: 2500
Successful: 2499
Failed: 1
Success rate: 100.0%
Total duration: 190.57s
PERFORMANCE METRICS:
Average TTFT: 3.556s
95th percentile TTFT: 4.206s
Average tokens/sec: 24.1
95th percentile tokens/sec: 31.5
Overall throughput: 1914.5 tokens/sec
========================================================
Tools to know: LLMPerf (Ray Project), LangSmith (LangChain observability), Arize Phoenix, vLLM/SGLang built-in benchmarks, AgentOps, PromptTools.
References: PagedAttention (Kwon et al. 2023, vLLM).
Pattern 28 — Long-Term Memory
Long-Term Memory lets LLM apps remember across sessions without bloating the prompt context.
Problem
LLM calls are stateless. Apps simulate state by prepending conversation history, but transformers scale quadratically with sequence length, so even Gemini's 1M-token window is cost-prohibitive to fill repeatedly.
Solution: Four Types of Memory
Working memory
Within-session message history. Prune intelligently: never break a (user, assistant) pair, always keep the system prompt:
from langchain_core.messages import trim_messages
trim_messages(
messages,
strategy="last",
token_counter=ChatOpenAI(MODEL_ID),
max_tokens=1000,
start_on="human",
end_on=("human", "tool"),
include_system=True,
)
Episodic memory
Cross-session message retrieval: find prior conversations relevant to the current query. Persist messages to a database and search by content plus metadata (user, recency, topic). RAG-style retrieval (cosine similarity, keywords, or hybrid).
Procedural memory
User profile and system instructions. Either let the user supply a system prompt directly (Bench.io style) or extract preferences from messages ("I'm allergic to nuts" gets added to the profile):
Semantic memory
Content-driven memory: find prior facts relevant to the current question. Different from episodic memory: episodic is recency-driven, semantic is content-driven.
In LangGraph:
store = PostgresStore(connection_string="postgresql://.../dbname")
trip_memories_ns = (user_id, "trip_memories")
memory = {"trip": {"from": "SEA", "to": "KEF", "depart_time": ...}}
memory_id = hash(json.dumps(memory, sort_keys=True))
store.put(trip_memories_ns, memory_id, memory)
# Later
most_recent_trip = in_memory_store.search(trip_memories_ns)[-1]
Example: Mem0
config = {
"vector_store": {
"provider": "chroma",
"config": {"collection_name": "mem0_basic_example", "path": "/tmp/chroma_db"}
},
"llm": {
"provider": "openai",
"config": {"model": "gpt-4o-mini", "temperature": 0.1}
},
"embedder": {
"provider": "openai",
"config": {"model": "text-embedding-3-small"}
},
"history_db_path": os.path.join(temp_dir, "history.db")
}
memory = Memory.from_config(config)
# Add a conversation
conversation = [
{"role": "user", "content": "I'm looking to travel from Seattle to Reykjavik..."},
{"role": "assistant", "content": "The best way to travel from Seattle to Reykjavik is to fly..."}
]
memory.add(conversation, user_id="megan")
# Search later
relevant_memories = memory.search(
query="What are my options to travel to Reykjavik?",
user_id="megan", limit=3
)
# Returns: ["Interested in travel from Seattle to Reykjavik"]
Behind the scenes
On memory.add, Mem0 prompts the LLM to extract memorable facts (preferences, plans, relationships) and ignores small talk and reconstructible knowledge. The extracted text is embedded and stored in ChromaDB:
chroma_client.add(
embeddings=[embedding_vector],
documents=["User wants to travel from Seattle to Reykjavik"],
metadatas=[{"user_id": "megan", "created_at": "2025-07-15T10:30:00Z",
"memory_id": "uuid-12345", "category": "customer_info"}],
ids=["memory_uuid_12345"]
)
Procedural memory is split across stores: KV (Redis) for quick lookups (user:megan:destination → "Reykjavik"), graph DB (Neo4j) for relationships, relational DB for audit trail.
On memory.search, embed the query and run a similarity search over the vector store filtered by user_id. Layer in KV and graph queries. Filters allow narrowing, so you can retrieve food-preference memories only when booking a flight:
filters = {
"AND": [{"categories": {"contains": "food_preferences"}}]
}
client.search(query, user_id="megan", filters=filters)
For session-scoped (short-term) memory, use run_id instead of (or in addition to) user_id.
Considerations
Deploy Mem0 as a microservice in production, not as an in-process Python module. Write memories asynchronously or in a background thread so you don't slow UX. Match the memory type to the use case: working memory for chatbots, episodic for multi-step workflows, procedural for personalization, semantic for large-document processing.
Prefer semantic over episodic when possible. Storing all messages and retrieving by similarity is brittle and high-latency. Extract structured memories upfront, which gives you fewer entries that are more searchable and easier to debug.
References: Sumers et al. 2023 (memory taxonomy); Wang et al. 2023 (latent-space LLM memory); Mem0 (Chhikara et al. 2025); LangMem.
Summary
| Pattern | Problem | Solution | When to use |
|---|---|---|---|
| Small Language Model (24) | Frontier models too costly / slow / hard to host | Distillation, quantization, speculative decoding | Narrow-scope tasks, edge devices, GPU-constrained environments |
| Prompt Caching (25) | Repeated queries waste compute; long prefixes inflate TTFT | Client-side memoization, semantic caching, server-side prefix caching | Repeated query workloads, multitenant systems, streaming chats |
| Inference Optimization (26) | Self-hosted LLMs need max throughput | Continuous batching, speculative decoding, prompt compression | Self-hosted production, real-time apps, high-throughput serving |
| Degradation Testing (27) | Need to find quality slip-points, not just failure | Track TTFT, EERL, TPS, RPS; scalability, stress, and load tests | Any production deployment that needs SLAs |
| Long-Term Memory (28) | LLMs are stateless; full history can't fit in prompts | Working / episodic / procedural / semantic memory via Mem0 etc. | Chatbots, agents, multi-session workflows, personalization |
A production-grade deployment composes all five. A distilled-and-quantized SLM serves under continuous batching with speculative decoding, a prefix cache short-circuits common system prompts, Mem0 manages user memory, and the whole stack is monitored with degradation-testing dashboards. Each pattern attacks a different production constraint. Pick the ones that map to your bottleneck rather than applying all of them defensively.
Previous chapter
Enabling Agents to Take ActionNext chapter
Setting Safeguards