Reliable, Scalable, and Maintainable Applications

Designing Data Intensive Applications by Martin Kleppmann

Chapter 1: Reliable, Scalable, and Maintainable Applications

Introduction

Many applications today are data-intensive, as opposed to compute-intensive. Raw CPU power is rarely a limiting factor — bigger problems are usually the amount of data, the complexity of data, and the speed at which it is changing.

A data-intensive application is typically built from standard building blocks:

Databases — Store data so that they, or another application, can find it again later.
Caches — Remember the result of an expensive operation, to speed up reads.
Search Indexes — Allow users to search data by keyword or filter it in various ways.
Stream Processing — Send a message to another process, to be handled asynchronously.
Batch Processing — Periodically crunch a large amount of accumulated data.

Section 1: Thinking About Data Systems

We typically think of databases, queues, caches, etc. as very different categories of tools. However, the boundaries between categories are becoming blurred:

Datastores are used as message queues (e.g., Redis).
Message queues have database-like durability guarantees (e.g., Apache Kafka).

Increasingly, applications have such demanding requirements that a single tool can no longer meet all data processing and storage needs. The work is broken down into tasks performed by different tools, stitched together using application code.

For example, if you have an application-managed caching layer (using Memcached or similar), or a full-text search server (such as Elasticsearch or Solr) separate from your main database, it is normally the application code's responsibility to keep those caches and indexes in sync with the main database.

When you combine several tools to provide a service, the service's API usually hides implementation details from clients.
Your composite data system may provide certain guarantees: e.g., that the cache will be correctly invalidated or updated on writes so that outside clients see consistent results.
You have essentially created a new, special-purpose data system from smaller, general-purpose components.
You are now not only an application developer, but also a data system designer.

Key Design Questions

When designing a data system or service, tricky questions arise:

How do you ensure that the data remains correct and complete, even when things go wrong internally?
How do you provide consistently good performance to clients, even when parts of your system are degraded?
How do you scale to handle an increase in load?
What does a good API for the service look like?

Many factors influence the design: skills and experience of the people involved, legacy system dependencies, timescale for delivery, organization's tolerance of risk, regulatory constraints, etc.

Three Core Concerns

This book focuses on three concerns important in most software systems:

Reliability — The system should continue to work correctly even in the face of adversity (hardware or software faults, and even human error).
Scalability — As the system grows (in data volume, traffic volume, or complexity), there should be reasonable ways of dealing with that growth.
Maintainability — Over time, many different people will work on the system, and they should all be able to work on it productively.

Section 2: Reliability

Reliability means "continuing to work correctly, even when things go wrong." For software, typical expectations include:

The application performs the function that the user expected.
It can tolerate the user making mistakes or using the software in unexpected ways.
Its performance is good enough for the required use case, under the expected load and data volume.
The system prevents any unauthorized access and abuse.

Faults vs Failures

A fault is one component of the system deviating from its spec.
A failure is when the system as a whole stops providing the required service to the user.
It is impossible to reduce the probability of a fault to zero; therefore it is best to design fault-tolerance mechanisms that prevent faults from causing failures.
It can make sense to deliberately induce faults to test fault-tolerance machinery (e.g., Netflix Chaos Monkey — randomly kills individual processes without warning to ensure fault-tolerance is continually exercised and tested).
Many critical bugs are actually due to poor error handling; by deliberately inducing faults, you increase confidence that faults will be handled correctly when they occur naturally.
Although we generally prefer tolerating faults over preventing faults, there are cases where prevention is better (e.g., security — if an attacker has compromised a system, that event cannot be undone).

2.1 Hardware Faults

Hard disks crash, RAM becomes faulty, power grids have blackouts, network cables get unplugged.
Hard disks have a mean time to failure (MTTF) of about 10 to 50 years.
- On a storage cluster with 10,000 disks, expect on average one disk to die per day.

Approaches

Add redundancy to individual hardware components:
- RAID configuration for disks.
- Dual power supplies and hot-swappable CPUs.
- Batteries and diesel generators for backup power.
- This approach cannot completely prevent hardware problems from causing failures, but it is well understood and can often keep a machine running uninterrupted for years.
Until recently, hardware redundancy was sufficient for most applications — multi-machine redundancy was only required by a small number of applications where high availability was absolutely essential.
Software fault-tolerance techniques — As data volumes and computing demands have increased, more applications use larger numbers of machines, which proportionally increases the rate of hardware faults.
- In some cloud platforms (e.g., AWS), it is fairly common for virtual machine instances to become unavailable without warning, as platforms prioritize flexibility and elasticity over single-machine reliability.
- Hence there is a move toward systems that can tolerate the loss of entire machines.
- Operational advantage: allows rolling upgrades — patching one node at a time without downtime of the entire system (e.g., applying OS security patches).

2.2 Software Errors

Systematic errors within the system are harder to anticipate and tend to cause many more failures than uncorrelated hardware faults because they are correlated across nodes. Examples include:

A software bug that crashes every instance given a particular bad input (e.g., the June 30, 2012 leap second bug in the Linux kernel).
A runaway process that uses up shared resources — CPU time, memory, disk space, or network bandwidth.
A dependent service that slows down, becomes unresponsive, or returns corrupted responses.
Cascading failures, where a small fault in one component triggers faults in others.

Mitigations

The bugs that cause these kinds of software faults often lie dormant for a long time until triggered by an unusual set of circumstances. The software is making some assumption about its environment — and while that assumption is usually true, it eventually stops being true for some reason.
There is no quick solution to systematic faults. Lots of small things can help:
- Carefully thinking about assumptions and interactions in the system.
- Thorough testing and process isolation.
- Allowing processes to crash and restart.
- Measuring, monitoring, and analyzing system behavior in production.
- If a system is expected to provide some guarantee (e.g., in a message queue, that the number of incoming messages equals the number of outgoing messages), it can constantly check itself while running and raise an alert if a discrepancy is found.

2.3 Human Errors

Humans are known to be unreliable — configuration errors by operators are the leading cause of outages, while hardware faults play a role in only 10–25% of outages.

Approaches to Mitigate Human Errors

Minimize opportunities for error — Well-designed abstractions, APIs, and admin interfaces that make it easy to do "the right thing" and discourage "the wrong thing." However, if interfaces are too restrictive, people will work around them, negating their benefit — a tricky balance.
Decouple mistake-prone areas from failure-prone areas — Provide fully featured non-production sandbox environments where people can explore and experiment safely, using real data, without affecting real users.
Test thoroughly at all levels — Unit tests, whole-system integration tests, and manual tests. Automated testing is especially valuable for covering corner cases that rarely arise in normal operation.
Allow quick and easy recovery — Make it fast to roll back configuration changes, roll out new code gradually (so bugs affect only a small subset of users), and provide tools to recompute data (in case the old computation was incorrect).
Set up detailed and clear monitoring — Performance metrics and error rates (telemetry). Like a rocket after launch, telemetry is essential for tracking what is happening and understanding failures. Monitoring provides early warning signals and allows checking whether assumptions or constraints are being violated.
Good management practices and training — A complex and important aspect.

How Important Is Reliability?

Bugs in business applications cause lost productivity and legal risks.
Outages of ecommerce sites have huge costs in lost revenue and reputation damage.
Even "noncritical" applications have a responsibility to users (e.g., a photo app storing family memories).
Reliability may be sacrificed for cost reasons (prototypes, narrow profit margins), but this should be a conscious decision.

Section 3: Scalability

Scalability is the term used to describe a system's ability to cope with increased load. It is not a one-dimensional label — discussing scalability means considering questions like:

"If the system grows in a particular way, what are our options for coping with the growth?"
"How can we add computing resources to handle the additional load?"

3.1 Describing Load

Load can be described with a few numbers called load parameters. The best choice depends on the architecture:

Requests per second to a web server.
Ratio of reads to writes in a database.
Number of simultaneously active users in a chat room.
Hit rate on a cache.

Twitter Example

Two of Twitter's main operations (data from November 2012):

Post tweet — A user publishes a new message to their followers (4.6k requests/sec average, over 12k at peak).
Home timeline — A user views tweets posted by people they follow (300k requests/sec).

The scaling challenge is fan-out — each user follows many people, and each user is followed by many people. Fan-out is a term borrowed from electronic engineering, describing the number of requests to other services needed to serve one incoming request.

Approach 1: Query on Read

Insert new tweet into a global collection.
When a user requests their home timeline, look up all people they follow, find all their tweets, and merge them sorted by time.

In a relational database, this could be expressed as:

SELECT tweets.*, users.* FROM tweets
  JOIN users   ON tweets.sender_id    = users.id
  JOIN follows ON follows.followee_id = users.id
  WHERE follows.follower_id = current_user

Twitter initially used this approach but the systems struggled to keep up with the load of home timeline queries.

Approach 2: Pre-compute on Write

Maintain a cache for each user's home timeline (like a mailbox).
When a user posts a tweet, look up all followers and insert the tweet into each of their home timeline caches.
Reading the home timeline is then cheap because it's pre-computed.

This works better because the average rate of published tweets is almost two orders of magnitude lower than the rate of home timeline reads — so it's preferable to do more work at write time and less at read time.
Trade-off: A tweet is delivered to ~75 followers on average, so 4.6k tweets/sec becomes 345k writes/sec to home timeline caches.
- Some users have over 30 million followers — a single tweet may result in 30 million+ writes.
- Twitter aims to deliver tweets to followers within 5 seconds.

Hybrid Approach (Final Solution)

Most users' tweets are fanned out at write time (Approach 2).
Celebrities (users with very large follower counts) are excepted from fan-out.
Celebrity tweets are fetched separately and merged at read time (Approach 1).
This hybrid delivers consistently good performance.

Key insight: The distribution of followers per user is a key load parameter for discussing scalability, as it determines the fan-out load.

3.2 Describing Performance

Once load is described, investigate what happens when load increases:

When you increase a load parameter and keep system resources unchanged, how is performance affected?
When you increase a load parameter, how much do you need to increase resources to keep performance unchanged?

Key Metrics

Batch processing systems (e.g., Hadoop): Care about throughput — records processed per second, or total time for a job.
Online systems: Care about response time — time between a client sending a request and receiving a response.

Latency vs Response Time

Response time is what the client sees: service time + network delays + queueing delays.
Latency is the duration a request is waiting to be handled (awaiting service).
These are often used synonymously, but they are not the same.

Response Time as a Distribution

Even making the same request repeatedly yields slightly different response times each time.
Random additional latency can be introduced by: a context switch to a background process, loss of a network packet and TCP retransmission, a garbage collection pause, a page fault forcing a read from disk, mechanical vibrations in the server rack, and many other causes.
Therefore, think of response time not as a single number, but as a distribution of values.

Percentiles

Response time should be thought of as a distribution of values, not a single number.
The mean (average) is not a good metric — it doesn't tell you how many users experienced that delay.
Median (p50) — The halfway point. Half of requests are faster, half are slower. Also known as the 50th percentile.
- Note: the median refers to a single request. If a user makes several requests (over a session, or because several resources are included in a single page), the probability that at least one is slower than the median is much greater than 50%.
Higher percentiles — p95, p99, p999 are the response time thresholds at which 95%, 99%, or 99.9% of requests are faster.
- These are also known as tail latencies.
- Amazon uses the 99.9th percentile for internal services because the slowest requests often come from the most valuable customers (most data, most purchases).
- A 100 ms increase in response time reduces sales by 1%.
- A 1-second slowdown reduces customer satisfaction by 16%.
Optimizing the 99.99th percentile is often too expensive with diminishing returns.

SLOs and SLAs

Service Level Objectives (SLOs) and Service Level Agreements (SLAs) define expected performance and availability.
Example SLA: median response time < 200 ms, 99th percentile < 1 s, service up at least 99.9% of the time.

Head-of-Line Blocking

Queueing delays often account for a large part of response time at high percentiles.
A server can only process a small number of things in parallel (limited by CPU cores). A small number of slow requests can hold up the processing of subsequent requests — an effect known as head-of-line blocking.
Even if subsequent requests are fast to process, the client sees a slow overall response time due to waiting for the prior request to complete.
Important to measure response times on the client side.
When generating load artificially to test scalability, the load-generating client needs to keep sending requests independently of the response time. If the client waits for the previous request to complete before sending the next one, it artificially keeps queues shorter than in reality, skewing measurements.

Tail Latency Amplification

In backend services called multiple times per end-user request, even one slow call makes the entire request slow.
The more backend calls required, the higher the chance of hitting a slow one.

Calculating Percentiles

Algorithms for efficient approximation: forward decay, t-digest, HdrHistogram.
Never average percentiles — the correct way is to add the histograms.

3.3 Approaches for Coping with Load

An architecture appropriate for one level of load is unlikely to cope with 10x that load.
Rethink architecture on every order of magnitude load increase.

Scaling Up vs Scaling Out

Scaling up (vertical scaling) — Moving to a more powerful machine.
Scaling out (horizontal scaling) — Distributing load across multiple smaller machines (also called shared-nothing architecture).
In reality, good architectures involve a pragmatic mixture of both approaches.

Elastic vs Manual Scaling

Elastic systems automatically add computing resources when load increases.
- Useful if load is highly unpredictable.
Manually scaled systems are simpler and may have fewer operational surprises.

Stateless vs Stateful

Distributing stateless services across multiple machines is fairly straightforward.
Taking stateful data systems from a single node to distributed introduces a lot of additional complexity.
Common wisdom: keep your database on a single node (scale up) until scaling cost or high-availability requirements force distribution.

No Magic Scaling Sauce

The architecture of large-scale systems is usually highly specific to the application.
The problem may be volume of reads, writes, data to store, complexity, response time requirements, access patterns, or a mixture.
For example, a system designed to handle 100,000 requests/sec, each 1 kB in size, looks very different from a system designed for 3 requests/min, each 2 GB in size — even though both have the same data throughput.
An architecture that scales well is built around assumptions of which operations are common vs rare — the load parameters. If those assumptions turn out to be wrong, the engineering effort for scaling is at best wasted, and at worst counterproductive.
In an early-stage startup or unproven product, it's usually more important to iterate quickly on product features than to scale to some hypothetical future load.
Scalable architectures are built from general-purpose building blocks arranged in familiar patterns.

Section 4: Maintainability

The majority of the cost of software is not in its initial development, but in its ongoing maintenance — fixing bugs, keeping systems operational, investigating failures, adapting to new platforms, modifying for new use cases, repaying technical debt, and adding new features.

Many people working on software systems dislike maintenance of so-called legacy systems — perhaps it involves fixing other people's mistakes, working with outdated platforms, or systems forced to do things they were never intended for. Every legacy system is unpleasant in its own way.

However, we can and should design software to minimize pain during maintenance, and thus avoid creating legacy software ourselves. Three design principles:

4.1 Operability: Making Life Easy for Operations

"Good operations can often work around the limitations of bad software, but good software cannot run reliably with bad operations."

Operations Team Responsibilities

Monitoring system health and quickly restoring service.
Tracking down causes of problems (failures, degraded performance).
Keeping software and platforms up to date, including security patches.
Keeping tabs on how different systems affect each other.
Anticipating future problems (e.g., capacity planning).
Establishing good practices for deployment and configuration management.
Performing complex maintenance tasks (e.g., platform migrations).
Maintaining system security as configuration changes are made.
Defining processes for predictable operations and stable production environments.
Preserving organizational knowledge as people come and go.

Good Operability Means

Providing visibility into runtime behavior with good monitoring.
Good support for automation and integration with standard tools.
Avoiding dependency on individual machines.
Good documentation and an easy-to-understand operational model.
Good default behavior with freedom to override defaults.
Self-healing where appropriate, with manual control when needed.
Exhibiting predictable behavior, minimizing surprises.

4.2 Simplicity: Managing Complexity

As projects get larger, they often become very complex and difficult to understand, slowing down everyone and increasing maintenance cost. A software project mired in complexity is sometimes described as a big ball of mud.

Symptoms of Complexity

Explosion of the state space.
Tight coupling of modules.
Tangled dependencies.
Inconsistent naming and terminology.
Hacks for performance problems.
Special-casing to work around issues elsewhere.

When complexity makes maintenance hard, budgets and schedules are often overrun. In complex software, there is also a greater risk of introducing bugs when making a change — hidden assumptions, unintended consequences, and unexpected interactions are more easily overlooked.

Accidental vs Essential Complexity

Accidental complexity is not inherent in the problem the software solves (as seen by the users) but arises only from the implementation.
Making a system simpler does not necessarily mean reducing functionality — it means removing accidental complexity.

Abstraction

The best tool for removing accidental complexity is abstraction.
A good abstraction hides implementation detail behind a clean, simple-to-understand façade. It can also be used for a wide range of different applications — reuse is more efficient than reimplementing, and quality improvements in the abstracted component benefit all applications that use it.
Examples:
- High-level programming languages abstract away machine code, CPU registers, and syscalls.
- SQL abstracts away complex on-disk and in-memory data structures, concurrent requests from other clients, and inconsistencies after crashes.
Finding good abstractions is very hard, especially in distributed systems — there are many good algorithms, but it is much less clear how to package them into abstractions that keep complexity manageable.

4.3 Evolvability: Making Change Easy

System requirements are always in constant flux — new facts, previously unanticipated use cases emerge, business priorities change, users request new features, new platforms replace old platforms, legal or regulatory requirements change, growth forces architectural changes.

Agile working patterns provide a framework for adapting to change. Technical tools include test-driven development (TDD) and refactoring.
Most Agile techniques focus on a fairly small, local scale (a couple of source code files within the same application). This book searches for ways of increasing agility at the level of a larger data system — perhaps consisting of several different applications or services with different characteristics.
- For example, how would you "refactor" Twitter's architecture for assembling home timelines from Approach 1 to Approach 2?
The ease of modifying a data system is closely linked to its simplicity and abstractions — simple and easy-to-understand systems are usually easier to modify than complex ones.
Evolvability (also known as extensibility, modifiability, or plasticity) = agility on a data system level.

Summary

Key Takeaways

An application must meet functional requirements (what it should do) and nonfunctional requirements (reliability, scalability, maintainability, security, compliance, compatibility).
Reliability — Making systems work correctly even when faults occur. Faults can be in hardware (random, uncorrelated), software (systematic, hard to deal with), and humans (inevitable mistakes). Fault-tolerance techniques hide certain faults from end users.
Scalability — Having strategies for keeping performance good even when load increases. Requires ways of describing load and performance quantitatively (load parameters, percentiles). Add processing capacity to remain reliable under high load.
Maintainability — Making life better for engineering and operations teams. Good abstractions reduce complexity. Good operability means good visibility and effective management. Design for evolvability.
There is no easy fix — but certain patterns and techniques keep reappearing across different kinds of applications.

Important References

Michael Stonebraker and Uğur Çetintemel: "'One Size Fits All': An Idea Whose Time Has Come and Gone," at 21st International Conference on Data Engineering (ICDE), April 2005.
Walter L. Heimerdinger and Charles B. Weinstock: "A Conceptual Framework for System Fault Tolerance," Technical Report CMU/SEI-92-TR-033, Software Engineering Institute, Carnegie Mellon University, October 1992.
Ding Yuan, Yu Luo, Xin Zhuang, et al.: "Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems," at 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI), October 2014.
Yury Izrailevsky and Ariel Tseitlin: "The Netflix Simian Army," techblog.netflix.com, July 19, 2011.
Richard I. Cook: "How Complex Systems Fail," Cognitive Technologies Laboratory, April 2000.
David Oppenheimer, Archana Ganapathi, and David A. Patterson: "Why Do Internet Services Fail, and What Can Be Done About It?," at 4th USENIX Symposium on Internet Technologies and Systems (USITS), March 2003.
Raffi Krikorian: "Timelines at Scale," at QCon San Francisco, November 2012.
Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, et al.: "Dynamo: Amazon's Highly Available Key-Value Store," at 21st ACM Symposium on Operating Systems Principles (SOSP), October 2007.
Jeffrey Dean and Luiz André Barroso: "The Tail at Scale," Communications of the ACM, volume 56, number 2, pages 74–80, February 2013.
James Hamilton: "On Designing and Deploying Internet-Scale Services," at 21st Large Installation System Administration Conference (LISA), November 2007.
Brian Foote and Joseph Yoder: "Big Ball of Mud," at 4th Conference on Pattern Languages of Programs (PLoP), September 1997.
Ben Moseley and Peter Marks: "Out of the Tar Pit," at BCS Software Practice Advancement (SPA), 2006.
Frederick P Brooks: "No Silver Bullet – Essence and Accident in Software Engineering," in The Mythical Man-Month, Anniversary edition, Addison-Wesley, 1995.

Next chapter

Data Models and Query Languages

Pagefy