The Trouble with Distributed Systems

Designing Data Intensive Applications by Martin Kleppmann

Chapter 8: The Trouble with Distributed Systems

Introduction

Working with distributed systems is fundamentally different from writing software on a single computer — there are lots of new and exciting ways for things to go wrong. This chapter is a thoroughly pessimistic overview of problems that may occur: unreliable networks, unreliable clocks, and process pauses. In distributed systems, suspicion, pessimism, and paranoia pay off.

Section 1: Faults and Partial Failures

On a single computer, things are fairly predictable: either it works or it doesn't. If an internal fault occurs, we prefer a total crash rather than returning a wrong result. Computers hide the fuzzy physical reality and present an idealized system model with mathematical perfection.

In a distributed system, some parts may be broken while others work fine — this is a partial failure. Partial failures are nondeterministic: the same operation may sometimes work and sometimes fail. You may not even know whether something succeeded or not.

Cloud Computing vs Supercomputing

	Supercomputing (HPC)	Cloud Computing
Fault handling	Checkpoint to storage; stop entire cluster on failure; restart from checkpoint	Tolerate faults; keep running; rolling upgrades
Availability	Offline batch jobs (can be stopped/restarted)	Online services (must serve users with low latency)
Hardware	Specialized, reliable hardware; RDMA	Commodity machines; higher failure rates
Network	Specialized topologies (meshes, toruses)	IP and Ethernet (Clos topologies)
Scale	If something is always broken, system spends time recovering instead of working	Kill bad VMs, request new ones

In a system with thousands of nodes, something is always broken. Fault handling must be part of the software design.

Building a Reliable System from Unreliable Components

Error-correcting codes allow accurate transmission over a noisy channel.
TCP provides reliable transport on top of unreliable IP (retransmits lost packets, eliminates duplicates, reorders).
The system can be more reliable than its parts, but there is always a limit.

Section 2: Unreliable Networks

Most distributed systems use shared-nothing architecture with asynchronous packet networks. When you send a request and expect a response, many things can go wrong:

Your request may have been lost (someone unplugged a cable).
Your request may be waiting in a queue (network or recipient overloaded).
The remote node may have failed (crashed or powered down).
The remote node may have temporarily stopped responding (GC pause) but will respond later.
The remote node processed your request, but the response was lost.
The remote node processed your request, but the response is delayed.

The sender can't tell whether the packet was delivered. The only option is a timeout: give up waiting and assume the response won't arrive. But you still don't know if the request got through.

Network Faults in Practice

One study found ~12 network faults per month in a medium-sized datacenter.
Adding redundant networking gear doesn't help as much as hoped (doesn't guard against human error — a major cause of outages).
EC2 is notorious for frequent transient network glitches. Sharks bite undersea cables.
Even if faults are rare, your software must handle them. Deliberately trigger network problems in testing (Chaos Monkey).

Detecting Faults

Load balancers need to stop sending requests to dead nodes.
In single-leader replication, followers need to detect leader failure for failover.
Some feedback is possible: RST/FIN packets if process crashed, notification scripts if process crashed but OS is running, ICMP Destination Unreachable from routers.
But you can't count on rapid feedback. If you want to be sure a request succeeded, you need a positive response from the application itself.

Timeouts and Unbounded Delays

Long timeout → long wait until a node is declared dead.
Short timeout → higher risk of incorrectly declaring a node dead (premature failover during a temporary slowdown makes things worse).
If the network guaranteed maximum delay d and nodes guaranteed handling within r, then 2d + r would be a reasonable timeout. But asynchronous networks have unbounded delays and servers can't guarantee maximum handling time.

Network Congestion and Queueing

Variability in packet delays is mostly due to queueing:

Network switch queue — Multiple nodes sending to the same destination. If the queue fills up, packets are dropped.
OS queue — All CPU cores busy; incoming requests queued until the application handles them.
TCP flow control — Limits the rate of sending to avoid overloading the network.
TCP retransmission — Lost packets are retransmitted after a timeout, adding delay.

TCP vs UDP

UDP avoids flow control and retransmission → lower and more predictable latency, but unreliable.
Good for latency-sensitive applications where delayed data is worthless (VoIP, videoconferencing).

Phi Accrual Failure Detector

Rather than fixed timeouts, systems can continuously measure response times and jitter, and automatically adjust timeouts. Used by Akka and Cassandra.

Synchronous vs Asynchronous Networks

Telephone networks use circuit switching: a fixed amount of bandwidth is reserved for each call. No queueing → bounded delay. But wastes bandwidth when idle.
Datacenter networks and the internet use packet switching: optimized for bursty traffic. Bandwidth shared dynamically → queueing → unbounded delays. But maximizes utilization.
Quality of Service (QoS) and admission control can emulate circuit switching on packet networks, but this is not currently enabled in multi-tenant datacenters or the public internet.

Section 3: Unreliable Clocks

Each machine has its own clock (quartz crystal oscillator). Clocks can be synchronized via NTP (Network Time Protocol), but synchronization is limited by network delay.

Time-of-Day Clocks vs Monotonic Clocks

	Time-of-Day Clock	Monotonic Clock
Returns	Current date and time (wall-clock time, seconds since epoch)	Arbitrary value (e.g., nanoseconds since boot)
Synchronized	Yes, via NTP. May jump backward if adjusted.	No synchronization needed. NTP may slew the rate but never jumps.
Use for	Points in time (when did this happen?)	Durations (how long did this take?)
Cross-machine comparison	Meaningful (if synchronized)	Meaningless
Suitable for measuring elapsed time	No (may jump)	Yes
API	`clock_gettime(CLOCK_REALTIME)`, `System.currentTimeMillis()`	`clock_gettime(CLOCK_MONOTONIC)`, `System.nanoTime()`

Clock Synchronization and Accuracy

NTP and hardware clocks are fickle:

Quartz clocks drift (up to 200 ppm = 6 ms drift for a clock synchronized once a day, 17 seconds if synced once).
NTP can be firewalled off without anyone noticing.
NTP accuracy limited by network delay (minimum ~35 ms error over the internet).
Some NTP servers are wrong or misconfigured.
Leap seconds (59 or 61 seconds in a minute) have crashed many large systems. Workaround: smearing (gradual adjustment over a day).
VM pauses — When a CPU core is shared, each VM is paused for tens of milliseconds while another runs.
User-controlled devices — Hardware clocks may be deliberately set to wrong time.

Relying on Synchronized Clocks

Incorrect clocks easily go unnoticed — they don't cause dramatic crashes, just silent and subtle data loss. Monitor clock offsets between all machines; declare nodes with excessive drift as dead.

Timestamps for Ordering Events (Dangerous!)

Last Write Wins (LWW) with time-of-day clocks: even with <3 ms clock skew, writes can be ordered incorrectly. A node with a lagging clock silently drops writes.
LWW cannot distinguish sequential writes from truly concurrent writes.
Logical clocks (incrementing counters) are a safer alternative for ordering events.

Clock Confidence Intervals

A clock reading is not a precise point in time — it has an uncertainty range.
Most systems don't expose this uncertainty.
Google's TrueTime API (Spanner) explicitly reports [earliest, latest] — the actual time is somewhere in that interval. Uses GPS receivers and atomic clocks in each datacenter (~7 ms uncertainty).

Synchronized Clocks for Global Snapshots (Spanner)

Spanner uses TrueTime confidence intervals for snapshot isolation across datacenters.
If two intervals don't overlap, the ordering is definite. If they overlap, the ordering is uncertain.
Spanner deliberately waits for the length of the confidence interval before committing, ensuring causal ordering.

Process Pauses

A thread can be paused for a significant time at any point:

Garbage collection (GC) — "Stop-the-world" pauses can last seconds or even minutes.
VM suspension — VM can be suspended and resumed (live migration). Pause can be arbitrary length.
Context switching — OS or hypervisor switches to another thread/VM. Steal time.
Synchronous disk I/O — Thread paused waiting for slow disk (especially network filesystems like EBS).
Swapping to disk (paging) — Memory access triggers a page fault → slow disk I/O.
Unix signal handling (e.g., SIGSTOP/SIGCONT).

A node must assume its execution can be paused for a significant length of time. During the pause, the rest of the world keeps moving and may declare the paused node dead.

Response Time Guarantees

Hard real-time systems (aircraft, rockets, robots) have specified deadlines. Require restricted programming languages, no dynamic memory allocation, enormous testing. Very expensive. Not economical for most server-side systems.

Limiting GC Impact

Treat GC pauses like brief planned outages: stop sending requests to the node, wait for outstanding requests to finish, then GC while no requests are in progress.
Some financial trading systems use this approach.

Section 4: Knowledge, Truth, and Lies

A node in the network cannot know anything for sure — it can only make guesses based on messages it receives (or doesn't receive). If a remote node doesn't respond, there is no way of knowing what state it is in, because problems in the network cannot reliably be distinguished from problems at a node.

4.1 The Truth Is Defined by the Majority

A node cannot trust its own judgment. A distributed system cannot rely on a single node. Instead, many algorithms rely on a quorum — voting among nodes. Decisions require some minimum number of votes to reduce dependence on any one node.

If a quorum of nodes declares another node dead, then it must be considered dead, even if that node still feels alive. The individual node must abide by the quorum decision and step down. Most commonly, the quorum is an absolute majority (more than half the nodes).

4.2 The Leader and the Lock

Situations where only one node is allowed to do something:

Only one node is the leader for a database partition (to avoid split brain).
Only one transaction is allowed to hold the lock for a particular resource (to prevent concurrent writes and corruption).
Only one user is allowed to register a particular username.

Even if a node believes it is "the chosen one," the other nodes may have declared it dead and elected a replacement. If the old node continues acting in its self-appointed capacity, it can cause data corruption.

Fencing Tokens

Consider a request-handling loop using a lease:

while (true) {
    request = getIncomingRequest();
    // Ensure that the lease always has at least 10 seconds remaining
    if (lease.expiryTimeMillis - System.currentTimeMillis() < 10000) {
        lease = lease.renew();
    }
    if (lease.isValid()) {
        process(request);
    }
}

Two problems: (1) it relies on synchronized clocks (the expiry time is set by a different machine), and (2) even with a local monotonic clock, the code assumes very little time passes between checking the time and processing the request. A 15-second GC pause around lease.isValid() could mean the lease expires before the request is processed — but the thread won't notice until the next loop iteration, by which time it may have already done something unsafe.

Every time the lock server grants a lock or lease, it returns a fencing token — a number that increases every time a lock is granted.
Every write request to the storage service must include the current fencing token.
The storage server rejects any write with a token lower than one it has already processed.
Example: Client 1 gets token 33, pauses (lease expires). Client 2 gets token 34, writes successfully. Client 1 wakes up, sends write with token 33 → rejected by storage server.
ZooKeeper's zxid or cversion can be used as fencing tokens (guaranteed monotonically increasing).
The resource itself must check tokens — it's not sufficient for clients to check their own lock status.

4.3 Byzantine Faults

Fencing tokens handle nodes that are inadvertently acting in error. But if a node deliberately wants to subvert the system, it could send messages with a fake fencing token.

In this book we assume nodes are unreliable but honest: they may be slow or unresponsive, but if they respond, they tell the truth to the best of their knowledge.

A Byzantine fault is when a node sends arbitrary faulty or corrupted responses (e.g., claiming to have received a message when it didn't). The Byzantine Generals Problem: reaching consensus when some participants may be traitors sending contradictory messages.

The Byzantine Generals Problem

A generalization of the Two Generals Problem, which imagines two army generals who need to agree on a battle plan. They can only communicate by messenger, and messengers sometimes get delayed or lost (like packets in a network).
In the Byzantine version, there are n generals who need to agree, but some are traitors in their midst. Most generals are loyal and send truthful messages, but the traitors may try to deceive and confuse the others by sending fake or untrue messages (while trying to remain undiscovered). It is not known in advance who the traitors are.
The name comes from Byzantium (the ancient Greek city that became Constantinople, now Istanbul). There isn't historic evidence that Byzantine generals were especially prone to intrigue — the name derives from "Byzantine" meaning excessively complicated, bureaucratic, devious. Lamport chose the nationality to avoid offending readers — he was advised that calling it "The Albanian Generals Problem" was not such a good idea.

When Byzantine Fault Tolerance Matters

Aerospace — Radiation can corrupt memory/CPU registers → unpredictable responses. Flight control systems must tolerate Byzantine faults.
Multiple organizations — Participants may attempt to cheat (e.g., peer-to-peer networks like Bitcoin/blockchains).

When It Doesn't

In your own datacenter, all nodes are controlled by your organization. Byzantine fault-tolerant protocols are too complex and expensive for most server-side systems.
Web applications need to expect malicious client behavior, but handle it with input validation, sanitization, and output escaping — not Byzantine fault-tolerant protocols.

4.4 Weak Forms of Lying

Even without full Byzantine fault tolerance, it's worth guarding against weak forms of "lying" — invalid messages due to hardware issues, software bugs, and misconfiguration. These protection mechanisms are not full-blown Byzantine fault tolerance (they would not withstand a determined adversary), but they are simple and pragmatic steps toward better reliability:

Network packet corruption — Usually caught by TCP/UDP checksums, but sometimes evades detection. Simple measures are usually sufficient, such as application-level checksums.
Input sanitization — A publicly accessible application must carefully sanitize any inputs from users: check values are within reasonable ranges, limit string sizes to prevent denial of service through large memory allocations. An internal service behind a firewall may get away with less strict checks, but basic sanity-checking of values (e.g., in protocol parsing) is a good idea.
NTP with multiple servers — Client contacts all servers, estimates their errors, and checks that a majority of servers agree on some time range. As long as most servers are okay, a misconfigured NTP server reporting incorrect time is detected as an outlier and excluded from synchronization. Multiple servers make NTP more robust than relying on a single server.

4.5 System Models

To reason about distributed algorithms, we define a system model — an abstraction describing what faults an algorithm may assume.

Timing Models

Model	Assumption
Synchronous	Bounded network delay, bounded process pauses, bounded clock error. Not realistic for most practical systems.
Partially synchronous	Behaves like synchronous most of the time, but sometimes exceeds bounds. Most realistic model.
Asynchronous	No timing assumptions at all. No clocks, no timeouts. Very restrictive.

Node Failure Models

Model	Assumption
Crash-stop	A node can only fail by crashing. Once crashed, it's gone forever.
Crash-recovery	Nodes may crash and later restart. Stable storage (disk) survives crashes; in-memory state is lost. Most useful model.
Byzantine	Nodes may do absolutely anything, including sending contradictory messages.

The most useful combination for real systems: partially synchronous model with crash-recovery faults.

4.6 Correctness of Algorithms

Define properties that a correct algorithm must satisfy. Example for fencing tokens:

Uniqueness — No two requests return the same value.
Monotonic sequence — If request x completed before y began, then token_x < token_y.
Availability — A node that requests a token and doesn't crash eventually receives a response.

4.7 Safety and Liveness Properties

	Safety	Liveness
Informal definition	Nothing bad happens	Something good eventually happens
If violated	Can point to a specific point in time when it was broken. Violation cannot be undone.	May not hold now, but there is always hope it will be satisfied in the future.
Examples	Uniqueness, monotonic sequence	Availability, eventual consistency
Requirement	Must always hold, even if all nodes crash or the entire network fails	Allowed to make caveats (e.g., only if a majority of nodes are working)

Distinguishing safety and liveness helps deal with difficult system models: safety properties must hold in all situations; liveness properties can have conditions.

4.8 Mapping System Models to the Real World

Algorithms in the crash-recovery model assume stable storage survives crashes. But what if the disk is corrupted?
Theoretical models are simplified abstractions. A real implementation may still need to handle "impossible" situations (even if the handling is just printf("Sucks to be you") and exit(666)).
Proving an algorithm correct in a system model doesn't guarantee the implementation always behaves correctly — but it's a very good first step.
Theoretical analysis and empirical testing are equally important.

Summary

Key Takeaways

Problems in Distributed Systems:

Unreliable Networks — Packets may be lost, delayed, duplicated, or reordered. The reply may also be lost or delayed. If you don't get a reply, you don't know why. The only option is a timeout.
Unreliable Clocks — Clocks may be significantly out of sync (despite NTP). May jump forward or backward. Don't rely on them for ordering events. Use logical clocks instead.
Process Pauses — A process may pause for a substantial amount of time (GC, VM suspension, context switching, disk I/O, swapping) and be declared dead by other nodes, then resume without knowing it was paused.

Consequences:

A node cannot know anything for sure — only make guesses based on messages received.
Faults are hard to detect — most algorithms rely on timeouts, which can't distinguish network from node failures.
No global variable, no shared memory, no common knowledge. The only way information flows is via the unreliable network.
Major decisions cannot be made by a single node — require quorum protocols.

Key Concepts:

Fencing tokens — Monotonically increasing numbers that prevent stale lock holders from corrupting data.
Byzantine faults — Nodes that lie. Relevant for aerospace and blockchains, but too expensive for most server-side systems.
System models — Partially synchronous + crash-recovery is the most realistic.
Safety properties must always hold; liveness properties can have caveats.

If you can keep things on a single machine, it is generally worth doing so. But for scalability, fault tolerance, and low latency, distributed systems are necessary — and these problems must be confronted.

Important References

Leslie Lamport: "Time, Clocks, and the Ordering of Events in a Distributed System," Communications of the ACM, volume 21, number 7, pages 558–565, July 1978.
James C. Corbett, Jeffrey Dean, Michael Epstein, et al.: "Spanner: Google's Globally-Distributed Database," at 10th USENIX OSDI, October 2012.
Leslie Lamport, Robert Shostak, and Marshall Pease: "The Byzantine Generals Problem," ACM TOPLAS, volume 4, number 3, pages 382–401, July 1982.
Cynthia Dwork, Nancy Lynch, and Larry Stockmeyer: "Consensus in the Presence of Partial Synchrony," Journal of the ACM, volume 35, number 2, pages 288–323, April 1988.
Peter Bailis and Kyle Kingsbury: "The Network Is Reliable," ACM Queue, volume 12, number 7, pages 48–55, July 2014.
Phillipa Gill, Navendu Jain, and Nachiappan Nagappan: "Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications," at ACM SIGCOMM, August 2011.
Naohiro Hayashibara, Xavier Défago, Rami Yared, and Takuya Katayama: "The φ Accrual Failure Detector," JAIST Technical Report IS-RR-2004-010, May 2004.
Flavio P. Junqueira and Benjamin Reed: ZooKeeper: Distributed Process Coordination, O'Reilly Media, 2013.
Mark Cavage: "There's Just No Getting Around It: You're Building a Distributed System," ACM Queue, volume 11, number 4, pages 80–89, April 2013.
Jeff Hodges: "Notes on Distributed Systems for Young Bloods," somethingsimilar.com, January 14, 2013.
Kyle Kingsbury: "Call Me Maybe: Cassandra," aphyr.com, September 24, 2013.
Jay Kreps: "Getting Real About Distributed System Reliability," blog.empathybox.com, March 19, 2012.
Caitie McCaffrey: "Clients Are Jerks: AKA How Halo 4 DoSed the Services at Launch & How We Survived," caitiem.com, June 23, 2015.
Frank McSherry, Michael Isard, and Derek G. Murray: "Scalability! But at What COST?," at 15th USENIX HotOS, May 2015.

Previous chapter

Transactions

Next chapter

Consistency and Consensus

Pagefy