Consistency and Consensus
Designing Data Intensive Applications by Martin KleppmannChapter 9: Consistency and Consensus
Introduction
The best way of building fault-tolerant systems is to find some general-purpose abstractions with useful guarantees, implement them once, and then let applications rely on those guarantees. One of the most important abstractions for distributed systems is consensus: getting all of the nodes to agree on something.
This chapter covers:
- Linearizability — the strongest commonly used consistency model
- Ordering Guarantees — causality, total ordering, total order broadcast
- Distributed Transactions and Consensus — 2PC, fault-tolerant consensus algorithms
Section 1: Consistency Guarantees
Most replicated databases provide at least eventual consistency (convergence): if you stop writing and wait, eventually all read requests return the same value. This is a very weak guarantee — it doesn't say when replicas will converge.
Eventual consistency is hard for application developers because it's so different from a single-threaded program where you assign a value and immediately read it back. Stronger consistency models have worse performance or are less fault-tolerant, but are easier to use correctly.
Transaction isolation (avoiding race conditions from concurrent transactions) and distributed consistency (coordinating replica state despite delays and faults) are mostly independent concerns, though there is some overlap.
Section 2: Linearizability
The idea: make a system appear as if there were only one copy of the data, and all operations on it are atomic. Also known as atomic consistency, strong consistency, immediate consistency, or external consistency.
Linearizability is a recency guarantee: as soon as one client successfully completes a write, all clients reading from the database must be able to see the value just written.
Alice and Bob Example
Alice sees the final score, tells Bob. Bob refreshes but his request goes to a lagging replica — he sees the game as still ongoing. Since Bob's query started after he heard Alice's result, this is a linearizability violation.
What Makes a System Linearizable?
- Operations on a register:
read(x) ⇒ v,write(x, v) ⇒ r,cas(x, v_old, v_new) ⇒ r - If a read is concurrent with a write, it may return either the old or new value.
- But once any read returns the new value, all subsequent reads must also return the new value — the value atomically "flips" at some point during the write.
The requirement: the lines joining operation markers always move forward in time, never backward.
Linearizability vs Serializability
- Serializability — An isolation property of transactions (multiple objects). Guarantees transactions behave as if executed in some serial order. That order can differ from actual execution order.
- Linearizability — A recency guarantee on reads and writes of a single register (individual object). Doesn't group operations into transactions.
- Strict serializability (strong-1SR) = both serializability + linearizability. Achieved by 2PL or actual serial execution. SSI is not linearizable (reads from a consistent snapshot).
Relying on Linearizability
- Locking and leader election — All nodes must agree which node owns the lock / is the leader. ZooKeeper and etcd use consensus algorithms for this. Oracle RAC uses linearizable locks per disk page.
- Constraints and uniqueness guarantees — Unique usernames, non-negative bank balances, no double-booking. Requires a single up-to-date value all nodes agree on. (Some constraints can be treated loosely — e.g., overbooking a flight.)
- Cross-channel timing dependencies — When there are multiple communication channels (e.g., file storage + message queue), linearizability prevents race conditions between them.
Implementing Linearizable Systems
| Replication Method | Linearizable? |
|---|---|
| Single-leader | Potentially (if reads from leader or synchronous followers). Not if using snapshot isolation or if there are concurrency bugs. Failover can violate linearizability. |
| Consensus algorithms | Yes — ZooKeeper, etcd. Contain measures to prevent split brain and stale replicas. |
| Multi-leader | Not linearizable (concurrent writes on multiple nodes, async replication). |
| Leaderless (Dynamo-style) | Probably not. LWW with clocks is almost certainly nonlinearizable. Sloppy quorums ruin it. Even strict quorums can be nonlinearizable. |
Linearizability and Quorums
Even with w + r > n, race conditions can make strict quorums nonlinearizable. Making Dynamo-style quorums linearizable requires synchronous read repair before returning results + writer reading latest state before writing — at the cost of reduced performance. A linearizable compare-and-set cannot be implemented this way — it requires a consensus algorithm.
The Cost of Linearizability
- If your application requires linearizability and some replicas are disconnected → those replicas become unavailable (must wait or return error).
- If your application does not require linearizability → each replica can process requests independently → remains available during network problems.
The CAP Theorem
- Popularly: Consistency, Availability, Partition tolerance — pick 2 out of 3. But this is misleading: network partitions are a fault, not a choice. They will happen.
- Better phrasing: Consistent or Available when Partitioned.
- CAP is of very narrow scope (only linearizability + network partitions). It has been superseded by more precise results and is of mostly historical interest today. Best avoided.
Linearizability and Network Delays
- Even RAM on a modern multi-core CPU is not linearizable (each core has its own cache, asynchronously updated to main memory). The reason: performance, not fault tolerance.
- Linearizability is slow — response time is at least proportional to the uncertainty of delays in the network. A faster algorithm for linearizability does not exist. Weaker consistency models can be much faster.
Section 3: Ordering Guarantees
Ordering and Causality
Ordering helps preserve causality (cause comes before effect). Examples throughout the book:
- Consistent prefix reads — seeing the answer before the question
- Replication — an update to a row that doesn't exist yet
- Concurrent writes — happens-before relationship
- Snapshot isolation — consistent with causality (effects of all causally prior operations are visible)
- Write skew — causally dependent on observation of current state
- Alice and Bob — Bob should see the score after hearing Alice's exclamation
Causal Order is Partial, Not Total
- Total order — Any two elements can be compared (e.g., natural numbers: 5 < 13).
- Partial order — Some elements are incomparable (e.g., mathematical sets: {a,b} vs {b,c}).
| Linearizability | Causality | |
|---|---|---|
| Order type | Total order (single timeline, all operations ordered) | Partial order (concurrent operations are incomparable) |
| Concurrency | No concurrent operations (single copy, atomic) | Concurrent operations exist (timeline branches and merges, like Git) |
Linearizability Implies Causality
Any linearizable system preserves causality correctly. But linearizability harms performance and availability. The good news: causal consistency is the strongest possible consistency model that does not slow down due to network delays and remains available during network failures. Many systems that appear to require linearizability actually only need causal consistency.
Sequence Number Ordering
Tracking all causal dependencies is impractical. Instead, use sequence numbers or timestamps from a logical clock (counter incremented for every operation, not a physical clock).
- In single-leader replication: the leader increments a counter for each operation → total order consistent with causality.
- Without a single leader, noncausal generators (odd/even counters, physical timestamps, block allocation) are not consistent with causality.
Lamport Timestamps
- A pair of (counter, node ID). Compare by counter first, then node ID as tiebreaker.
- Key idea: every node and client tracks the maximum counter value seen so far and includes it on every request. When a node receives a request with a higher counter, it immediately increases its own counter to that maximum.
- This ensures ordering is consistent with causality.
- Lamport timestamps vs version vectors: Lamport timestamps enforce a total ordering but cannot distinguish concurrent from causally dependent operations. Version vectors can.
Timestamp Ordering Is Not Sufficient
Lamport timestamps define a total order consistent with causality, but are not sufficient for many problems (e.g., unique username registration). The total order only emerges after collecting all operations — a node cannot know in real-time if another node is concurrently claiming the same username with a lower timestamp. You need to know when the order is finalized → total order broadcast.
Total Order Broadcast
Also called atomic broadcast. A protocol for exchanging messages between nodes with two safety properties:
- Reliable delivery — If a message is delivered to one node, it is delivered to all nodes.
- Totally ordered delivery — Messages are delivered to every node in the same order.
Uses:
- Database replication (state machine replication) — Every replica processes the same writes in the same order.
- Serializable transactions — Every message is a deterministic transaction; all nodes process in the same order.
- Lock service with fencing tokens — Every lock request is appended to the log; sequence numbers serve as fencing tokens (ZooKeeper's
zxid).
Implementing Linearizable Storage Using Total Order Broadcast
Total order broadcast is asynchronous (no guarantee of when a message is delivered). Linearizability is a recency guarantee. But you can build linearizable storage on top of TOB:
- Append a message to the log claiming the username.
- Read the log, wait for your message to be delivered back.
- If the first message for that username is yours → success. Otherwise → abort.
For linearizable reads: sequence reads through the log, or query the latest log position and wait, or read from a synchronously updated replica (chain replication).
Implementing Total Order Broadcast Using Linearizable Storage
Use a linearizable register with atomic increment-and-get. For every message, increment the register and attach the sequence number. Unlike Lamport timestamps, these numbers have no gaps — if you have message 4 and receive message 6, you know to wait for message 5.
The Equivalence
A linearizable compare-and-set register, total order broadcast, and consensus are all equivalent — if you can solve one, you can transform it into a solution for the others.
Section 4: Distributed Transactions and Consensus
Consensus: getting several nodes to agree on something. Important for:
- Leader election — All nodes must agree who the leader is (avoid split brain → data loss).
- Atomic commit — All nodes must agree on the outcome of a distributed transaction (all commit or all abort).
The FLP Impossibility Result
Fischer, Lynch, and Paterson proved that there is no algorithm that always reaches consensus if there is a risk of a node crash — in the asynchronous system model (no clocks or timeouts). But if the algorithm can use timeouts or random numbers, consensus becomes solvable. Distributed systems can usually achieve consensus in practice.
4.1 Two-Phase Commit (2PC)
The most common way of solving atomic commit across multiple nodes. (Note: 2PC ≠ 2PL — completely different things.)
Uses a coordinator (transaction manager) that doesn't appear in single-node transactions.
The Protocol
- Application requests a globally unique transaction ID from the coordinator.
- Application begins a single-node transaction on each participant, attached to the global transaction ID.
- When ready to commit, coordinator sends a prepare request to all participants.
- Each participant ensures it can definitely commit under all circumstances (writes data to disk, checks constraints). By replying "yes," it surrenders the right to abort but doesn't actually commit yet.
- Coordinator makes a definitive commit or abort decision (only commit if all voted "yes"). Writes decision to its transaction log on disk — this is the commit point.
- Coordinator sends commit/abort request to all participants. If it fails, it retries forever — the decision is irrevocable.
Two "points of no return": participant votes "yes" (promises to commit), and coordinator decides (irrevocable).
Coordinator Failure
If the coordinator crashes after participants vote "yes" but before sending the commit/abort request, participants are stuck in doubt (uncertain). They cannot unilaterally commit or abort. The only way to complete is to wait for the coordinator to recover and read its transaction log.
Three-Phase Commit (3PC)
An alternative that is theoretically nonblocking, but assumes bounded network delay and bounded response times — doesn't work in most practical systems. 2PC continues to be used despite the coordinator failure problem.
4.2 Distributed Transactions in Practice
Two types often conflated:
- Database-internal — Nodes of the same distributed database (e.g., VoltDB, MySQL Cluster). Can use optimized internal protocols. Often work quite well.
- Heterogeneous — Different technologies (e.g., two different databases, or a database + message broker). Much more challenging.
Exactly-Once Message Processing
A message from a queue can be acknowledged as processed if and only if the database transaction for processing it was successfully committed. Both the message acknowledgment and the database writes are atomically committed in a single distributed transaction. If either fails, both are aborted → message is safely redelivered.
XA Transactions
X/Open XA (eXtended Architecture) — A standard (since 1991) for implementing 2PC across heterogeneous technologies. Supported by PostgreSQL, MySQL, DB2, SQL Server, Oracle, ActiveMQ, HornetQ, MSMQ, IBM MQ.
XA is a C API (not a network protocol) for interfacing with a transaction coordinator. The coordinator is often a library in the same application process. It keeps track of participants and uses a local disk log for commit/abort decisions.
Holding Locks While In Doubt
The critical problem: database transactions take row-level locks on modified rows. These locks cannot be released until the transaction commits or aborts. If the coordinator crashes and takes 20 minutes to restart, locks are held for 20 minutes. If the coordinator's log is lost, locks are held forever (until manual resolution). Other transactions that need the same data are blocked.
Recovering from Coordinator Failure
Orphaned in-doubt transactions can occur when the coordinator cannot determine the outcome (e.g., corrupted log). These sit forever, holding locks. The only way out: an administrator manually decides to commit or roll back — under high stress during a production outage.
Heuristic decisions — An emergency escape hatch allowing a participant to unilaterally decide. "Heuristic" is a euphemism for probably breaking atomicity.
Limitations of Distributed Transactions
- The coordinator is a single point of failure (often not highly available by default).
- Makes application servers stateful (coordinator logs are crucial durable state).
- XA is a lowest common denominator — cannot detect deadlocks across systems, doesn't work with SSI.
- 2PC requires all participants to respond → tendency to amplify failures (runs counter to fault tolerance).
4.3 Fault-Tolerant Consensus
One or more nodes propose values, and the consensus algorithm decides on one of those values.
Properties
- Uniform agreement — No two nodes decide differently.
- Integrity — No node decides twice.
- Validity — If a node decides value v, then v was proposed by some node.
- Termination — Every node that does not crash eventually decides some value (the fault-tolerance property — a liveness property).
The termination property means 2PC does not satisfy consensus (it can get stuck waiting for the coordinator). Consensus requires at least a majority of nodes to be functioning.
Safety properties (agreement, integrity, validity) are always met, even if a majority fails. A large-scale outage stops processing but cannot corrupt the consensus system.
Consensus Algorithms and Total Order Broadcast
Best-known: Viewstamped Replication (VSR), Paxos, Raft, Zab. They decide on a sequence of values (total order broadcast), not just a single value.
- VSR (1988, Oki and Liskov) — One of the earliest fault-tolerant consensus algorithms. Uses a primary replica that proposes operations; if the primary fails, a view change protocol elects a new primary with an incremented view number. Revisited in 2012 by Liskov and Cowling.
- Paxos (Lamport) — The most widely studied. Single-decree Paxos decides one value; Multi-Paxos optimizes for deciding a sequence of values. Notoriously difficult to implement correctly.
- Raft (2014, Ongaro and Ousterhout) — Designed explicitly for understandability. Uses a term number for leader epochs. Widely adopted in practice (etcd, CockroachDB, TiKV).
- Zab (ZooKeeper Atomic Broadcast) — Used by ZooKeeper. Similar to Raft but developed independently.
There are quite a few similarities between these algorithms, but they are not the same. VSR, Raft, and Zab implement total order broadcast directly. In the case of Paxos, this optimization is known as Multi-Paxos.
Total order broadcast = repeated rounds of consensus (each round decides the next message to deliver):
- Agreement → all nodes deliver same messages in same order
- Integrity → messages not duplicated
- Validity → messages not corrupted or fabricated
- Termination → messages not lost
VSR, Raft, and Zab implement total order broadcast directly. Paxos uses Multi-Paxos.
Epoch Numbering and Quorums
All consensus protocols use a leader internally, but don't guarantee the leader is unique. Instead: an epoch number (ballot number in Paxos, view number in VSR, term number in Raft) guarantees the leader is unique within each epoch.
- When the leader is thought to be dead → vote to elect a new leader with an incremented epoch number.
- Before deciding anything, the leader must collect votes from a quorum (typically a majority).
- Two rounds of voting: one to choose a leader, one to vote on a proposal. The quorums must overlap — at least one node participated in both votes.
Key differences from 2PC: the coordinator is elected (not fixed), and only a majority is needed (not every participant).
Limitations of Consensus
- Voting is a kind of synchronous replication → performance cost.
- Requires a strict majority (3 nodes to tolerate 1 failure, 5 nodes to tolerate 2).
- Assumes a fixed set of nodes (dynamic membership extensions are less well understood).
- Relies on timeouts for failure detection → frequent false leader elections in environments with variable network delays → terrible performance.
- Sensitive to network problems (e.g., Raft can bounce leadership between two nodes if one network link is unreliable).
4.4 Membership and Coordination Services
ZooKeeper and etcd are designed to hold small amounts of data that fit in memory, replicated using fault-tolerant total order broadcast.
ZooKeeper is modeled after Google's Chubby lock service (2006, Mike Burrows). Chubby implements not only consensus but also a lock service with advisory locks, small-file storage, and session/lease management. It is used internally at Google for leader election, naming, and configuration. ZooKeeper is the open-source equivalent, implementing the same ideas with an interesting set of additional features.
ZooKeeper Features
- Linearizable atomic operations — Atomic compare-and-set for implementing distributed locks (implemented as leases with expiry time).
- Total ordering of operations — Monotonically increasing transaction ID (
zxid) and version number (cversion) serve as fencing tokens. - Failure detection — Clients maintain long-lived sessions with periodic heartbeats. If heartbeats cease beyond the session timeout → session declared dead → locks automatically released (ephemeral nodes).
- Change notifications — Clients can watch for changes (new nodes joining, sessions timing out). Avoids frequent polling.
Use Cases
- Allocating work to nodes — Leader/primary election, partition assignment, rebalancing. Uses atomic operations, ephemeral nodes, and notifications.
- Service discovery — Services register their network endpoints in ZooKeeper when they start up. (DNS is the traditional approach and doesn't require consensus; but if your consensus system already knows the leader, it makes sense to use it for service discovery too.)
- Membership services — Determining which nodes are currently active and live members of a cluster. Coupling failure detection with consensus allows nodes to agree on which nodes are alive or dead.
ZooKeeper runs on a fixed number of nodes (usually 3 or 5) and performs majority votes among those nodes, while supporting a potentially large number of clients — "outsourcing" coordination work to an external service.
Used by: HBase, Hadoop YARN, OpenStack Nova, Kafka. MongoDB uses its own config server. LinkedIn's Espresso uses Helix (built on ZooKeeper).
Summary
Equivalent Problems (All Reducible to Consensus)
| Problem | What is "decided" |
|---|---|
| Linearizable compare-and-set registers | Whether to set the value based on current value |
| Atomic transaction commit | Whether to commit or abort |
| Total order broadcast | The order in which to deliver messages |
| Locks and leases | Which client acquired the lock |
| Membership/coordination service | Which nodes are alive or dead |
| Uniqueness constraint | Which conflicting write to allow |
Three Ways to Handle Leader Failure
- Wait for recovery — System blocked in the meantime (XA/JTA coordinators). Doesn't satisfy termination.
- Manual failover — Humans choose a new leader. "Consensus by act of God." Limited by human speed.
- Automatic leader election — Requires a consensus algorithm (VSR, Paxos, Raft, Zab). The recommended approach.
Key Insights
- Linearizability makes replicated data appear as a single copy with atomic operations. Easy to understand but slow (especially with network delays). CAP theorem is of mostly historical interest.
- Causal consistency is the strongest model that doesn't slow down due to network delays and remains available during network failures. Many systems that seem to need linearizability actually only need causal consistency.
- Lamport timestamps provide a total order consistent with causality but are not sufficient for real-time decisions (the order only emerges after the fact).
- Total order broadcast is equivalent to consensus. It provides the foundation for database replication, serializable transactions, and lock services.
- 2PC provides atomic commit but is a blocking protocol (coordinator failure → participants stuck in doubt holding locks).
- Fault-tolerant consensus (Raft, Paxos, Zab) uses epoch numbering and quorum voting. Requires a majority of nodes. Provides total order broadcast and linearizable operations.
- ZooKeeper/etcd outsource consensus, failure detection, and membership services to a dedicated coordination service.
Important References
- Maurice P. Herlihy and Jeannette M. Wing: "Linearizability: A Correctness Condition for Concurrent Objects," ACM TOPLAS, volume 12, number 3, pages 463–492, July 1990.
- Leslie Lamport: "Time, Clocks, and the Ordering of Events in a Distributed System," Communications of the ACM, volume 21, number 7, pages 558–565, July 1978.
- Michael J. Fischer, Nancy Lynch, and Michael S. Paterson: "Impossibility of Distributed Consensus with One Faulty Process," Journal of the ACM, volume 32, number 2, pages 374–382, April 1985.
- Diego Ongaro and John K. Ousterhout: "In Search of an Understandable Consensus Algorithm (Extended Version)," at USENIX ATC, June 2014.
- Leslie Lamport: "Paxos Made Simple," ACM SIGACT News, volume 32, number 4, pages 51–58, December 2001.
- Flavio P. Junqueira, Benjamin C. Reed, and Marco Serafini: "Zab: High-Performance Broadcast for Primary-Backup Systems," at 41st IEEE DSN, June 2011.
- Flavio P. Junqueira and Benjamin Reed: ZooKeeper: Distributed Process Coordination, O'Reilly Media, 2013.
- Mike Burrows: "The Chubby Lock Service for Loosely-Coupled Distributed Systems," at 7th USENIX OSDI, November 2006.
- Eric A. Brewer: "CAP Twelve Years Later: How the 'Rules' Have Changed," IEEE Computer Magazine, volume 45, number 2, pages 23–29, February 2012.
- Martin Kleppmann: "Please Stop Calling Databases CP or AP," martin.kleppmann.com, May 11, 2015.
- Seth Gilbert and Nancy Lynch: "Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services," ACM SIGACT News, volume 33, number 2, pages 51–59, June 2002.
- Jim N. Gray and Leslie Lamport: "Consensus on Transaction Commit," ACM TODS, volume 31, number 1, pages 133–160, March 2006.
- Pat Helland: "Life Beyond Distributed Transactions: An Apostate's Opinion," at 3rd CIDR, January 2007.
- Tushar Deepak Chandra and Sam Toueg: "Unreliable Failure Detectors for Reliable Distributed Systems," Journal of the ACM, volume 43, number 2, pages 225–267, March 1996.
- Philip A. Bernstein, Vassos Hadzilacos, and Nathan Goodman: Concurrency Control and Recovery in Database Systems, Addison-Wesley, 1987.
Previous chapter
The Trouble with Distributed SystemsNext chapter
Batch Processing