Replication

Designing Data Intensive Applications by Martin Kleppmann

Chapter 5: Replication

Introduction

Replication means keeping a copy of the same data on multiple machines connected via a network. Reasons to replicate data:

Keep data geographically close to users (reduce latency)
Allow the system to continue working even if some parts have failed (increase availability)
Scale out the number of machines that can serve read queries (increase read throughput)

All of the difficulty in replication lies in handling changes to replicated data. Three popular algorithms:

Single-leader replication
Multi-leader replication
Leaderless replication

Section 1: Leaders and Followers

Each node that stores a copy of the database is called a replica. The most common solution is leader-based replication (also known as active/passive or master–slave replication):

One replica is designated the leader (master/primary). All writes must be sent to the leader.
Other replicas are followers (read replicas/slaves/secondaries/hot standbys). The leader sends data changes to all followers as a replication log or change stream. Each follower applies writes in the same order.
Reads can go to the leader or any follower. Writes are only accepted on the leader.

Used by: PostgreSQL (9.0+), MySQL, Oracle Data Guard, SQL Server AlwaysOn, MongoDB, RethinkDB, Espresso, Kafka, RabbitMQ highly available queues, DRBD.

1.1 Synchronous Versus Asynchronous Replication

Synchronous — The leader waits until the follower confirms it received the write before reporting success to the user.
- Advantage: follower is guaranteed to have an up-to-date copy consistent with the leader.
- Disadvantage: if the synchronous follower doesn't respond, the write cannot be processed. The leader must block all writes.
Asynchronous — The leader sends the message but doesn't wait for a response.

Semi-Synchronous

It is impractical for all followers to be synchronous (any one node outage halts the whole system). In practice: one follower is synchronous, the others are asynchronous. If the synchronous follower becomes unavailable, one of the asynchronous followers is made synchronous. This guarantees an up-to-date copy on at least two nodes (leader + one synchronous follower).

Fully Asynchronous

If the leader fails and is not recoverable, any writes not yet replicated to followers are lost (write is not guaranteed to be durable even if confirmed to the client).
Advantage: the leader can continue processing writes even if all followers have fallen behind.
Asynchronous replication is widely used, especially with many followers or geographically distributed replicas.

Research on Replication

Losing data when the leader fails is a serious problem for asynchronously replicated systems, so researchers have continued investigating replication methods that do not lose data but still provide good performance and availability.
Chain replication is a variant of synchronous replication that has been successfully implemented in systems such as Microsoft Azure Storage. In chain replication, writes are propagated through a chain of nodes in sequence (head → intermediate → tail), and reads are served from the tail. This provides strong consistency while distributing the load of serving reads.
There is a strong connection between consistency of replication and consensus (getting several nodes to agree on a value), explored in detail in Chapter 9.

1.2 Setting Up New Followers

How to ensure a new follower has an accurate copy without downtime:

Take a consistent snapshot of the leader's database at some point in time (without locking the entire database). Third-party tools like innobackupex for MySQL.
Copy the snapshot to the new follower node.
The follower connects to the leader and requests all data changes since the snapshot. The snapshot must be associated with an exact position in the leader's replication log (PostgreSQL: log sequence number; MySQL: binlog coordinates).
When the follower has processed the backlog, it has caught up and can continue processing changes as they happen.

1.3 Handling Node Outages

Follower Failure: Catch-up Recovery

Each follower keeps a log of data changes received from the leader.
On crash/restart or network interruption, the follower connects to the leader and requests all changes that occurred while it was disconnected.
Once applied, it has caught up and continues receiving the stream.

Leader Failure: Failover

One of the followers must be promoted to be the new leader. This process is called failover.

Automatic failover steps:

Determining the leader has failed — Most systems use a timeout (e.g., 30 seconds without response → assumed dead).
Choosing a new leader — Election by majority of remaining replicas, or appointment by a previously elected controller node. Best candidate: replica with the most up-to-date data (minimize data loss). This is a consensus problem.
Reconfiguring the system — Clients send writes to the new leader. If the old leader comes back, it must become a follower and recognize the new leader.

Things That Can Go Wrong with Failover

Unreplicated writes lost — If async replication, the new leader may not have all writes from the old leader. Most common solution: old leader's unreplicated writes are discarded (may violate durability expectations).
Dangerous with external coordination — GitHub incident: out-of-date MySQL follower promoted to leader, reused autoincrementing primary keys that were also used in Redis → private data disclosed to wrong users.
Split brain — Two nodes both believe they are the leader. If both accept writes with no conflict resolution → data loss or corruption. Safety catch: STONITH (Shoot The Other Node In The Head) — shut down one node if two leaders detected. If poorly designed, both nodes may be shut down.
Timeout too short — Unnecessary failovers during temporary load spikes or network glitches, making the situation worse.

For these reasons, some operations teams prefer manual failover even if automatic failover is supported.

1.4 Implementation of Replication Logs

Statement-Based Replication

Leader logs every write statement (INSERT, UPDATE, DELETE) and sends to followers.
Problems: Nondeterministic functions (NOW(), RAND()) generate different values on each replica; statements depending on existing data must execute in exactly the same order; side effects (triggers, stored procedures) may differ.
Used in MySQL before version 5.1. VoltDB uses it safely by requiring deterministic transactions.

Write-Ahead Log (WAL) Shipping

The leader sends its WAL (the same log used for crash recovery) across the network to followers.
Used in PostgreSQL and Oracle.
Disadvantage: WAL describes data at a very low level (which bytes changed in which disk blocks) → closely coupled to the storage engine. Cannot run different software versions on leader and followers → zero-downtime upgrades are difficult.

Logical (Row-Based) Log Replication

Uses a different log format from the storage engine (logical log vs. physical representation).
A sequence of records describing writes at the granularity of a row:
- Inserted row: new values of all columns.
- Deleted row: enough info to uniquely identify the row (primary key, or all column values).
- Updated row: enough info to identify the row + new values of changed columns.
Decoupled from storage engine → backward compatible, allows different software versions or even different storage engines on leader and followers.
Easier for external applications to parse → enables change data capture (sending database contents to data warehouses, custom indexes, caches).
MySQL's binlog (row-based replication) uses this approach.

Trigger-Based Replication

Uses database triggers and stored procedures to log changes to a separate table, read by an external process that replicates to another system.
Examples: Oracle GoldenGate, Databus for Oracle, Bucardo for Postgres.
Greater overhead and more prone to bugs, but useful for flexibility (replicate a subset of data, replicate between different database types, custom conflict resolution).

Section 2: Problems with Replication Lag

Leader-based replication requires all writes to go through a single node, but read-only queries can go to any replica. For read-heavy workloads (common on the web), you can create many followers and distribute reads across them — a read-scaling architecture.

This only realistically works with asynchronous replication (synchronous replication to all followers would mean any single node failure makes the entire system unavailable for writing).

Eventual Consistency

If an application reads from an asynchronous follower, it may see outdated information. If you run the same query on the leader and a follower at the same time, you may get different results. This inconsistency is temporary — if you stop writing and wait, followers will eventually catch up. This is called eventual consistency.

The replication lag (delay between a write on the leader and being reflected on a follower) may be a fraction of a second in normal operation, but can increase to several seconds or even minutes under load or network problems.

Three examples of problems caused by replication lag:

2.1 Reading Your Own Writes

A user submits data (sent to the leader), then views it (read from a follower). If the follower hasn't caught up, the user's own data appears to be lost.

We need read-after-write consistency (also called read-your-writes consistency): if the user reloads the page, they will always see any updates they submitted themselves. No promises about other users' updates.

Techniques to Implement Read-After-Write Consistency

Read from the leader for things the user may have modified. Example: on a social network, always read the user's own profile from the leader, other users' profiles from followers.
Track time of last update — For one minute after the last update, make all reads from the leader. Or monitor replication lag and prevent queries on followers more than one minute behind.
Client remembers timestamp of most recent write — The system ensures the replica serving reads reflects updates at least until that timestamp. The timestamp could be a logical timestamp (log sequence number) or the actual system clock (requires clock synchronization).
Multi-datacenter complexity — Any request needing the leader must be routed to the datacenter containing the leader.

Cross-Device Read-After-Write Consistency

If the user enters information on one device and views it on another:

Timestamp metadata needs to be centralized (code on one device doesn't know about updates on another).
If replicas are in different datacenters, no guarantee connections from different devices route to the same datacenter — may need to route all of a user's devices to the same datacenter.

2.2 Monotonic Reads

A user makes several reads from different replicas. They might see data from a fresh replica first, then from a stale replica — appearing as though time went backward (e.g., a comment appears, then disappears on refresh).

Monotonic reads guarantee that if one user makes several reads in sequence, they will not see time go backward — they will not read older data after having previously read newer data. It's a lesser guarantee than strong consistency, but stronger than eventual consistency.

Implementation: Make sure each user always reads from the same replica (e.g., chosen based on a hash of the user ID, not randomly). If that replica fails, reroute to another.

2.3 Consistent Prefix Reads

A violation of causality: an observer sees the answer before the question.

Example: Mr. Poons asks a question, Mrs. Cake answers. But if Mrs. Cake's words go through a follower with little lag and Mr. Poons's words go through a follower with greater lag, a third person sees the answer before the question.

Consistent prefix reads guarantee: if a sequence of writes happens in a certain order, anyone reading those writes will see them appear in the same order.

This is a particular problem in partitioned (sharded) databases where different partitions operate independently with no global ordering of writes.
Solution: Make sure causally related writes are written to the same partition. Or use algorithms that explicitly track causal dependencies.

2.4 Solutions for Replication Lag

If the application can tolerate replication lag increasing to minutes or hours with "no problem," that's great.
If it results in a bad user experience, design the system to provide stronger guarantees (read-after-write, monotonic reads, etc.).
Dealing with these issues in application code is complex and easy to get wrong.
Transactions exist to provide stronger guarantees so the application can be simpler. However, many distributed databases have abandoned transactions, claiming they are too expensive — this is overly simplistic.

Section 3: Multi-Leader Replication

Leader-based replication has one major downside: there is only one leader, and all writes must go through it. If you can't connect to the leader, you can't write.

Multi-leader configuration (master–master / active/active): allow more than one node to accept writes. Each leader simultaneously acts as a follower to the other leaders.

3.1 Use Cases

Multi-Datacenter Operation

With a leader in each datacenter, regular leader–follower replication is used within each datacenter; between datacenters, each leader replicates changes to leaders in other datacenters.

	Single-Leader	Multi-Leader
Performance	Every write goes over the internet to the leader's datacenter (significant latency)	Every write processed in the local datacenter, replicated asynchronously (inter-datacenter delay hidden from users)
Datacenter outage tolerance	Failover needed to promote a follower in another datacenter	Each datacenter continues operating independently; replication catches up when failed datacenter comes back
Network problem tolerance	Very sensitive to inter-datacenter link problems (writes are synchronous over this link)	Tolerates network problems better (async replication)

Tools: Tungsten Replicator (MySQL), BDR (PostgreSQL), GoldenGate (Oracle).

Big downside: The same data may be concurrently modified in two different datacenters → write conflicts must be resolved. Multi-leader replication is often considered dangerous territory that should be avoided if possible.

Clients with Offline Operation

Applications that need to work while disconnected (e.g., calendar apps on mobile/laptop). Every device has a local database acting as a leader; async multi-leader replication (sync) happens when online. Each device is essentially a "datacenter" with an extremely unreliable network connection. CouchDB is designed for this mode.

Collaborative Editing

Real-time collaborative editing (Etherpad, Google Docs) is essentially multi-leader replication. For fast collaboration, make the unit of change very small (e.g., a single keystroke) and avoid locking — but this brings all the challenges of multi-leader replication including conflict resolution.

3.2 Handling Write Conflicts

Conflict Avoidance

The simplest strategy: ensure all writes for a particular record go through the same leader. Example: route a particular user's requests to the same datacenter. Breaks down if you need to change the designated leader (datacenter failure, user relocation).

Converging Toward a Consistent State

In multi-leader, there is no defined ordering of writes. The database must resolve conflicts in a convergent way (all replicas arrive at the same final value):

Last Write Wins (LWW) — Give each write a unique ID (e.g., timestamp), pick the highest ID as the winner, discard others. Popular but dangerously prone to data loss.
Replica priority — Higher-numbered replica always takes precedence. Also implies data loss.
Merge values — E.g., order alphabetically and concatenate ("B/C").
Record the conflict — Preserve all information in an explicit data structure, resolve later (perhaps by prompting the user).

Custom Conflict Resolution Logic

On write — Database calls a conflict handler as soon as conflict is detected (e.g., Bucardo allows Perl snippets). Runs in background, must execute quickly.
On read — All conflicting writes are stored. Next time data is read, multiple versions are returned to the application, which resolves the conflict. CouchDB works this way.

Conflict resolution usually applies at the level of an individual row or document, not an entire transaction.

Automatic Conflict Resolution

CRDTs (Conflict-free Replicated Datatypes) — Data structures (sets, maps, ordered lists, counters) that can be concurrently edited and automatically resolve conflicts. Implemented in Riak 2.0.
Mergeable persistent data structures — Track history explicitly (like Git), use a three-way merge function.
Operational transformation — Algorithm behind Etherpad and Google Docs, designed for concurrent editing of ordered lists (e.g., characters in a text document).

3.3 Multi-Leader Replication Topologies

Circular topology — Each node receives writes from one node and forwards to another. MySQL's default.
Star topology — One designated root node forwards writes to all others. Can be generalized to a tree.
All-to-all — Every leader sends writes to every other leader. Most general.

In circular and star topologies, writes pass through several nodes. Each write is tagged with node identifiers to prevent infinite replication loops. Problem: if one node fails, it interrupts the flow (manual reconfiguration needed).

All-to-all has better fault tolerance (messages travel along different paths), but can have issues with writes arriving in the wrong order due to different network link speeds:

This is a causality problem: an update depends on a prior insert, but a replica may receive the update before the insert. Timestamps are not sufficient (clocks can't be trusted). Version vectors can be used to order events correctly, but conflict detection is poorly implemented in many multi-leader systems.

Section 4: Leaderless Replication

Some systems abandon the concept of a leader entirely, allowing any replica to directly accept writes from clients. Popularized by Amazon's in-house Dynamo system. Open source Dynamo-style datastores: Riak, Cassandra, Voldemort.

4.1 Writing When a Node Is Down

In a leaderless configuration, the client sends the write to all replicas in parallel. If some are unavailable, the write succeeds as long as enough replicas acknowledge it.

When reading, the client sends requests to several nodes in parallel and uses version numbers to determine which value is newer.

Read Repair

When a client reads from several nodes and detects a stale response, it writes the newer value back to the stale replica. Works well for frequently read values.

Anti-Entropy Process

A background process that constantly looks for differences between replicas and copies missing data. Does not copy writes in any particular order; there may be a significant delay.

4.2 Quorums for Reading and Writing

If there are n replicas, every write must be confirmed by w nodes, and we must query at least r nodes for each read. As long as w + r > n, we expect to get an up-to-date value when reading (at least one of the r nodes must have the latest value).

Common choice: n odd (typically 3 or 5), w = r = (n + 1) / 2.
w = n, r = 1 → optimized for fast reads.
w = 1, r = n → optimized for fast writes.
With n = 3, w = 2, r = 2 → tolerate one unavailable node.
With n = 5, w = 3, r = 3 → tolerate two unavailable nodes.
Reads and writes are always sent to all n replicas in parallel; w and r determine how many we wait for.

4.3 Limitations of Quorum Consistency

Even with w + r > n, stale values can be returned due to:

Sloppy quorums — w writes may end up on different nodes than the r reads.
Concurrent writes — No clear ordering; if LWW is used, writes can be lost due to clock skew.
Concurrent write and read — Write may be reflected on only some replicas.
Write succeeded on fewer than w replicas — Not rolled back on replicas where it succeeded.
Node restored from stale replica — Number of replicas with new value may fall below w.

Dynamo-style databases are optimized for use cases that can tolerate eventual consistency. The parameters w and r adjust the probability of stale reads but are not absolute guarantees.

4.4 Sloppy Quorums and Hinted Handoff

During a network interruption, a client may not be able to reach the n "home" nodes for a value, but can reach other nodes in the cluster.

Sloppy quorum — Accept writes on reachable nodes that aren't among the designated n "home" nodes. (Like staying on a neighbor's couch when locked out of your house.)
Hinted handoff — Once the network interruption is fixed, writes temporarily accepted by other nodes are sent to the appropriate "home" nodes. (Your neighbor asks you to go home once you find your keys.)

Sloppy quorums increase write availability but mean that even with w + r > n, you cannot be sure to read the latest value (it may be on nodes outside of n). It's only an assurance of durability, not consistency.

Riak: sloppy quorums enabled by default.
Cassandra and Voldemort: disabled by default.

4.5 Detecting Concurrent Writes

Dynamo-style databases allow several clients to concurrently write to the same key → conflicts arise even with strict quorums.

Last Write Wins (LWW)

Attach a timestamp to each write, pick the biggest as the winner, discard others.
Only supported conflict resolution in Cassandra; optional in Riak.
Achieves eventual convergence but at the cost of durability — only one of several concurrent writes survives, others are silently discarded.
Safe only if a key is written once and thereafter treated as immutable (e.g., use a UUID as the key).

The "Happens-Before" Relationship and Concurrency

Operation A happens before B if B knows about A, depends on A, or builds upon A.
Two operations are concurrent if neither happens before the other (neither knows about the other).
Three possibilities: A happened before B, B happened before A, or A and B are concurrent.

Capturing Causal Dependencies (Version Numbers)

Algorithm:

Server maintains a version number for every key, incremented on every write.
When a client reads, the server returns all values not yet overwritten + the latest version number. Client must read before writing.
When a client writes, it includes the version number from the prior read and must merge all values it received.
Server overwrites all values with that version number or below, but keeps values with a higher version number (those are concurrent).

Merging Concurrently Written Values (Siblings)

Riak calls concurrent values siblings.
Simple approach: take the union of sibling values (e.g., merge shopping carts).
Problem with removals: a removed item reappears in the union. Solution: use a tombstone (deletion marker with a version number) instead of actually deleting.
CRDTs (e.g., Riak's datatype support) can automatically merge siblings, including preserving deletions.

Version Vectors

With multiple replicas (no leader), use a version number per replica as well as per key.
Each replica increments its own version number on writes and tracks version numbers from other replicas.
The collection of version numbers from all replicas is called a version vector.
Dotted version vectors (used in Riak 2.0) are a refined variant.
Version vectors are sent from database to clients on reads, and back to the database on writes (Riak calls this the causal context).
Note: version vectors ≠ vector clocks (subtle difference — version vectors are the right structure for comparing replica state).

Summary

Key Takeaways

Purposes of Replication:

High availability — System keeps running even when machines/datacenters go down.
Disconnected operation — Application works during network interruptions.
Latency — Data geographically close to users.
Scalability — Handle higher read volume via replicas.

Three Approaches:

Single-leader — All writes to one node. Reads from any replica (may be stale). Easy to understand, no conflict resolution.
Multi-leader — Writes to any of several leaders. More robust to faults/latency, but harder to reason about, requires conflict resolution.
Leaderless — Writes and reads to/from several nodes in parallel. Quorum-based. Most robust but weakest consistency guarantees.

Consistency Models for Replication Lag:

Read-after-write consistency — Users always see data they submitted themselves.
Monotonic reads — Users don't see time go backward.
Consistent prefix reads — Users see data in causal order.

Concurrency and Conflict Resolution:

Multi-leader and leaderless allow concurrent writes → conflicts.
LWW is simple but loses data.
Version vectors and the happens-before relationship determine concurrency.
Merging siblings (union, CRDTs, tombstones) preserves data.

Important References

Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, et al.: "Dynamo: Amazon's Highly Available Key-Value Store," at 21st ACM SOSP, October 2007.
Leslie Lamport: "Time, Clocks, and the Ordering of Events in a Distributed System," Communications of the ACM, volume 21, number 7, pages 558–565, July 1978.
Werner Vogels: "Eventually Consistent," ACM Queue, volume 6, number 6, pages 14–19, October 2008.
Douglas B. Terry: "Replicated Data Consistency Explained Through Baseball," Microsoft Research, Technical Report MSR-TR-2011-137, October 2011.
Douglas B. Terry, Alan J. Demers, Karin Petersen, et al.: "Session Guarantees for Weakly Consistent Replicated Data," at 3rd PDIS, September 1994.
Robbert van Renesse and Fred B. Schneider: "Chain Replication for Supporting High Throughput and Availability," at 6th USENIX OSDI, December 2004.
Marc Shapiro, Nuno Preguiça, Carlos Baquero, and Marek Zawirski: "A Comprehensive Study of Convergent and Commutative Replicated Data Types," INRIA Research Report no. 7506, January 2011.
Martin Kleppmann and Alastair R. Beresford: "A Conflict-Free Replicated JSON Datatype," arXiv:1608.03960, August 13, 2016.
David K. Gifford: "Weighted Voting for Replicated Data," at 7th ACM SOSP, December 1979.
Heidi Howard, Dahlia Malkhi, and Alexander Spiegelman: "Flexible Paxos: Quorum Intersection Revisited," arXiv:1608.06696, August 24, 2016.
Peter Bailis, Shivaram Venkataraman, Michael J. Franklin, et al.: "Quantifying Eventual Consistency with PBS," Communications of the ACM, volume 57, number 8, pages 93–102, August 2014.
Nuno Preguiça, Carlos Baquero, Paulo Sérgio Almeida, et al.: "Dotted Version Vectors: Logical Clocks for Optimistic Replication," arXiv:1011.5808, November 26, 2010.
Jesse Newland: "GitHub Availability This Week," github.com, September 14, 2012.
Yoshinori Matsunobu: "Semi-Synchronous Replication at Facebook," April 1, 2014.
Chengzheng Sun and Clarence Ellis: "Operational Transformation in Real-Time Group Editors: Issues, Algorithms, and Achievements," at ACM CSCW, November 1998.

Previous chapter

Encoding and Evolution

Next chapter

Partitioning

Pagefy