Transactions
Designing Data Intensive Applications by Martin KleppmannChapter 7: Transactions
Introduction
In the harsh reality of data systems, many things can go wrong: database software/hardware failure mid-write, application crashes mid-operation, network interruptions, concurrent writes overwriting each other, clients reading partially updated data, race conditions.
Transactions are the mechanism for simplifying these issues. A transaction groups several reads and writes into a logical unit: either the entire transaction succeeds (commit) or it fails (abort/rollback). Error handling becomes much simpler — no need to worry about partial failure.
Transactions were created to simplify the programming model for applications accessing a database. They are not a law of nature — they have advantages and limitations.
Section 1: The Slippery Concept of a Transaction
1.1 The Meaning of ACID
The safety guarantees provided by transactions are described by ACID: Atomicity, Consistency, Isolation, and Durability (coined in 1983 by Härder and Reuter). In practice, one database's ACID ≠ another's — ACID has become mostly a marketing term.
Systems that don't meet ACID are sometimes called BASE (Basically Available, Soft state, Eventual consistency) — even more vague than ACID.
Atomicity
- In ACID, atomicity is not about concurrency (that's isolation). It describes what happens if a client makes several writes and a fault occurs partway through.
- If the transaction cannot be completed, it is aborted and the database discards or undoes all writes made so far.
- The application can be sure that an aborted transaction changed nothing, so it can safely retry.
- Perhaps abortability would have been a better term.
Consistency
The word "consistency" is terribly overloaded (at least 4 different meanings):
- Replica consistency (Chapter 5)
- Consistent hashing (Chapter 6)
- Linearizability in CAP theorem (Chapter 9)
- ACID consistency — Application-specific invariants that must always be true (e.g., credits and debits must balance)
ACID consistency is a property of the application, not the database. The application defines transactions correctly to preserve invariants. The letter C doesn't really belong in ACID.
Isolation
Concurrently executing transactions are isolated from each other — they cannot step on each other's toes. The classic formalization is serializability: the result is the same as if transactions had run serially (one after another).
In practice, serializable isolation is rarely used due to performance penalty. Some popular databases (e.g., Oracle 11g) don't even implement it — Oracle's "serializable" actually implements snapshot isolation.
Durability
Once a transaction has committed successfully, any data it has written will not be forgotten, even if there is a hardware fault or crash.
- Single-node: data written to nonvolatile storage (hard drive/SSD) + write-ahead log.
- Replicated database: data successfully copied to some number of nodes.
- Perfect durability does not exist — various risk-reduction techniques (writing to disk, replication, backups) should be used together.
1.2 Single-Object and Multi-Object Operations
Single-Object Writes
Storage engines provide atomicity (log for crash recovery) and isolation (lock on each object) on the level of a single object. Some databases also provide:
- Atomic increment operations (no read-modify-write cycle needed)
- Compare-and-set operations (write only if value hasn't changed)
These are useful but are not transactions in the usual sense — a transaction groups multiple operations on multiple objects.
The Need for Multi-Object Transactions
- Foreign key references — Inserting records that refer to each other must keep references valid.
- Denormalized data — Document databases lacking joins encourage denormalization; updating denormalized info requires updating several documents atomically.
- Secondary indexes — Must be updated consistently with the data; without isolation, a record could appear in one index but not another.
Handling Errors and Aborts
A key feature of transactions: can be aborted and safely retried on error. But retrying isn't perfect:
- Transaction may have succeeded but network failed during acknowledgment → performed twice (need deduplication).
- Error due to overload → retrying makes it worse (use exponential backoff).
- Only worth retrying after transient errors (deadlock, network interruption), not permanent errors (constraint violation).
- Side effects outside the database (e.g., sending email) may happen even if transaction is aborted.
Section 2: Weak Isolation Levels
Concurrency bugs are hard to find by testing (only triggered with unlucky timing). Databases provide transaction isolation to hide concurrency issues. Serializable isolation guarantees transactions have the same effect as if they ran serially — but it has a performance cost, so many systems use weaker levels.
2.1 Read Committed
The most basic level. Two guarantees:
- No dirty reads — You only see data that has been committed.
- No dirty writes — You only overwrite data that has been committed.
No Dirty Reads
Why prevent dirty reads:
- Seeing a partially updated state is confusing and may cause incorrect decisions.
- If a transaction aborts, you'd see data that was never actually committed (rolled back).
No Dirty Writes
Example: Alice and Bob simultaneously buy the same car. Without preventing dirty writes, the sale could be awarded to Bob but the invoice sent to Alice.
Note: read committed does not prevent the race condition in Figure 7-1 (two counter increments) — that's a different problem (lost updates).
Implementation
- Dirty writes prevented by row-level locks: a transaction must acquire a lock on an object before modifying it, held until commit or abort.
- Dirty reads prevented by remembering both the old committed value and the new uncommitted value. Other transactions read the old value until the new value is committed. (Using read locks would harm performance — one long write transaction blocks all reads.)
- Read committed is the default in Oracle 11g, PostgreSQL, SQL Server 2012, MemSQL, and many others.
2.2 Snapshot Isolation and Repeatable Read
Read committed still allows concurrency bugs. Example: read skew (nonrepeatable read).
Alice has $1,000 split across two accounts ($500 each). A transaction transfers $100 between them. If Alice reads one account before the transfer and the other after, she sees only $900 total — $100 has "vanished." This is acceptable under read committed (both values were committed when read), but problematic for:
- Backups — A backup taken during writes could contain inconsistencies that become permanent on restore.
- Analytic queries and integrity checks — Scanning large parts of the database would return nonsensical results if observing different points in time.
Snapshot isolation solves this: each transaction reads from a consistent snapshot of the database — all data committed at the start of the transaction. Even if data is subsequently changed, each transaction sees only the old data from that point in time.
Supported by: PostgreSQL, MySQL (InnoDB), Oracle, SQL Server, and others.
Implementation: Multi-Version Concurrency Control (MVCC)
- Key principle: readers never block writers, and writers never block readers.
- The database keeps several different committed versions of an object (various in-progress transactions may need different points in time).
- Each transaction gets a unique, always-increasing transaction ID (txid).
- Each row has a
created_byfield (txid that inserted it) and adeleted_byfield (txid that requested deletion, initially empty). - An update is internally translated into a delete + create.
- A garbage collection process removes rows marked for deletion when no transaction can access them anymore.
Visibility Rules
An object is visible to a transaction if:
- At the time the reader's transaction started, the transaction that created the object had already committed.
- The object is not marked for deletion, or if it is, the transaction that requested deletion had not yet committed when the reader's transaction started.
More specifically:
- Any writes by transactions in progress at snapshot time are ignored (even if they subsequently commit).
- Any writes by aborted transactions are ignored.
- Any writes by transactions with a later txid are ignored (regardless of commit status).
- All other writes are visible.
Indexes and Snapshot Isolation
- One option: index points to all versions of an object; index queries filter out invisible versions.
- Another approach (CouchDB, Datomic, LMDB): append-only/copy-on-write B-trees — don't overwrite pages, create new copies. Each write transaction creates a new B-tree root that is a consistent snapshot.
Repeatable Read and Naming Confusion
- The SQL standard doesn't have the concept of snapshot isolation (it predates the invention).
- PostgreSQL and MySQL call their snapshot isolation level "repeatable read" (to claim standards compliance).
- Oracle calls it "serializable" (even though it's weaker than true serializability).
- IBM DB2 uses "repeatable read" to mean serializability.
- Nobody really knows what repeatable read means.
2.3 Preventing Lost Updates
The lost update problem occurs with a read-modify-write cycle: two transactions read a value, modify it, and write back. The second write doesn't include the first modification — one modification is lost (the later write clobbers the earlier write).
Scenarios: incrementing a counter, updating an account balance, adding an element to a JSON list, two users editing a wiki page simultaneously.
Solutions
1. Atomic Write Operations (best choice when applicable)
UPDATE counters SET value = value + 1 WHERE key = 'foo';
- Implemented by taking an exclusive lock on the object when read (cursor stability), or forcing all atomic operations to execute on a single thread.
- MongoDB provides atomic operations for local modifications to JSON documents; Redis for data structures like priority queues.
- ORM frameworks can accidentally bypass atomic operations with unsafe read-modify-write cycles.
2. Explicit Locking
BEGIN TRANSACTION;
SELECT * FROM figures
WHERE name = 'robot' AND game_id = 222
FOR UPDATE; -- lock all returned rows
-- Check whether move is valid, then update
UPDATE figures SET position = 'c4' WHERE id = 1234;
COMMIT;
FOR UPDATEtells the database to lock all rows returned by the query.- Useful when atomic operations aren't sufficient (e.g., game logic that can't be expressed as a database query).
- Easy to forget a necessary lock → introduce a race condition.
3. Automatically Detecting Lost Updates
- Allow read-modify-write cycles to execute in parallel; if the transaction manager detects a lost update, abort and force retry.
- PostgreSQL's repeatable read, Oracle's serializable, and SQL Server's snapshot isolation automatically detect lost updates.
- MySQL/InnoDB's repeatable read does not detect lost updates.
- Great feature: doesn't require special application code, less error-prone.
4. Compare-and-Set
UPDATE wiki_pages SET content = 'new content'
WHERE id = 1234 AND content = 'old content';
- Allow an update only if the value hasn't changed since you last read it. If it has, the update has no effect → retry.
- Caution: If the database allows the WHERE clause to read from an old snapshot, this may not be safe.
5. Conflict Resolution and Replication
- In multi-leader or leaderless replication, locks and compare-and-set don't apply (no single up-to-date copy).
- Allow concurrent writes to create conflicting versions (siblings), resolve and merge after the fact.
- Commutative atomic operations (e.g., incrementing a counter) work well in replicated contexts — Riak 2.0 datatypes prevent lost updates across replicas.
- Last Write Wins (LWW) is prone to lost updates — unfortunately the default in many replicated databases.
2.4 Write Skew and Phantoms
Write skew is a generalization of the lost update problem: two transactions read the same objects, then update different objects based on what they read.
Doctor On-Call Example
Hospital requires at least one doctor on call. Alice and Bob are both on call, both feeling unwell, both click "go off call" at the same time. Both transactions check: 2 doctors on call → safe to go off call. Both proceed. Result: no doctor on call — requirement violated.
Options for Preventing Write Skew
- Atomic single-object operations don't help (multiple objects involved).
- Automatic lost update detection doesn't help (not automatically detected by PostgreSQL, MySQL/InnoDB, Oracle, or SQL Server's snapshot isolation).
- Database constraints may help for some cases, but multi-object constraints are rarely supported.
- Explicit locking with
SELECT ... FOR UPDATEon the rows the transaction depends on. - True serializable isolation is the only complete solution.
More Examples of Write Skew
- Meeting room booking — Two users book the same room at the same time. Snapshot isolation doesn't prevent the conflict.
- Multiplayer game — Two players move different figures to the same position.
- Claiming a username — Two users try to register the same username simultaneously. (A unique constraint solves this one.)
- Preventing double-spending — Two spending items inserted concurrently that together cause a negative balance.
Phantoms Causing Write Skew
The pattern:
- A
SELECTquery checks whether some requirement is satisfied. - Application code decides how to continue based on the result.
- A write (
INSERT,UPDATE, orDELETE) changes the precondition of step 2.
A phantom is when a write in one transaction changes the result of a search query in another transaction. Snapshot isolation avoids phantoms in read-only queries, but in read-write transactions, phantoms can lead to write skew.
The problem: in some cases (meeting room booking, username claiming), the query checks for the absence of rows, and the write adds a row. SELECT FOR UPDATE can't attach locks to rows that don't exist yet.
Materializing Conflicts
Artificially introduce a lock object into the database. Example: create a table of all possible room/time-slot combinations ahead of time. A transaction locks (SELECT FOR UPDATE) the relevant rows before checking and inserting.
- Takes a phantom and turns it into a lock conflict on concrete rows.
- Hard and error-prone to figure out; ugly to let concurrency control leak into the application data model.
- Should be considered a last resort — serializable isolation is much preferable.
Section 3: Serializability
Serializable isolation is the strongest isolation level. It guarantees that even though transactions may execute in parallel, the end result is the same as if they had executed one at a time, serially. The database prevents all possible race conditions.
Three techniques for implementing serializability:
- Actual Serial Execution
- Two-Phase Locking (2PL)
- Serializable Snapshot Isolation (SSI)
3.1 Actual Serial Execution
The simplest way: remove concurrency entirely — execute only one transaction at a time, in serial order, on a single thread.
Why This Became Feasible (~2007)
- RAM became cheap enough to keep the entire active dataset in memory → transactions execute much faster (no waiting for disk I/O).
- OLTP transactions are usually short and make a small number of reads and writes. Long-running analytic queries can run on a consistent snapshot (snapshot isolation) outside the serial execution loop.
Implemented in: VoltDB/H-Store, Redis, Datomic.
Stored Procedures
Interactive client/server transactions spend most time in network communication between application and database. With single-threaded execution, this would be dreadful throughput.
Solution: the application submits the entire transaction code to the database ahead of time as a stored procedure.
Modern implementations use general-purpose languages: VoltDB uses Java/Groovy, Datomic uses Java/Clojure, Redis uses Lua.
VoltDB uses stored procedures for replication: executes the same stored procedure on each replica (requires deterministic procedures).
Partitioning
Single-threaded execution limits throughput to one CPU core. To scale: partition data so each partition has its own transaction processing thread running independently.
- Single-partition transactions scale linearly with CPU cores.
- Cross-partition transactions require coordination across all partitions → vastly slower (VoltDB: ~1,000 cross-partition writes/sec vs. much higher single-partition throughput).
Constraints of Serial Execution
- Every transaction must be small and fast (one slow transaction stalls everything).
- Active dataset must fit in memory.
- Write throughput must be low enough for a single CPU core (or transactions must be partitionable).
- Cross-partition transactions are possible but limited.
3.2 Two-Phase Locking (2PL)
For ~30 years, the only widely used algorithm for serializability. (Note: 2PL ≠ 2PC — two-phase locking is completely different from two-phase commit.)
How It Works
Several transactions can concurrently read the same object, but as soon as anyone wants to write, exclusive access is required:
- If A has read an object and B wants to write → B must wait until A commits/aborts.
- If A has written an object and B wants to read → B must wait until A commits/aborts.
Key difference from snapshot isolation: in 2PL, writers block readers AND readers block writers. (Snapshot isolation: readers never block writers, writers never block readers.)
Implementation: Shared and Exclusive Locks
- Read → acquire lock in shared mode (multiple transactions can hold shared locks simultaneously; must wait if exclusive lock exists).
- Write → acquire lock in exclusive mode (no other transaction may hold any lock).
- Read then write → upgrade shared lock to exclusive lock.
- Locks held until end of transaction (commit or abort). First phase: acquire locks; second phase: release all locks.
Used by: MySQL (InnoDB) and SQL Server for serializable isolation; DB2 for repeatable read.
Deadlocks
Transaction A waits for B's lock, B waits for A's lock. The database automatically detects deadlocks and aborts one transaction so others can proceed. The aborted transaction must be retried. Deadlocks occur much more frequently under 2PL than under weaker isolation levels.
Performance
The big downside: significantly worse throughput and response times than weak isolation.
- Overhead of acquiring/releasing locks.
- Reduced concurrency — if two transactions may cause any race condition, one must wait.
- Unstable latencies — one slow transaction or one that acquires many locks can cause the rest of the system to grind to a halt.
Predicate Locks
To prevent phantoms, we need locks that apply to objects matching a search condition (not just specific rows). A predicate lock belongs to all objects matching some condition:
- Transaction A reads objects matching a condition → acquires shared predicate lock.
- Transaction A wants to insert/update/delete → must check if old or new value matches any existing predicate lock.
- The key idea: predicate locks apply even to objects that don't yet exist (phantoms).
Index-Range Locks (Next-Key Locking)
Predicate locks don't perform well (checking for matching locks is time-consuming). Most databases with 2PL use index-range locking — a simplified approximation:
- Instead of locking "bookings for room 123 between noon and 1pm," lock "all bookings for room 123" or "all bookings between noon and 1pm."
- Attach a shared lock to the index entry (e.g., room_id = 123 or the time range in a time-based index).
- Any transaction that wants to insert/update/delete a matching record must update the same index → encounters the lock → forced to wait.
- Not as precise as predicate locks (may lock a bigger range), but much lower overhead — a good compromise.
- If no suitable index exists, the database falls back to a shared lock on the entire table.
3.3 Serializable Snapshot Isolation (SSI)
A promising algorithm that provides full serializability with only a small performance penalty compared to snapshot isolation. First described in 2008 (Michael Cahill's PhD thesis).
Used in: PostgreSQL (since version 9.1), FoundationDB.
Pessimistic vs Optimistic Concurrency Control
- 2PL is pessimistic — If anything might go wrong, wait until it's safe. Like mutual exclusion in multi-threaded programming.
- Serial execution is pessimistic to the extreme — Each transaction has an exclusive lock on the entire database/partition.
- SSI is optimistic — Transactions continue without blocking. When a transaction wants to commit, the database checks whether isolation was violated; if so, the transaction is aborted and retried.
Optimistic concurrency control performs badly under high contention (many transactions accessing the same objects → high abort rate). But with enough spare capacity and low contention, it performs better than pessimistic approaches.
SSI is based on snapshot isolation + an algorithm for detecting serialization conflicts among writes.
Detecting Stale MVCC Reads
- Under snapshot isolation, a transaction ignores writes by uncommitted transactions (MVCC visibility rules).
- SSI tracks when a transaction ignores another transaction's writes. When the transaction wants to commit, the database checks whether any of the ignored writes have now been committed. If so → abort.
- Why wait until commit? A read-only transaction wouldn't need to be aborted. The other transaction may yet abort. By avoiding unnecessary aborts, SSI preserves snapshot isolation's support for long-running reads.
Detecting Writes That Affect Prior Reads
- When a transaction writes to the database, it looks in the indexes for other transactions that have recently read the affected data.
- Similar to acquiring a write lock, but instead of blocking, the lock acts as a tripwire: it notifies the transactions that their data may no longer be up to date.
- The first transaction to commit succeeds; when the conflicting transaction tries to commit, the conflicting write has already been committed → must abort.
Performance of SSI
- vs 2PL: One transaction doesn't need to block waiting for another's locks. Writers don't block readers, readers don't block writers → much more predictable latency. Read-only queries run on a consistent snapshot without locks.
- vs Serial Execution: Not limited to a single CPU core. FoundationDB distributes conflict detection across multiple machines → scales to very high throughput. Transactions can read and write across multiple partitions.
- The rate of aborts affects overall performance. Read-write transactions should be fairly short (long-running read-only transactions are okay). SSI is probably less sensitive to slow transactions than 2PL or serial execution.
Summary
Race Conditions and Isolation Levels
| Race Condition | Read Committed | Snapshot Isolation | Serializable |
|---|---|---|---|
| Dirty reads | ✅ Prevented | ✅ Prevented | ✅ Prevented |
| Dirty writes | ✅ Prevented | ✅ Prevented | ✅ Prevented |
| Read skew (nonrepeatable reads) | ❌ | ✅ Prevented (MVCC) | ✅ Prevented |
| Lost updates | ❌ | ⚠️ Some implementations detect | ✅ Prevented |
| Write skew | ❌ | ❌ | ✅ Prevented |
| Phantom reads | ❌ | ⚠️ Read-only OK; read-write problematic | ✅ Prevented |
Three Approaches to Serializable Transactions
| Serial Execution | Two-Phase Locking (2PL) | SSI | |
|---|---|---|---|
| Approach | Single-threaded, one at a time | Shared/exclusive locks, readers block writers | Optimistic, check at commit time |
| Performance | Limited to one CPU core | Significantly worse throughput, unstable latencies | Small penalty vs snapshot isolation |
| Scalability | Partitioning helps, but cross-partition is slow | Deadlocks, lock contention | Distributes across multiple machines |
| Best for | Small, fast transactions; dataset fits in memory | Traditional RDBMS serializability | General purpose; read-heavy workloads |
Important References
- Jim N. Gray, Raymond A. Lorie, Gianfranco R. Putzolu, and Irving L. Traiger: "Granularity of Locks and Degrees of Consistency in a Shared Data Base," in Modelling in Data Base Management Systems, 1976.
- Theo Härder and Andreas Reuter: "Principles of Transaction-Oriented Database Recovery," ACM Computing Surveys, volume 15, number 4, pages 287–317, December 1983.
- Hal Berenson, Philip A. Bernstein, Jim N. Gray, et al.: "A Critique of ANSI SQL Isolation Levels," at ACM SIGMOD, May 1995.
- Alan Fekete, Dimitrios Liarokapis, Elizabeth O'Neil, et al.: "Making Snapshot Isolation Serializable," ACM Transactions on Database Systems, volume 30, number 2, pages 492–528, June 2005.
- Michael J. Cahill, Uwe Röhm, and Alan Fekete: "Serializable Isolation for Snapshot Databases," at ACM SIGMOD, June 2008.
- Dan R. K. Ports and Kevin Grittner: "Serializable Snapshot Isolation in PostgreSQL," at 38th VLDB, August 2012.
- Michael Stonebraker, Samuel Madden, Daniel J. Abadi, et al.: "The End of an Architectural Era (It's Time for a Complete Rewrite)," at 33rd VLDB, September 2007.
- Robert Kallman, Hideaki Kimura, Jonathan Natkins, et al.: "H-Store: A High-Performance, Distributed Main Memory Transaction Processing System," Proceedings of the VLDB Endowment, volume 1, number 2, pages 1496–1499, August 2008.
- Martin Kleppmann: "Hermitage: Testing the 'I' in ACID," martin.kleppmann.com, November 25, 2014.
- Peter Bailis, Aaron Davidson, Alan Fekete, et al.: "Highly Available Transactions: Virtues and Limitations," at 40th VLDB, September 2014.
- Philip A. Bernstein, Vassos Hadzilacos, and Nathan Goodman: Concurrency Control and Recovery in Database Systems, Addison-Wesley, 1987.
- Joseph M. Hellerstein, Michael Stonebraker, and James Hamilton: "Architecture of a Database System," Foundations and Trends in Databases, volume 1, number 2, pages 141–259, November 2007.
- Donald D. Chamberlin, Morton M. Astrahan, et al.: "A History and Evaluation of System R," Communications of the ACM, volume 24, number 10, pages 632–646, October 1981.
- James C. Corbett, Jeffrey Dean, Michael Epstein, et al.: "Spanner: Google's Globally-Distributed Database," at 10th USENIX OSDI, October 2012.
Previous chapter
PartitioningNext chapter
The Trouble with Distributed Systems