The Future of Data Systems
Designing Data Intensive Applications by Martin KleppmannChapter 12: The Future of Data Systems
Introduction
This final chapter shifts perspective toward the future — proposing ideas for how we might fundamentally improve the ways we design and build applications that are reliable, scalable, and maintainable. The goal: applications that are robust, correct, evolvable, and ultimately beneficial to humanity.
Section 1: Data Integration
There is no one right solution for any given problem — many approaches are each appropriate in different circumstances. In complex applications, data is often used in several different ways, and no single piece of software is suitable for all of them. You inevitably end up cobbling together several different pieces of software.
1.1 Combining Specialized Tools
Besides the database and search index, you may need: analytics systems (data warehouses, batch/stream processing), caches, denormalized views, machine learning/recommendation systems, and notification systems. The need for data integration becomes apparent when you zoom out and consider dataflows across an entire organization.
1.2 Reasoning About Dataflows
When copies of the same data are maintained in several storage systems, be very clear about inputs and outputs: where is data written first, and which representations are derived from which sources?
- Write data to a system of record database first, capture changes via CDC, and apply them to the search index in the same order.
- If CDC is the only way of updating the index, you can be confident the index is entirely derived from the system of record and therefore consistent with it.
- Allowing the application to directly write to both the database and the search index introduces the problems of dual writes (race conditions, partial failures).
- Updating derived data based on an event log can be made deterministic and idempotent, making fault recovery easy.
1.3 Derived Data vs Distributed Transactions
| Distributed Transactions | Log-Based Derived Data | |
|---|---|---|
| Ordering | Locks for mutual exclusion | Log for ordering |
| Atomicity | Atomic commit (2PC) | Deterministic retry + idempotence |
| Consistency | Linearizability (read-your-writes) | Asynchronous (no timing guarantees by default) |
| Fault tolerance | XA has poor fault tolerance | More robust and practical |
Log-based derived data is the most promising approach for integrating different data systems.
1.4 The Limits of Total Ordering
Constructing a totally ordered event log is feasible for small systems (single-leader replication). But limitations emerge at scale:
- Throughput exceeds a single machine → must partition the log across multiple machines → order between partitions is ambiguous.
- Multiple geographically distributed datacenters → separate leader per datacenter → undefined ordering across datacenters.
- Microservices → each service has independent durable state → no defined order for events from different services.
- Client-side state (offline-capable apps) → clients and servers see events in different orders.
Deciding on a total order is equivalent to consensus — most consensus algorithms are designed for a single node's throughput. Approaches for partial ordering: logical timestamps, recording causal dependencies with unique event identifiers, conflict resolution algorithms (CRDTs).
1.5 Batch and Stream Processing
Batch and stream processors are the tools for transforming data from a system of record into derived datasets (search indexes, materialized views, recommendations, aggregate metrics).
- Spark performs stream processing on top of batch (microbatching).
- Apache Flink performs batch processing on top of stream.
- In principle, one can emulate the other, though performance characteristics vary.
Maintaining Derived State
- Batch processing has a functional flavor: deterministic, pure functions, immutable inputs, append-only outputs.
- Stream processing extends this with managed, fault-tolerant state.
- Helpful to think of derived data as a data pipeline (transformation function applied to a system of record).
Reprocessing Data for Application Evolution
- Stream processing reflects changes with low delay.
- Batch processing allows reprocessing large amounts of historical data to derive new views.
- Reprocessing enables restructuring a dataset into a completely different model (beyond simple schema changes like adding an optional field).
- Gradual migration — Maintain old and new schemas side by side as two independently derived views. Shift users gradually. Every stage is easily reversible.
Schema Migrations on Railways
In the early days of railway building in 19th-century England, there were various competing standards for the gauge (distance between the two rails). Trains built for one gauge couldn't run on tracks of another gauge. After a single standard gauge was decided upon in 1846, tracks with other gauges had to be converted — but how do you do this without shutting down the train line for months? The solution: first convert the track to dual gauge by adding a third rail. This can be done gradually, and trains of both gauges can run on the line using two of the three rails. Eventually, once all trains are converted, the nonstandard rail can be removed. This is analogous to maintaining old and new schemas side by side. Nevertheless, it is expensive, which is why nonstandard gauges still exist today (e.g., the BART system in San Francisco uses a different gauge from the majority of the US).
The Lambda Architecture
- Incoming data recorded as immutable events. Two parallel systems: batch (Hadoop MapReduce) for correctness, stream (Storm) for speed.
- Stream processor produces approximate updates quickly; batch processor later produces corrected version. The reasoning: batch processing is simpler and less prone to bugs, while stream processors are thought to be less reliable and harder to make fault-tolerant.
- Problems:
- Maintaining the same logic in both a batch and a stream processing framework is significant additional effort. Libraries like Summingbird provide an abstraction for computations that can run in either context, but the operational complexity of debugging, tuning, and maintaining two different systems remains.
- Since the stream and batch pipelines produce separate outputs, they need to be merged to respond to user requests. This merge is fairly easy for simple aggregations over tumbling windows, but becomes significantly harder for complex operations like joins and sessionization.
- Although reprocessing the entire historical dataset is great, doing so frequently is expensive on large datasets. The batch pipeline often needs to process incremental batches (e.g., an hour's worth at the end of every hour), which raises problems with handling stragglers and windows that cross batch boundaries. Incrementalizing a batch computation adds complexity, making it more akin to the streaming layer — running counter to the goal of keeping the batch layer simple.
Unifying Batch and Stream Processing
More recent work allows both batch and stream computations in the same system, requiring:
- Ability to replay historical events through the same engine that handles recent events.
- Exactly-once semantics for stream processors.
- Windowing by event time, not processing time.
Section 2: Unbundling Databases
At the most abstract level, databases, Hadoop, and operating systems all perform the same functions: store data and allow you to process and query it.
2.1 Composing Data Storage Technologies
A database already does many things internally that look like derived data systems:
- Secondary indexes — Derived from primary data, kept in sync on every write.
- Materialized views — Precomputed cache of query results.
- Replication logs — Keep copies on other nodes up to date.
- Full-text search indexes — Built into some relational databases.
Creating an index (CREATE INDEX) is remarkably similar to setting up a new follower replica or bootstrapping CDC — the database reprocesses the existing dataset and derives the index as a new view.
Batch and stream processors are like elaborate implementations of triggers, stored procedures, and materialized view maintenance — but provided by various different pieces of software, running on different machines, administered by different teams.
2.2 Federated Databases vs Unbundled Databases
| Federated Databases (Polystore) | Unbundled Databases | |
|---|---|---|
| Focus | Unifying reads — unified query interface across storage engines | Unifying writes — synchronizing writes across systems |
| Approach | High-level query language over diverse backends (e.g., PostgreSQL foreign data wrappers) | Asynchronous event log with idempotent writes |
| Philosophy | Relational tradition — single integrated system | Unix tradition — composing loosely coupled components |
The traditional approach to synchronizing writes (distributed transactions / XA) is the wrong solution. An ordered log of events with idempotent consumers is a much simpler abstraction that works across heterogeneous systems.
Advantages of Log-Based Integration (Loose Coupling)
- System level — Asynchronous event streams make the system more robust to outages. If a consumer fails, the event log buffers messages; the faulty consumer catches up when fixed.
- Human level — Different teams can independently develop, improve, and maintain different derived data systems. Each team focuses on doing one thing well.
2.3 Designing Applications Around Dataflow
The "database inside-out" approach — composing specialized storage and processing systems with application code. Related to dataflow languages (Oz, Juttle), functional reactive programming (Elm), logic programming (Bloom), and even spreadsheets (automatic recalculation when inputs change).
Application Code as a Derivation Function
- Secondary index → straightforward transformation (pick indexed fields, sort).
- Full-text search index → NLP functions (language detection, stemming, synonyms) + inverted index.
- Machine learning model → derived from training data via feature extraction and statistical analysis.
- Cache → aggregation of data in the form displayed in the UI.
When the derivation function is application-specific (ML, full-text search, caching), it needs to be custom application code, not just a database's built-in feature. Modern deployment tools (Mesos, YARN, Docker, Kubernetes) are better suited for running this code than database stored procedures.
Stream Processors and Services
Composing stream operators into dataflow systems is similar to microservices, but with one-directional, asynchronous message streams instead of synchronous request/response.
Example: currency conversion for a purchase. Instead of making a synchronous RPC to an exchange rate service (which adds latency and a failure dependency), use a stream-table join between purchase events and exchange rate update events. The stream processor subscribes to the exchange rate changelog and maintains a local copy — no network request needed at query time.
2.4 Observing Derived State
Write Path and Read Path
- Write path — Precomputed eagerly when data comes in (batch and stream processing, updating derived datasets). Like eager evaluation.
- Read path — Computed lazily when someone asks for it (serving user requests). Like lazy evaluation.
- The derived dataset is where the write path and read path meet — a trade-off between work done at write time vs read time.
- Caches, indexes, and materialized views shift the boundary: more work on the write path (precomputing) to save effort on the read path.
Stateful, Offline-Capable Clients
- On-device state is a cache of state on the server. The pixels on screen are a materialized view; model objects are a local replica of remote state.
- Server-sent events and WebSockets allow the server to actively push state changes to the browser.
- Extending the write path all the way to the end-user device — state changes flow from one device, through event logs and stream processors, to another device's UI. Propagated with fairly low delay (under one second end-to-end).
Reads Are Events Too
Read requests can be represented as streams of events, routed to the same stream operator as writes — performing a stream-table join between read queries and the database. A one-off read passes through the join and is forgotten; a subscribe request is a persistent join with past and future events.
Recording read events enables better tracking of causal dependencies and data provenance (e.g., what the user saw before making a purchase decision).
Section 3: Aiming for Correctness
We want applications that are reliable and correct — well-defined semantics even in the face of faults. Transactions (atomicity, isolation, durability) have been the traditional tools, but their foundations are weaker than they seem (weak isolation levels, limited to single datacenter, limited scale).
3.1 The End-to-End Argument for Databases
Just because an application uses a data system with strong safety properties (e.g., serializable transactions), that does not mean the application is guaranteed to be free from data loss or corruption. Application bugs can still write incorrect data.
Exactly-Once Execution
- Making operations idempotent is one of the most effective approaches — same effect whether executed once or multiple times.
- Requires additional metadata (e.g., set of operation IDs that have updated a value) and fencing when failing over.
Duplicate Suppression
- TCP suppresses duplicate packets within a single connection, but not across connections.
- If a client sends a
COMMITbut the connection drops before receiving acknowledgment, the client doesn't know if the transaction committed. Retrying may execute it twice. - Even database-level deduplication doesn't help with duplicates from the end-user device (e.g., user retries an HTTP POST on a weak cellular connection).
Operation Identifiers
- Generate a unique identifier (UUID) for each operation on the client side. Include it as a hidden form field.
- Pass the operation ID all the way through to the database. Use a uniqueness constraint on the request ID to ensure each operation executes only once.
- The requests table also acts as an event log (hinting at event sourcing).
The End-to-End Argument
Saltzer, Reed, and Clark (1984): "The function in question can completely and correctly be implemented only with the knowledge and help of the application standing at the endpoints of the communication system."
- TCP duplicate suppression, stream processor exactly-once semantics, and database transactions are all useful low-level reliability features, but they are not sufficient by themselves for end-to-end correctness.
- The application itself needs to take end-to-end measures (e.g., duplicate suppression with operation IDs).
- Low-level mechanisms reduce the probability of problems at higher levels, but the remaining higher-level faults still need to be handled.
3.2 Enforcing Constraints
Uniqueness Constraints Require Consensus
- In a distributed setting, enforcing uniqueness requires consensus — typically implemented by funneling all events through a single leader node.
- Can be scaled out by partitioning based on the value that needs to be unique (e.g., partition by username).
Uniqueness in Log-Based Messaging
A stream processor consumes all messages in a log partition sequentially on a single thread:
- Every request for a username is appended to a partition determined by the hash of the username.
- The stream processor reads requests sequentially, tracks which usernames are taken, and emits success or rejection messages to an output stream.
- The client watches the output stream for its result.
This is the same algorithm as implementing linearizable storage using total order broadcast. Scales by increasing the number of partitions.
Multi-Partition Request Processing Without Atomic Commit
Example: transferring money from account A to account B (different partitions):
- The transfer request gets a unique request ID and is appended to a log partition based on the request ID.
- A stream processor reads the request and emits two messages: a debit instruction (partitioned by A) and a credit instruction (partitioned by B), both including the original request ID.
- Further processors consume the debit/credit streams, deduplicate by request ID, and apply changes to account balances.
No distributed transaction needed — the request is durably logged as a single message first, then derived instructions are emitted. If the stream processor crashes, it resumes from its last checkpoint and may re-emit messages, but the deduplication in step 3 prevents double-processing.
3.3 Timeliness and Integrity
The term "consistency" conflates two different requirements:
| Timeliness | Integrity | |
|---|---|---|
| Meaning | Users observe the system in an up-to-date state | No corruption, no contradictory or false data |
| Violation | "Eventual consistency" — temporary, resolves by waiting | "Perpetual inconsistency" — permanent, requires explicit repair |
| Importance | Annoying if violated | Catastrophic if violated |
| Example | Credit card transaction not yet appearing on statement (normal) | Statement balance ≠ sum of transactions (very bad) |
ACID transactions provide both timeliness (linearizability) and integrity (atomic commit). Event-based dataflow systems decouple them:
- No guarantee of timeliness (asynchronous by design), unless explicitly built.
- Integrity can be achieved through:
- Representing writes as a single immutable message (event sourcing).
- Deterministic derivation functions for all other state updates.
- End-to-end operation identifiers for duplicate suppression and idempotence.
- Immutable messages allowing reprocessing to recover from bugs.
Loosely Interpreted Constraints
Many real applications can get away with weaker notions of uniqueness:
- Two people register the same username → send one an apology, ask them to choose another (compensating transaction).
- More items ordered than in stock → order more stock, apologize for the delay, offer a discount.
- Airlines routinely overbook flights — compensation processes handle excess demand.
- Bank overdraft → charge an overdraft fee.
The cost of the apology is a business decision. If acceptable, you can write optimistically and check constraints after the fact — no need for linearizable constraints.
Coordination-Avoiding Data Systems
These observations mean dataflow systems can provide strong integrity guarantees without requiring coordination, achieving better performance and fault tolerance:
- Can operate across multiple datacenters in a multi-leader configuration with asynchronous replication.
- Weak timeliness (not linearizable without coordination) but strong integrity.
- Serializable transactions still useful at small scope for maintaining derived state.
- Synchronous coordination only introduced where strictly needed (e.g., before an irrecoverable operation).
3.4 Trust, but Verify
Data can become corrupted on disks, in networks (evading TCP checksums), and through software bugs (even in mature databases like MySQL and PostgreSQL).
Auditing
- Checking the integrity of data. HDFS and Amazon S3 run background processes that continually read back files and compare to replicas.
- Test your backups — don't just trust they work.
Designing for Auditability
- Event-based systems provide better auditability: user input represented as a single immutable event, state updates derived deterministically.
- The provenance of data is much clearer with explicit dataflow.
- Can rerun batch/stream processors to verify derived state matches expectations.
- Continuous end-to-end integrity checks increase confidence and allow faster iteration.
Cryptographic Tools
- Blockchains and distributed ledgers (Bitcoin, Ethereum, etc.) explore cryptographic proofs of integrity.
- Certificate transparency and Merkle trees can prove that a log has not been tampered with.
- Making these algorithms scalable for general data systems is an area of ongoing research.
Section 4: Doing the Right Thing
Every system is built for a purpose, but consequences reach far beyond that purpose. Many datasets are about people — their behavior, interests, identity. We must treat such data with humanity and respect. Software development increasingly involves making important ethical choices.
Predictive Analytics
- Using data to predict weather or disease spread is one thing; predicting whether a convict will reoffend, a loan applicant will default, or an insurance customer will make expensive claims has a direct effect on individual people's lives.
- Someone who has (accurately or falsely) been labeled as risky by an algorithm may suffer a large number of "no" decisions. Systematically being excluded from jobs, air travel, insurance, property rental, and financial services has been called "algorithmic prison" — without proof of guilt and with little chance of appeal.
Bias and Discrimination
- Algorithms trained on historical data can learn and systematize existing biases. Even without explicitly using protected characteristics, they can use proxies (e.g., zip code or IP address as proxy for race in racially segregated neighborhoods).
- "Machine learning is like money laundering for bias" — the algorithm takes biased data as input and produces output that appears objective but codifies discrimination.
- Predictive analytics merely extrapolate from the past; if the past is discriminatory, they codify that discrimination. If we want the future to be better, moral imagination is required — something only humans can provide.
Responsibility and Accountability
- When algorithms make mistakes, who is accountable? If a self-driving car causes an accident, who is responsible? If an automated credit scoring algorithm systematically discriminates, is there recourse?
- A credit score summarizes "How did you behave in the past?" whereas predictive analytics works on "Who is similar to you, and how did people like you behave?" — this implies stereotyping people.
- Much data is statistical: even if the probability distribution is correct on the whole, individual cases may well be wrong.
Feedback Loops
- Predictive systems can create self-reinforcing cycles: bad credit score → can't find work → worse credit score → even harder to find work. A downward spiral due to poisonous assumptions, hidden behind a camouflage of mathematical rigor.
- Recommendation systems may show people only opinions they already agree with, leading to echo chambers where stereotypes, misinformation, and polarization breed.
Privacy and Tracking
Surveillance
- As a thought experiment, replace "data" with "surveillance": "In our surveillance-driven organization we collect real-time surveillance streams and store them in our surveillance warehouse. Our surveillance scientists use advanced analytics and surveillance processing in order to derive new insights."
- We have built the greatest mass surveillance infrastructure the world has ever seen. Even the most totalitarian regimes could only dream of putting a microphone in every room and forcing every person to carry a location-tracking device — yet we voluntarily throw ourselves into this world. The difference is just that data is collected by corporations rather than government agencies.
Consent and Freedom of Choice
- Users have little knowledge of what data they feed into databases or how it is processed. Most privacy policies do more to obscure than to illuminate.
- For a user who does not consent to surveillance, the only alternative is not to use the service. But if a service is "regarded by most people as essential for basic social participation," opting out is not a real choice — surveillance becomes inescapable.
Privacy and Use of Data
- Having privacy does not mean keeping everything secret; it means having the freedom to choose which things to reveal to whom. It is a decision right, an aspect of autonomy.
- When data is extracted through surveillance infrastructure, privacy rights are transferred from the individual to the data collector. Companies choose to keep much of the outcome secret because revealing it would be perceived as creepy.
Data as Assets and Power
- Behavioral data is sometimes called "data exhaust" — but from an economic point of view, if targeted advertising pays for a service, behavioral data is the service's core asset. The application is merely a means to lure users into feeding personal information into the surveillance infrastructure.
- Data brokers — a shady industry operating in secrecy, purchasing, aggregating, analyzing, and reselling intrusive personal data, mostly for marketing purposes.
- Data is not just an asset but a "toxic asset" (Bruce Schneier) — whenever we collect data, we must balance benefits with the risk of it falling into the wrong hands through breaches, hostile governments, or unscrupulous management.
- "It is poor civic hygiene to install technologies that could someday facilitate a police state."
The Way Forward
- Data is the pollution problem of the information age (Bruce Schneier). Protecting privacy is the environmental challenge. Just as the Industrial Revolution had a dark side (pollution, exploitation) that needed regulation, our transition to the information age has problems we must confront.
- We need a culture shift: stop regarding users as metrics to be optimized. Remember they are humans who deserve respect, dignity, and agency.
- Self-regulate data collection practices. Educate end users about how their data is used.
- Don't retain data forever — purge when no longer needed.
- Enforce access control through cryptographic protocols, not merely by policy.
- Consider not just today's political environment, but all possible future governments that might access the data.
Summary
Key Takeaways
Data Integration:
- No single tool satisfies all needs. Combine specialized tools with clear dataflow reasoning.
- Log-based derived data (CDC, event sourcing) is more robust than distributed transactions for integrating heterogeneous systems.
- Batch and stream processing are converging. Lambda architecture has been superseded by unified systems.
Unbundling Databases:
- Batch/stream processors are like triggers, stored procedures, and materialized view maintenance — but across different systems.
- Federated databases unify reads; unbundled databases unify writes (via asynchronous event logs with idempotent consumers).
- The "database inside-out" approach: application code as derivation functions, stream processors as services, write path vs read path.
- Extend the write path to end-user devices (server-sent events, WebSockets). Reads are events too.
Aiming for Correctness:
- End-to-end argument: low-level reliability features (TCP, transactions) are necessary but not sufficient. Applications need end-to-end measures (operation identifiers, idempotence).
- Uniqueness constraints can be enforced via log-based messaging (sequential processing per partition).
- Multi-partition operations without atomic commit: log the request as a single message, derive instructions, deduplicate downstream.
- Timeliness (eventual consistency) vs Integrity (perpetual inconsistency) — integrity is much more important.
- Coordination-avoiding data systems: strong integrity without coordination, weak timeliness. Coordination only where strictly needed.
- Design for auditability: immutable events, deterministic derivations, end-to-end integrity checks.
Doing the Right Thing:
- Treat data about people with humanity and respect.
- Be aware of bias, feedback loops, surveillance, and the limits of consent.
- Data is a toxic asset — balance benefits with risks. Privacy is a fundamental right.
Important References
- Jerome H. Saltzer, David P. Reed, and David D. Clark: "End-to-End Arguments in System Design," ACM TOCS, volume 2, number 4, pages 277–288, November 1984.
- Jay Kreps: "The Log: What Every Software Engineer Should Know About Real-Time Data's Unifying Abstraction," engineering.linkedin.com, December 16, 2013.
- Pat Helland and Dave Campbell: "Building on Quicksand," at 4th CIDR, January 2009.
- Pat Helland: "Life Beyond Distributed Transactions: An Apostate's Opinion," at 3rd CIDR, January 2007.
- Pat Helland: "Immutability Changes Everything," at 7th CIDR, January 2015.
- Martin Kleppmann: "Turning the Database Inside-out with Apache Samza," at Strange Loop, September 2014.
- Nathan Marz and James Warren: Big Data: Principles and Best Practices of Scalable Real-Time Data Systems, Manning, 2015.
- Jay Kreps: "Questioning the Lambda Architecture," oreilly.com, July 2, 2014.
- Peter Bailis, Alan Fekete, Michael J. Franklin, et al.: "Coordination-Avoiding Database Systems," Proceedings of the VLDB Endowment, volume 8, number 3, pages 185–196, November 2014.
- Cathy O'Neil: Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy, Crown Publishing, 2016.
- Bruce Schneier: Data and Goliath: The Hidden Battles to Collect Your Data and Control Your World, W. W. Norton, 2015.
- Shoshana Zuboff: "Big Other: Surveillance Capitalism and the Prospects of an Information Civilization," Journal of Information Technology, volume 30, number 1, pages 75–89, April 2015.
- Frank McSherry, Derek G. Murray, Rebecca Isaacs, and Michael Isard: "Differential Dataflow," at 6th CIDR, January 2013.
- Martin Kleppmann and Jay Kreps: "Kafka, Samza and the Unix Philosophy of Distributed Data," IEEE Data Engineering Bulletin, volume 38, number 4, pages 4–14, December 2015.
- Maciej Cegłowski: "Haunted by Data," idlewords.com, October 2015.
Previous chapter
Stream Processing