Encoding and Evolution

Designing Data Intensive Applications by Martin Kleppmann

Chapter 4: Encoding and Evolution

Introduction

Applications inevitably change over time. A change to an application's features often requires a change to the data it stores. However, in a large application, code changes often cannot happen instantaneously:

Server-side applications — You may want to perform a rolling upgrade (staged rollout), deploying the new version to a few nodes at a time. This allows deployment without service downtime and encourages more frequent releases.
Client-side applications — You're at the mercy of the user, who may not install the update for some time.

This means old and new versions of the code, and old and new data formats, may coexist in the system at the same time.

Backward and Forward Compatibility

Backward compatibility — Newer code can read data that was written by older code. (Normally not hard — you know the old format and can handle it.)
Forward compatibility — Older code can read data that was written by newer code. (Trickier — requires older code to ignore additions made by a newer version.)

Section 1: Formats for Encoding Data

Programs work with data in two different representations:

In memory — Objects, structs, lists, arrays, hash tables, trees (optimized for CPU access, using pointers).
On disk / over the network — A self-contained sequence of bytes (pointers wouldn't make sense to another process).

Translation between the two:

In-memory → bytes = encoding (also called serialization or marshalling)
Bytes → in-memory = decoding (also called parsing, deserialization, unmarshalling)

1.1 Language-Specific Formats

Many languages have built-in encoding: Java's java.io.Serializable, Ruby's Marshal, Python's pickle, Kryo for Java, etc.

Problems

Tied to a particular programming language — Reading data in another language is very difficult. Commits you to your current language for a long time.
Security problems — Decoding needs to instantiate arbitrary classes. An attacker can get your application to decode an arbitrary byte sequence → remotely execute arbitrary code.
Versioning is an afterthought — Often neglect forward and backward compatibility.
Efficiency is an afterthought — Java's built-in serialization is notorious for bad performance and bloated encoding.

Generally a bad idea to use language-specific encoding for anything other than very transient purposes.

1.2 JSON, XML, and Binary Variants

JSON and XML are widely known, widely supported, and almost as widely disliked. CSV is another popular language-independent format.

Subtle Problems

Number encoding ambiguity — XML and CSV can't distinguish numbers from digit-strings. JSON doesn't distinguish integers from floating-point, and doesn't specify precision.
- Integers greater than 2^53 cannot be exactly represented in IEEE 754 double-precision floating-point. Twitter includes tweet IDs twice in its JSON API: once as a number, once as a string, to work around JavaScript's parsing issues.
No binary string support — JSON and XML support Unicode strings but not binary strings (sequences of bytes). Workaround: Base64 encoding, which increases data size by 33%.
Schema support is optional and complex — XML Schema is fairly widespread; many JSON tools don't bother with schemas. CSV has no schema at all.

Despite these flaws, JSON, XML, and CSV are good enough for many purposes, especially as data interchange formats between organizations.

Binary Encoding

For internal use within an organization, binary formats can be more compact and faster to parse. Binary encodings for JSON include MessagePack, BSON, BJSON, UBJSON, BISON, Smile. For XML: WBXML, Fast Infoset.

Since they don't prescribe a schema, they must include all field names in the encoded data.

MessagePack Example

Encoding the following record:

{
    "userName": "Martin",
    "favoriteNumber": 1337,
    "interests": ["daydreaming", "hacking"]
}

The binary encoding is 66 bytes long, only slightly less than the 81 bytes of textual JSON (with whitespace removed).
Not clear whether such a small space reduction is worth the loss of human-readability.

1.3 Thrift and Protocol Buffers

Apache Thrift (originally developed at Facebook) and Protocol Buffers (protobuf, originally developed at Google) are binary encoding libraries based on the same principle. Both made open source in 2007–08.

Both require a schema for any data that is encoded.

Thrift Schema (IDL)

struct Person {
  1: required string       userName,
  2: optional i64          favoriteNumber,
  3: optional list<string> interests
}

Protocol Buffers Schema

message Person {
    required string user_name       = 1;
    optional int64  favorite_number = 2;
    repeated string interests       = 3;
}

Both come with a code generation tool that produces classes implementing the schema in various programming languages.

Key Difference from MessagePack

The encoded data contains field tags (numbers 1, 2, 3) instead of field names. Field tags are like aliases — a compact way of identifying fields without spelling out the name.

Thrift BinaryProtocol (59 bytes)

Thrift CompactProtocol (34 bytes)

Packs field type and tag number into a single byte, uses variable-length integers (numbers between –64 and 63 encoded in one byte, –8192 to 8191 in two bytes, etc.).

Protocol Buffers (33 bytes)

Similar bit packing to Thrift's CompactProtocol.

Note: required vs optional makes no difference to encoding — it's just a runtime check that fails if a required field is not set.

Field Tags and Schema Evolution

An encoded record is the concatenation of its encoded fields, each identified by its tag number and annotated with a datatype.
You can change the name of a field (encoded data never refers to field names), but you cannot change a field's tag (would invalidate all existing encoded data).

Forward Compatibility (old code reads new data)

Add new fields with new tag numbers. Old code that doesn't recognize a new tag number simply ignores that field (the datatype annotation tells the parser how many bytes to skip).

Backward Compatibility (new code reads old data)

New code can always read old data because tag numbers still have the same meaning.
Every field added after initial deployment must be optional or have a default value (if required, new code reading old data would fail the check).

Removing Fields

You can only remove a field that is optional (required fields can never be removed).
You can never reuse the same tag number again.

Datatype Changes

Changing a 32-bit integer to 64-bit: new code reads old data fine (fills missing bits with zeros). Old code reading new data may truncate the value.
Protocol Buffers: optional → repeated is safe (old code sees only the last element of the list). Thrift has a dedicated list datatype that doesn't allow this evolution but supports nested lists.

1.4 Avro

Apache Avro was started in 2009 as a subproject of Hadoop, because Thrift was not a good fit for Hadoop's use cases.

Two schema languages: Avro IDL (human editing) and JSON (machine-readable).

Avro IDL Schema

record Person {
    string               userName;
    union { null, long } favoriteNumber = null;
    array<string>        interests;
}

Key difference: no tag numbers in the schema.

Avro Binary Encoding (32 bytes — the most compact)

Nothing in the encoded data identifies fields or their datatypes. The encoding is simply values concatenated together.
To parse, go through fields in the order they appear in the schema. The binary data can only be decoded correctly if the code reading the data uses the exact same schema as the code that wrote it.

The Writer's Schema and the Reader's Schema

Writer's schema — The schema the application uses when encoding data.
Reader's schema — The schema the application expects when decoding data.
The key idea: the writer's schema and reader's schema don't have to be the same — they only need to be compatible.
The Avro library resolves differences by looking at both schemas side by side and translating.

Fields matched by field name (not position). Fields in different order are fine.
Field in writer's schema but not reader's → ignored.
Field expected by reader but not in writer's schema → filled with default value.

Schema Evolution Rules

You may only add or remove a field that has a default value.
Adding a field without a default value breaks backward compatibility. Removing a field without a default value breaks forward compatibility.
To allow null, you must use a union type: union { null, long, string } field; — Avro doesn't have optional/required markers like Thrift/Protobuf.
Changing a field's datatype is possible if Avro can convert the type.
Changing a field's name: use aliases in the reader's schema (backward compatible but not forward compatible).

How Does the Reader Know the Writer's Schema?

Large file with lots of records — Include the writer's schema once at the beginning of the file (Avro object container files).
Database with individually written records — Include a version number at the beginning of each record; keep a list of schema versions in the database. Reader fetches the writer's schema for that version. (Espresso works this way.)
Sending records over a network — Negotiate the schema version on connection setup, use it for the lifetime of the connection (Avro RPC protocol).

Dynamically Generated Schemas

Avro's key advantage over Thrift/Protobuf: no tag numbers → friendlier to dynamically generated schemas.

Example: dump a relational database to a binary file. Generate an Avro schema from the relational schema (each table → a record, each column → a field, column name → field name).
If the database schema changes, just generate a new Avro schema. The data export process doesn't need to pay attention to schema changes.
With Thrift/Protobuf, field tags would have to be assigned by hand every time the database schema changes.

Code Generation

Thrift and Protobuf rely on code generation (useful in statically typed languages like Java, C++, C#).
In dynamically typed languages (JavaScript, Ruby, Python), code generation is less useful and often frowned upon.
Avro provides optional code generation. Without it, you can simply open an Avro object container file and look at the data (the file is self-describing).

1.5 The Merits of Schemas

Binary encodings based on schemas (Protocol Buffers, Thrift, Avro) have nice properties:

More compact than "binary JSON" variants — can omit field names from encoded data.
Schema is valuable documentation — required for decoding, so guaranteed to be up to date (unlike manually maintained docs).
Database of schemas allows checking forward and backward compatibility before deployment.
Code generation from schema enables type checking at compile time in statically typed languages.

Schema evolution allows the same flexibility as schemaless/schema-on-read JSON databases, while providing better guarantees about your data and better tooling.

Section 2: Modes of Dataflow

Whenever you want to send data to another process with which you don't share memory, you need to encode it as a sequence of bytes. Three common ways data flows between processes:

Via databases
Via service calls (REST and RPC)
Via asynchronous message passing

2.1 Dataflow Through Databases

The process that writes to the database encodes the data; the process that reads from the database decodes it. You can think of storing something in the database as sending a message to your future self.

Backward compatibility is clearly necessary (otherwise your future self can't decode what you previously wrote).
Forward compatibility is also often required — in a rolling upgrade, a newer version may write a value that is subsequently read by an older version still running.

Preserving Unknown Fields

If you add a field to a record schema, newer code writes a value for that new field. An older version reads the record, updates it, and writes it back. The desirable behavior: old code should keep the new field intact, even though it couldn't interpret it.

If you decode a database value into model objects and later reencode, the unknown field might be lost in translation. You need to be aware of this at the application level.

Data Outlives Code

Within a single database, you may have values written five milliseconds ago and values written five years ago.
When you deploy a new application version, you replace old code within minutes. But the five-year-old data will still be there in the original encoding unless explicitly rewritten. This is summed up as "data outlives code".
Rewriting (migrating) data into a new schema is expensive on large datasets, so most databases avoid it. Most relational databases allow simple schema changes (e.g., adding a column with a null default) without rewriting existing data — the database fills in nulls for missing columns when old rows are read.
LinkedIn's Espresso uses Avro for storage, leveraging Avro's schema evolution rules.

Archival Storage

Database snapshots (for backup or loading into a data warehouse) are typically encoded using the latest schema, even if the source contained a mixture of schema versions.
Since the dump is written in one go and is immutable, formats like Avro object container files are a good fit.
Also a good opportunity to encode in an analytics-friendly column-oriented format such as Parquet.

2.2 Dataflow Through Services: REST and RPC

The most common arrangement: clients and servers. The server exposes an API over the network (a service), and clients connect to make requests.

A server can itself be a client to another service (e.g., a web app server is a client to a database).
This approach decomposes a large application into smaller services by area of functionality — traditionally called service-oriented architecture (SOA), more recently refined as microservices architecture.
Services expose an application-specific API (not arbitrary queries like databases) — providing encapsulation and fine-grained restrictions on what clients can do.
Key design goal: make services independently deployable and evolvable. Each service owned by one team, releasing new versions frequently without coordinating with other teams.

Web Services

When HTTP is the underlying protocol. Used in several contexts:

Client application (mobile app, JavaScript web app) making requests over the public internet.
One service to another within the same organization/datacenter (middleware).
One service to another owned by a different organization (public APIs, OAuth, credit card processing).

REST vs SOAP

REST — A design philosophy building upon HTTP principles. Emphasizes simple data formats, URLs for identifying resources, HTTP features for cache control, authentication, content type negotiation. Associated with microservices. An API designed according to REST principles is called RESTful.
SOAP — An XML-based protocol for network API requests. Aims to be independent from HTTP. Comes with a sprawling multitude of related standards (WS-*). API described using WSDL (Web Services Description Language) — enables code generation but is not human-readable. Users rely heavily on tool support and IDEs.
- SOAP has fallen out of favor in most smaller companies but is still used in many large enterprises.
RESTful APIs favor simpler approaches with less code generation. Definition formats like OpenAPI (Swagger) can describe RESTful APIs and produce documentation.

The Problems with Remote Procedure Calls (RPCs)

RPC tries to make a request to a remote network service look the same as calling a local function (location transparency). This is fundamentally flawed:

Unpredictable — A network request may be lost, or the remote machine may be slow/unavailable. You must anticipate network problems and retry.
Timeouts — A network request may return without a result. You don't know if the request got through or not.
Idempotence needed — Retrying may cause the action to be performed multiple times (the request got through but the response was lost). Need deduplication mechanisms.
Variable latency — A network request is much slower than a function call, and latency is wildly variable.
Encoding overhead — Parameters must be encoded into bytes. Problematic with larger objects.
Cross-language type translation — The RPC framework must translate datatypes between languages (e.g., JavaScript's problems with numbers > 2^53).

Part of the appeal of REST is that it doesn't try to hide the fact that it's a network protocol.

Current Directions for RPC

Despite the problems, RPC isn't going away. New generation of frameworks:

gRPC — RPC using Protocol Buffers.
Finagle — Uses Thrift.
Rest.li — Uses JSON over HTTP.
Thrift and Avro come with RPC support included.

These frameworks are more explicit about remote requests being different from local calls:

Futures (promises) — Encapsulate asynchronous actions that may fail. Simplify parallel requests to multiple services.
Streams (gRPC) — A call consists of a series of requests and responses over time.
Service discovery — Allowing a client to find which IP address and port number a particular service is at.

REST vs RPC Trade-offs

Custom RPC with binary encoding can achieve better performance than JSON over REST.
RESTful APIs have significant advantages: good for experimentation and debugging (curl, web browser), supported by all mainstream languages, vast ecosystem of tools (servers, caches, load balancers, proxies, firewalls, monitoring).
REST is predominant for public APIs. RPC is mainly for requests between services within the same organization/datacenter.

Data Encoding and Evolution for RPC

Assume servers are updated first, clients second → need backward compatibility on requests and forward compatibility on responses.
Compatibility properties are inherited from the encoding format used (Thrift, gRPC/Protobuf, Avro RPC, SOAP/XML schemas, RESTful/JSON).
Service compatibility is harder across organizational boundaries — the provider often has no control over clients and cannot force upgrades. May need to maintain multiple versions of the API side by side.
API versioning approaches: version number in the URL, HTTP Accept header, or stored per client via API keys (e.g., Stripe's approach).

2.3 Message-Passing Dataflow

Asynchronous message-passing systems are somewhere between RPC and databases:

Like RPC: a client's request (a message) is delivered to another process with low latency.
Like databases: the message goes via an intermediary called a message broker (message queue / message-oriented middleware), which stores the message temporarily.

Advantages of a Message Broker

Acts as a buffer if the recipient is unavailable or overloaded → improves reliability.
Can automatically redeliver messages to a crashed process → prevents message loss.
Avoids the sender needing to know the IP address and port of the recipient (useful in cloud deployments where VMs come and go).
Allows one message to be sent to several recipients.
Logically decouples the sender from the recipient (sender publishes messages, doesn't care who consumes them).

Communication is usually one-way (sender doesn't expect a reply) and asynchronous (sender doesn't wait for delivery).

Message Brokers

Past: commercial enterprise software (TIBCO, IBM WebSphere, webMethods).
Now: open source — RabbitMQ, ActiveMQ, HornetQ, NATS, Apache Kafka.
One process sends a message to a named queue or topic; the broker delivers it to one or more consumers/subscribers.
A topic provides one-way dataflow. A consumer may publish to another topic (chaining) or to a reply queue (request/response pattern).
Message brokers don't enforce any particular data model — a message is just a sequence of bytes with some metadata. Use any encoding format.
If encoding is backward and forward compatible, publishers and consumers can be changed and deployed independently.

Distributed Actor Frameworks

The actor model is a programming model for concurrency:

Logic is encapsulated in actors (instead of dealing directly with threads, race conditions, locking, deadlock).
Each actor has some local state (not shared), communicates with other actors by sending/receiving asynchronous messages.
Message delivery is not guaranteed (messages may be lost in error scenarios).
Each actor processes only one message at a time → no need to worry about threads.

In distributed actor frameworks, the same message-passing mechanism works across multiple nodes. Messages are transparently encoded, sent over the network, and decoded on the other side.

Location transparency works better in the actor model than in RPC, because the actor model already assumes messages may be lost.

A distributed actor framework integrates a message broker and the actor programming model into a single framework. Rolling upgrades still require forward and backward compatibility.

Three Popular Distributed Actor Frameworks

Akka — Uses Java's built-in serialization by default (no forward/backward compatibility). Can be replaced with Protocol Buffers for rolling upgrades.
Orleans — Uses a custom encoding format that doesn't support rolling upgrades by default. To deploy a new version: set up a new cluster, move traffic, shut down the old one. Custom serialization plug-ins can be used.
Erlang OTP — Surprisingly hard to make changes to record schemas despite many high-availability features. Rolling upgrades are possible but need careful planning. Experimental maps datatype (Erlang R17, 2014) may make this easier.

Summary

Key Takeaways

Encoding Formats:

Language-specific encodings — Restricted to a single language, often fail to provide compatibility. Avoid.
Textual formats (JSON, XML, CSV) — Widespread, compatibility depends on usage. Vague about datatypes (numbers, binary strings). Optional schema languages.
Binary schema-driven formats (Thrift, Protocol Buffers, Avro) — Compact, efficient, clearly defined forward and backward compatibility semantics. Schemas are useful for documentation and code generation. Downside: data must be decoded before it's human-readable.

Encoding Size Comparison:

Format	Bytes
JSON (textual, no whitespace)	81
MessagePack (binary JSON)	66
Thrift BinaryProtocol	59
Thrift CompactProtocol	34
Protocol Buffers	33
Avro	32

Modes of Dataflow:

Databases — Writer encodes, reader decodes. Data outlives code. Preserve unknown fields.
RPC and REST APIs — Client encodes request, server decodes and encodes response, client decodes response. REST predominant for public APIs; RPC for internal services.
Asynchronous message passing (message brokers or actors) — Nodes communicate by sending encoded messages. One-way, asynchronous, decoupled.

With care, backward/forward compatibility and rolling upgrades are quite achievable.

Important References

Mark Slee, Aditya Agarwal, and Marc Kwiatkowski: "Thrift: Scalable Cross-Language Services Implementation," Facebook technical report, April 2007.
"Protocol Buffers Developer Guide," Google, Inc.
"Apache Avro 1.7.7 Documentation," avro.apache.org.
Martin Kleppmann: "Schema Evolution in Avro, Protocol Buffers and Thrift," martin.kleppmann.com, December 5, 2012.
Roy Thomas Fielding: "Architectural Styles and the Design of Network-Based Software Architectures," PhD Thesis, University of California, Irvine, 2000.
Sam Newman: Building Microservices, O'Reilly Media, 2015.
Pat Helland: "Data on the Outside Versus Data on the Inside," at 2nd CIDR, January 2005.
Michi Henning: "The Rise and Fall of CORBA," ACM Queue, volume 4, number 5, pages 28–34, June 2006.
Andrew D. Birrell and Bruce Jay Nelson: "Implementing Remote Procedure Calls," ACM TOCS, volume 2, number 1, pages 39–59, February 1984.
Jim Waldo, Geoff Wyant, Ann Wollrath, and Sam Kendall: "A Note on Distributed Computing," Sun Microsystems Laboratories, Technical Report TR-94-29, November 1994.
Steve Vinoski: "Convenience over Correctness," IEEE Internet Computing, volume 12, number 4, pages 89–92, July 2008.
"grpc-common Documentation," Google, Inc.
Tony Hoare: "Null References: The Billion Dollar Mistake," at QCon London, March 2009.
Philip A. Bernstein, Sergey Bykov, Alan Geller, et al.: "Orleans: Distributed Virtual Actors for Programmability and Scalability," Microsoft Research Technical Report MSR-TR-2014-41, March 2014.
"OpenAPI Specification (Swagger)," swagger.io.

Previous chapter

Storage and Retrieval

Next chapter

Replication

Pagefy