Partitioning

Designing Data Intensive Applications by Martin Kleppmann

Chapter 6: Partitioning

Introduction

For very large datasets or very high query throughput, replication alone is not sufficient — we need to break the data up into partitions (also known as sharding).

Terminology Across Databases

Term	Used By
Partition	General / established term
Shard	MongoDB, Elasticsearch, SolrCloud
Region	HBase
Tablet	Bigtable
vnode	Cassandra, Riak
vBucket	Couchbase

Each piece of data belongs to exactly one partition. In effect, each partition is a small database of its own. The main reason for partitioning is scalability: different partitions can be placed on different nodes in a shared-nothing cluster, distributing data across many disks and query load across many processors.

Section 1: Partitioning and Replication

Partitioning is usually combined with replication so that copies of each partition are stored on multiple nodes for fault tolerance. A node may store more than one partition. With leader–follower replication, each node may be the leader for some partitions and a follower for others.

Section 2: Partitioning of Key-Value Data

The goal is to spread data and query load evenly across nodes. If partitioning is unfair, we call it skewed. A partition with disproportionately high load is called a hot spot.

Assigning records to nodes randomly distributes data evenly but makes reads impossible without querying all nodes. We can do better.

2.1 Partitioning by Key Range

Assign a continuous range of keys (from some minimum to some maximum) to each partition, like volumes of a paper encyclopedia.

The ranges are not necessarily evenly spaced — partition boundaries need to adapt to the data distribution.
Boundaries can be chosen manually by an administrator or automatically by the database.
Used by: Bigtable, HBase, RethinkDB, MongoDB (before version 2.4).

Advantages

Within each partition, keys can be kept in sorted order → efficient range scans.
The key can be treated as a concatenated index to fetch several related records in one query.
Example: sensor data with key = timestamp (year-month-day-hour-minute-second) → easily fetch all readings from a particular month.

Disadvantage: Hot Spots

Certain access patterns lead to hot spots. If the key is a timestamp, all writes go to the same partition (today's partition) while others sit idle.
Solution: Prefix each timestamp with the sensor name so partitioning is first by sensor name, then by time. Write load is spread across partitions. Trade-off: fetching values for multiple sensors within a time range requires a separate range query for each sensor name.

2.2 Partitioning by Hash of Key

Use a hash function to determine the partition for a given key. A good hash function takes skewed data and makes it uniformly distributed.

Assign each partition a range of hashes (rather than a range of keys). Every key whose hash falls within a partition's range is stored in that partition.
The hash function need not be cryptographically strong. Cassandra and MongoDB use MD5; Voldemort uses Fowler–Noll–Vo.
Caution: Java's Object.hashCode() and Ruby's Object#hash may return different hash values in different processes — not suitable for partitioning.

Consistent Hashing

Defined by Karger et al. — a way of evenly distributing load using randomly chosen partition boundaries without central control.
The term is confusing ("consistent" has nothing to do with replica consistency or ACID consistency). This approach doesn't work very well for databases in practice, so it's best to just call it hash partitioning.

Disadvantage: No Efficient Range Queries

By using the hash of the key, we lose the ability to do efficient range queries. Keys that were once adjacent are now scattered across all partitions.
MongoDB (hash-based sharding): range queries must be sent to all partitions.
Range queries on primary key are not supported by Riak, Couchbase, or Voldemort.

Cassandra's Compromise: Compound Primary Key

A table can be declared with a compound primary key consisting of several columns.
Only the first part of the key is hashed to determine the partition; the other columns are used as a concatenated index for sorting data in SSTables.
A query cannot search for a range of values within the first column, but if it specifies a fixed value for the first column, it can perform an efficient range scan over the other columns.
Example: primary key (user_id, update_timestamp) → efficiently retrieve all updates by a particular user within a time interval, sorted by timestamp. Different users may be on different partitions, but within each user, updates are ordered on a single partition.

2.3 Skewed Workloads and Relieving Hot Spots

Hashing reduces hot spots but can't avoid them entirely. If all reads and writes are for the same key (e.g., a celebrity user on social media with millions of followers), all requests go to the same partition.

Technique: Random Prefix

Add a random number (e.g., two-digit decimal) to the beginning or end of the hot key → splits writes evenly across 100 different keys on different partitions.
Trade-off: Reads must now read from all 100 keys and combine results. Only makes sense for the small number of hot keys — unnecessary overhead for the vast majority of keys. Need bookkeeping to track which keys are being split.
Most data systems currently cannot automatically compensate for highly skewed workloads — it's the application's responsibility.

Section 3: Partitioning and Secondary Indexes

Secondary indexes don't identify a record uniquely — they are a way of searching for occurrences of a particular value (e.g., find all cars whose color is red). They are the bread and butter of relational databases and common in document databases too. The problem: secondary indexes don't map neatly to partitions.

Two main approaches:

3.1 Document-Partitioned Indexes (Local Indexes)

Each partition maintains its own secondary indexes, covering only the documents in that partition. When you write a document, you only deal with the partition containing that document ID.

Writing: Simple — only update the partition containing the document.
Reading: Must send the query to all partitions and combine results. This is called scatter/gather — can be expensive and prone to tail latency amplification.
Used by: MongoDB, Riak, Cassandra, Elasticsearch, SolrCloud, VoltDB.
Database vendors recommend structuring partitioning so secondary index queries can be served from a single partition, but this is not always possible (especially with multiple secondary indexes in a single query).

3.2 Term-Partitioned Indexes (Global Indexes)

A global index covers data in all partitions, but is itself partitioned (differently from the primary key index).

The term we're looking for determines the partition of the index (e.g., color:red → partition 0; color:yellow → partition 1).
Can partition by the term itself (useful for range scans) or by a hash of the term (more even load distribution).

	Document-Partitioned (Local)	Term-Partitioned (Global)
Reads	Scatter/gather across all partitions (expensive)	Single partition lookup (efficient)
Writes	Only update one partition (fast)	May need to update multiple partitions of the index (slower, more complicated)

In practice, updates to global secondary indexes are often asynchronous (the change may not be immediately reflected in the index).
Amazon DynamoDB: global secondary indexes updated within a fraction of a second normally, but may experience longer delays during infrastructure faults.
Also used by: Riak's search feature, Oracle data warehouse.

Section 4: Rebalancing Partitions

Over time, query throughput increases, dataset size grows, or machines fail — data and requests need to be moved from one node to another. This is called rebalancing.

Requirements

After rebalancing, load should be shared fairly between nodes.
The database should continue accepting reads and writes during rebalancing.
No more data than necessary should be moved (minimize network and disk I/O).

4.1 How NOT to Do It: hash mod N

hash(key) mod N seems easy, but if N changes, most keys need to be moved. Example: hash(key) = 123456 → on node 6 with 10 nodes, node 3 with 11 nodes, node 0 with 12 nodes. Excessively expensive.

4.2 Fixed Number of Partitions

Create many more partitions than nodes and assign several partitions to each node. Example: 10 nodes, 1,000 partitions → ~100 partitions per node.

When a node is added, it steals a few partitions from every existing node. When removed, the reverse happens.
Only entire partitions are moved. The number of partitions doesn't change, nor does the assignment of keys to partitions — only the assignment of partitions to nodes changes.
Can account for mismatched hardware by assigning more partitions to more powerful nodes.
Used by: Riak, Elasticsearch, Couchbase, Voldemort.
The number of partitions is usually fixed at setup and not changed afterward. Must be chosen high enough for future growth, but not so high that management overhead becomes excessive.
Difficulty: Choosing the right number if dataset size is highly variable (partitions too large → expensive rebalancing; too small → too much overhead).

4.3 Dynamic Partitioning

For key range–partitioned databases, fixed boundaries would be inconvenient (all data could end up in one partition).

When a partition grows to exceed a configured size (HBase default: 10 GB), it is split into two. When a partition shrinks below a threshold, it is merged with an adjacent partition. Similar to B-tree top-level splitting.
After splitting, one half can be transferred to another node to balance load.
Advantage: Number of partitions adapts to total data volume — small data = few partitions (low overhead); huge data = bounded partition size.
Caveat: An empty database starts with a single partition (all writes go to one node). To mitigate: pre-splitting — configure an initial set of partitions on an empty database (HBase, MongoDB). Requires knowing the key distribution in advance.
Works with both key-range and hash-partitioned data (MongoDB 2.4+ supports both).

4.4 Partitioning Proportionally to Nodes

A third option (used by Cassandra and Ketama): a fixed number of partitions per node (Cassandra default: 256 per node).

Partition size grows proportionally to dataset size while node count is unchanged; when nodes are added, partitions become smaller.
When a new node joins, it randomly chooses existing partitions to split and takes ownership of one half of each.
Randomization can produce unfair splits, but averages out over many partitions. Cassandra 3.0 introduced an improved algorithm to avoid unfair splits.
Requires hash-based partitioning (boundaries picked from the hash function range). Corresponds most closely to the original definition of consistent hashing.

4.5 Automatic or Manual Rebalancing

Fully automatic — Convenient but can be unpredictable. Rebalancing is expensive (rerouting requests, moving large amounts of data). Can overload the network and harm performance.
Dangerous with automatic failure detection — An overloaded node is temporarily slow → other nodes conclude it's dead → automatically rebalance → puts additional load on the overloaded node and network → cascading failure.
Fully manual — Administrator explicitly configures partition assignment.
Middle ground (Couchbase, Riak, Voldemort) — System generates a suggested partition assignment, but requires an administrator to commit it before taking effect.
Having a human in the loop is slower but helps prevent operational surprises.

Section 5: Request Routing

When a client wants to make a request, how does it know which node to connect to? This is an instance of service discovery.

Three Approaches

Contact any node (e.g., via round-robin load balancer). If that node owns the partition, it handles the request; otherwise, it forwards to the appropriate node.
Routing tier — All requests go to a partition-aware load balancer first, which forwards to the correct node. The routing tier doesn't handle requests itself.
Client-aware — The client knows the partitioning and connects directly to the appropriate node.

ZooKeeper

Many distributed data systems rely on a separate coordination service like ZooKeeper to keep track of cluster metadata:

Each node registers itself in ZooKeeper.
ZooKeeper maintains the authoritative mapping of partitions to nodes.
Routing tier or partitioning-aware clients subscribe to this information.
When a partition changes ownership or a node is added/removed, ZooKeeper notifies the routing tier.
Used by: LinkedIn's Espresso (via Helix), HBase, SolrCloud, Kafka.
MongoDB uses its own config server implementation and mongos daemons as the routing tier.

Gossip Protocol

Cassandra and Riak use a gossip protocol among nodes to disseminate cluster state changes. Requests can be sent to any node, which forwards to the appropriate node (approach 1). Avoids dependency on an external coordination service but puts more complexity in the database nodes.

Parallel Query Execution

MPP (Massively Parallel Processing) relational database products used for analytics support much more sophisticated queries (joins, filtering, grouping, aggregation). The MPP query optimizer breaks complex queries into execution stages and partitions, many of which execute in parallel on different nodes.

Summary

Key Takeaways

Two Main Partitioning Approaches:

Key range partitioning — Keys are sorted; a partition owns a range from some minimum to some maximum. Efficient range queries, but risk of hot spots. Typically rebalanced by dynamic splitting.
Hash partitioning — Hash function applied to each key; partition owns a range of hashes. Destroys key ordering (range queries inefficient), but distributes load more evenly. Commonly uses a fixed number of partitions.
Hybrid — Compound key: one part identifies the partition (hashed), another part for sort order (Cassandra).

Secondary Index Partitioning:

Document-partitioned (local) — Index stored in same partition as data. Single partition updated on write; scatter/gather on read.
Term-partitioned (global) — Index partitioned separately by indexed values. Efficient reads from single partition; writes may update multiple partitions (often async).

Rebalancing Strategies:

Don't use hash mod N.
Fixed number of partitions (Riak, Elasticsearch, Couchbase).
Dynamic partitioning with splitting/merging (HBase, MongoDB).
Proportional to nodes (Cassandra).
Prefer human in the loop to prevent cascading failures.

Request Routing:

Contact any node (gossip protocol — Cassandra, Riak).
Routing tier (ZooKeeper — HBase, Kafka, Espresso).
Client-aware partitioning.

Important References

David J. DeWitt and Jim N. Gray: "Parallel Database Systems: The Future of High Performance Database Systems," Communications of the ACM, volume 35, number 6, pages 85–98, June 1992.
David Karger, Eric Lehman, Tom Leighton, et al.: "Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web," at 29th Annual ACM STOC, pages 654–663, 1997.
Avinash Lakshman and Prashant Malik: "Cassandra – A Decentralized Structured Storage System," at 3rd ACM SIGOPS LADIS, October 2009.
Martin Kleppmann: "Java's hashCode Is Not Safe for Distributed Systems," martin.kleppmann.com, June 18, 2012.
John Lamping and Eric Veach: "A Fast, Minimal Memory, Consistent Hash Algorithm," arxiv.org, June 2014.
Kishore Gopalakrishna, Shi Lu, Zhen Zhang, et al.: "Untangling Cluster Management with Helix," at ACM SoCC, October 2012.
Shivnath Babu and Herodotos Herodotou: "Massively Parallel Databases and MapReduce Systems," Foundations and Trends in Databases, volume 5, number 1, pages 1–104, November 2013.
Eric Evans: "Rethinking Topology in Cassandra," at ApacheCon Europe, November 2012.
Ikai Lan: "App Engine Datastore Tip: Monotonically Increasing Values Are Bad," ikaisays.com, January 25, 2011.
Branimir Lambov: "New Token Allocation Algorithm in Cassandra 3.0," datastax.com, January 28, 2016.

Previous chapter

Replication

Next chapter

Transactions

Pagefy