Pagefy

Pagefy

Storage and Retrieval

Designing Data Intensive Applications by Martin Kleppmann

Chapter 3: Storage and Retrieval

Introduction

On the most fundamental level, a database needs to do two things: when you give it some data, it should store the data, and when you ask it again later, it should give the data back to you.

As an application developer, you need to select a storage engine appropriate for your application. In order to tune a storage engine to perform well on your kind of workload, you need a rough idea of what it's doing under the hood.

There is a big difference between storage engines optimized for transactional workloads and those optimized for analytics. Two families of storage engines:

  1. Log-structured storage engines
  2. Page-oriented storage engines (such as B-trees)

Section 1: Data Structures That Power Your Database

The World's Simplest Database

Consider a key-value store implemented as two Bash functions:

#!/bin/bash
db_set () {
    echo "$1,$2" >> database
}
db_get () {
    grep "^$1," database | sed -e "s/^$1,//" | tail -n 1
}
  • db_set appends to the end of a file — very efficient (O(1) writes).
  • db_get scans the entire file — terrible performance (O(n) reads).
  • Many databases internally use a log: an append-only sequence of records (not necessarily human-readable).

The Need for Indexes

  • To efficiently find a value for a particular key, we need an index — additional metadata on the side that acts as a signpost.
  • An index is an additional structure derived from the primary data. It doesn't affect the contents of the database, only the performance of queries.
  • Important trade-off: Well-chosen indexes speed up read queries, but every index slows down writes (because the index must be updated on every write).
  • Databases don't usually index everything by default — you choose indexes manually based on your application's typical query patterns.

Section 2: Hash Indexes

The simplest indexing strategy: keep an in-memory hash map where every key is mapped to a byte offset in the data file — the location at which the value can be found.

  • When you append a new key-value pair, update the hash map to reflect the offset of the data you just wrote.
  • To look up a value, use the hash map to find the offset, seek to that location, and read the value.

Bitcask

This is essentially what Bitcask (the default storage engine in Riak) does:

  • Offers high-performance reads and writes.
  • Requirement: all keys must fit in available RAM (hash map is kept entirely in memory).
  • Values can use more space than available memory — loaded from disk with just one disk seek.
  • Well suited for workloads where the value for each key is updated frequently (e.g., URL → play count for a cat video). Many writes per key, but feasible to keep all keys in memory.

Compaction and Segment Merging

How to avoid running out of disk space with an append-only log:

  1. Break the log into segments — Close a segment file when it reaches a certain size, write subsequent data to a new segment.
  2. Compaction — Throw away duplicate keys in the log, keeping only the most recent update for each key.
  1. Merge segments together while performing compaction. Segments are never modified after being written — the merged segment is written to a new file. Merging happens in a background thread; reads and writes continue using old segments until merging is complete, then old segments are deleted.
  • Each segment has its own in-memory hash table mapping keys to file offsets.
  • To find a key: check the most recent segment's hash map first, then the second-most-recent, and so on.
  • Merging keeps the number of segments small, so lookups don't check many hash maps.

Implementation Details

  • File format — Binary format is faster and simpler than CSV (encode string length in bytes, followed by raw string).
  • Deleting records — Append a special deletion record called a tombstone. During merging, the tombstone tells the process to discard previous values for that key.
  • Crash recovery — In-memory hash maps are lost on restart. Can restore by reading entire segment files (slow), or store a snapshot of each segment's hash map on disk (Bitcask's approach) for faster recovery.
  • Partially written records — Use checksums to detect and ignore corrupted parts of the log.
  • Concurrency control — Only one writer thread (strictly sequential appends). Segment files are append-only and immutable, so they can be read concurrently by multiple threads.

Why Append-Only?

  • Sequential writes are much faster than random writes, especially on magnetic spinning-disk hard drives. Also preferable on SSDs.
  • Concurrency and crash recovery are much simpler with immutable segment files.
  • Merging old segments avoids data file fragmentation over time.

Limitations of Hash Table Index

  • The hash table must fit in memory. An on-disk hash map is difficult to make perform well (lots of random I/O, expensive to grow, hash collisions require fiddly logic).
  • Range queries are not efficient — you cannot easily scan over all keys between kitty00000 and kitty99999; you'd have to look up each key individually.

Section 3: SSTables and LSM-Trees

Sorted String Tables (SSTables)

A simple but powerful change: require that the sequence of key-value pairs is sorted by key. Also require that each key only appears once within each merged segment file (compaction ensures this).

Advantages Over Hash Indexes

1. Efficient Merging — Like the mergesort algorithm: read input files side by side, copy the lowest key to the output file, repeat. When the same key appears in multiple segments, keep the value from the most recent segment.

2. Sparse In-Memory Index — You no longer need an index of all keys in memory. Because keys are sorted, you can jump to a known offset and scan from there. A sparse index (one key for every few kilobytes of segment file) is sufficient.

3. Block Compression — Since read requests scan over several key-value pairs in a range, records can be grouped into a block and compressed before writing to disk. Each entry of the sparse index points at the start of a compressed block. Saves disk space and reduces I/O bandwidth.

Constructing and Maintaining SSTables

How to get data sorted by key when incoming writes can occur in any order:

  1. Memtable — When a write comes in, add it to an in-memory balanced tree data structure (e.g., red-black tree or AVL tree). This in-memory tree is called a memtable.
  2. When the memtable exceeds a threshold (typically a few megabytes), write it out to disk as an SSTable file (efficient because the tree already maintains sorted order). This becomes the most recent segment. Writes continue to a new memtable instance.
  3. To serve a read request: first check the memtable, then the most recent on-disk segment, then the next-older segment, etc.
  4. Periodically run a merging and compaction process in the background to combine segment files and discard overwritten/deleted values.

Crash Recovery

  • If the database crashes, the most recent writes in the memtable (not yet written to disk) are lost.
  • Solution: keep a separate write-ahead log on disk to which every write is immediately appended (not sorted — only for crash recovery).
  • Every time the memtable is written out to an SSTable, the corresponding log can be discarded.

Making an LSM-Tree out of SSTables

This algorithm is essentially what is used in:

  • LevelDB and RocksDB — key-value storage engine libraries designed to be embedded into other applications.
  • Cassandra and HBase — both inspired by Google's Bigtable paper (which introduced the terms SSTable and memtable).
  • Lucene (used by Elasticsearch and Solr) — uses a similar method for its term dictionary (key = term, value = list of document IDs containing that term, i.e., the postings list).

Originally described by Patrick O'Neil et al. as the Log-Structured Merge-Tree (LSM-Tree). Storage engines based on this principle of merging and compacting sorted files are called LSM storage engines.

Performance Optimizations

  • Bloom filters — A memory-efficient data structure for approximating the contents of a set. Can tell you if a key does not appear in the database, saving many unnecessary disk reads for nonexistent keys. (LSM-tree lookups for nonexistent keys are slow without this — must check memtable + all segments.)
  • Compaction strategies:
    • Size-tiered compaction — Newer and smaller SSTables are successively merged into older and larger SSTables. Used by HBase; supported by Cassandra.
    • Leveled compaction — Key range is split into smaller SSTables, older data moved into separate "levels," allowing compaction to proceed more incrementally with less disk space. Used by LevelDB and RocksDB.

Key properties of LSM-trees:

  • Data stored in sorted order → efficient range queries.
  • Disk writes are sequential → remarkably high write throughput.
  • Works well even when the dataset is much bigger than available memory.

Section 4: B-Trees

The most widely used indexing structure. Introduced in 1970 and called "ubiquitous" less than 10 years later. The standard index implementation in almost all relational databases, and many nonrelational databases too.

Like SSTables, B-trees keep key-value pairs sorted by key, allowing efficient key-value lookups and range queries. But the design philosophy is very different.

Key Differences from Log-Structured Indexes

Log-Structured (LSM)B-Trees
UnitVariable-size segments (several MB+)Fixed-size blocks/pages (traditionally 4 KB)
Write patternAlways write sequentiallyOverwrite a page on disk in place
OrganizationAppend-only filesTree of pages with references

How B-Trees Work

  • The database is broken down into fixed-size pages (traditionally 4 KB, sometimes bigger), read or written one page at a time. This corresponds closely to the underlying hardware (disks are also arranged in fixed-size blocks).
  • Each page is identified using an address or location (like a pointer, but on disk instead of in memory). Pages refer to other pages, constructing a tree of pages.
  • One page is the root of the B-tree. To look up a key, start at the root.
  • Each page contains several keys and references to child pages. Each child is responsible for a continuous range of keys, and the keys between references indicate the boundaries between those ranges.
  • Eventually you reach a leaf page containing individual keys, which either contains the value inline or references to pages where values can be found.
  • The number of references to child pages in one page is called the branching factor (typically several hundred).

Capacity

A four-level tree of 4 KB pages with a branching factor of 500 can store up to 256 TB.

A B-tree with n keys always has a depth of O(log n) — most databases fit into a B-tree that is three or four levels deep.

Updates and Inserts

  • Update: Search for the leaf page containing the key, change the value, write the page back to disk. All references to that page remain valid.
  • Insert: Find the page whose range encompasses the new key and add it. If there isn't enough free space, the page is split into two half-full pages, and the parent page is updated to account for the new subdivision.

This algorithm ensures the tree remains balanced.

Making B-Trees Reliable

Write-Ahead Log (WAL)

  • Overwriting a page is a dangerous operation. Some operations require several pages to be overwritten (e.g., splitting a page requires writing the two split pages + updating the parent). If the database crashes mid-operation, you get a corrupted index (e.g., an orphan page).
  • Solution: a write-ahead log (WAL), also known as a redo log — an append-only file to which every B-tree modification must be written before it can be applied to the tree pages. Used to restore the B-tree to a consistent state after a crash.

Concurrency Control

  • Updating pages in place requires careful concurrency control if multiple threads access the B-tree simultaneously.
  • Typically done by protecting the tree's data structures with latches (lightweight locks).
  • Log-structured approaches are simpler: merging happens in the background without interfering with incoming queries, and old segments are atomically swapped for new ones.

B-Tree Optimizations

  1. Copy-on-write — Instead of overwriting pages + WAL, write modified pages to a different location and create new parent page versions pointing at the new location (used by LMDB). Also useful for concurrency control (snapshot isolation).
  2. Key abbreviation — Save space by not storing the entire key in interior pages, only enough to act as boundaries between key ranges. Allows higher branching factor and fewer levels (sometimes called B+ tree).
  3. Sequential leaf layout — Try to lay out leaf pages in sequential order on disk for efficient range scans. Difficult to maintain as the tree grows. LSM-trees have an advantage here since they rewrite large segments in one go during merging.
  4. Sibling pointers — Each leaf page has references to its left and right sibling pages, allowing scanning keys in order without jumping back to parent pages.
  5. Fractal trees — B-tree variants that borrow log-structured ideas to reduce disk seeks.

Section 5: Comparing B-Trees and LSM-Trees

As a rule of thumb:

  • LSM-trees are typically faster for writes.
  • B-trees are thought to be faster for reads (LSM-trees must check several data structures and SSTables at different stages of compaction).

However, benchmarks are often inconclusive and sensitive to workload details. Test with your particular workload.

Advantages of LSM-Trees

Lower Write Amplification

  • Write amplification = one write to the database resulting in multiple writes to disk over the database's lifetime.
  • B-trees write every piece of data at least twice (WAL + tree page), plus overhead from writing an entire page even if only a few bytes changed. Some engines overwrite the same page twice to avoid partially updated pages.
  • LSM-trees also rewrite data multiple times (compaction and merging), but can sometimes have lower write amplification than B-trees.
  • Write amplification is of particular concern on SSDs, which can only overwrite blocks a limited number of times before wearing out.

Higher Write Throughput

  • LSM-trees sequentially write compact SSTable files rather than overwriting several pages in the tree.
  • Particularly important on magnetic hard drives where sequential writes are much faster than random writes.

Better Compression

  • LSM-trees can be compressed better and produce smaller files on disk.
  • B-trees leave some disk space unused due to fragmentation (page splits, rows not fitting in existing pages).
  • LSM-trees periodically rewrite SSTables to remove fragmentation → lower storage overheads, especially with leveled compaction.
  • On SSDs: lower write amplification and reduced fragmentation allow more read/write requests within available I/O bandwidth.

Downsides of LSM-Trees

Compaction Interference

  • Compaction can interfere with ongoing reads and writes. At higher percentiles, response time can be quite high. B-trees can be more predictable.

Compaction Can't Keep Up

  • At high write throughput, disk bandwidth is shared between initial writes and compaction threads.
  • If compaction cannot keep up with incoming writes: number of unmerged segments grows → disk space fills up, reads slow down (more segment files to check).
  • SSTable-based engines typically don't throttle incoming writes even if compaction can't keep up → need explicit monitoring.

Key Uniqueness

  • In B-trees, each key exists in exactly one place in the index.
  • LSM-trees may have multiple copies of the same key in different segments.
  • B-trees are attractive for strong transactional semantics: transaction isolation can use locks on ranges of keys, directly attached to the tree.

Bottom Line

  • B-trees are very ingrained in database architecture and provide consistently good performance for many workloads.
  • In new datastores, log-structured indexes are becoming increasingly popular.
  • No quick rule for which is better — test empirically.

Section 6: Other Indexing Structures

Secondary Indexes

  • A primary key uniquely identifies one row/document/vertex. A secondary index is not unique — there may be many rows with the same key.
  • Solved by: making each value a list of matching row identifiers (like a postings list), or making each key unique by appending a row identifier.
  • Both B-trees and LSM-trees can be used as secondary indexes.
  • Crucial for performing joins efficiently.

Storing Values Within the Index

Heap Files

  • The value in an index can be the actual row, or a reference to the row stored elsewhere.
  • The place where rows are stored is called a heap file (stores data in no particular order).
  • Heap file approach avoids duplicating data when multiple secondary indexes are present — each index just references a location in the heap file.
  • When updating a value without changing the key: efficient if the new value is not larger. If larger, the record may need to be moved → either update all indexes or leave a forwarding pointer.

Clustered Index

  • Stores the indexed row directly within the index (eliminates the extra hop from index to heap file).
  • In MySQL's InnoDB, the primary key is always a clustered index; secondary indexes refer to the primary key (not a heap file location).
  • In SQL Server, you can specify one clustered index per table.

Covering Index (Index with Included Columns)

  • A compromise: stores some of a table's columns within the index.
  • Allows some queries to be answered using the index alone (the index covers the query).
  • Trade-off: speeds up reads, but requires additional storage and adds overhead on writes.

Multi-Column Indexes

Concatenated Index

  • Combines several fields into one key by appending one column to another.
  • Like an old-fashioned phone book: index from (lastname, firstname) to phone number.
  • Can find all people with a particular last name, or a particular lastname-firstname combination. But useless for finding people by first name alone.

Multi-Dimensional Indexes

  • Important for geospatial data. Example: find all restaurants within a rectangular map area.
    SELECT * FROM restaurants WHERE latitude  > 51.4946 AND latitude  < 51.5079
                                AND longitude > -0.1162 AND longitude < -0.1004;
    
  • A standard B-tree or LSM-tree can give you all restaurants in a range of latitudes OR longitudes, but not both simultaneously.
  • Options:
    • Translate a 2D location into a single number using a space-filling curve, then use a regular B-tree index.
    • Specialized spatial indexes such as R-trees (e.g., PostGIS implements geospatial indexes as R-trees using PostgreSQL's Generalized Search Tree).
  • Multi-dimensional indexes are not just for geography: color ranges (red, green, blue) on ecommerce, or (date, temperature) for weather observations. Used by HyperDex.

Full-Text Search and Fuzzy Indexes

  • Standard indexes assume exact data. Fuzzy querying (e.g., misspelled words) requires different techniques.
  • Full-text search engines allow: synonym expansion, ignoring grammatical variations, searching for words near each other, and other linguistic analysis.
  • Lucene can search for words within a certain edit distance (1 edit distance = one letter added, removed, or replaced).
  • Lucene uses a finite state automaton over characters in keys (similar to a trie) as its in-memory index, which can be transformed into a Levenshtein automaton for efficient fuzzy search.

Keeping Everything in Memory

In-Memory Databases

  • As RAM becomes cheaper, it's feasible to keep entire datasets in memory (potentially distributed across several machines).
  • Some are caching-only (e.g., Memcached — acceptable to lose data on restart).
  • Others aim for durability: battery-powered RAM, writing a log of changes to disk, periodic snapshots, or replicating to other machines.
  • Examples: VoltDB, MemSQL, Oracle TimesTen (relational), RAMCloud (key-value with durability), Redis and Couchbase (weak durability via async disk writes).

Why In-Memory Databases Are Fast

  • Counterintuitively, not because they don't read from disk (the OS caches recently used disk blocks in memory anyway).
  • Rather, they avoid the overheads of encoding in-memory data structures in a form that can be written to disk.
  • They can also provide data models difficult to implement with disk-based indexes (e.g., Redis offers priority queues and sets).

Anti-Caching

  • An in-memory database architecture extended to support datasets larger than available memory.
  • Evicts least recently used data from memory to disk, loads it back when accessed again.
  • Similar to OS virtual memory/swap, but the database manages memory more efficiently (at the granularity of individual records, not entire memory pages).
  • Still requires indexes to fit entirely in memory.

Section 7: Transaction Processing or Analytics?

OLTP vs OLAP

PropertyOLTP (Transaction Processing)OLAP (Analytic Processing)
Main read patternSmall number of records per query, fetched by keyAggregate over large number of records
Main write patternRandom-access, low-latency writes from user inputBulk import (ETL) or event stream
Primarily used byEnd user/customer, via web applicationInternal analyst, for decision support
What data representsLatest state of data (current point in time)History of events that happened over time
Dataset sizeGigabytes to terabytesTerabytes to petabytes
BottleneckDisk seek timeDisk bandwidth
  • A transaction doesn't necessarily have ACID properties — it just means allowing clients to make low-latency reads and writes (as opposed to batch processing).

Data Warehousing

  • OLTP systems are critical to business operations → database administrators are reluctant to let analysts run expensive ad hoc queries on them.
  • A data warehouse is a separate database that analysts can query without affecting OLTP operations.
  • Contains a read-only copy of data from all the various OLTP systems in the company.

ETL (Extract–Transform–Load)

Data is extracted from OLTP databases, transformed into an analysis-friendly schema, cleaned up, and loaded into the data warehouse.

  • The data warehouse can be optimized for analytic access patterns (the indexing algorithms for OLTP are not good at answering analytic queries).
  • On the surface, a data warehouse and a relational OLTP database look similar (both have SQL interface), but the internals are quite different.
  • Data warehouse vendors: Teradata, Vertica, SAP HANA, ParAccel (Amazon RedShift is a hosted version). Open source SQL-on-Hadoop: Apache Hive, Spark SQL, Cloudera Impala, Facebook Presto, Apache Drill (some based on Google's Dremel).

Stars and Snowflakes: Schemas for Analytics

Star Schema (Dimensional Modeling)

At the center is a fact table — each row represents an event that occurred at a particular time (e.g., a customer's purchase). Fact tables can become extremely large (tens of petabytes at companies like Apple, Walmart, eBay).

  • Some columns are attributes (e.g., price, cost).
  • Other columns are foreign key references to dimension tables — representing the who, what, where, when, how, and why of the event.
  • Even date and time are often represented using dimension tables (to encode additional info like public holidays).
  • The name "star schema" comes from the visualization: fact table in the middle, dimension tables around it like rays of a star.

Snowflake Schema

  • A variation where dimensions are further broken down into subdimensions (e.g., separate tables for brands and product categories).
  • More normalized than star schemas, but star schemas are often preferred because they are simpler for analysts.

Table Width

  • Fact tables often have over 100 columns, sometimes several hundred.
  • Dimension tables can also be very wide (all metadata relevant for analysis).

Section 8: Column-Oriented Storage

With trillions of rows and petabytes of data in fact tables, storing and querying them efficiently is a challenging problem.

  • A typical data warehouse query only accesses 4 or 5 columns at a time (SELECT * is rarely needed for analytics).
  • In row-oriented storage (most OLTP databases), all values from one row are stored next to each other. A query must load all rows (each with 100+ attributes) from disk, parse them, and filter — very slow.

The Column-Oriented Idea

Don't store all values from one row together — store all values from each column together instead. If each column is stored in a separate file, a query only needs to read and parse the columns used in that query.

  • The column-oriented layout relies on each column file containing rows in the same order. To reassemble a row, take the kth entry from each column file.
  • Applies equally to nonrelational data (e.g., Parquet is a columnar storage format supporting a document data model, based on Google's Dremel).

Column Compression

Column-oriented storage lends itself well to compression because column values are often repetitive.

Bitmap Encoding

  • Take a column with n distinct values and turn it into n separate bitmaps: one bitmap per distinct value, with one bit per row (1 if the row has that value, 0 if not).
  • If n is small (e.g., ~200 countries), bitmaps can be stored with one bit per row.
  • If n is bigger (sparse bitmaps), additionally apply run-length encoding → remarkably compact.

Bitmap Operations for Queries

  • WHERE product_sk IN (30, 68, 69) → Load three bitmaps, calculate bitwise OR.
  • WHERE product_sk = 31 AND store_sk = 3 → Load two bitmaps, calculate bitwise AND.
  • Works because columns contain rows in the same order — the kth bit in one column's bitmap corresponds to the same row as the kth bit in another column's bitmap.

Column Families ≠ Column-Oriented

  • Cassandra and HBase have column families (inherited from Bigtable), but they store all columns from a row together within each column family and don't use column compression. The Bigtable model is still mostly row-oriented.

Memory Bandwidth and Vectorized Processing

  • For queries scanning millions of rows, bottlenecks include disk-to-memory bandwidth and main memory to CPU cache bandwidth.
  • Column-oriented storage is good for efficient CPU usage: the query engine can take a chunk of compressed column data that fits in the CPU's L1 cache and iterate through it in a tight loop (no function calls).
  • Operators like bitwise AND/OR can operate on compressed column data directly.
  • This technique is called vectorized processing.

Sort Order in Column Storage

  • Data can be sorted (entire row at a time, even though stored by column) to serve as an indexing mechanism.
  • The administrator chooses sort columns based on common queries (e.g., date_key as first sort key, product_sk as second).
  • Sorted order helps with compression: the first sort column will have long sequences of repeated values → simple run-length encoding can compress it to a few kilobytes even with billions of rows.
  • Compression effect is strongest on the first sort key; subsequent columns are more jumbled.

Several Different Sort Orders

  • Introduced in C-Store and adopted by Vertica: store the same data sorted in several different ways (since data is replicated to multiple machines anyway).
  • When processing a query, use the version that best fits the query pattern.
  • Different from secondary indexes in row-oriented stores: in a column store, there are no pointers to data elsewhere, only columns containing values.

Writing to Column-Oriented Storage

  • Column-oriented storage, compression, and sorting make reads faster but writes more difficult.
  • Update-in-place (like B-trees) is not possible with compressed columns.
  • Solution: LSM-trees. All writes first go to an in-memory store (sorted structure), then merged with column files on disk in bulk when enough writes accumulate (essentially what Vertica does).
  • Queries examine both column data on disk and recent writes in memory, combining the two. The query optimizer hides this distinction from the user.

Aggregation: Data Cubes and Materialized Views

Materialized Views

  • A materialized view is an actual copy of query results written to disk (vs. a virtual view which is just a query shortcut).
  • When underlying data changes, the materialized view needs to be updated → makes writes more expensive.
  • Not often used in OLTP databases, but can make sense in read-heavy data warehouses.

Data Cubes (OLAP Cubes)

A common special case of a materialized view: a grid of aggregates grouped by different dimensions.

  • Example: dates along one axis, products along the other. Each cell contains the aggregate (e.g., SUM of net_price) for that date-product combination.
  • Can apply the same aggregate along each row or column to get summaries reduced by one dimension.
  • With 5 dimensions (date, product, store, promotion, customer), it's a five-dimensional hypercube.
  • Advantage: Certain queries become very fast (effectively precomputed).
  • Disadvantage: Doesn't have the same flexibility as querying raw data (e.g., can't calculate proportion of sales for items costing more than $100 if price isn't a dimension).
  • Most data warehouses keep as much raw data as possible and use data cubes only as a performance boost for certain queries.

Summary

Key Takeaways

Storage engines fall into two broad categories:

1. OLTP (Transaction Processing)

  • User-facing, huge volume of requests, each touching a small number of records.
  • Application requests records using some key; the storage engine uses an index to find the data.
  • Disk seek time is often the bottleneck.
  • Two main schools of thought:
    • Log-structured school — Only appends to files, never updates in place. Bitcask, SSTables, LSM-trees, LevelDB, Cassandra, HBase, Lucene.
    • Update-in-place school — Treats disk as fixed-size pages that can be overwritten. B-trees (used in all major relational databases).
  • Log-structured engines systematically turn random-access writes into sequential writes → higher write throughput.

2. OLAP (Analytics)

  • Used by business analysts, much lower volume of queries, but each query scans millions of records.
  • Disk bandwidth (not seek time) is the bottleneck.
  • Column-oriented storage is an increasingly popular solution: encode data compactly, minimize data read from disk.
  • Techniques: column compression, bitmap encoding, run-length encoding, vectorized processing, sort order optimization, materialized views, data cubes.

Important References

  1. Justin Sheehy and David Smith: "Bitcask: A Log-Structured Hash Table for Fast Key/Value Data," Basho Technologies, April 2010.
  2. Fay Chang, Jeffrey Dean, Sanjay Ghemawat, et al.: "Bigtable: A Distributed Storage System for Structured Data," at 7th USENIX OSDI, November 2006.
  3. Patrick O'Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O'Neil: "The Log-Structured Merge-Tree (LSM-Tree)," Acta Informatica, volume 33, number 4, pages 351–385, June 1996.
  4. Rudolf Bayer and Edward M. McCreight: "Organization and Maintenance of Large Ordered Indices," Boeing Scientific Research Laboratories, report no. 20, July 1970.
  5. Douglas Comer: "The Ubiquitous B-Tree," ACM Computing Surveys, volume 11, number 2, pages 121–137, June 1979.
  6. Goetz Graefe: "Modern B-Tree Techniques," Foundations and Trends in Databases, volume 3, number 4, pages 203–402, August 2011.
  7. Burton H. Bloom: "Space/Time Trade-offs in Hash Coding with Allowable Errors," Communications of the ACM, volume 13, number 7, pages 422–426, July 1970.
  8. Manos Athanassoulis, Michael S. Kester, et al.: "Designing Access Methods: The RUM Conjecture," at 19th EDBT, March 2016.
  9. Sergey Melnik, Andrey Gubarev, Jing Jing Long, et al.: "Dremel: Interactive Analysis of Web-Scale Datasets," at 36th VLDB, pages 330–339, September 2010.
  10. Ralph Kimball and Margy Ross: The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, 3rd edition, John Wiley & Sons, July 2013.
  11. Daniel J. Abadi, Peter Boncz, Stavros Harizopoulos, et al.: "The Design and Implementation of Modern Column-Oriented Database Systems," Foundations and Trends in Databases, volume 5, number 3, pages 197–280, December 2013.
  12. Michael Stonebraker, Daniel J. Abadi, Adam Batkin, et al.: "C-Store: A Column-oriented DBMS," at 31st VLDB, pages 553–564, September 2005.
  13. Andrew Lamb, Matt Fuller, Ramakrishna Varadarajan, et al.: "The Vertica Analytic Database: C-Store 7 Years Later," Proceedings of the VLDB Endowment, volume 5, number 12, pages 1790–1801, August 2012.
  14. Jim Gray, Surajit Chaudhuri, Adam Bosworth, et al.: "Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals," Data Mining and Knowledge Discovery, volume 1, number 1, pages 29–53, March 2007.
  15. James C. Corbett, Jeffrey Dean, Michael Epstein, et al.: "Spanner: Google's Globally-Distributed Database," at 10th USENIX OSDI, October 2012.