Pagefy

Pagefy

Data Models and Query Languages

Designing Data Intensive Applications by Martin Kleppmann

Chapter 2: Data Models and Query Languages

Introduction

Data models are perhaps the most important part of developing software, because they have such a profound effect — not only on how the software is written, but also on how we think about the problem that we are solving.

Most applications are built by layering one data model on top of another. For each layer, the key question is: how is it represented in terms of the next-lower layer?

  1. Application developer — Models the real world (people, organizations, goods, actions) in terms of objects or data structures and APIs. These structures are often specific to the application.
  2. General-purpose data model — When storing those data structures, express them as JSON/XML documents, tables in a relational database, or a graph model.
  3. Database engineers — Decide how to represent that JSON/relational/graph data in terms of bytes in memory, on disk, or on a network.
  4. Hardware engineers — Represent bytes in terms of electrical currents, pulses of light, magnetic fields, etc.

Each layer hides the complexity of the layers below it by providing a clean data model. These abstractions allow different groups of people to work together effectively.

Every data model embodies assumptions about how it is going to be used. Some kinds of usage are easy and some are not supported; some operations are fast and some perform badly.


Section 1: Relational Model Versus Document Model

The best-known data model today is SQL, based on the relational model proposed by Edgar Codd in 1970: data is organized into relations (tables in SQL), where each relation is an unordered collection of tuples (rows in SQL).

  • The relational model was a theoretical proposal, and many doubted whether it could be implemented efficiently. However, by the mid-1980s, RDBMSes and SQL became the tools of choice.
  • The dominance of relational databases has lasted around 25–30 years — an eternity in computing history.
  • Roots lie in business data processing on mainframes in the 1960s–70s: transaction processing (sales, banking, airline reservations) and batch processing (invoicing, payroll, reporting).
  • The goal of the relational model was to hide implementation detail behind a cleaner interface.
  • Over the years, competitors emerged: the network model and hierarchical model (1970s–80s), object databases (late 1980s–early 1990s), XML databases (early 2000s) — each generated hype but never lasted.
  • Relational databases generalized remarkably well beyond their original scope to a broad variety of use cases (web publishing, social networking, ecommerce, games, SaaS, etc.).

1.1 The Birth of NoSQL

NoSQL is the latest attempt to overthrow the relational model's dominance. The name was originally just a catchy Twitter hashtag for a meetup on open source, distributed, nonrelational databases in 2009. It has been retroactively reinterpreted as "Not Only SQL".

Driving Forces Behind NoSQL Adoption

  1. A need for greater scalability than relational databases can easily achieve — very large datasets or very high write throughput.
  2. A widespread preference for free and open source software over commercial database products.
  3. Specialized query operations not well supported by the relational model.
  4. Frustration with the restrictiveness of relational schemas, and a desire for a more dynamic and expressive data model.

Different applications have different requirements. It seems likely that relational databases will continue to be used alongside nonrelational datastores — an idea called polyglot persistence.


1.2 The Object-Relational Mismatch

Most application development today is done in object-oriented programming languages, which leads to a common criticism: if data is stored in relational tables, an awkward translation layer is required between the objects in application code and the database model of tables, rows, and columns. This disconnect is called an impedance mismatch.

  • ORM frameworks like ActiveRecord and Hibernate reduce boilerplate code but can't completely hide the differences between the two models.

LinkedIn Profile Example

Consider how a résumé (LinkedIn profile) could be expressed in a relational schema:

  • Fields like first_name and last_name appear exactly once per user → columns on the users table.
  • But most people have had more than one job, varying education, and multiple contact info items → one-to-many relationship.

Ways to Represent One-to-Many Relationships

  1. Traditional SQL (normalized) — Put positions, education, and contact info in separate tables with a foreign key reference to the users table.
  2. Structured datatypes / XML / JSON columns — Later SQL standards added support for multi-valued data within a single row (supported by Oracle, IBM DB2, MS SQL Server, PostgreSQL, MySQL).
  3. Encode as JSON/XML in a text column — Let the application interpret the structure. Cannot use the database to query inside that column.

JSON Representation

For a self-contained document like a résumé, JSON is quite appropriate. Document-oriented databases like MongoDB, RethinkDB, CouchDB, and Espresso support this model.

{
  "user_id": 251,
  "first_name": "Bill",
  "last_name": "Gates",
  "summary": "Co-chair of the Bill & Melinda Gates... Active blogger.",
  "region_id": "us:91",
  "industry_id": 131,
  "positions": [
    {"job_title": "Co-chair", "organization": "Bill & Melinda Gates Foundation"},
    {"job_title": "Co-founder, Chairman", "organization": "Microsoft"}
  ],
  "education": [
    {"school_name": "Harvard University", "start": 1973, "end": 1975},
    {"school_name": "Lakeside School, Seattle", "start": null, "end": null}
  ],
  "contact_info": {
    "blog": "http://thegatesnotes.com",
    "twitter": "http://twitter.com/BillGates"
  }
}
  • JSON reduces the impedance mismatch between application code and storage layer.
  • JSON has better locality than the multi-table schema — all relevant information is in one place, one query is sufficient (vs. multiple queries or messy multi-way joins).
  • The one-to-many relationships imply a tree structure in the data, and JSON makes this explicit.

1.3 Many-to-One and Many-to-Many Relationships

In the JSON example, region_id and industry_id are given as IDs, not plain-text strings. Why?

Advantages of Using IDs (Standardized Lists)

  • Consistent style and spelling across profiles.
  • Avoiding ambiguity (e.g., several cities with the same name).
  • Ease of updating — the name is stored in only one place.
  • Localization support — standardized lists can be translated.
  • Better search — e.g., a search for philanthropists in Washington can match Seattle profiles.

Normalization

  • When you use an ID, the human-meaningful information is stored in only one place, and everything refers to it by ID.
  • When you store text directly, you are duplicating the human-meaningful information in every record.
  • The advantage of an ID: it has no meaning to humans, so it never needs to change. Anything meaningful to humans may need to change, and if duplicated, all copies need updating → risks inconsistencies.
  • Removing such duplication is the key idea behind normalization in databases.

The Problem with Document Databases

  • Normalizing data requires many-to-one relationships (many people live in one region, many people work in one industry), which don't fit nicely into the document model.
  • In relational databases, referring to rows by ID and joining is easy. In document databases, joins are often weak or unsupported.
  • If the database doesn't support joins, you have to emulate joins in application code by making multiple queries.

Data Becomes More Interconnected Over Time

Even if the initial version fits a join-free document model, data tends to become more interconnected as features are added:

  • Organizations and schools as entities — Instead of just strings, they become references to entities with their own web pages, logos, news feeds.
  • Recommendations — One user writes a recommendation for another. The recommendation should reference the author's profile (so if they update their photo, it reflects everywhere).

1.4 Are Document Databases Repeating History?

The debate about many-to-many relationships is much older than NoSQL — it goes back to the earliest computerized database systems.

The Hierarchical Model (IMS)

  • The most popular database for business data processing in the 1970s was IBM's Information Management System (IMS), originally developed for stock-keeping in the Apollo space program (first commercially released in 1968, still in use today on IBM mainframes).
  • IMS used the hierarchical model: all data as a tree of records nested within records — remarkably similar to JSON used by document databases.
  • Worked well for one-to-many relationships, but made many-to-many relationships difficult and didn't support joins.
  • Developers had to decide whether to duplicate (denormalize) data or manually resolve references — the same problems developers face with document databases today.

The Network Model (CODASYL)

  • Standardized by the Conference on Data Systems Languages (CODASYL).
  • A generalization of the hierarchical model: a record could have multiple parents (allowing many-to-one and many-to-many relationships).
  • Links between records were like pointers in a programming language (stored on disk). The only way to access a record was to follow a path from a root record along chains of links — called an access path.
  • A query was performed by moving a cursor through the database, iterating over lists and following access paths.
  • Even CODASYL committee members admitted this was like "navigating around an n-dimensional data space".
  • The code for querying and updating was complicated and inflexible. If you didn't have a path to the data you wanted, you were stuck. Changing access paths required rewriting handwritten query code.

The Relational Model

  • Laid out all data in the open: a relation (table) is simply a collection of tuples (rows), and that's it.
  • No labyrinthine nested structures, no complicated access paths.
  • You can read any or all rows, selecting those that match an arbitrary condition.
  • The query optimizer automatically decides which parts of the query to execute in which order, and which indexes to use — effectively the "access path," but made automatically, not by the developer.
  • If you want to query data in new ways, just declare a new index — queries automatically use the most appropriate indexes.
  • Key insight: build a query optimizer once, and all applications benefit. A general-purpose optimizer wins in the long run over handcoded access paths.

Comparison to Document Databases

  • Document databases reverted to the hierarchical model in one aspect: storing nested records within their parent record rather than in a separate table.
  • But for many-to-one and many-to-many relationships, relational and document databases are not fundamentally different: both use a unique identifier (foreign key vs. document reference), resolved at read time by a join or follow-up queries.
  • Document databases have not followed the path of CODASYL.

1.5 Relational Versus Document Databases Today

The main arguments:

Document ModelRelational Model
StrengthsSchema flexibility, better performance due to locality, closer to application data structuresBetter support for joins, many-to-one and many-to-many relationships

Which Data Model Leads to Simpler Application Code?

  • If data has a document-like structure (tree of one-to-many relationships, entire tree loaded at once) → use a document model.
  • The relational technique of shredding (splitting a document-like structure into multiple tables) can lead to cumbersome schemas and complicated application code.
  • Document model limitation: you cannot refer directly to a nested item within a document — you need to say something like "the second item in the list of positions for user 251" (like an access path in the hierarchical model).
  • If your application uses many-to-many relationships, the document model becomes less appealing. Joins can be emulated in application code, but that moves complexity into the application and is usually slower than a join inside the database.
  • For highly interconnected data: document model is awkward, relational model is acceptable, graph models are the most natural.

Schema Flexibility in the Document Model

  • Document databases are sometimes called schemaless, but that's misleading — the code that reads the data usually assumes some structure (an implicit schema).
  • More accurate terms:
    • Schema-on-read — The structure is implicit, only interpreted when data is read (like dynamic type checking).
    • Schema-on-write — The schema is explicit, the database ensures all written data conforms to it (like static type checking).
Example: Changing Data Format
  • Document database (schema-on-read):
    if (user && user.name && !user.first_name) {
        // Documents written before Dec 8, 2013 don't have first_name
        user.first_name = user.name.split(" ")[0];
    }
    
  • Relational database (schema-on-write):
    ALTER TABLE users ADD COLUMN first_name text;
    UPDATE users SET first_name = split_part(name, ' ', 1);      -- PostgreSQL
    UPDATE users SET first_name = substring_index(name, ' ', 1);  -- MySQL
    
  • Schema changes have a bad reputation of being slow, but most relational databases execute ALTER TABLE in a few milliseconds. MySQL is a notable exception — it copies the entire table, which can mean minutes or hours of downtime (tools like pt-online-schema-change and gh-ost exist to work around this).
When Schema-on-Read is Advantageous
  • Items in the collection don't all have the same structure (heterogeneous data):
    • Many different types of objects, not practical to put each in its own table.
    • Structure determined by external systems you don't control, which may change at any time.

Data Locality for Queries

  • A document is usually stored as a single continuous string (JSON, XML, or binary variant like MongoDB's BSON).
  • If your application often needs the entire document, there is a performance advantage to this storage locality (vs. multiple index lookups across multiple tables).
  • The locality advantage only applies if you need large parts of the document at the same time. The database typically loads the entire document even if you access only a small portion.
  • On updates, the entire document usually needs to be rewritten — only modifications that don't change the encoded size can be performed in place.
  • Recommendation: keep documents fairly small and avoid writes that increase document size.
  • Locality is not limited to the document model:
    • Google's Spanner — allows schema to declare that a table's rows should be interleaved (nested) within a parent table.
    • Oracle — multi-table index cluster tables.
    • Bigtable model (Cassandra, HBase) — column-family concept for managing locality.

Convergence of Document and Relational Databases

  • Most relational databases have supported XML since the mid-2000s, and JSON more recently (PostgreSQL 9.3+, MySQL 5.7+, IBM DB2 10.5+).
  • On the document side, RethinkDB supports relational-like joins; some MongoDB drivers automatically resolve database references (client-side join).
  • Relational and document databases are becoming more similar over time — the data models complement each other.
  • A hybrid of the relational and document models is a good route for databases to take in the future.

Section 2: Query Languages for Data

When the relational model was introduced, it included a new way of querying data: SQL is a declarative query language, whereas IMS and CODASYL queried the database using imperative code.

2.1 Declarative vs Imperative

Imperative Example (JavaScript)

function getSharks() {
    var sharks = [];
    for (var i = 0; i < animals.length; i++) {
        if (animals[i].family === "Sharks") {
            sharks.push(animals[i]);
        }
    }
    return sharks;
}

Relational Algebra

sharks = σ_family="Sharks" (animals)

Declarative (SQL)

SELECT * FROM animals WHERE family = 'Sharks';

Why Declarative is Better

  1. More concise and easier to work with than imperative APIs.
  2. Hides implementation details of the database engine — the database can introduce performance improvements without requiring changes to queries.
  3. Order independence — SQL doesn't guarantee any particular ordering, so the database is free to rearrange data (e.g., reclaim disk space) without breaking queries. Imperative code may rely on ordering.
  4. Better for parallel execution — Declarative languages specify only the pattern of results, not the algorithm. The database is free to use a parallel implementation. Imperative code specifies instructions in a particular order, making it very hard to parallelize.

2.2 Declarative Queries on the Web

The advantages of declarative languages are not limited to databases. Consider CSS vs imperative JavaScript for styling:

Declarative (CSS)

li.selected > p {
    background-color: blue;
}

Imperative (JavaScript DOM)

var liElements = document.getElementsByTagName("li");
for (var i = 0; i < liElements.length; i++) {
    if (liElements[i].className === "selected") {
        var children = liElements[i].childNodes;
        for (var j = 0; j < children.length; j++) {
            var child = children[j];
            if (child.nodeType === Node.ELEMENT_NODE && child.tagName === "P") {
                child.setAttribute("style", "background-color: blue");
            }
        }
    }
}

Problems with the imperative approach:

  • If the selected class is removed, the blue color won't be removed (even if code is rerun) — CSS handles this automatically.
  • If you want to use a new API for better performance, you have to rewrite the code. Browser vendors can improve CSS/XPath performance without breaking compatibility.

Similarly, in databases, declarative query languages like SQL turned out to be much better than imperative query APIs (like those used by IMS and CODASYL).


2.3 MapReduce Querying

MapReduce is a programming model for processing large amounts of data in bulk across many machines, popularized by Google. A limited form is supported by some NoSQL datastores (MongoDB, CouchDB).

  • MapReduce is neither fully declarative nor fully imperative — somewhere in between.
  • Based on the map (also known as collect) and reduce (also known as fold or inject) functions from functional programming.

Example: Counting Shark Sightings Per Month

PostgreSQL (declarative):

SELECT date_trunc('month', observation_timestamp) AS observation_month,
       sum(num_animals) AS total_animals
FROM observations
WHERE family = 'Sharks'
GROUP BY observation_month;

MongoDB MapReduce:

db.observations.mapReduce(
    function map() {
        var year  = this.observationTimestamp.getFullYear();
        var month = this.observationTimestamp.getMonth() + 1;
        emit(year + "-" + month, this.numAnimals);
    },
    function reduce(key, values) {
        return Array.sum(values);
    },
    {
        query: { family: "Sharks" },
        out: "monthlySharkReport"
    }
);
  • The map and reduce functions must be pure functions (no side effects, no additional database queries) — this allows the database to run them anywhere, in any order, and rerun on failure.
  • MapReduce is a fairly low-level programming model for distributed execution. Higher-level query languages like SQL can be implemented as a pipeline of MapReduce operations.

Usability Problem

  • Writing two carefully coordinated JavaScript functions is often harder than writing a single query.
  • A declarative query language offers more opportunities for a query optimizer to improve performance.

MongoDB Aggregation Pipeline

For these reasons, MongoDB 2.2 added the aggregation pipeline — a declarative query language:

db.observations.aggregate([
    { $match: { family: "Sharks" } },
    { $group: {
        _id: {
            year:  { $year:  "$observationTimestamp" },
            month: { $month: "$observationTimestamp" }
        },
        totalAnimals: { $sum: "$numAnimals" }
    } }
]);
  • Similar in expressiveness to a subset of SQL, but uses JSON-based syntax.
  • The moral: a NoSQL system may find itself accidentally reinventing SQL, albeit in disguise.

Section 3: Graph-Like Data Models

If your application has mostly one-to-many relationships (tree-structured data) or no relationships between records, the document model is appropriate. But if many-to-many relationships are very common, it becomes more natural to model data as a graph.

A graph consists of two kinds of objects:

  • Vertices (also known as nodes or entities)
  • Edges (also known as relationships or arcs)

Typical Examples

  • Social graphs — Vertices are people, edges indicate who knows whom.
  • The web graph — Vertices are web pages, edges are HTML links.
  • Road or rail networks — Vertices are junctions, edges are roads or railway lines.

Well-known algorithms operate on graphs: car navigation systems search for the shortest path, PageRank determines web page popularity for search rankings.

Graphs are not limited to homogeneous data. An equally powerful use is storing completely different types of objects in a single datastore. For example, Facebook maintains a single graph with vertices representing people, locations, events, checkins, and comments; edges represent friendships, checkin locations, comments on posts, event attendance, etc.

The example in Figure 2-5 shows two people, Lucy from Idaho and Alain from Beaune, France. They are married and living in London. It illustrates things difficult to express in a traditional relational schema: different kinds of regional structures in different countries (France has départements and régions, the US has counties and states), varying granularity of data (Lucy's residence is a city, her birthplace is a state).

Graphs are good for evolvability: as you add features, a graph can easily be extended to accommodate changes in your application's data structures.


3.1 Property Graphs

In the property graph model (implemented by Neo4j, Titan, InfiniteGraph), each vertex consists of:

  • A unique identifier
  • A set of outgoing edges
  • A set of incoming edges
  • A collection of properties (key-value pairs)

Each edge consists of:

  • A unique identifier
  • The vertex at which the edge starts (the tail vertex)
  • The vertex at which the edge ends (the head vertex)
  • A label to describe the kind of relationship
  • A collection of properties (key-value pairs)

Relational Representation of a Property Graph

CREATE TABLE vertices (
    vertex_id   integer PRIMARY KEY,
    properties  json
);

CREATE TABLE edges (
    edge_id     integer PRIMARY KEY,
    tail_vertex integer REFERENCES vertices (vertex_id),
    head_vertex integer REFERENCES vertices (vertex_id),
    label       text,
    properties  json
);

CREATE INDEX edges_tails ON edges (tail_vertex);
CREATE INDEX edges_heads ON edges (head_vertex);

Important Aspects

  1. Any vertex can have an edge to any other vertex — no schema restricts which kinds of things can be associated.
  2. Given any vertex, you can efficiently find both its incoming and outgoing edges, and thus traverse the graph forward and backward (hence indexes on both tail_vertex and head_vertex).
  3. By using different labels for different kinds of relationships, you can store several different kinds of information in a single graph while maintaining a clean data model.

3.2 The Cypher Query Language

Cypher is a declarative query language for property graphs, created for the Neo4j graph database. (Named after a character in The Matrix, not related to cryptography.)

Inserting Data

CREATE
  (NAmerica:Location {name:'North America', type:'continent'}),
  (USA:Location      {name:'United States', type:'country'}),
  (Idaho:Location    {name:'Idaho',         type:'state'}),
  (Lucy:Person       {name:'Lucy'}),
  (Idaho) -[:WITHIN]->  (USA)  -[:WITHIN]-> (NAmerica),
  (Lucy)  -[:BORN_IN]-> (Idaho)

Arrow notation: (Idaho) -[:WITHIN]-> (USA) creates an edge labeled WITHIN, with Idaho as the tail node and USA as the head node.

Querying: Find People Who Emigrated from the US to Europe

MATCH
  (person) -[:BORN_IN]->  () -[:WITHIN*0..]-> (us:Location {name:'United States'}),
  (person) -[:LIVES_IN]-> () -[:WITHIN*0..]-> (eu:Location {name:'Europe'})
RETURN person.name
  • :WITHIN*0.. means "follow a WITHIN edge, zero or more times" (like * in a regular expression).
  • The query optimizer automatically chooses the most efficient execution strategy — you could start by scanning all people, or start from the Location vertices and work backward.

3.3 Graph Queries in SQL

Graph data can be represented in a relational database, but querying it with SQL is difficult because the number of joins is not fixed in advance (you may need to traverse a variable number of edges).

Since SQL:1999, this can be expressed using recursive common table expressions (WITH RECURSIVE). The same emigration query in SQL:

WITH RECURSIVE
  in_usa(vertex_id) AS (
      SELECT vertex_id FROM vertices WHERE properties->>'name' = 'United States'
    UNION
      SELECT edges.tail_vertex FROM edges
        JOIN in_usa ON edges.head_vertex = in_usa.vertex_id
        WHERE edges.label = 'within'
  ),
  in_europe(vertex_id) AS (
      SELECT vertex_id FROM vertices WHERE properties->>'name' = 'Europe'
    UNION
      SELECT edges.tail_vertex FROM edges
        JOIN in_europe ON edges.head_vertex = in_europe.vertex_id
        WHERE edges.label = 'within'
  ),
  born_in_usa(vertex_id) AS (
    SELECT edges.tail_vertex FROM edges
      JOIN in_usa ON edges.head_vertex = in_usa.vertex_id
      WHERE edges.label = 'born_in'
  ),
  lives_in_europe(vertex_id) AS (
    SELECT edges.tail_vertex FROM edges
      JOIN in_europe ON edges.head_vertex = in_europe.vertex_id
      WHERE edges.label = 'lives_in'
  )
SELECT vertices.properties->>'name'
FROM vertices
JOIN born_in_usa     ON vertices.vertex_id = born_in_usa.vertex_id
JOIN lives_in_europe ON vertices.vertex_id = lives_in_europe.vertex_id;

If the same query can be written in 4 lines in Cypher but requires 29 lines in SQL, that shows different data models are designed to satisfy different use cases. Pick a data model suitable for your application.


3.4 Triple-Stores and SPARQL

The triple-store model is mostly equivalent to the property graph model, using different words for the same ideas.

In a triple-store, all information is stored as three-part statements: (subject, predicate, object).

  • Example: (Jim, likes, bananas) — Jim is the subject, likes is the predicate, bananas is the object.

The subject is equivalent to a vertex. The object is one of two things:

  1. A primitive value (string, number) — the predicate and object are equivalent to a property's key and value on the subject vertex. E.g., (lucy, age, 33) → vertex lucy with {"age": 33}.
  2. Another vertex — the predicate is an edge, the subject is the tail vertex, the object is the head vertex. E.g., (lucy, marriedTo, alain).

Turtle Format (Example)

@prefix : <urn:example:>.
_:lucy     a :Person;   :name "Lucy";          :bornIn _:idaho.
_:idaho    a :Location; :name "Idaho";         :type "state";   :within _:usa.
_:usa      a :Location; :name "United States"; :type "country"; :within _:namerica.
_:namerica a :Location; :name "North America"; :type "continent".

The Semantic Web and RDF

  • The semantic web idea: websites publish machine-readable data (not just text/pictures for humans), allowing data from different websites to be combined into a "web of data".
  • The Resource Description Framework (RDF) was intended as the mechanism for this.
  • The semantic web was overhyped in the early 2000s and hasn't been realized in practice, but triples can be a good internal data model for applications regardless.
  • RDF uses URIs for subjects, predicates, and objects to avoid naming conflicts when combining data from different sources.

The SPARQL Query Language

SPARQL is a query language for triple-stores using the RDF data model (pronounced "sparkle"). It predates Cypher, and Cypher's pattern matching is borrowed from SPARQL.

PREFIX : <urn:example:>
SELECT ?personName WHERE {
  ?person :name ?personName.
  ?person :bornIn  / :within* / :name "United States".
  ?person :livesIn / :within* / :name "Europe".
}

Comparison with Cypher:

(person) -[:BORN_IN]-> () -[:WITHIN*0..]-> (location)   # Cypher
?person :bornIn / :within* ?location.                    # SPARQL

SPARQL is a powerful tool for applications to use internally, even if the semantic web never happens.


3.5 Graph Databases vs. the Network Model (CODASYL)

At first glance, CODASYL's network model looks similar to graph databases. But they differ in important ways:

CODASYLGraph Databases
SchemaSchema specified which record types could be nested within whichAny vertex can have an edge to any other vertex — no restrictions
AccessOnly way to reach a record was to traverse an access pathRefer directly to any vertex by unique ID, or use an index
OrderingChildren of a record were an ordered setVertices and edges are not ordered (sort only at query time)
Query styleAll queries were imperative, difficult to write, easily broken by schema changesSupport high-level declarative languages (Cypher, SPARQL)

3.6 The Foundation: Datalog

Datalog is a much older language than SPARQL or Cypher, studied extensively by academics in the 1980s. It provides the foundation that later query languages build upon. Used in Datomic and Cascalog (Datalog implementation for querying large datasets in Hadoop).

Datalog's data model is similar to the triple-store model, but instead of (subject, predicate, object), we write predicate(subject, object).

Data as Datalog Facts

name(namerica, 'North America').
type(namerica, continent).
name(usa, 'United States').
type(usa, country).
within(usa, namerica).
name(idaho, 'Idaho').
type(idaho, state).
within(idaho, usa).
name(lucy, 'Lucy').
born_in(lucy, idaho).

Query Using Rules

within_recursive(Location, Name) :- name(Location, Name).     /* Rule 1 */
within_recursive(Location, Name) :- within(Location, Via),     /* Rule 2 */
                                    within_recursive(Via, Name).

migrated(Name, BornIn, LivingIn) :- name(Person, Name),       /* Rule 3 */
                                    born_in(Person, BornLoc),
                                    within_recursive(BornLoc, BornIn),
                                    lives_in(Person, LivingLoc),
                                    within_recursive(LivingLoc, LivingIn).

?- migrated(Who, 'United States', 'Europe').
/* Who = 'Lucy'. */

How Rules Work

  • We define rules that tell the database about new predicates (within_recursive and migrated). These are derived from data or from other rules.
  • Rules can refer to other rules, just like functions calling other functions or recursively calling themselves.
  • A rule applies if the system can find a match for all predicates on the righthand side of the :- operator.
  • By repeated application of rules 1 and 2, within_recursive can tell us all locations contained within any named location.

Step-by-step Application

  1. name(namerica, 'North America') exists → Rule 1 generates within_recursive(namerica, 'North America').
  2. within(usa, namerica) exists + previous step → Rule 2 generates within_recursive(usa, 'North America').
  3. within(idaho, usa) exists + previous step → Rule 2 generates within_recursive(idaho, 'North America').

The Datalog approach requires a different kind of thinking, but it's very powerful because rules can be combined and reused in different queries. Less convenient for simple one-off queries, but copes better with complex data.


Summary

Key Takeaways

  1. Historically, data started as one big tree (hierarchical model), but that wasn't good for many-to-many relationships, so the relational model was invented.
  2. More recently, NoSQL datastores diverged in two directions:
    • Document databases — target use cases where data comes in self-contained documents and relationships between documents are rare.
    • Graph databases — target use cases where anything is potentially related to everything.
  3. All three models (document, relational, graph) are widely used today, each good in its respective domain. One model can be emulated in another, but the result is often awkward — that's why we have different systems for different purposes.
  4. Document and graph databases typically don't enforce a schema, making it easier to adapt to changing requirements. But the application still assumes structure — it's a question of whether the schema is explicit (enforced on write) or implicit (handled on read).
  5. Each data model comes with its own query language: SQL, MapReduce, MongoDB's aggregation pipeline, Cypher, SPARQL, and Datalog. CSS and XSL/XPath have interesting parallels.
  6. Other specialized data models exist: genome databases (GenBank), particle physics (LHC), full-text search indexes.

Important References

  1. Edgar F. Codd: "A Relational Model of Data for Large Shared Data Banks," Communications of the ACM, volume 13, number 6, pages 377–387, June 1970.
  2. Michael Stonebraker and Joseph M. Hellerstein: "What Goes Around Comes Around," in Readings in Database Systems, 4th edition, MIT Press, 2005.
  3. Pramod J. Sadalage and Martin Fowler: NoSQL Distilled, Addison-Wesley, August 2012.
  4. Jeffrey Dean and Sanjay Ghemawat: "MapReduce: Simplified Data Processing on Large Clusters," at 6th USENIX Symposium on Operating System Design and Implementation (OSDI), December 2004.
  5. Lin Qiao, Kapil Surlaker, Shirshanka Das, et al.: "On Brewing Fresh Espresso: LinkedIn's Distributed Data Serving Platform," at ACM SIGMOD, June 2013.
  6. Charles W. Bachman: "The Programmer as Navigator," Communications of the ACM, volume 16, number 11, pages 653–658, November 1973.
  7. Joseph M. Hellerstein, Michael Stonebraker, and James Hamilton: "Architecture of a Database System," Foundations and Trends in Databases, volume 1, number 2, pages 141–259, November 2007.
  8. James C. Corbett, Jeffrey Dean, Michael Epstein, et al.: "Spanner: Google's Globally-Distributed Database," at 10th USENIX Symposium on Operating System Design and Implementation (OSDI), October 2012.
  9. Fay Chang, Jeffrey Dean, Sanjay Ghemawat, et al.: "Bigtable: A Distributed Storage System for Structured Data," at 7th USENIX OSDI, November 2006.
  10. Nathan Bronson, Zach Amsden, George Cabrera, et al.: "TAO: Facebook's Distributed Data Store for the Social Graph," at USENIX ATC, June 2013.
  11. Todd J. Green, Shan Shan Huang, Boon Thau Loo, and Wenchao Zhou: "Datalog and Recursive Query Processing," Foundations and Trends in Databases, volume 5, number 2, pages 105–195, November 2013.
  12. Herb Sutter: "The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software," Dr. Dobb's Journal, volume 30, number 3, March 2005.
  13. Steve Harris, Andy Seaborne, and Eric Prud'hommeaux: "SPARQL 1.1 Query Language," W3C Recommendation, March 2013.
  14. Sarah Mei: "Why You Should Never Use MongoDB," sarahmei.com, November 11, 2013.
  15. Shlomi Noach: "gh-ost: GitHub's Online Schema Migration Tool for MySQL," githubengineering.com, August 1, 2016.