Paper link: https://research.google/pubs/pub39966/
Spanner, TrueTime and the CAP Theorem: https://research.google/pubs/pub45855/
Related to Google F1

Commentary:

“It is the most advanced database in the world right now.”

Novel Idea:

Google Spanner is a globally-distributed, synchronously-replicated database. Spanner offers a level of consistency it calls “external consistency” - which includes/implements serializable isolation level but with tunable concurrency control.

Spanner is made possible with TrueTime, a novel clock API that exposes uncertainty, and builds some powerful features on top: non-blocking reads in the past, lock-free read-only transactions, and atomic schema changes.

Since 2012, Google Spanner feature set has continued to evolve. This overview focuses on the original 2012 paper.

Impact:

Google’s business is in serving advertisement online, across the world, and at high availability. F1 serves this business, and F1 is built on top of Spanner as its physical database layer. The most significant impact of Spanner is perhaps that it powers Google’s revenue.

Since its introduction, Spanner has inspired several other distributed databases: TiDB, CockroachDB. Spanner remains a unique product of Google Cloud because of its custom TrueTime clock capability, and Google’s vast network connection across the world.

Spanner is designed for global scale - to millions of machines across hundreds/thousands of datacenters and trillions of data rows. Spanner provides all of this with strong consistency guarantees, and five 9s of availability guarantees.

Content:

At the highest level of abstraction, Spanner is a database that shards data across many sets of Paxos state machines in datacenters spread all over the world. Spanner automatically reshards data across machines as the amount of data or the number of servers changes, and it automatically migrates data across machines to balance load and in response to failures.

F1, which is Google’s advertisement backend, is a client of Spanner. F1 prefers high availability, so its Spanner replication preference is 5 replicas.

Spanner was built in response to consistent complaints from customers about BigTable’s shortcomings: applications wanted to evolve schemas, and/or strong consistency with wide-area replication.

First, Spanner offers such replication configuration for applications. Second, because of its ability to use TrueTime to assign globally meaningful timestamps upon transaction commits, Spanner provides “externally consistent” read/writes, and globally-consistent time snapshot reads.

S2. Implementation

Spanner deployment - universe - global
    Placement driver - moves data across zones
    Zone - set of physical locations that data replicates across
        One zonemaster
        Thousands of spanservers
            Tablets - similar to BigTable tablets
                SST/B-tree files + WAL
                DFS
            One Paxos state machine
                Writes initiate Paxos at leader
                Reads from any updated replica
                Leader for the group
                    Holds 10s leases
                    Lock table for 2PL
                    Transaction Manager for 2PC

Data placement

Spanner groups buckets of key-value pairs with the same prefix, into a directory. A Spanner Paxos Group may contain multiple directories that are frequently accessed together. In contrast to BigTable tablets, Spanner tablets are not lexicographically contiguous partitions of the key/row space - it can contain multiple discrete partitions of the key/row space, depending on access.

This is most similar to HDFS blocks, TiDB data regions, or BigTable tablets as mentioned.

Application needs

There were 3 key features that application developers wanted:

Stronger consistency than eventual consistency
Data model that was easy to evolve
SQL-like query language

Examples of well-known Google applications that use Megastore are Gmail, Picasa, Calendar, Android Market, and AppEngine. The need to support a SQL-like query language in Spanner was also clear, given the popularity of Dremel [28] as an interactive data-analysis tool. Finally, the lack of cross-row transactions in Bigtable led to frequent complaints.

Google Percolator addresses transaction needs but it suffered in availability and performance due to its implementation of general two-phase commit. Spanner address the concern two ways:

Support transactions as a feature, but let application developers deal with performance problems.
Run 2PC over Paxos to increase availability.

Data model

Spanner rows must have primary key column(s).

Spanner offers the unique feature of INTERLEAVE IN to declare a table’s content is somehow hierarchically related/accessed frequently with another table’s content. At the physical data level, this is a hint to the data placement driver to colocate data.

S3. TrueTime

We leave most of the details for another paper

I’m still waiting on that paper.

Let’s define the following:

Endpoints (earliest and latest) of a TTinterval are TTstamp
TT.now() returns TTinterval that contains the absolute time this call was invoked
epsilon is the error bound, equal to half of the TTinterval width
We define any event e happened at physical absolute time t_abs(e)

In formal terms, TrueTime guarantees:

tt = TT.now()
tt.earliest <= t_abs(e) <= tt.latest

TrueTime is implemented by a set of time master machines per datacenter and a timeslave daemon per machine. The majority of masters have GPS receivers with dedicated antennas.

Every daemon polls a variety of masters [29] to reduce vulnerability to errors from any one master. Between synchronizations, a daemon advertises a slowly increasing time uncertainty, epsilon is derived from a conservatively applied worst-case local clock drift.

S4. Concurrency Control

Spanner guarantees a whole-database transactional read at timestamp t will see effects of all transactions committed as of t. This is one aspect of external consistency.

Spanner read-only transactions can execute without locking, without blocking writes, because Spanner can isolate data (in the Paxos state machines) based on their timestamps and execute the reads at an optimal, system-chosen timestamp, and query any Paxos replica that is up-to-date for that chosen timestamp. For example, in section 4.2.2, if RO-transaction is scoped to a single Paxos group, the Paxos leader can generally assign the transaction timestamp to be “last committed write” as it satisfies external consistency. For larger scoped RO-transactions, Spanner may choose to optimize the timestamp choice.

Paxos Leader Leases

Spanner’s implementation of Paxos leader lease renewal has some similarity to Raft, with TrueTime timestamps of the leader acting like the Raft election term number.

Timestamps for RW Transactions

In this paper Spanner implements RW transactions by two-phase locking. Spanner assigns the transaction timestamp to be the time of the Paxos write representing the transaction commit.

Within each Paxos group, Spanner assigns timestamps to Paxos writes in monotonically increasing order, even across leaders.

Start

Transaction coordinator leader assigns a commit timestamp s_i >= TT.now().latest

Commit wait

Transaction coordinator leader ensures clients cannot see data committed by T_i until TT.after(s_i) is true, meaning the absolute physical time of this transaction has definitely passed.

Serving Reads at a Timestamp

Every replica tracks a value called safe time t_safe, the max timestamp at which this replica is up-to-date. This is probably similar to the log index number in Raft. At the Paxos level, this is the highest-applied Paxos write.

t_safe = min( t_paxos_safe, t_TM_safe )

The tricky one is t_TM_safe: when there are zero prepared but not committed transactions, it is infinity. Participant slaves refer to the leader’s transaction manager for this timestamp. But as soon as a transaction is prepared, this timestamp can be assigned as follows:

As we discuss in Section 4.2.1, the commit protocol ensures that every participant knows a lower bound on a prepared transaction’s timestamp. Every participant leader (for a group g) for a transaction T assigns a prepare timestamp to its prepare record s_prepare_g. The coordinator leader ensures that the transaction’s commit timestamp s_i >= s_prepare_g over all participant groups.

Therefore in such a case, t_TM_safe = min_across_transactions( s_prepare_g ) - 1

As described in Section 4.2.4, this leads to a weakness where a single prepared transaction can prevent t_safe from advancing (at any replica). This is a false conflict if later reads do not conflict with the transaction. Spanner works around this by keeping a finer-grained mapping from key ranges to s_prepare_g timestamps.

Read-Write Transactions

Spanner uses two-phase commit to perform its transactions across Paxos groups. Spanner clients actually drive the following work of a transaction:

Buffering writes
Issue reads to leaders of appropriate group(s)
Send keepalive messages to participant leaders
Perform reads, kick off two-phase commit
Choose a coordinator group, send commit message to each participant group leader

A non-coordinator-participant leader first acquires write locks. It then chooses a prepare timestamp (> than any previous transactions), then logs a prepare record through Paxos. Each participant then notifies the coordinator of its prepare timestamp.

Thus, the prepare phase at each Paxo group is replicated and highly available, compared to non-Paxos 2PC.

The coordinator leader also acquires write locks, but skips the prepare phase. It chooses a timestamp for the entire transaction after hearing from all other participant leaders.

The commit timestamp, s, must be:

Greater than or equal to all prepare timestamps from each participant
>= TT.now().latest at the time the coordinator received its commit message (from the client)
Greater than any timestamps the leader has assigned to previous transactions

The coordinator leader then logs a commit record through Paxos, preserving the same replication and high availability.

The coordinator leader can skip the prepare phase because it must wait until all other participants reply with their prepare notifications anyway, at which point this coordinator will know if it is ready to commit.

Before allowing any coordinator replica to apply the commit record, the coordinator leader waits until TT.after(s), so as to obey the commit-wait rule described in Section 4.1.2. Because the coordinator leader chose s based on TT.now().latest and now waits until that timestamp is guaranteed to be in the past, the expected wait is at least 2 * epsilon_avg.

After commit wait, the coordinator sends the commit timestamp to the client and all other participant leaders. Each participant leader logs the transaction’s outcome through Paxos. All participants apply at the same timestamp and then release locks.

The application of the writes at a particular timestamp is similar to the Raft replicated state machine, but instead of log indexes, TrueTime assigns meaningful, monotonically increasing indexes values to the state changes.

More details about Spanner’s transaction capabilities here: document

Another interesting mention: t_paxos_safe cannot advance in absence of Paxos writes. Instead of taking the easy route of doing zero-fill writes, this happens:

Spanner addresses this problem by taking advantage of the disjointness of leader-lease intervals. Each Paxos leader advances t_paxos_safe by keeping a threshold above which future writes’ timestamps will occur: it maintains a mapping MinNextTS(n) from Paxos sequence number n to the minimum timestamp that may be assigned to Paxos sequence number n + 1.

This is possible because the leaders know their term leases are 10 seconds, and they have access to TrueTime to assign such timestamps to the Paxos sequence numbers with confidence. Leaders can also tell when the end of their lease is coming up, and they must renew their lease to preserve lease disjointness.

Relationship with F1

Google’s advertisement backend, F1, was based on manually sharded MySQL with some interesting details:

This backend was originally based on a MySQL database that was manually sharded many ways. The uncompressed dataset is tens of terabytes, which is small compared to many NoSQL instances, but was large enough to cause difficulties with sharded MySQL. The MySQL sharding scheme assigned each customer and all related data to a fixed shard. This layout enabled the use of indexes and complex query processing on a per-customer basis, but required some knowledge of the sharding in application business logic. Resharding this revenue-critical database as it grew in the number of customers and their data was extremely costly.

And this bit:

The last resharding took over two years of intense effort, and involved coordination and testing across dozens of teams to minimize risk. This operation was too complex to do regularly.

The reasons to switch to Spanner were clear:

First, Spanner removes the need to manually re-shard. Second, Spanner provides synchronous replication and automatic failover. With MySQL master-slave replication, failover was difficult, and risked data loss and downtime. Third, F1 requires strong transactional semantics, which made using other NoSQL systems impractical. Application semantics requires transactions across arbitrary data, and consistent reads. The F1 team also needed secondary indexes on their data (since Spanner does not yet provide automatic support for secondary indexes), and was able to implement their own consistent global indexes using Spanner transactions.

As a result of Spanner:

Spanner’s automatic failover has been nearly invisible to them.

Spanner’s timestamp semantics made it efficient for F1 to maintain in-memory data structures computed from the database state. F1 maintains a logical history log of all changes, which is written into Spanner itself as part of every transaction. F1 takes full snapshots of data at a timestamp to initialize its data structures, and then reads incremental changes to update them.

Prior Work:

Google Spanner consistency and TrueTime discussion