Commentary:

Facebook overcomes its scale and performance challenges by taking the concept of an eventually consistent, distributed hash table to the limits, via a geographically distributed, hierarchical system of caches. This kind of system is innovative when its use case and scale is appropriate. I think Facebook TAO, originally published here in 2013 but referenced by many papers later, was at the beginning of a trend of “building a specific DB for your data and problem”.

“A million dollars isn’t cool, Mark. You know what’s cool? A BILLION dollars.”

Paper link: https://www.usenix.org/system/files/conference/atc13/atc13-bronson.pdf

Date: 2013

Novel Idea

Before TAO, applications at Facebook aggressively used Memcache as a lookaside cache against MySQL to serve the social graph. TAO makes special optimizations by optimizing graph semantics directly, and continues to use MySQL as its source of truth. TAO favors availability and performance over strong consistency.

Main Result(s):

TAO as described here in 2013 can serve Facebook’s social graph at ~1 billion QPS for reads and ~1 million QPS for writes, for more than 1 billion active users on this changing graph. It is single instance, multi-tenant.

Prior Work:

Some influential prior work that built up to TAO, at least from industry, were: Google BigTable, Spanner, Amazon DynamoDB, LinkedIn Voldemort, Twitter FlockDB, Redis, and Yahoo PNUTS. And from academia, there were the usual suspects: consistent hashing, distributed data stores (Chord and Tapestry).

Competitive work:

Facebook’s TAO system is unique in its scale and aggressive performance optimizations, because in this time period circa 2009-2014, Facebook saw incredibly high rate of growth in its active user base, worldwide. There were competitive distributed KV stores, e.g. Dynamo, but Facebook TAO leverages the unique use case of its application to avoid the application-side conflict resolution required in Dynamo (while not outsourcing Facebook’s critical infrastructure to Amazon). Twitter’s FlockDB also did not provide the same scale factor that Facebook needed. Same goes for other similar DBs.

Content:

S2. Background

Facebook pages operate on a pull model, which is read heavy. Every page aggregates and filters a large number of items specific to the viewing user (viewer).

S2.1. Graph structure and Memcache

Facebook has legacy systems accessing the social graph in MySQL, over client-side Memcache. This logic was tightly coupled.

The structure of the social graph is as follows: objects are stored as nodes, and associations are stored as edges.

A KV cache alone is not efficient when an edge list changes: we need something to support concurrent incremental updates to edge lists, and queries need to do better than always fetch the entire edge list, requiring a complete reload upon every change.

Client-side caching logic makes failure modes complex: we need something to centralize the graph access API and cache control logic.

Inter-regional communication for read-after-write: when we restrict the data model and understand the graph semantics, concurrent cache invalidation messages can be processed more efficiently. This provides Facebook with read-after-write consistency for clients sharing a cache, within a region, in normal operations, without waiting on inter-region communication. Section 4.4 - 4.5 describes this.

S3. Data Model and API

Consider the social networking example in Figure 1a. The social graph includes the users (Alice, Bob, Cathy, and David), their relationships, their actions (checking in, commenting, and liking), and a physical location (the Golden Gate Bridge).

Consider when Facebook application servers render the page: the individual nodes and edges can be cached and reused, but the aggregated content and privacy checks cannot be reused.

S3.1. Objects and Associations

Objects in TAO are keyed by 64-bit integers globally, structured as follows:

(id) -> (otype, (key -> value)*)

Associations are keyed by the triple (source_id, association_type, destination_id):

(id1, atype, id2) -> (time_int, (key -> value)*)

Notice the time_int field in the association edges, it plays a central role in queries.

Both objects and associations may contain data as key/value pairs. Within this data, a per-type schema enumerates the possible keys, the value types, and default.

S3.2. Objects

In Facebook usage, entities that repeat are represented as objects, such as people, comments, or places. A “Like” action is represented as a one-directional association (this is why you cannot Like a Like). Associations model actions that happen at most once or record state transitions. Hence, there is at most one edge of any type between any two objects - this edge may be bidirectional.

Self-edges are allowed.

S3.3. Associations

Bidirectional edges are modeled as two separate association entries. TAO supports synchronously updating both in its API.

# upserts this association and its inverse if defined
assoc_add(id1, atype, id2, timestamp, (k->v)*)

S3.4. Query API

As expected, a query needs an originating object id, and the association type atype.

An application needs to enumerate all destination ids. Here is where TAO can make optimizations specific to Facebook:

A characteristic of the social graph is that … many of the queries are for the newest subset. This creation-time locality arises whenever an application focuses on recent items. If the Alice in Figure 1 is a famous celebrity then there might be thousands of comments attached to her checkin, but only the most recent ones will be rendered by default.

TAO defines the association list as all objects associated with “id1” and “atype”, in descending order by “time_int” field:

Association List: (id1, atype) -> [ a_new ... a_old ]

Now the API can be described:

assoc_range(id1, atype, pos, limit): returns elements of the (id1, atype) association list with pagination. “50 Most recent comments on Alice’s checkin” is a call to assoc_range(632, COMMENT, 0, 50).
assoc_time_range(id1, atype, high, low, limit): a time bound version of the above.
assoc_get(id1, atype, id2set, high, low): returns all associations (id1, atype, id2) and their time and data, with the restriction id2 is in id2set. The parameters “high” and “low” are time bounds to optimize for large lists.
assoc_count(id1, atype) - returns count of association list for (id1, atype), the number of edges originating at id1.

S4.1. Architecture - Storage

TAO relies on MySQL as its backing source of truth. TAO splits its data across logical shards of MySQL database servers, in typical primary replica configuration. A DB table encapsulates all objects, and another table encapsulates all associations.

Each object’s id embeds its shard_id, which ties it to a hosting shard permanently. An association is stored on the shard of the originating object id, id1 - this means writes to the association are concentrated to a single shard.

S4.2. Caching

I think here are two high level points:

Cache server fleets, called tiers, make up the majority of TAO machines, so clients interact directly with the caching servers which implements the complete API.
“Cache servers understand the semantics of their content” and use this fact to aggressively optimize in case of misses, e.g. “a cached count of zero is sufficient to answer a range query” - so we can allow stale reads and issue an async cache refresh.

Bidirectional association updates are not atomic because they almost always span two different shards, and TAO leaves “hanging” bidirectional associations to a background job to eventually repair.

S4.4. Leaders and Followers

This is where Facebook’s unique scale leads to interesting design: to accommodate the sheer volume of requests, TAO takes one step beyond the typical “primary-replica-cache”, to use the same model on its cache tier. This is also to alleviate quadratic all-to-all communication when shards are small and are all served out of a single large fleet of cache machines.

TAO introduces the notion of Leader cache tiers and Follower cache tiers. Clients talk to Follower cache servers only. Follower cache machines either answers the requests, or forwards read misses and writes to the Leader. The Leader cache servers talk to the storage, which depending on region, can be either the MySQL primary or replica.

TAO implements consistency messages within its Leader-Follower cache tiers. When Follower caches handle writes, they send a synchronous request to the Leader shard, but then lets the Leader asynchronously update all other Follower shards later.

As an optimization, when invalidating an association that may truncate/discard data at the Follower caches, “the Leader sends a refill message to notify followers about an association write. If a follower has cached the association, then the refill request triggers a query to the leader to update the follower’s now-stale association list.” This eagerly refills for read queries.

S4.5. Geographical Scale

In this case it is best to refer to the picture in the paper: Figure 2: Multi-region TAO configuration.

The master region sends read misses, writes, and embedded consistency messages to the master database (A).

This is expected like normal “primary-replica-cache” setups.

Consistency messages are delivered to the slave leader (B) as the replication stream updates the slave database.

This is to invalidate caches in the Slave Region.

Slave leader sends writes to the master leader (C) and read misses to the replica DB (D).

Part C is expected as part of single global Master for writes. Part D is to update the local region’s read replica.

S5.2. Implementation - MySQL Mapping

Associations are stored similarly to objects, but to support range queries, their tables have an additional index based on id1, atype, and time. To avoid potentially expensive SELECT COUNT queries, association counts are stored in a separate table.

You can almost write out a good guess of what the query and schema looks like: SELECT id2 FROM assoc WHERE id1 = ... AND atype = ... ORDER BY time DESC LIMIT 6000

S5.3. Implementation - Cache Sharding and Hot Spot Relief

Shards within TAO are mapped to servers using consistent hashing. However, like with rest of Facebook’s unique scale, this is not enough, as TAO takes on the additional step of using shard cloning like CDN cache trees:

TAO rebalances load among followers with shard cloning, in which reads to a shard are served by multiple followers in a tier. Consistency management messages for a cloned shard are sent to all followers hosting that shard.

TAO takes the further step of going client-side caching when access rates exceed certain thresholds, caching the data and version - this fits TAO’s API by eliminating data transfer in replies when data has not changed since the client’s previous request.

S5.4 High-Degree Objects

TAO was designed to cache objects with less than 6,000 associations of the same type. There are implications when calling the assoc_get queries when the caller specifies id1 and id2, but the queried id2 could be in the uncached tail of the association list - this kind of query for high degree id1 may go directly to the database. However, one possible optimization available in the application layer is to invert the query direction if the association is bidirectional and the object id2 has a smaller degree for this association.

I suspect this limitation is also why Facebook cannot easily show me only the comments my friends made in popular post (in a mass group for example).

S6.1. Consistency design

In normal operation TAO provides read-after-write consistency when everything is isolated to a single tier, and async eventual consistency everywhere else - for successful writes. I note this is quite a loose level of consistency, as we mention problems below. TAO propagates its own changeset between the Master region and Slave region structure, which raises an interesting problem in Slave regions.

The changeset cannot always be safely applied to the follower’s cache contents, because the follower’s cache may be stale if the refill or invalidate from a second follower’s update has not yet been delivered. We resolve this race condition in most cases with a version number that is present in the persistent store and the cache. The version number is incremented during each update, so the follower can safely invalidate its local copy of the data if the changeset indicates that its pre-update value was stale.

So in the above paragraph Facebook mentions the problem with conflicting concurrent writes, but it’s not easy to resolve conflicts and it’s very application specific. Otherwise a branch forms in the timeline of an object or association.

In slave regions, this scheme is vulnerable to a rare race condition between cache eviction and storage server update propagation. The slave storage server may hold an older version of a piece of data than what is cached by the caching server, so if the post-changeset entry is evicted from cache and then reloaded from the database, a client may observe a value go back in time in a single follower tier.

This reads like a temporary condition and the write is not lost, so it will eventually propagate back to the user upon complete repair of the data within this follower tier.

Such a situation can only occur if it takes longer for the slave region’s storage server to receive an update than it does for a cached item to be evicted from cache, which is rare in practice.

One of the ways to work around this in TAO is to mark reads as critical, which proxies the read directly to the master region to avoid reusing stale data.

Ideas for further work:

Facebook has shown here how they have pushed the eventually consistent model to a very aggressive limit, and it works for them. Some key design choices in TAO, like tiered caching, geographical distribution, client-side caching above thresholds, leveraging application-specific logic to invert query direction for associations, are all good techniques to try in other projects when appropriate.

I also think there is further work possible to resolve the race condition between Follower caches and Slave region storage. The paper mentioned how Facebook resolves most of the problem with a version number but it’s not mentioned how the application can handle the merge conflicts, besides maybe to show an older version to the user. Maybe if Facebook had a central source of global timestamps in its changeset like TrueTime, this problem can be solved.