Google F1: A Distributed SQL Database That Scales - reading notes
Original paper https://research.google/pubs/pub41344/
Summary
This was a satisfying paper to read because it builds upon previous Google infrastructure papers, and brings them together from a large-scale application’s point of view. F1 powers AdWords, so it is a critical system in the sense that AdWords is responsible for 90%+ of Google’s revenue.
Themes:
- For query execution - simplicity brings performance;
- Build on strong foundational pieces;
- Rather than minimize contention of a single disk in terms of disk seeks, F1 system minimizes network roundtrips, where each network trip maximizes parallel requests to the remote Spanner storage, which optimally utilizes local resources by recursively building on strong, simple foundations;
F1’s addition to Spanner:
- A full SQL layer, and distributed SQL query execution; including joining of (Protobuf native) data from external sources. This is largely the bridge between the data storage and consistency functionality of Spanner, and the AdWords application developers who are familiar with legacy MySQL system.
- Transactionally consistent local + global secondary indexes - this is for AdWords application only.
- Async schema changes: an external system checks and maintains the system in face of continuous schema changes - this guards against data corruption and maintains availability.
- Optimistic Transactions: this is built on top of Spanner’s capabilities, effectively leveraging a CAS implementation combining Spanner’s lock-free snapshot reads and consistent transactions.
- Automatic history recording and publishing - another feature for the users, includes a creative and effective cache application.
Key design choices underpinning F1:
- The data schema is explicit in the sense it exposes data clustering close to the application programmer. This is a shared concept with F1. Concretely, F1 stores data with a hierarchical structure (outlined later). This layout is optimized with Spanner storage in mind, and the result is reduction of RPC trips.
- Heavy use of batching, parallel processing, and async reads
- A new ORM library for application developers that surfaces these points. This makes it easier to debug slow queries.
Scale:
In 2012, roughly 100TB, 100s of applications sending queries, and 1000s of users; >100K QPS; 99.999% availability. Latency p50 comparable to sharded MySQL, with tail latency much lower than MySQL.
Architecture overview:
F1 System consists of (1) Client, (2) Load Balancers, (3) the most commonly communicated F1 Server, and (4) F1 Master and (5) F1 Slaves.
F1 Servers (3) are mostly stateless except over the duration of a transaction, during which the F1 Server is a client of Spanner, and holds pessimistic write leases/locks.
F1 Master and Slaves (4 and 5) are responsible during distributed SQL query executions. Masters maintain pool membership over the Slave pool. Slave nodes are responsible for processing the distributed queries on behalf of the coordinating F1 Server that initialized the query.
Scaling the F1 system is trivial - simply add more nodes, which is done quickly. Adding new Spanner nodes results in data re-distribution (copying of Spanner directory files).
Due to the nature of remote data storage, minimum commit latency is relatively high 50-150ms compared to MySQL. The F1 ORM client library exposes this effect and forces the application developer to minimize serial tasks.
Spanner review:
- Spanner provides consistent transactions using two-phase locking: multiple reads, taking shared or exclusive locks, followed by a single write that upgrades locks and atomically commits.
- Spanner transactions are most efficient when updating data co-located in a single group of directories. This is relevant for local indexes and Optimistic Transactions.
- Spanner multi-group transactions involve two-phase commit (2PC) on top of Paxos. This scales well to tens of “participant groups”, not more. This is important for global indexes.
F1 Data model:
Hierarchical schema
- F1 makes data clustering explicit, mirroring the data layout of Spanner storage. Two types of tables: Root and Child, all have Primary Keys. Child tables must have their primary key prefixed with the parent table’s primary key. The storage is in sorted order. This implies query processing optimizations later in SECTION 8.
- This interleaving of data rows from both root and children tables means the Spanner storage retains data locality. The paper highlights a comparison to traditional schema where a read from two related data tables incur a doubling of read (hence RPC) requests.
Note here the relationship between F1’s optimization and the true data’s properties: Advertising campaigns are clustered by a finite set of Customers/Advertisers, and the size distribution is not the same as say, a social network’s “users” data set.
Indexing:
- Indexes are stored as separate tables in Spanner, key by
(index key -> target table's primary key)
- There are 2 types of physical index storage: Local and Global.
- Local index: like child tables, local index make use of Spanner’s directory-local storage; local index updates are fast.
- Global index: doing this with global consistency is still a hard scalability problem, but it’s mitigated because: Spanner has already auto-sharded the data across many directories, and consistently stored them on multiple servers; writing a single row of global index requires adding a single extra participant to the initiating transaction, incurring a 2PC; this is fine for up to 10s of participants.
- F1 still encourages application developers to use global indexes sparingly, and to avoid bulk global index updates.
Schema changes:
- F1 leverages Spanner’s non-blocking schema change feature, and adds some additional features. Schema changes happen often and is buisiness critical.
- The full design of F1’s schema change subsystem is described in a separate paper.
- The key features of the subsystem is to enforce (1) at most two schemas are active, and (2) dividing a schema change operation into mutually compatible phases, initiating an automatic MapReduce backfill job, concurrent with new writes.
Transactions:
- F1’s transaction build on top of Spanner features. Most notably, Optimistic Transactions.
-
Optimistic transaction: (1) Read phase, collect rows and their last modify timestamps; (2) F1 creates a short-lived Spanner pessimistic transaction, involving a re-read, and if successful, a final write.
- F1 clients defaults to optimistic transactions as a feature. The benefits are highlighted:
- Tolerance for misbehaving clients: Spanner snapshot reads do not block and do not conflict, so long-running clients do not slow down the system.
- Long running transactions as a feature: it enables some F1 transactions to involve waiting for user input.
- Easy to retry transparently in F1 Server, hiding transient Spanner errors from user.
- Speculative Writes: a MapReduce job can remember timestamps for that read, and speculatively write new data (think of some cache optimizations or execution optimizations based on history). This is a very sophisticated optimization.
- Optimistic transaction drawbacks:
- Low throughput under high contention: just imagine incrementing a counter.
Change history recording
- History recording is another top level feature in F1. Every transaction creates one ore more “ChangeBatch” protobufs, which includes: (1) primary key, (2) complete column values before and after.
- The ChangeBatch is written as another child table of each root table. For transactions spanning multiple root tables, a ChangeBatch is written per root table row, and includes references to each other.
- Change history is propagated to consumers in a Pub/Sub system, and consumers sometimes complain of flooding.
- It is also creatively used in caching for frontend: after a user commits an update, the F1 client reads the cache: it opportunistically looks at 2 places: its own cache (which requires invalidation and re-read) and the change history notifications. If the notification arrives earlier than cache propagation, this is a win.
Client ORM
- F1 replaces a legacy MySQL DB, which application developers often interfaced via ORM. Traditional ORM has many drawbacks, most of which makes it unsuitable for a dataset the size and scale of AdWords. The traditional MySQL ORM also mismatched against Protobufs.
- F1 client library aims to resolve most of these challenges. It futher extends SQL to work with Protobuf nested fields.
Query processing
- F1 executes query either (1) as low-latency central/locally, or (2) long running batched/parallel.
- All input data and internal data flow is randomly hashed and do not make use of ordering. This as I understand makes implementation simple.
- Intermediary processors stream data as soon as possible, mirroring implementation of GFS/CFS; this maximizes pipelining, and minimizes buffering, and also goes against sorting/preserving data order in intermediate steps.
- F1 bets on the speed and reliability of Google’s DC networks: F1 performance is theoretically bounded by network switches, but it has not been a problem in practice at Google’s DCs.
- F1 (in 2012) does not make query checkpointing, so long running queries may fail and need a full restart. Implementing intermediate checkpointing in distributed query processing is difficult without hurting performance of the common non-failing scenario.
- Hierarchical table joins are fast across 2 related tables, and only require buffering a single row from each table.
- Queries interact natively with Protobufs