Keynote #4 by David Nichols from Microsoft. Published abstract and speaker bio.
In his talk, David shared some stories about his experience of using SQL in a data center environment to provide cloud services. The speaker was a bit fast while talking, I captured most of his message and the important parts, but i had to skip some parts.
In Windows Live, when building a new service they prefer to use off-the-shelf products such as SQL. Why SQL? familiar tested programming model (real queries, real transactions, good data modeling, excellent at OLTP, easy to find developers that know it). Solid systems software (used often and fine tuned many times and updated). Challenges with using SQL, living without single-image database model (no global transactions or global indexes). Administration and maintenance overhead. Breaking things at scale.
DB partitioned by user, many users per instance DB because it is easy and self contained. User info are small enough that you can place multiple users on single location. Front ends send requests to proper DB. Location is determined by lookup (a Lookup Partition Service – LPS – maps users to partitions). DBs are partitioned by hash to avoid hotspots.
Architecture: Three stages of scale out: bigger server, functional division, and data division.
A problem with scaling out: updates to multiple sevices and users (e.g., add messenger buddy, upload a photo which is writen to file store and recent activity store). Two-phase commit is out (because the risk of having the crash that locks your data out is too high), instead us ad hoc methods: for example: write A intent, write B, write A; another example, write A and work item, let work item write B; another example: write A, then B, tolerate inconsistency.
Another problem is how do you read data about multiple users or even all users. Example scenario, user updates his status, his friends need to know. The old way (inefficient) to do that is to write a change about the users into the profile of all affected users, easy to query, but heavy write load.
Data Availability and Reliability. Replication is used for all user data using SQL replication. Front ends have library (WebStore) to notice failures and switch to secondary. Original scheme was one-to-one which was too slow because parallel transactions vs. single replication stream. Next try was to have four DBs talking to four DBs which fixed most speed problems, but too much load on secondaries after failure. Current approach uses 8-host pods, 25% load increase for secondaries on failure (8×8 matrix, and the replication was done on the transpose of the matrix). However, still not fast enough for key tables (100’s of write threads vs. 5 replication streams). Manual replication (FE’s run SProcs at both primary and secondary, but small probability of inconsistent data). Replication runs a few seconds behind (ops reluctant to auto-promote secondary due to potential data in replication stream), new SQL tech should fix this.
Data loss causes: above the app (external application and old data); in the app (software bugs especially migration logic bugs); below the app (controller failure, disk failure). Mitigation techniques: audit trails and soft deletes for above app problems; per-user backup for software bugs; tape backup, sql replication, and RAID for below app problems (however these are expensive).
Managing Replication: fail safe set, a set of databases in some sort of replication membership. Typical fail safe set is two to four DBs (most are two). Fail safe are the true target of partition schemes.
Upgrade options: upgrade partitions: run DDL in each partition (via WebStore), this is complicated by replication, after all DBs are done, upgrade FEs (SProcs are compatible; changed APIs get new names). Migrate users: can take various forms (between servers, within a server, or even within services), and migrating users can be complex, slow, error-prone, and nobody’s likes it.
Some Operation stories.
Capacity management: growth is in units of servers. when to buy more? test teams provides one opinion, ops team aims to find max resource and stay below limit, two kinds of limits, graceful and catastrophic. Interesting thing about graceful vs catastrophic limits .. if you back off from graceful limits, you can usually go back to your original state (good state), however for catastrophic limits, even if you back off you can remain in a bad situation.
Ops lessons: 1) never do the same thing to all machines at once -stats queries, re-indexing have all crashed clusters in the past. 2) Smaller DBs are better, already coping with many DBs, plus re-indexing backups, upgrades are all faster for small DBs. 3) Read-only mode is powerful (failure maintenance and migration all use it). 4) Use the the live site to try things out (new code new SQL settings etc) “taste vs test”.
Conclusions: SQL can be tamed, it has some real issues but mostly manageable with some infrastructure, and its ops cost not out of line. It is hard to do better than SQL, it keeps improving, each time we go to design something, we find that SQL already design it, perhaps not in the form we want exactly, but close enough and not worth the effort probably. However SQL is not always the best solution.
SQL wish list. Easy ones: partitioned data support, easy migration/placement control, reporting, jobs; supporting aggregated data pattern, improved manageability. Hard ones: taming DB schema evolution, soft delete/versioning support of some kind, and A–D transactions (Atomic & Durable).