Archive for the ‘Systems’ Category

[LADIS 2009] Keynote #4 – Life on the Farm: Using SQL for Fun and Profit in Windows Live

October 11th, 2009 No comments

Keynote #4 by David Nichols from Microsoft. Published abstract and speaker bio.

In his talk, David shared some stories about his experience of using SQL in a data center environment to provide cloud services. The speaker was a bit fast while talking, I captured most of his message and the important parts, but i had to skip some parts.

In Windows Live, when building a new service they prefer to use off-the-shelf products such as SQL. Why SQL? familiar tested programming model (real queries, real transactions, good data modeling, excellent at OLTP, easy to find developers that know it). Solid systems software (used often and fine tuned many times and updated). Challenges with using SQL, living without single-image database model (no global transactions or global indexes). Administration and maintenance overhead. Breaking things at scale.

DB partitioned by user, many users per instance DB because it is easy and self contained. User info are small enough that you can place multiple users on single location. Front ends send requests to proper DB. Location is determined by lookup (a Lookup Partition Service – LPS – maps users to partitions). DBs are partitioned by hash to avoid hotspots.

Architecture: Three stages of scale out: bigger server, functional division, and data division.

A problem with scaling out: updates to multiple sevices and users (e.g., add messenger buddy, upload a photo which is writen to file store and recent activity store). Two-phase commit is out (because the risk of having the crash that locks your data out is too high), instead us ad hoc methods: for example: write A intent, write B, write A; another example, write A and work item, let work item write B; another example: write A, then B, tolerate inconsistency.

Another problem is how do you read data about multiple users or even all users. Example scenario, user updates his status, his friends need to know. The old way (inefficient) to do that is to write a change about the users into the profile of all affected users, easy to query, but heavy write load.

Data Availability and Reliability. Replication is used for all user data using SQL replication. Front ends have library (WebStore) to notice failures and switch to secondary. Original scheme was one-to-one which was too slow because parallel transactions vs. single replication stream. Next try was to have four DBs talking to four DBs which fixed most speed problems, but too much load on secondaries after failure. Current approach uses 8-host pods, 25% load increase for secondaries on failure (8×8 matrix, and the replication was done on the transpose of the matrix). However, still not fast enough for key tables (100′s of write threads vs. 5 replication streams). Manual replication (FE’s run SProcs at both primary and secondary, but small probability of inconsistent data). Replication runs a few seconds behind (ops reluctant to auto-promote secondary due to potential data in replication stream), new SQL tech should fix this.

Data loss causes: above the app (external application and old data); in the app (software bugs especially migration logic bugs); below the app (controller failure, disk failure). Mitigation techniques: audit trails and soft deletes for above app problems; per-user backup for software bugs; tape backup, sql replication, and RAID for below app problems (however these are expensive).

Managing Replication: fail safe set, a set of databases in some sort of replication membership. Typical fail safe set is two to four DBs (most are two). Fail safe are the true target of partition schemes.

Upgrade options: upgrade partitions: run DDL in each partition (via WebStore), this is complicated by replication, after all DBs are done, upgrade FEs (SProcs are compatible; changed APIs get new names). Migrate users: can take various forms (between servers, within a server, or even within services), and migrating users can be complex, slow, error-prone, and nobody’s likes it.

Some Operation stories.

Capacity management: growth is in units of servers. when to buy more? test teams provides one opinion, ops team aims to find max resource and stay below limit, two kinds of limits, graceful and catastrophic. Interesting thing about graceful vs catastrophic limits .. if you back off from graceful limits, you can usually go back to your original state (good state), however for catastrophic limits, even if you back off you can remain in a bad situation.

Ops lessons: 1) never do the same thing to all machines at once -stats queries, re-indexing have all crashed clusters in the past. 2) Smaller DBs are better, already coping with many DBs, plus re-indexing backups, upgrades are all faster for small DBs. 3) Read-only mode is powerful (failure maintenance and migration all use it). 4) Use the the live site to try things out (new code new SQL settings etc) “taste vs test”.

Conclusions: SQL can be tamed, it has some real issues but mostly manageable with some infrastructure, and its ops cost not out of line. It is hard to do better than SQL, it keeps improving, each time we go to design something, we find that SQL already design it, perhaps not in the form we want exactly, but close enough and not worth the effort probably. However SQL is not always the best solution.

SQL wish list. Easy ones: partitioned data support, easy migration/placement control, reporting, jobs; supporting aggregated data pattern, improved manageability. Hard ones: taming DB schema evolution, soft delete/versioning support of some kind, and A–D transactions (Atomic & Durable).

Categories: Conferences, Systems Tags:

[LADIS 2009] Technical session #3 – Storage

October 11th, 2009 No comments

First Talk: Consistency without concurrency control by Marc Shapiro

This seemed like an interesting piece of work. Unfortunately i came in a bit late from the break and so my writing is sloppy and doesn’t do it much justice. However the paper about CRDTs and TreeDoc has been published in ICDCS.

Problem motivation: TreeDoc is a storage structure that uses binary tree encoding to address and store data. Inserting data is done by adding leaves to the tree. Reading the document consists of reading the binary tree using an “In Order” traversal. Deleting portions of the tree involves marking nodes with tombstones. However, trees can grow very badly, so removing deleted nodes and “rebalancing” the tree is needed. However, now after the rebalancing the tree addresses do not have the same meaning as before, so incoming updates might be inserted in the wrong location. So how can we agree on current addresses without concurrency control.

Tree located at two types of sites: Core and Nebula. The core is a smaller group that runs 2-phase commit to manage updates. the Nebula is a larger set of remote sites that do not run a consistency protocol. Catch-up protocol: if a core and nebula are networked partitioned, core proceeds with updates and buffers operations, let’s say that the nebula also gets some updates and buffers them. Then when the nebula gets the updates from the core, and replays it and the replays its own operations.

main point: There is a need for useful data structures that support operations that commute. The commutativity gives us convergence between multiple sites without concurrency control. TreeDoc is an example of such data structure. The main point with such data structures is that we should take care of garbage collection because it becomes a big issue.

Second Talk: Provenance as First Class Cloud Data by Kiran-Kumar Muniswamy-Reddy

This talk gave motivation for why would provenance be useful in cloud computing services. The speaker argued that provenance can allow us to reason better about the data from cloud services. The speaker argued that native support for provenance in cloud services will be beneficial.

Provenance tells us where did the data come from, its dependencies, and origins. Provenance is essentially a DAG that captures links between objects. Motivating example applications: web-search vs. cloud-search: both have tons of resources, however web search uses hyperlinks to infer dependencies, while no such thing exists for cloud-search. Provenance can provide a solution for that, and this has been argued for in a previous paper by Shah in usenix ’07. Another example, pre-fetching. Provenance can tell us which documents are related to each other, and this allows you to pre-fetch related items for performance. Other examples include ACLs and auditing apps.

Requirements for provenance: consistency, long-term persistence, queryable, security, coordinate compute facilities and storage facilities.

Third Talk: Cassandra – A Decentralized Structured Storage System by Prashant Malik

Why Cassandra? Lots of data (copies of messages, reverse indices of messages, per user data ..etc), random queries ..etc.

Design goals: high availability, eventual consistency (trade-off strong consistency in favor of high availability), incremental scalability, optimistic replication, “knobs” to tune trade-offs between consistency durability and latency, low total cost of ownership, minimal administration.

Data model: similar to the BigTable data model. columns are indexed by key, data is stored in column families, and the columns are sorted by value or by timestamp. Super columns allow columns to be added dynamically.

Write operations, a client issues a write request to a random node in the Cassandra cluster. The “partitioner” determines the nodes responsible for the data. Locally, write operations are logged and then applied to an in-memory version. Commit log is stored on a dedicated disk local to the machine.

Write properties: there are no locks in the critical path, we have sequential disk access. It behaves like a write back cache, and we have append support without read ahead. Atomicity guarantee for a key per replica. “Always Writable”, writes accepted even during failures, in that case the write is handed-off to some other node and loaded back to the correct place when node comes back up.

Reads are sent from the client to any node in the cassandra cluster, and then depending about the knobs the reads either get the most recent value or a quorrum.

Gossip is used between replicas using the Scuttlebutt protocol which has low overhead. Failure detection assigns a failure suspicion to nodes that increases with time until you hear again from users.

Lessons learned: add fancy features only when absolutely necessary. Failures are the norm not the exception. You need system-level monitoring. Value simple designs.

Fourth Talk: Towards Decoupling Storage and Computation in Hadoop with SuperDataNodes by George Porter

Hadoop is growing, gaining adopting, and used in production (Facebook,, linked in). E.g., facebook imports 25/day to 1k hadoop nodes. A key to that growth and efficiency relies on coupling compute and storage: benefits of moving computation to data, scheduling, locality reduce traffic, map parallelism (“grep” type workload).

So, when to couple storage with computation? This is a critical and complicated design decision, and this is not always done right. Examples, Emerging best practices with dedicated clusters. Your data center design may not be based on the needs for Hadoop (adding map/reduce to existing cluster, or a small workgroup who like the programming model such as Pig, Hive, and Mahout).

Goal is to support late binding between storage and computation. Explore alternative balances between the two (specifically explore the extreme point of decoupling storage and compute nodes). An observation from the Facebook deployment is that the scheduler is really good at scheduling nodes to local nodes for small tasks and bad for scheduling them in rack-locality for large tasks.

SuperDataNode approach: key features include a stateless worker tier, and storage node with shared pool of disks under single O/S, and a high bisection bandwidth worker tier.

There has been alot of talk about advantages of coupling storage and computation, what are the advantages of decoupling them. Advantages include, decoupling amount of storage from number of worker nodes. More intra-rack bandwidth than inter-rack bandwidth. Support for “archival” data, subset of data with low probability of access. Increased uniformity for job scheduling and block placement. Ease of management, workers become stateless; SDN management similar to that of a regular storage node. Replication only for node failures.

Limitations of SDN, scarce storage bandwidth between workers and SDN. Effective throughput with N disks in SDN (@ 100MB/sec each) 1:N ration of bandwidth between local and remote disks. Effect on fault -tolerance. Disk vs Node vs Link failure model. Cost. Performance depends on the work loads.

Evaluation compared a baseline hadoop cluster and an SDN cluster with 10 servers. The results showed that SDN performed better for grep and sort like workloads, and a bad case was random writers were hadoop performed better (workload was just each worker write to disk as fast as possible .. 100% parallelism).

Categories: Conferences, Systems Tags:

[LADIS 2009] Keynote #2 – Some Lessons Learned from Running Amazon Web Services

October 10th, 2009 No comments

Keynote #2 by Marvin Theimer from Amazon. Published abstract and speaker bio.

In his talk, Marvin reflected on experiences building and maintaining applications in data centers. He stressed the point that each of these issues are non-surprising individually by themselves, but the very large scale makes all of the possible all at once, and this is the surprising point! I really liked this talk.

A nice analogy he gave for building and running data center and cloud services is: Evolving a Cessna prop-plane into a 747 jump jet in-flight :-)

Start with a Cessna prop-plane for cost and timeliness reasons. 4-9′s availability means that you get to land for 52 minutes every year (including scheduled maintenance, refueling, and crash landings). Success implies growth and evolution and rebuilding the plane mid-flight: Passenger capacity goes from 4-person cabin to 747 jumbo wide-body cabin, support for “scale out” means you add jet engines and remove the propellers while flying, testing and safety have to happen while flying!

Here are the lessons learned:

The unexpected happens! A fuse blows and darkens a set of racks, chillers die in a datacenter and a fraction of servers are down, an electrical plug bursts into flames, tornadoes or lightening hits datacenter, datacenter floods from the roof down, a telco connectivity goes down, the DNS provider creates black holes, simultaneous infant mortality occurs of servers newly-deployed in multiple datacenters, power generation doesn’t start because the ambient temperature is too high, load ..etc

Networking challenges. The IP protocol is deeply embedded in systems that you de-facto have to use it. IP networks can have lost packets, duplicate packets, and corrupted packets. Even if you use TCP your applications still need to worry about lost packets, duplicate packets, and corrupted packets. Software (and hardware) bugs can result in consistent loss or corruption of some packets. You have to be prepared for message storms. Client software is sometimes written without a notion of backing off on retries. One might expect that CRCs and the design of TCP can catch most of these issues. However, we are running in such a large scale that there are enough rare events that can give multiple errors. For example, if a switch or some network hardware erroneously flips the 8th bit of every 64 packets, with the large running scale, these rare events can happen repeatedly!

Things you should be able to do without causing outages: adding new hardware, deploying a new version of software, rolling back to a previous version of software, recovering from the absence, loss, or corruption of non-critical data. Losing a mirror of a DBMS, recovering from having lost a mirror of a DBMS, losing a host in its fleet, losing a datacenter, losing network connectivity between data centers. Can we roll back some parts in the middle of upgrading other parts ?

System resources/objects have lives of their own! Resources/objects in a service may live longer than the accounts used to create them. You have to be able to remap them between accounts. Resources/objects may live longer than versions of the service! You have to be able to migrate them forward with minimal or no disruptions. For example, EC2 instances were designed to run for short periods on demand, but customers start using them and keeping instances up for a long time, and this happens often enough such that shooting down long-lived instances will upset the clients. So how can you deal with that ?

Downstream dependencies fail. It’s a service-oriented architecture. The good news is that your service has the ability to keep going even if other services become unavailable, and the challenge is how to keep going and/or degrade gracefully if you depend on the functionality of downstream services at low levels. Suppose all services are 4-9′s available, if a downstream service fails for 52 minutes, how will you meet your own SLA of failing no more than 52 minutes ? Cascading outages happen, if multiple downstream services fail, how will you handle it? For example, if a storage service fails, 2 services depending on it can also fail, then more services depending on them fail, and so on and so forth. Services need to defend against that.

You must be prepared to deal with data corruption. Data corruption happens: flakey hardware, IO sub-systems can lie, software can be wrong, system evolution happen, people can screw up. End-to-end integrity checks are a must, straight-forward data corruption checking, how do you know if your system is operating correctly? Can your design do fsck in < 52 minutes ?

Keep it simple. It’s 4am on Sunday morning and the service has gone down, can you explain the corner cases of your design to the front-line on-call team over the phone? can you figure out what’s going on in under 52 minutes? Can you make sure it is not a corner case of using your code that did not result in that crash, or how to fix it if it is ? Simple brute force is sometimes preferable to elegant complexity: examples: eventual consistency considered painful (but sometimes necessary), P2P can be harder to debug than centralized approaches (but may be necessary). Is it necessary to build your system to handle situations after product growth when it is more likely that your system will actually change and be replaced by the time that it is big enough to require handling that issue.

Scale: will your design envelope scale far enough? Do you understand your components well enough? Cloud computing has global reach, services may grow at an astonishing pace, the overall scale is HUGE!. The scale of cloud computing tends to push systems outside their standard design envelopes. The rule of thumb that you must redesign your system every time it grows by 10x implies you must be prepared to redesign early and often.

**CAE Trade-Off for Resources. CAE: cost-efficient, available, elastic. You can only pick two of them!

Do not Ignore the Business Model or your TCO. Do you know all the sources of cost? can you accurately measure them? Do you know all the “dimensions of cost” that will be used in pricing? Can you meter them? Have you thought about ways the system can be abused? How will you resolve billing disputes? All these may affect the design of the service in fundamental ways. This is important to measure even if you think that your revenue will come from adds. For example, some customer figured out that if they store large names in the key part of the key/value store rather than the name, they can reduce their cost by 1000x times because S3 only charges for the size of the value not key! So you have to think about what you are not charging people for and how can they abuse it.

Elastic Resources: What boundaries to expose? High availability apps require the notion of independent failure zones –> introduce the notion of availability zones (AZ). Concurrent apps want bounded, preferably low message latency and high bandwidths –> introduce notion of cluster affinity to an AZ. The challenges of AZ clustering, clumping effect since everyone will want to be near everyone else (for example, if you ask people to pick an AZ and they don’t care, everyone will end up in AZ1 !!), makes elastic scheduling harder. Fine-tuned applications are the enemy of elasticity, customers will try to divine your intra-AZ topology (co-location on the same rack, etc.) Eventual evolution to different network infrastructures and topologies means you don’t want to expose more than you have to.

Summary and Conclusions: The unexpected happens, in large systems even extremely rare events occur with a non-negligible frequency; what’s your story on handling them? Keep it simple: it’s 4′am and the clock is ticking- can you debug what’s going on in your system? Cloud computing is a business: you have to think about cost-efficiency as well as availability and elasticity.


Mike Freedman, Princeton University: What things of these issues are specific for infrastructure provider (such as Amazon) compared to web service providers such as or hotmail?

Answer: Many things are common such as hazards and load. As for other things such as accounting and billing, this is still useful for service providers, then this can at least minimize your running costs and allow you to know where you are spending your money.

Ken Birman, Cornell University: What makes you feel consistency, is it the 4am call or is latency and competitiveness and the added complexity?

Answer: It is the 4am call. When you have systems at large scale, you have to work out all the possible cases in your system and you can not cheat out of it. These corner cases make it hard. Remember that this has to be developed in a timely manner and it is developed by junior developers that are evolving their knowledge and expertise in this.

—: How do you test the resilience of your data centers? Are there people who go and turn off part of your datacenter?

Answer: Essentially yes! You test as much as you can, then you roll out.

Dough Terry, MSR-SV: Shouldn’t the analogy be that you start with a fleet of Cessnas and you want to evolve them into a fleet of Jumbo jets in flight without losing all of them together.

Answer: the problem is that you can not parallelize everything. There is some percentage of your code that does not get fixed.

Hakim Weatherspoon, Cornell University: What about embracing failure? running your systems hot and expect that nodes will fail ?

Answer: That solves some of the existing problems, but newer problems that we don’t know about yet can rise. For example, we never thought that the boot temperature on backup power generators will ever be an issue but it was! So you can never enumerate all problems.

Categories: Conferences, Systems Tags: