[LADIS 2009] Keynote #2 – Some Lessons Learned from Running Amazon Web Services
Keynote #2 by Marvin Theimer from Amazon. Published abstract and speaker bio.
In his talk, Marvin reflected on experiences building and maintaining applications in data centers. He stressed the point that each of these issues are non-surprising individually by themselves, but the very large scale makes all of the possible all at once, and this is the surprising point! I really liked this talk.
A nice analogy he gave for building and running data center and cloud services is: Evolving a Cessna prop-plane into a 747 jump jet in-flight 🙂
Start with a Cessna prop-plane for cost and timeliness reasons. 4-9’s availability means that you get to land for 52 minutes every year (including scheduled maintenance, refueling, and crash landings). Success implies growth and evolution and rebuilding the plane mid-flight: Passenger capacity goes from 4-person cabin to 747 jumbo wide-body cabin, support for “scale out” means you add jet engines and remove the propellers while flying, testing and safety have to happen while flying!
Here are the lessons learned:
The unexpected happens! A fuse blows and darkens a set of racks, chillers die in a datacenter and a fraction of servers are down, an electrical plug bursts into flames, tornadoes or lightening hits datacenter, datacenter floods from the roof down, a telco connectivity goes down, the DNS provider creates black holes, simultaneous infant mortality occurs of servers newly-deployed in multiple datacenters, power generation doesn’t start because the ambient temperature is too high, load ..etc
Networking challenges. The IP protocol is deeply embedded in systems that you de-facto have to use it. IP networks can have lost packets, duplicate packets, and corrupted packets. Even if you use TCP your applications still need to worry about lost packets, duplicate packets, and corrupted packets. Software (and hardware) bugs can result in consistent loss or corruption of some packets. You have to be prepared for message storms. Client software is sometimes written without a notion of backing off on retries. One might expect that CRCs and the design of TCP can catch most of these issues. However, we are running in such a large scale that there are enough rare events that can give multiple errors. For example, if a switch or some network hardware erroneously flips the 8th bit of every 64 packets, with the large running scale, these rare events can happen repeatedly!
Things you should be able to do without causing outages: adding new hardware, deploying a new version of software, rolling back to a previous version of software, recovering from the absence, loss, or corruption of non-critical data. Losing a mirror of a DBMS, recovering from having lost a mirror of a DBMS, losing a host in its fleet, losing a datacenter, losing network connectivity between data centers. Can we roll back some parts in the middle of upgrading other parts ?
System resources/objects have lives of their own! Resources/objects in a service may live longer than the accounts used to create them. You have to be able to remap them between accounts. Resources/objects may live longer than versions of the service! You have to be able to migrate them forward with minimal or no disruptions. For example, EC2 instances were designed to run for short periods on demand, but customers start using them and keeping instances up for a long time, and this happens often enough such that shooting down long-lived instances will upset the clients. So how can you deal with that ?
Downstream dependencies fail. It’s a service-oriented architecture. The good news is that your service has the ability to keep going even if other services become unavailable, and the challenge is how to keep going and/or degrade gracefully if you depend on the functionality of downstream services at low levels. Suppose all services are 4-9’s available, if a downstream service fails for 52 minutes, how will you meet your own SLA of failing no more than 52 minutes ? Cascading outages happen, if multiple downstream services fail, how will you handle it? For example, if a storage service fails, 2 services depending on it can also fail, then more services depending on them fail, and so on and so forth. Services need to defend against that.
You must be prepared to deal with data corruption. Data corruption happens: flakey hardware, IO sub-systems can lie, software can be wrong, system evolution happen, people can screw up. End-to-end integrity checks are a must, straight-forward data corruption checking, how do you know if your system is operating correctly? Can your design do fsck in < 52 minutes ?
Keep it simple. It’s 4am on Sunday morning and the service has gone down, can you explain the corner cases of your design to the front-line on-call team over the phone? can you figure out what’s going on in under 52 minutes? Can you make sure it is not a corner case of using your code that did not result in that crash, or how to fix it if it is ? Simple brute force is sometimes preferable to elegant complexity: examples: eventual consistency considered painful (but sometimes necessary), P2P can be harder to debug than centralized approaches (but may be necessary). Is it necessary to build your system to handle situations after product growth when it is more likely that your system will actually change and be replaced by the time that it is big enough to require handling that issue.
Scale: will your design envelope scale far enough? Do you understand your components well enough? Cloud computing has global reach, services may grow at an astonishing pace, the overall scale is HUGE!. The scale of cloud computing tends to push systems outside their standard design envelopes. The rule of thumb that you must redesign your system every time it grows by 10x implies you must be prepared to redesign early and often.
**CAE Trade-Off for Resources. CAE: cost-efficient, available, elastic. You can only pick two of them!
Do not Ignore the Business Model or your TCO. Do you know all the sources of cost? can you accurately measure them? Do you know all the “dimensions of cost” that will be used in pricing? Can you meter them? Have you thought about ways the system can be abused? How will you resolve billing disputes? All these may affect the design of the service in fundamental ways. This is important to measure even if you think that your revenue will come from adds. For example, some customer figured out that if they store large names in the key part of the key/value store rather than the name, they can reduce their cost by 1000x times because S3 only charges for the size of the value not key! So you have to think about what you are not charging people for and how can they abuse it.
Elastic Resources: What boundaries to expose? High availability apps require the notion of independent failure zones –> introduce the notion of availability zones (AZ). Concurrent apps want bounded, preferably low message latency and high bandwidths –> introduce notion of cluster affinity to an AZ. The challenges of AZ clustering, clumping effect since everyone will want to be near everyone else (for example, if you ask people to pick an AZ and they don’t care, everyone will end up in AZ1 !!), makes elastic scheduling harder. Fine-tuned applications are the enemy of elasticity, customers will try to divine your intra-AZ topology (co-location on the same rack, etc.) Eventual evolution to different network infrastructures and topologies means you don’t want to expose more than you have to.
Summary and Conclusions: The unexpected happens, in large systems even extremely rare events occur with a non-negligible frequency; what’s your story on handling them? Keep it simple: it’s 4’am and the clock is ticking- can you debug what’s going on in your system? Cloud computing is a business: you have to think about cost-efficiency as well as availability and elasticity.
Mike Freedman, Princeton University: What things of these issues are specific for infrastructure provider (such as Amazon) compared to web service providers such as walmart.com or hotmail?
Answer: Many things are common such as hazards and load. As for other things such as accounting and billing, this is still useful for service providers, then this can at least minimize your running costs and allow you to know where you are spending your money.
Ken Birman, Cornell University: What makes you feel consistency, is it the 4am call or is latency and competitiveness and the added complexity?
Answer: It is the 4am call. When you have systems at large scale, you have to work out all the possible cases in your system and you can not cheat out of it. These corner cases make it hard. Remember that this has to be developed in a timely manner and it is developed by junior developers that are evolving their knowledge and expertise in this.
—: How do you test the resilience of your data centers? Are there people who go and turn off part of your datacenter?
Answer: Essentially yes! You test as much as you can, then you roll out.
Dough Terry, MSR-SV: Shouldn’t the analogy be that you start with a fleet of Cessnas and you want to evolve them into a fleet of Jumbo jets in flight without losing all of them together.
Answer: the problem is that you can not parallelize everything. There is some percentage of your code that does not get fixed.
Hakim Weatherspoon, Cornell University: What about embracing failure? running your systems hot and expect that nodes will fail ?
Answer: That solves some of the existing problems, but newer problems that we don’t know about yet can rise. For example, we never thought that the boot temperature on backup power generators will ever be an issue but it was! So you can never enumerate all problems.