Archive

Author Archive

[LADIS 2009] Technical session #3 – Storage

October 11th, 2009 No comments

First Talk: Consistency without concurrency control by Marc Shapiro

This seemed like an interesting piece of work. Unfortunately i came in a bit late from the break and so my writing is sloppy and doesn’t do it much justice. However the paper about CRDTs and TreeDoc has been published in ICDCS.

Problem motivation: TreeDoc is a storage structure that uses binary tree encoding to address and store data. Inserting data is done by adding leaves to the tree. Reading the document consists of reading the binary tree using an “In Order” traversal. Deleting portions of the tree involves marking nodes with tombstones. However, trees can grow very badly, so removing deleted nodes and “rebalancing” the tree is needed. However, now after the rebalancing the tree addresses do not have the same meaning as before, so incoming updates might be inserted in the wrong location. So how can we agree on current addresses without concurrency control.

Tree located at two types of sites: Core and Nebula. The core is a smaller group that runs 2-phase commit to manage updates. the Nebula is a larger set of remote sites that do not run a consistency protocol. Catch-up protocol: if a core and nebula are networked partitioned, core proceeds with updates and buffers operations, let’s say that the nebula also gets some updates and buffers them. Then when the nebula gets the updates from the core, and replays it and the replays its own operations.

main point: There is a need for useful data structures that support operations that commute. The commutativity gives us convergence between multiple sites without concurrency control. TreeDoc is an example of such data structure. The main point with such data structures is that we should take care of garbage collection because it becomes a big issue.


Second Talk: Provenance as First Class Cloud Data by Kiran-Kumar Muniswamy-Reddy

This talk gave motivation for why would provenance be useful in cloud computing services. The speaker argued that provenance can allow us to reason better about the data from cloud services. The speaker argued that native support for provenance in cloud services will be beneficial.

Provenance tells us where did the data come from, its dependencies, and origins. Provenance is essentially a DAG that captures links between objects. Motivating example applications: web-search vs. cloud-search: both have tons of resources, however web search uses hyperlinks to infer dependencies, while no such thing exists for cloud-search. Provenance can provide a solution for that, and this has been argued for in a previous paper by Shah in usenix ’07. Another example, pre-fetching. Provenance can tell us which documents are related to each other, and this allows you to pre-fetch related items for performance. Other examples include ACLs and auditing apps.

Requirements for provenance: consistency, long-term persistence, queryable, security, coordinate compute facilities and storage facilities.


Third Talk: Cassandra – A Decentralized Structured Storage System by Prashant Malik

Why Cassandra? Lots of data (copies of messages, reverse indices of messages, per user data ..etc), random queries ..etc.

Design goals: high availability, eventual consistency (trade-off strong consistency in favor of high availability), incremental scalability, optimistic replication, “knobs” to tune trade-offs between consistency durability and latency, low total cost of ownership, minimal administration.

Data model: similar to the BigTable data model. columns are indexed by key, data is stored in column families, and the columns are sorted by value or by timestamp. Super columns allow columns to be added dynamically.

Write operations, a client issues a write request to a random node in the Cassandra cluster. The “partitioner” determines the nodes responsible for the data. Locally, write operations are logged and then applied to an in-memory version. Commit log is stored on a dedicated disk local to the machine.

Write properties: there are no locks in the critical path, we have sequential disk access. It behaves like a write back cache, and we have append support without read ahead. Atomicity guarantee for a key per replica. “Always Writable”, writes accepted even during failures, in that case the write is handed-off to some other node and loaded back to the correct place when node comes back up.

Reads are sent from the client to any node in the cassandra cluster, and then depending about the knobs the reads either get the most recent value or a quorrum.

Gossip is used between replicas using the Scuttlebutt protocol which has low overhead. Failure detection assigns a failure suspicion to nodes that increases with time until you hear again from users.

Lessons learned: add fancy features only when absolutely necessary. Failures are the norm not the exception. You need system-level monitoring. Value simple designs.


Fourth Talk: Towards Decoupling Storage and Computation in Hadoop with SuperDataNodes by George Porter

Hadoop is growing, gaining adopting, and used in production (Facebook, last.fm, linked in). E.g., facebook imports 25/day to 1k hadoop nodes. A key to that growth and efficiency relies on coupling compute and storage: benefits of moving computation to data, scheduling, locality reduce traffic, map parallelism (“grep” type workload).

So, when to couple storage with computation? This is a critical and complicated design decision, and this is not always done right. Examples, Emerging best practices with dedicated clusters. Your data center design may not be based on the needs for Hadoop (adding map/reduce to existing cluster, or a small workgroup who like the programming model such as Pig, Hive, and Mahout).

Goal is to support late binding between storage and computation. Explore alternative balances between the two (specifically explore the extreme point of decoupling storage and compute nodes). An observation from the Facebook deployment is that the scheduler is really good at scheduling nodes to local nodes for small tasks and bad for scheduling them in rack-locality for large tasks.

SuperDataNode approach: key features include a stateless worker tier, and storage node with shared pool of disks under single O/S, and a high bisection bandwidth worker tier.

There has been alot of talk about advantages of coupling storage and computation, what are the advantages of decoupling them. Advantages include, decoupling amount of storage from number of worker nodes. More intra-rack bandwidth than inter-rack bandwidth. Support for “archival” data, subset of data with low probability of access. Increased uniformity for job scheduling and block placement. Ease of management, workers become stateless; SDN management similar to that of a regular storage node. Replication only for node failures.

Limitations of SDN, scarce storage bandwidth between workers and SDN. Effective throughput with N disks in SDN (@ 100MB/sec each) 1:N ration of bandwidth between local and remote disks. Effect on fault -tolerance. Disk vs Node vs Link failure model. Cost. Performance depends on the work loads.

Evaluation compared a baseline hadoop cluster and an SDN cluster with 10 servers. The results showed that SDN performed better for grep and sort like workloads, and a bad case was random writers were hadoop performed better (workload was just each worker write to disk as fast as possible .. 100% parallelism).

Categories: Conferences, Systems Tags:

[LADIS 2009] Keynote #2 – Some Lessons Learned from Running Amazon Web Services

October 10th, 2009 No comments

Keynote #2 by Marvin Theimer from Amazon. Published abstract and speaker bio.

In his talk, Marvin reflected on experiences building and maintaining applications in data centers. He stressed the point that each of these issues are non-surprising individually by themselves, but the very large scale makes all of the possible all at once, and this is the surprising point! I really liked this talk.

A nice analogy he gave for building and running data center and cloud services is: Evolving a Cessna prop-plane into a 747 jump jet in-flight 🙂

Start with a Cessna prop-plane for cost and timeliness reasons. 4-9’s availability means that you get to land for 52 minutes every year (including scheduled maintenance, refueling, and crash landings). Success implies growth and evolution and rebuilding the plane mid-flight: Passenger capacity goes from 4-person cabin to 747 jumbo wide-body cabin, support for “scale out” means you add jet engines and remove the propellers while flying, testing and safety have to happen while flying!

Here are the lessons learned:

The unexpected happens! A fuse blows and darkens a set of racks, chillers die in a datacenter and a fraction of servers are down, an electrical plug bursts into flames, tornadoes or lightening hits datacenter, datacenter floods from the roof down, a telco connectivity goes down, the DNS provider creates black holes, simultaneous infant mortality occurs of servers newly-deployed in multiple datacenters, power generation doesn’t start because the ambient temperature is too high, load ..etc

Networking challenges. The IP protocol is deeply embedded in systems that you de-facto have to use it. IP networks can have lost packets, duplicate packets, and corrupted packets. Even if you use TCP your applications still need to worry about lost packets, duplicate packets, and corrupted packets. Software (and hardware) bugs can result in consistent loss or corruption of some packets. You have to be prepared for message storms. Client software is sometimes written without a notion of backing off on retries. One might expect that CRCs and the design of TCP can catch most of these issues. However, we are running in such a large scale that there are enough rare events that can give multiple errors. For example, if a switch or some network hardware erroneously flips the 8th bit of every 64 packets, with the large running scale, these rare events can happen repeatedly!

Things you should be able to do without causing outages: adding new hardware, deploying a new version of software, rolling back to a previous version of software, recovering from the absence, loss, or corruption of non-critical data. Losing a mirror of a DBMS, recovering from having lost a mirror of a DBMS, losing a host in its fleet, losing a datacenter, losing network connectivity between data centers. Can we roll back some parts in the middle of upgrading other parts ?

System resources/objects have lives of their own! Resources/objects in a service may live longer than the accounts used to create them. You have to be able to remap them between accounts. Resources/objects may live longer than versions of the service! You have to be able to migrate them forward with minimal or no disruptions. For example, EC2 instances were designed to run for short periods on demand, but customers start using them and keeping instances up for a long time, and this happens often enough such that shooting down long-lived instances will upset the clients. So how can you deal with that ?

Downstream dependencies fail. It’s a service-oriented architecture. The good news is that your service has the ability to keep going even if other services become unavailable, and the challenge is how to keep going and/or degrade gracefully if you depend on the functionality of downstream services at low levels. Suppose all services are 4-9’s available, if a downstream service fails for 52 minutes, how will you meet your own SLA of failing no more than 52 minutes ? Cascading outages happen, if multiple downstream services fail, how will you handle it? For example, if a storage service fails, 2 services depending on it can also fail, then more services depending on them fail, and so on and so forth. Services need to defend against that.

You must be prepared to deal with data corruption. Data corruption happens: flakey hardware, IO sub-systems can lie, software can be wrong, system evolution happen, people can screw up. End-to-end integrity checks are a must, straight-forward data corruption checking, how do you know if your system is operating correctly? Can your design do fsck in < 52 minutes ?

Keep it simple. It’s 4am on Sunday morning and the service has gone down, can you explain the corner cases of your design to the front-line on-call team over the phone? can you figure out what’s going on in under 52 minutes? Can you make sure it is not a corner case of using your code that did not result in that crash, or how to fix it if it is ? Simple brute force is sometimes preferable to elegant complexity: examples: eventual consistency considered painful (but sometimes necessary), P2P can be harder to debug than centralized approaches (but may be necessary). Is it necessary to build your system to handle situations after product growth when it is more likely that your system will actually change and be replaced by the time that it is big enough to require handling that issue.

Scale: will your design envelope scale far enough? Do you understand your components well enough? Cloud computing has global reach, services may grow at an astonishing pace, the overall scale is HUGE!. The scale of cloud computing tends to push systems outside their standard design envelopes. The rule of thumb that you must redesign your system every time it grows by 10x implies you must be prepared to redesign early and often.

**CAE Trade-Off for Resources. CAE: cost-efficient, available, elastic. You can only pick two of them!

Do not Ignore the Business Model or your TCO. Do you know all the sources of cost? can you accurately measure them? Do you know all the “dimensions of cost” that will be used in pricing? Can you meter them? Have you thought about ways the system can be abused? How will you resolve billing disputes? All these may affect the design of the service in fundamental ways. This is important to measure even if you think that your revenue will come from adds. For example, some customer figured out that if they store large names in the key part of the key/value store rather than the name, they can reduce their cost by 1000x times because S3 only charges for the size of the value not key! So you have to think about what you are not charging people for and how can they abuse it.

Elastic Resources: What boundaries to expose? High availability apps require the notion of independent failure zones –> introduce the notion of availability zones (AZ). Concurrent apps want bounded, preferably low message latency and high bandwidths –> introduce notion of cluster affinity to an AZ. The challenges of AZ clustering, clumping effect since everyone will want to be near everyone else (for example, if you ask people to pick an AZ and they don’t care, everyone will end up in AZ1 !!), makes elastic scheduling harder. Fine-tuned applications are the enemy of elasticity, customers will try to divine your intra-AZ topology (co-location on the same rack, etc.) Eventual evolution to different network infrastructures and topologies means you don’t want to expose more than you have to.

Summary and Conclusions: The unexpected happens, in large systems even extremely rare events occur with a non-negligible frequency; what’s your story on handling them? Keep it simple: it’s 4’am and the clock is ticking- can you debug what’s going on in your system? Cloud computing is a business: you have to think about cost-efficiency as well as availability and elasticity.

Questions:

Mike Freedman, Princeton University: What things of these issues are specific for infrastructure provider (such as Amazon) compared to web service providers such as walmart.com or hotmail?

Answer: Many things are common such as hazards and load. As for other things such as accounting and billing, this is still useful for service providers, then this can at least minimize your running costs and allow you to know where you are spending your money.

Ken Birman, Cornell University: What makes you feel consistency, is it the 4am call or is latency and competitiveness and the added complexity?

Answer: It is the 4am call. When you have systems at large scale, you have to work out all the possible cases in your system and you can not cheat out of it. These corner cases make it hard. Remember that this has to be developed in a timely manner and it is developed by junior developers that are evolving their knowledge and expertise in this.

—: How do you test the resilience of your data centers? Are there people who go and turn off part of your datacenter?

Answer: Essentially yes! You test as much as you can, then you roll out.

Dough Terry, MSR-SV: Shouldn’t the analogy be that you start with a fleet of Cessnas and you want to evolve them into a fleet of Jumbo jets in flight without losing all of them together.

Answer: the problem is that you can not parallelize everything. There is some percentage of your code that does not get fixed.

Hakim Weatherspoon, Cornell University: What about embracing failure? running your systems hot and expect that nodes will fail ?

Answer: That solves some of the existing problems, but newer problems that we don’t know about yet can rise. For example, we never thought that the boot temperature on backup power generators will ever be an issue but it was! So you can never enumerate all problems.

Categories: Conferences, Systems Tags:

[LADIS 2009] Technical Session #2 – Applications and Services

October 10th, 2009 No comments

First Talk: Are Clouds Ready for Large Distributed Applications? by Kay Sripanidkulchai

This talk essentially focused on how can enterprise applications be transported to cloud computing settings. Issues focused on are: deployment, this is more complex than just booting up VMs due to data and functionality dependencies. The second issue, availability. Enterprise apps are heavily engineered to maximize uptime. According to a published study, current cloud services can expect up to 5 hours of down time per year. Enterprise customers however really expect 1 hour of downtime per year. So how can this gap be bridged ? The third issue is that of problem resolution.

Bridging the availability gap: ideas include: 1) implementing scaling architectures in the cloud, 2) developing APIs to allow multiple clouds to interact with each other so as to develop failover techniques, 3) Live VM migration to mask failures.

As for problem resolution: categories of issues raised regarding EC2 on EC2 boards: 10% of topics discussed are feature request, 56% user how-to. As for problems 25% cloud error, 64% user error, 11% unknown error. One of the important things that enterprise customers want is being able to know if something is not running correctly, is the issue with the cloud platform, the VM, faulty hardware or what. So techniques and tools have to be developed in that regards.


Second Talk: Cloudifying Source Code Repositories: How Much Does it Cost? by Michael Siegenthaler

Cloud computing used to be mainly used by large companies that have the resources that enabled them to build and maintain the datacenters. Now this is accessible to people outside these companies for low costs.

Why move source control to the cloud? resilient storage, no physical server to administrate, scale to large communities. Used SVN which is very popular, store data on S3 (problem with eventual consistency), used Yahoo Zookeeper (a coordination service) as a lock service. How to measure costs for SVN on S3? measure cost per diff files and stored files. Back of the envelope analysis of cost shows it is inexpensive even for large projects such as Debian and KDE. A trend to notice is that code repos are getting larger in size, but the price of storing a GB is decreasing with time.

Architecture: machines talk to front-end servers on EC2 and storage is on S3. The front-end need not be on EC2, the cloud is there mainly for storage. A problem with a naive implementation is that eventual consistency in S3 means that multiple revision numbers can be issued for conflicting updates. For this reason locking is required. The commit process essentially has a hook that acquires a lock from ZooKeeper and pull for the most recent version number. The most recent version is retrieved from S3 (retry if not found due to eventual consistency), then make commit and release lock and ZooKeeper increments the version number.

Performance evaluation: usage patterns: Apache foundation has 1 repo for 74 project with average 1.10 commits per minute and a max of 7 per minute. The Debian community has 506 repos with 1.12 commits per minute in aggregate and 6 in max. These were used as experiment traces. The results showed that as you add more front-end servers from EC2 the performance does not suffocate due to possible lock contention, and this was tried with differing number of clients.


Third Talk: Cloud9: A Software Testing Service by Stefan Bucur

There is a need to facilitate automatic testing of programs. Cloud computing can make this have better performance. Testing frameworks should provide autonomy (no human intervention), usability, performance. Cloud9 (http://cloud9.epfl.ch/) is a web service for testing cloud applications.

Symbolic Execution: when testing a function, instead of feeding it input values, send it an input abstraction (say, lambda) and whenever we see a control flow branching (such as an if statement) create a subtree of execution. One idea is to send each of these subtrees to a separate machine and test all possible execution paths at once. A naive approach can have many problems. For example trees can expand exponentially, so incrementally getting new resources to run can be problematic. A solution to that is to pre-allocate all needed machines. There are many challenges in parallel symbolic execution in the cloud such as dynamically load balancing trees among workers and state transfers. Along with other problems such as picking the right strategy portfolios

Preliminary results show that parallel symbolic execution on the cloud can give over linear improvement over conventional methods and KLEE.

Categories: Conferences, Systems Tags: