Conferences – Big Red Bits

[LADIS 2009] Technical Session #4 – Monitoring and Repair

hussam — Sun, 11 Oct 2009 22:33:57 +0000

Second Talk: A case for the accountable cloud by Andreas Haeberlen

Three cloud stories .. a threatening cloud, a promising cloud, and a nice cloud

The problem with current clouds is that the user does not know what the cloud service provider is doing with the customer’s code and data. Also, from the cloud service provider’s perspective, the operator does not know what is the code that they are running for customers supposed to do.

Alice is the customer running a service on the cloud owned and operated by Bob.

A solution: what if we had an oracle that Alice and Bob could ask about cloud problems? We want completeness, (if something is faulty, we will know) accuracy (no false positives), verifiability (the oracle can prove its diagnoses is correct).

Idea: make clud accountable to alice+bob. Cloud records its actions in a tamper-evident log, alice and bob can audit, use log to construct evidence that a fault does or does not exist.

Discussion: 1) Isn’t this too pessimistic? bob isn’t malicious ..maybe, but bob can get hacked, or things can just go wrong. 2) shouldn’t bob use fault tolerance instead? yes whenever we can, but masking faults is never perfect, we still need to check. 3) why would a provider want to deploy this? this feature will be attractive to prospective customers, and helpful for support. 4) Are these the right guarantees? completeness (no false negatives), could be relaxed with probabilistic completeness; verifiability could be relaxed only provide some evidence; accuracy (no false positives) can not be relaxed because we need to have confidence when we rule out problems.

A call to action: cloud accountability should do; deliverable provable guarantees, work for most cloud apps, require no changes to application code, cover a wide spectrum of properties, low overhead.

Work in Progress: Accountable Virtual Machines (AVM), goal: provide accountability for arbitrary unmodified software. cloud records enough data to enable deterministic replay, alice can replay log with a known-good copy of the software, can audit any part of the original execution.

Conclusion: current cloud designs carry risks for both customers and providers (mainly because of split administration problem). Proposed solution: accountable cloud. Lots of research opportunities.

Third Talk: Learning from the Past for Resolving Dilemmas of Asynchrony by Paul Ezhilchelvan

In an asynchronous model you can not bound message delivery time or even message processing time by a machine. However, in a probabilistic synchronous model, we can bound times within a certain probability via proactive measurements. The new central hypothesis of the new model is that most of the time, performance of the past is indicative of the performance of the near future (i.e. delay in the past is the indicative of delay in the future).

Design steps include doing proactive measurements, using them to establish synchrony bounds, and assign time bounds based on that, try that and see how it works and enable exceptions.

On-going work: development of exceptions (to deal with exceptional cases when mistakes are detected). Open environments are asynchronous, use crash signals for notification of extreme unexpected behavior.

[LADIS 2009] Technical Session #5 – Communication

hussam — Sun, 11 Oct 2009 20:37:11 +0000

First Talk: Bulletin Board: A Scalable and Robust Eventually Consistent Shared Memory over a Peer-to-Peer Overlay by Gregory Chockler

WebSphere Virtual Enterprise (WVE) is a product for managing resources in a data center. The product is a distributed system whose nodes and controllers need to communicate and share information, and BulletinBoard (BB) is used for that. BB is a platform service for facilitating group-based information sharing in a data center. It is critical component of WVE, and its primary application is monitoring and control, but the designers believe that it could be useful for other weakly consistent services.

Motivation & Contribution: Prior implementation of group communication implemented internall as not designed to grow 10 folds, and that was based on Virtual Synchronous group communication; robustness, stability, high runtime overheads as the system grew beyond several 100s of processes; static hierarchy introduced configuration problems. So the goal was to provide a new implementation to resolve the scaling and stability issues of the prior implementation (and implement this in a short time! so this constraint had important implications on the design decisions).

BB supports a write-sub (write subscribe) service model. It is a cross between pub-sub systems, shared memory systems, and traditional group communication systems. In pub-sub communication is async and done through topics. In shared memory we have overwrite semantics, singe writer per topic and process, and notifications are snapshots of state.

Consistency Semantics (single topic). PRAM Consistency: notified snapshots are consistent with the other process order of writes. A note made was that developers who built services on top of that turned out to understand this semantics of consistency.

Liveness Semantics (single topics). Uses Eventual inclusion: eventually each write by a correct and connected process is included into the notified snapshot. Eventual exclusion means that failed processes will be eventually excluded from updates.

Performance and Scalability Goals: adequate latency, scalable runtime costs, throughput is less of an issue (management load is fixed and low). Low overhead. Robustness, scalability in the presence of large number of processes and topics (2883 topics in a system of 127 processes, note that the initial target was around 1000 processes).

Approach: decided to build this on an overlay network called SON. Service Overlay Network (SON). SON is a semi-structured P2P overlay, already in the product, and self-* (recover from changes quickly without problems), resilient, and supports peer membership and broadcast. The research question here was whether if BB can be implemented efficiently on top of a P2P overlay like SON?

Architecture: SON with IAM (interest aware membership) built on top of it and BB on top of that (but BB can interact directly with SON).

Reliable Shared State Maintenance in SON for BB: is made fully decentralized, and update propagation is optimized for bimodal topic popularity. Overlay broadcast or iterative unicast over direct TCP connections if # subscribers of a topic is less than a certain threshold (and group broadcast otherwise). For Reliability, periodic refresh of the latest written value (on a long cycle) if not overwritten (this was a bad decision in retrospect) with state transfer to new/reconnected subscribers.

Experimental study on different topologies showed low cpu overhead and latency, but these numbers increased as the topology increased in size. Analysis of that revealed that this was because the periodic refreshes were stacked and caused increased CPU & latency overheads. An additional problem was in broadcast flooding, and when that was removed cpu & latency overheads stayed flat as the topology increased in size.

Lessons learned: communication cost is the major factor affecting scalability of overlay based implementations, and that anti-entropy techniques are best fit for such services.

Second Talk: Optimizing Information Flow in the Gossip Objects Platform by Ymir Vigfusson

In gossip, nodes exchange information with a random peer periodically in rounds. Gossip has appealing properties such as bounded network traffic, scalability in group size, robustness against failures, coding simplicity. This is nice when gossip is considered individually per application. In cloud computing with nodes joining many groups, the traffic is no longer bounded per node (but per topic).

The Gossip Objects (GO) platform is a general platform for running gossip for multiple applications on a single node. It bounds the gossip traffic going out of a particular node. The talk focused on how to select rumors to publish out from multiple applications on a single node such that we reduce number of messages. This is possible because rumor messages are small and have a short destination. An observation made is that rumors can be delivered indirectly, uninterested nodes can forward rumors to interested nodes.

The GO heuristic: recipient selection is biased towards higher group traffic. The content is selected by computing a utility of a rumor which is defined as the probability of that rumor will add information to a host that didn’t know that info.

Simulation, a first simulation of an extreme example with only two nodes joining many groups. The GO heuristic showed promising results. Then a real-world evaluation was conducted based on a 55 minute trace of the IBM WebSphere Virtual Enterprise Bulletin Board layer. The trace had 127 nodes and 1364 groups, and the evaluation showed that GO had placed a cap on traffic compared to random and random with stacking heuristics for GO. Additionally, the GO heuristic was able to deliver rumors faster than the other heuristic, and the number of messages needed to deliver the messages to interested nodes, and the GO heuristic had multiple orders of reduction over other heuristics and traditional rumor spreading.

Conclusion: GO implemented novel ideas such as per-node gossip, rumor stacking (pushing the rumor to the MTU size), utility based rumor dissemination, and adapting to traffic rates. GO gives per-node guarantees even when the # of groups scales up. Experimental results were compelling.

Questions:

Mike Spreitzer, IBM Research: What would happen if the number of groups increases?

Answer: Study of available real-world traces showed a pattern of overlap. We also conducted simulation with other group membership patterns and the results were similar.

—: What was the normal rumor size? And what would happen if that increased?

Answer: The average rumor size was 100Bytes. If the message size increased we will stack less rumors, but our platform can also reject really large rumors.

—: Have you thought about network-level encoding?

Answer: Not yet, but we plan to in the future.

—: Have you thought of leveraging other dissemination techniques to run under GO?

Answer: Actually, we thought about the opposite direction where we would run other communication protocols and map them under the hood to GO. Results are pending.

[LADIS 2009] Keynote #4 – Life on the Farm: Using SQL for Fun and Profit in Windows Live

hussam — Sun, 11 Oct 2009 19:38:15 +0000

Keynote #4 by David Nichols from Microsoft. Published abstract and speaker bio.

In his talk, David shared some stories about his experience of using SQL in a data center environment to provide cloud services. The speaker was a bit fast while talking, I captured most of his message and the important parts, but i had to skip some parts.

In Windows Live, when building a new service they prefer to use off-the-shelf products such as SQL. Why SQL? familiar tested programming model (real queries, real transactions, good data modeling, excellent at OLTP, easy to find developers that know it). Solid systems software (used often and fine tuned many times and updated). Challenges with using SQL, living without single-image database model (no global transactions or global indexes). Administration and maintenance overhead. Breaking things at scale.

DB partitioned by user, many users per instance DB because it is easy and self contained. User info are small enough that you can place multiple users on single location. Front ends send requests to proper DB. Location is determined by lookup (a Lookup Partition Service – LPS – maps users to partitions). DBs are partitioned by hash to avoid hotspots.

Architecture: Three stages of scale out: bigger server, functional division, and data division.

A problem with scaling out: updates to multiple sevices and users (e.g., add messenger buddy, upload a photo which is writen to file store and recent activity store). Two-phase commit is out (because the risk of having the crash that locks your data out is too high), instead us ad hoc methods: for example: write A intent, write B, write A; another example, write A and work item, let work item write B; another example: write A, then B, tolerate inconsistency.

Another problem is how do you read data about multiple users or even all users. Example scenario, user updates his status, his friends need to know. The old way (inefficient) to do that is to write a change about the users into the profile of all affected users, easy to query, but heavy write load.

Data Availability and Reliability. Replication is used for all user data using SQL replication. Front ends have library (WebStore) to notice failures and switch to secondary. Original scheme was one-to-one which was too slow because parallel transactions vs. single replication stream. Next try was to have four DBs talking to four DBs which fixed most speed problems, but too much load on secondaries after failure. Current approach uses 8-host pods, 25% load increase for secondaries on failure (8×8 matrix, and the replication was done on the transpose of the matrix). However, still not fast enough for key tables (100’s of write threads vs. 5 replication streams). Manual replication (FE’s run SProcs at both primary and secondary, but small probability of inconsistent data). Replication runs a few seconds behind (ops reluctant to auto-promote secondary due to potential data in replication stream), new SQL tech should fix this.

Data loss causes: above the app (external application and old data); in the app (software bugs especially migration logic bugs); below the app (controller failure, disk failure). Mitigation techniques: audit trails and soft deletes for above app problems; per-user backup for software bugs; tape backup, sql replication, and RAID for below app problems (however these are expensive).

Managing Replication: fail safe set, a set of databases in some sort of replication membership. Typical fail safe set is two to four DBs (most are two). Fail safe are the true target of partition schemes.

Upgrade options: upgrade partitions: run DDL in each partition (via WebStore), this is complicated by replication, after all DBs are done, upgrade FEs (SProcs are compatible; changed APIs get new names). Migrate users: can take various forms (between servers, within a server, or even within services), and migrating users can be complex, slow, error-prone, and nobody’s likes it.

Some Operation stories.

Capacity management: growth is in units of servers. when to buy more? test teams provides one opinion, ops team aims to find max resource and stay below limit, two kinds of limits, graceful and catastrophic. Interesting thing about graceful vs catastrophic limits .. if you back off from graceful limits, you can usually go back to your original state (good state), however for catastrophic limits, even if you back off you can remain in a bad situation.

Ops lessons: 1) never do the same thing to all machines at once -stats queries, re-indexing have all crashed clusters in the past. 2) Smaller DBs are better, already coping with many DBs, plus re-indexing backups, upgrades are all faster for small DBs. 3) Read-only mode is powerful (failure maintenance and migration all use it). 4) Use the the live site to try things out (new code new SQL settings etc) “taste vs test”.

Conclusions: SQL can be tamed, it has some real issues but mostly manageable with some infrastructure, and its ops cost not out of line. It is hard to do better than SQL, it keeps improving, each time we go to design something, we find that SQL already design it, perhaps not in the form we want exactly, but close enough and not worth the effort probably. However SQL is not always the best solution.

SQL wish list. Easy ones: partitioned data support, easy migration/placement control, reporting, jobs; supporting aggregated data pattern, improved manageability. Hard ones: taming DB schema evolution, soft delete/versioning support of some kind, and A–D transactions (Atomic & Durable).

[LADIS 2009] Technical session #3 – Storage

hussam — Sun, 11 Oct 2009 17:56:58 +0000

First Talk: Consistency without concurrency control by Marc Shapiro

This seemed like an interesting piece of work. Unfortunately i came in a bit late from the break and so my writing is sloppy and doesn’t do it much justice. However the paper about CRDTs and TreeDoc has been published in ICDCS.

Problem motivation: TreeDoc is a storage structure that uses binary tree encoding to address and store data. Inserting data is done by adding leaves to the tree. Reading the document consists of reading the binary tree using an “In Order” traversal. Deleting portions of the tree involves marking nodes with tombstones. However, trees can grow very badly, so removing deleted nodes and “rebalancing” the tree is needed. However, now after the rebalancing the tree addresses do not have the same meaning as before, so incoming updates might be inserted in the wrong location. So how can we agree on current addresses without concurrency control.

Tree located at two types of sites: Core and Nebula. The core is a smaller group that runs 2-phase commit to manage updates. the Nebula is a larger set of remote sites that do not run a consistency protocol. Catch-up protocol: if a core and nebula are networked partitioned, core proceeds with updates and buffers operations, let’s say that the nebula also gets some updates and buffers them. Then when the nebula gets the updates from the core, and replays it and the replays its own operations.

main point: There is a need for useful data structures that support operations that commute. The commutativity gives us convergence between multiple sites without concurrency control. TreeDoc is an example of such data structure. The main point with such data structures is that we should take care of garbage collection because it becomes a big issue.

Second Talk: Provenance as First Class Cloud Data by Kiran-Kumar Muniswamy-Reddy

This talk gave motivation for why would provenance be useful in cloud computing services. The speaker argued that provenance can allow us to reason better about the data from cloud services. The speaker argued that native support for provenance in cloud services will be beneficial.

Provenance tells us where did the data come from, its dependencies, and origins. Provenance is essentially a DAG that captures links between objects. Motivating example applications: web-search vs. cloud-search: both have tons of resources, however web search uses hyperlinks to infer dependencies, while no such thing exists for cloud-search. Provenance can provide a solution for that, and this has been argued for in a previous paper by Shah in usenix ’07. Another example, pre-fetching. Provenance can tell us which documents are related to each other, and this allows you to pre-fetch related items for performance. Other examples include ACLs and auditing apps.

Requirements for provenance: consistency, long-term persistence, queryable, security, coordinate compute facilities and storage facilities.

Third Talk: Cassandra – A Decentralized Structured Storage System by Prashant Malik

Why Cassandra? Lots of data (copies of messages, reverse indices of messages, per user data ..etc), random queries ..etc.

Design goals: high availability, eventual consistency (trade-off strong consistency in favor of high availability), incremental scalability, optimistic replication, “knobs” to tune trade-offs between consistency durability and latency, low total cost of ownership, minimal administration.

Data model: similar to the BigTable data model. columns are indexed by key, data is stored in column families, and the columns are sorted by value or by timestamp. Super columns allow columns to be added dynamically.

Write operations, a client issues a write request to a random node in the Cassandra cluster. The “partitioner” determines the nodes responsible for the data. Locally, write operations are logged and then applied to an in-memory version. Commit log is stored on a dedicated disk local to the machine.

Write properties: there are no locks in the critical path, we have sequential disk access. It behaves like a write back cache, and we have append support without read ahead. Atomicity guarantee for a key per replica. “Always Writable”, writes accepted even during failures, in that case the write is handed-off to some other node and loaded back to the correct place when node comes back up.

Reads are sent from the client to any node in the cassandra cluster, and then depending about the knobs the reads either get the most recent value or a quorrum.

Gossip is used between replicas using the Scuttlebutt protocol which has low overhead. Failure detection assigns a failure suspicion to nodes that increases with time until you hear again from users.

Lessons learned: add fancy features only when absolutely necessary. Failures are the norm not the exception. You need system-level monitoring. Value simple designs.

Fourth Talk: Towards Decoupling Storage and Computation in Hadoop with SuperDataNodes by George Porter

Hadoop is growing, gaining adopting, and used in production (Facebook, last.fm, linked in). E.g., facebook imports 25/day to 1k hadoop nodes. A key to that growth and efficiency relies on coupling compute and storage: benefits of moving computation to data, scheduling, locality reduce traffic, map parallelism (“grep” type workload).

So, when to couple storage with computation? This is a critical and complicated design decision, and this is not always done right. Examples, Emerging best practices with dedicated clusters. Your data center design may not be based on the needs for Hadoop (adding map/reduce to existing cluster, or a small workgroup who like the programming model such as Pig, Hive, and Mahout).

Goal is to support late binding between storage and computation. Explore alternative balances between the two (specifically explore the extreme point of decoupling storage and compute nodes). An observation from the Facebook deployment is that the scheduler is really good at scheduling nodes to local nodes for small tasks and bad for scheduling them in rack-locality for large tasks.

SuperDataNode approach: key features include a stateless worker tier, and storage node with shared pool of disks under single O/S, and a high bisection bandwidth worker tier.

There has been alot of talk about advantages of coupling storage and computation, what are the advantages of decoupling them. Advantages include, decoupling amount of storage from number of worker nodes. More intra-rack bandwidth than inter-rack bandwidth. Support for “archival” data, subset of data with low probability of access. Increased uniformity for job scheduling and block placement. Ease of management, workers become stateless; SDN management similar to that of a regular storage node. Replication only for node failures.

Limitations of SDN, scarce storage bandwidth between workers and SDN. Effective throughput with N disks in SDN (@ 100MB/sec each) 1:N ration of bandwidth between local and remote disks. Effect on fault -tolerance. Disk vs Node vs Link failure model. Cost. Performance depends on the work loads.

Evaluation compared a baseline hadoop cluster and an SDN cluster with 10 servers. The results showed that SDN performed better for grep and sort like workloads, and a bad case was random writers were hadoop performed better (workload was just each worker write to disk as fast as possible .. 100% parallelism).

[LADIS 2009] Keynote #2 – Some Lessons Learned from Running Amazon Web Services

hussam — Sun, 11 Oct 2009 00:46:16 +0000

Keynote #2 by Marvin Theimer from Amazon. Published abstract and speaker bio.

In his talk, Marvin reflected on experiences building and maintaining applications in data centers. He stressed the point that each of these issues are non-surprising individually by themselves, but the very large scale makes all of the possible all at once, and this is the surprising point! I really liked this talk.

A nice analogy he gave for building and running data center and cloud services is: Evolving a Cessna prop-plane into a 747 jump jet in-flight

Start with a Cessna prop-plane for cost and timeliness reasons. 4-9’s availability means that you get to land for 52 minutes every year (including scheduled maintenance, refueling, and crash landings). Success implies growth and evolution and rebuilding the plane mid-flight: Passenger capacity goes from 4-person cabin to 747 jumbo wide-body cabin, support for “scale out” means you add jet engines and remove the propellers while flying, testing and safety have to happen while flying!

Here are the lessons learned:

The unexpected happens! A fuse blows and darkens a set of racks, chillers die in a datacenter and a fraction of servers are down, an electrical plug bursts into flames, tornadoes or lightening hits datacenter, datacenter floods from the roof down, a telco connectivity goes down, the DNS provider creates black holes, simultaneous infant mortality occurs of servers newly-deployed in multiple datacenters, power generation doesn’t start because the ambient temperature is too high, load ..etc

Networking challenges. The IP protocol is deeply embedded in systems that you de-facto have to use it. IP networks can have lost packets, duplicate packets, and corrupted packets. Even if you use TCP your applications still need to worry about lost packets, duplicate packets, and corrupted packets. Software (and hardware) bugs can result in consistent loss or corruption of some packets. You have to be prepared for message storms. Client software is sometimes written without a notion of backing off on retries. One might expect that CRCs and the design of TCP can catch most of these issues. However, we are running in such a large scale that there are enough rare events that can give multiple errors. For example, if a switch or some network hardware erroneously flips the 8th bit of every 64 packets, with the large running scale, these rare events can happen repeatedly!

Things you should be able to do without causing outages: adding new hardware, deploying a new version of software, rolling back to a previous version of software, recovering from the absence, loss, or corruption of non-critical data. Losing a mirror of a DBMS, recovering from having lost a mirror of a DBMS, losing a host in its fleet, losing a datacenter, losing network connectivity between data centers. Can we roll back some parts in the middle of upgrading other parts ?

System resources/objects have lives of their own! Resources/objects in a service may live longer than the accounts used to create them. You have to be able to remap them between accounts. Resources/objects may live longer than versions of the service! You have to be able to migrate them forward with minimal or no disruptions. For example, EC2 instances were designed to run for short periods on demand, but customers start using them and keeping instances up for a long time, and this happens often enough such that shooting down long-lived instances will upset the clients. So how can you deal with that ?

Downstream dependencies fail. It’s a service-oriented architecture. The good news is that your service has the ability to keep going even if other services become unavailable, and the challenge is how to keep going and/or degrade gracefully if you depend on the functionality of downstream services at low levels. Suppose all services are 4-9’s available, if a downstream service fails for 52 minutes, how will you meet your own SLA of failing no more than 52 minutes ? Cascading outages happen, if multiple downstream services fail, how will you handle it? For example, if a storage service fails, 2 services depending on it can also fail, then more services depending on them fail, and so on and so forth. Services need to defend against that.

You must be prepared to deal with data corruption. Data corruption happens: flakey hardware, IO sub-systems can lie, software can be wrong, system evolution happen, people can screw up. End-to-end integrity checks are a must, straight-forward data corruption checking, how do you know if your system is operating correctly? Can your design do fsck in < 52 minutes ?

Keep it simple. It’s 4am on Sunday morning and the service has gone down, can you explain the corner cases of your design to the front-line on-call team over the phone? can you figure out what’s going on in under 52 minutes? Can you make sure it is not a corner case of using your code that did not result in that crash, or how to fix it if it is ? Simple brute force is sometimes preferable to elegant complexity: examples: eventual consistency considered painful (but sometimes necessary), P2P can be harder to debug than centralized approaches (but may be necessary). Is it necessary to build your system to handle situations after product growth when it is more likely that your system will actually change and be replaced by the time that it is big enough to require handling that issue.

Scale: will your design envelope scale far enough? Do you understand your components well enough? Cloud computing has global reach, services may grow at an astonishing pace, the overall scale is HUGE!. The scale of cloud computing tends to push systems outside their standard design envelopes. The rule of thumb that you must redesign your system every time it grows by 10x implies you must be prepared to redesign early and often.

**CAE Trade-Off for Resources. CAE: cost-efficient, available, elastic. You can only pick two of them!

Do not Ignore the Business Model or your TCO. Do you know all the sources of cost? can you accurately measure them? Do you know all the “dimensions of cost” that will be used in pricing? Can you meter them? Have you thought about ways the system can be abused? How will you resolve billing disputes? All these may affect the design of the service in fundamental ways. This is important to measure even if you think that your revenue will come from adds. For example, some customer figured out that if they store large names in the key part of the key/value store rather than the name, they can reduce their cost by 1000x times because S3 only charges for the size of the value not key! So you have to think about what you are not charging people for and how can they abuse it.

Elastic Resources: What boundaries to expose? High availability apps require the notion of independent failure zones –> introduce the notion of availability zones (AZ). Concurrent apps want bounded, preferably low message latency and high bandwidths –> introduce notion of cluster affinity to an AZ. The challenges of AZ clustering, clumping effect since everyone will want to be near everyone else (for example, if you ask people to pick an AZ and they don’t care, everyone will end up in AZ1 !!), makes elastic scheduling harder. Fine-tuned applications are the enemy of elasticity, customers will try to divine your intra-AZ topology (co-location on the same rack, etc.) Eventual evolution to different network infrastructures and topologies means you don’t want to expose more than you have to.

Summary and Conclusions: The unexpected happens, in large systems even extremely rare events occur with a non-negligible frequency; what’s your story on handling them? Keep it simple: it’s 4’am and the clock is ticking- can you debug what’s going on in your system? Cloud computing is a business: you have to think about cost-efficiency as well as availability and elasticity.

Questions:

Mike Freedman, Princeton University: What things of these issues are specific for infrastructure provider (such as Amazon) compared to web service providers such as walmart.com or hotmail?

Answer: Many things are common such as hazards and load. As for other things such as accounting and billing, this is still useful for service providers, then this can at least minimize your running costs and allow you to know where you are spending your money.

Ken Birman, Cornell University: What makes you feel consistency, is it the 4am call or is latency and competitiveness and the added complexity?

Answer: It is the 4am call. When you have systems at large scale, you have to work out all the possible cases in your system and you can not cheat out of it. These corner cases make it hard. Remember that this has to be developed in a timely manner and it is developed by junior developers that are evolving their knowledge and expertise in this.

—: How do you test the resilience of your data centers? Are there people who go and turn off part of your datacenter?

Answer: Essentially yes! You test as much as you can, then you roll out.

Dough Terry, MSR-SV: Shouldn’t the analogy be that you start with a fleet of Cessnas and you want to evolve them into a fleet of Jumbo jets in flight without losing all of them together.

Answer: the problem is that you can not parallelize everything. There is some percentage of your code that does not get fixed.

Hakim Weatherspoon, Cornell University: What about embracing failure? running your systems hot and expect that nodes will fail ?

Answer: That solves some of the existing problems, but newer problems that we don’t know about yet can rise. For example, we never thought that the boot temperature on backup power generators will ever be an issue but it was! So you can never enumerate all problems.

[LADIS 2009] Technical Session #2 – Applications and Services

hussam — Sat, 10 Oct 2009 22:14:21 +0000

First Talk: Are Clouds Ready for Large Distributed Applications? by Kay Sripanidkulchai

This talk essentially focused on how can enterprise applications be transported to cloud computing settings. Issues focused on are: deployment, this is more complex than just booting up VMs due to data and functionality dependencies. The second issue, availability. Enterprise apps are heavily engineered to maximize uptime. According to a published study, current cloud services can expect up to 5 hours of down time per year. Enterprise customers however really expect 1 hour of downtime per year. So how can this gap be bridged ? The third issue is that of problem resolution.

Bridging the availability gap: ideas include: 1) implementing scaling architectures in the cloud, 2) developing APIs to allow multiple clouds to interact with each other so as to develop failover techniques, 3) Live VM migration to mask failures.

As for problem resolution: categories of issues raised regarding EC2 on EC2 boards: 10% of topics discussed are feature request, 56% user how-to. As for problems 25% cloud error, 64% user error, 11% unknown error. One of the important things that enterprise customers want is being able to know if something is not running correctly, is the issue with the cloud platform, the VM, faulty hardware or what. So techniques and tools have to be developed in that regards.

Second Talk: Cloudifying Source Code Repositories: How Much Does it Cost? by Michael Siegenthaler

Cloud computing used to be mainly used by large companies that have the resources that enabled them to build and maintain the datacenters. Now this is accessible to people outside these companies for low costs.

Why move source control to the cloud? resilient storage, no physical server to administrate, scale to large communities. Used SVN which is very popular, store data on S3 (problem with eventual consistency), used Yahoo Zookeeper (a coordination service) as a lock service. How to measure costs for SVN on S3? measure cost per diff files and stored files. Back of the envelope analysis of cost shows it is inexpensive even for large projects such as Debian and KDE. A trend to notice is that code repos are getting larger in size, but the price of storing a GB is decreasing with time.

Architecture: machines talk to front-end servers on EC2 and storage is on S3. The front-end need not be on EC2, the cloud is there mainly for storage. A problem with a naive implementation is that eventual consistency in S3 means that multiple revision numbers can be issued for conflicting updates. For this reason locking is required. The commit process essentially has a hook that acquires a lock from ZooKeeper and pull for the most recent version number. The most recent version is retrieved from S3 (retry if not found due to eventual consistency), then make commit and release lock and ZooKeeper increments the version number.

Performance evaluation: usage patterns: Apache foundation has 1 repo for 74 project with average 1.10 commits per minute and a max of 7 per minute. The Debian community has 506 repos with 1.12 commits per minute in aggregate and 6 in max. These were used as experiment traces. The results showed that as you add more front-end servers from EC2 the performance does not suffocate due to possible lock contention, and this was tried with differing number of clients.

Third Talk: Cloud9: A Software Testing Service by Stefan Bucur

There is a need to facilitate automatic testing of programs. Cloud computing can make this have better performance. Testing frameworks should provide autonomy (no human intervention), usability, performance. Cloud9 (http://cloud9.epfl.ch/) is a web service for testing cloud applications.

Symbolic Execution: when testing a function, instead of feeding it input values, send it an input abstraction (say, lambda) and whenever we see a control flow branching (such as an if statement) create a subtree of execution. One idea is to send each of these subtrees to a separate machine and test all possible execution paths at once. A naive approach can have many problems. For example trees can expand exponentially, so incrementally getting new resources to run can be problematic. A solution to that is to pre-allocate all needed machines. There are many challenges in parallel symbolic execution in the cloud such as dynamically load balancing trees among workers and state transfers. Along with other problems such as picking the right strategy portfolios

Preliminary results show that parallel symbolic execution on the cloud can give over linear improvement over conventional methods and KLEE.

[LADIS 2009] Technical Session #1 – Programming Models

hussam — Sat, 10 Oct 2009 14:56:30 +0000

First talk: Cloud-TM: Harnessing the Cloud with Distributed Transactional Memories given by Luis Rodrigues

MapReduce is nice if your data and task fits the model. However, it is unnatural for many scenarios. Another model for programming in the cloud is: PGAS (Partitioned Global Address Space), it masks machines as different addresses, but it has falls short because programmers do not know how many machines is their program running on ahead of time. D-STM (Distributed Software Transactional Memories) extends the TM abstraction across the boundaries of a single machine.

Research challenges:

Automatic parallelization: extremely hard, but transactional support makes it easier to implement strategies based on the speculative execution portions of the code.
Fault tolerance: only started to be considered by recent D-STMs
Coping with Workload Heterogeneity. STM performance is heavily dependent on the workload, different algorithms exist optimized for different workloads, but this needs to be automated.
Automatic Resource Provisioning.
Persistence ACI vs ACID

Dependable Distributed STM (D2-STM) is a distributed fully replicated STM that uses atomic broadcast to coordinate replicas. Bloom filters used to reduce messages. Some prelim results: speculative replication, a technique that runs potentially conflicting transactions speculatively in order to hide the inter-replica coordination latency. Identifying and predicting the data access, we are developing stochastic techniques for identifying and predicting data access pattern. Thread-level speculation.

An application they built based on their techniques. FenixEDU manages the on-line campus activities used in production by the Technical University of Lisbon and being installed in other machines. 1000s of students. Web app, OO-domain model, Relational DMBS to store data, object/relational mapping tool to store objects in the db, runs on a STM implementation.

FenixEDU can be run in the cloud. We want programmers to be able to use the OO model they are familiar with, resource management has to be automatic, and consistency is crucial.

Conclusions: D-STM have many good properties that make them a promising technology to support distributed applications with consistency requirements in the cloud.

Second Talk: Storing and Accessing Live Mashup Content in the Cloud by Krzysztof Ostrowski

Interactive collaborative cloud services currently go through a “centralized” cloud service and users just poll for the updates. In the Cornell Live Objects model, clients collaborate together on the “edge”.

Cloud vs Edge ? if all on edge, we have persistence but potentially no consistency. If all on the cloud, we have consistency from the clients’ view but no scalability. For example, in second life, we can only handle 40 clients/server. The edge is much larger than the cloud. Many more under utilized machines in the edge than in the cloud.

In live objects, we used checkpointed channel. that are abstractions for communication that have proxies on machines and partially reside on the edge and cloud. Data is in that channel. Channels have a proxy and a network facing component. An event on a channel is considered to be delivered only if the proxy declares that and notifies the application above. If an update is received on a checkpoint channel (CC) we get a new CC`. For the programming model, CCs are given types in the programming language that reflect what kind of data is contained in each channel. Channels can have references to other channels which allow users to build complex structures and enable applications that subscribe to many channels representing many different objects. For example, a shared desktop can have references to the different objects in that desktop.

Conclusion: Checkpointed Channels are a new storage abstraction where users express interest in data regardless of where it resides or how it is stored (in a cloud service or a P2P system). Tremendous opportunity for scaling by splitting data between the cloud and edge.

Questions: questions mostly focused on the synchronized timing of correlated channels. For example, if a channel contains video data, and another channel contains audio data, how can the two be synchronized.

Third Talk: A Unified Execution Model for Cloud Computing by Eric Van Hensbergen

In current cloud programming settings, we have two models: platform as a service, and infrastructure as a service. That is, are we given unmanaged machines (EC2) that we are free to use as we want. Or are we given a managed platform (Google App Engine and Microsoft Azure) that manage the underlying complexity but limit what you can do. Idea, can we break down this barrier ? This is similar to previous studies on distributed operating systems from the past. However, the difference here is that they have to be more flexible, elastic, and work at scale.

Many techniques were discussed such as synthetic file systems, execution & control mechanisms, and aggregation techniques. Support for BlueGene/P Preliminary support for EC2.

[LADIS 2009] Keynote #1 – Data Serving in the Cloud

hussam — Sat, 10 Oct 2009 13:56:30 +0000

Keynote #1 by: Raghu Ramakrishnan from Yahoo! Research. Published abstract and Speaker bio .

Raghu’s talk focused mostly on how is data stored in large scale for cloud services. His talk was in three parts, the first part discussed general principles, the second part discussed the PNUTS (internally called Sherpa) at Yahoo, and in the last part he proposed having a community driven benchmark targeted at what he called VLSD Data Stores: Very Large Scale Distributed Data Stores.

Here are the “raw” notes I took while he was giving his presentation. Sorry about the roughness

Two types of cloud services at Yahoo!:

Horizontal (Platform) Clouds Services: e.g., storage, web front, ..etc
Functional Services: e.g., Content Optimization, Search Index, Ads Optimization, ML for spam detection ..etc

Yahoo!’s Cloud: massive user base (> 500M unique users per month) and very high requests per second.

VLSD DS: Very Large Scale Distributed Data Stores. Three types of data stores used in the cloud categorized by focus:

Large data analysis (e.g., Hadoop). Data warehousing, scan oriented workloads, focus on sequential disk I/O. focus on cpu cycles.
Structured record storage (e.g., PNUTS/Sherpa). CRUD, point lookups and short scans, index table, opt for latency.
Blob storage (e.g., MObStore). object retrieval and streaming, scalable file storage, opt for GB storage

In the last 30 years, the world has changed significantly. Things have become more elastic. Customers need scalability, flexible schemas, geographic distribution, high availability, reliable storage. Web serving apps can do without complicated queries and strong transactions. Some consistency is desirable, but not necessarily full ACID.

Typical applications for Yahoo VLSD:

user logins and profiles (changes must not be lost)
events (news alerts, social network activity, ad clicks)
app-specific data (flickr photo edits ..etc)

In VLSD data servincg stores, must:

Partition data across store. How are partitions determined? can they be changed easily ?
Availability and failure tolerance what failures are handled.
How is data Replicated? sync, or async, geo or non.

Brewer’s CAP theorem: Consistency Availability Partition tolerance. Can not have all three, must forgo one. Approaches to handle CAP, “BASE”, no ACID, use a single version of DB reconcile later. Defer transaction commitment.

PNUTS/Sherpa:

Environment:

Small records (<= 100KB)
Structured records (lots of fields and adding)
Extreme data scale (tens of TB)
Extreme request scale (Tens of thousands of res/sec
Low latency globally (20+ datacenters worldwide
High availability and reliability

What is PNUTS/Sherpa: parallel database (sharded), geographic replication, structured flexible scheme (NO schema evolution, at any point in time each table has a set of fields, but not all records have values for all fields). Hosted and managed. This is PNUTS today. In the future it will add support for indexes and views to be maintained async. PNUTS is built on other cloud services such as Tribble for pub/sub messaging, and Zookeeper for consistency

The actual data are stored on commodity boxes called storage units, data is broken into tablets. Storage units have tablets from multiple tables and a table’s tablets can be split on multiple storage units. The routers have maps for tablets to storage units. The same architecture is replicated at many areas. Using Tribble for message passing.

Data Model. Per-record ops: get/set/delete, multi-record ops: multiget/scan/getrange, Web service RESTful API

Tablets are hashed for load distribution. Tablets also can be ordered tables, and this allows better scans and frequent and range queries. in ordered tablets the data is ordered inside the tablet, but tablets can be shuffled on storage units. Index maintenance, how to have lots of interesting indexes and views without killing performance ? Solution is asynchrony.

Processing reads and updates. An update goes to a router that routes to a storage unit, and then the write is sent to two message brokers and then SUCCESS is returned to the SU and an update is back to the router. The two message brokers provides persistence (just like a write ahead log) but data is garbage collected, so availability and FT is provided by replication. Reads and multireads are straight forward through lookup from router. For Bulk Loads, pre-allocate tablets to avoid hotspots.

Asynchrony, replication, consistency

Replicaion from one datacenter to another happens eventually (order of 1 or 2 seconds). If copies are async updated, what can we say about stale copies ? ACID guarantees require sync updates which is very expensive. Eventual consistency: copies can drift apart but will eventually converge if the system is allowed to quiesce. Do we have middle ground ? Eventual consistency might not be enough.

If user update in one area, then network partition, then update in another region by same user, what will the final value be ? eventual consis will give one of the two values, but we want a specific last value.

PNUTS consistency model, Goal make it easier for apps to reason about updates and cope with ansync, what happens to a record with primary key “alice” ? Each record has a master, and each record has a version number that changes with updates. Masters can change.

Writes always go to the current version. Possibly stale versions at non-master location. Support test-and-set write per record transaction. Reads can get stale versions. But “read-uptodate” gets most recent version. Other variations, read forward will give you records with versions non-decreasing for sequential reads.

Operability: tablets are initially assigned to some storage units. An SU can get hot, so tablets are moved to other tablets. A tablet master (tablet controller) will always know about tablet moves. Consistency techniques: Per-record mastering and per tablet mastering.

Mastering: Alice making changes mostly in west cost, so master for the her records are in the west cost When alice moves to east cost, first few updates are bounced to west cost, then the tablet master is moved to east cost. Coping with failures, when failure happens, mastership is moved to another location, after recovery, mastership can stay in new location or move back to place.

Comparing Some Cloud Serving Stores

Many cloud DB (and nosql systems out there: PNUTS, BigTable, Azure, Cassandra, Megastore, Amazone. How do they compare ? Can we have a community drivern benchmark for comparing this ?

Baseline: Sharded MySQL, PNUTS, Cassandra, BigTable

Shard Server: server is apache + plugin + mysql, mysql scema key varchar(255) value mediumtext, flexible schema: value is blob of key/value pairs this is to have dynamic schemas to compare.

Pros of sharding: simple, infinitely scalable, low latency, geo-replication. Cons: not elastic (resharding is hard), poor suport for load balancing, failr over, replication unreliable, asyc log shipping.

Azure SDS: cloud of SQL server intstances. App partitions data into instance-sized pieces, transactions ardn queries within an instance (SDS instance = storage + per-field indexing)

Google MegaStore: transactions across groups: entity group hierarchically linked records, can transactionally update multiple records with an entity group, build on big table

PNUTS pros and cons: reliable geo-replication, scalability consistency model, elastic scaling, easy load balancing, Cons: system complexity relative to sharded my SQL to support geo-replication, consistency etc. Latency added by router

HBASE: HBASE is like BigTable on top Hadoop. When you try to write a record, this is spread to HRegion Server: records partitioned by column family into HStores each HStore contains many MapFiles. All writes to HStore applied to single memchche, Reads consult MapFiles and memcache, Memcaches flushed as MapFailes (HDFS files) when full. Compactions limit the number ofMapFile. Pros and Cons: Pros: log-based storage for high write throughput, elastic scaling, easy load balancing, column storage for OLAP workloads. Cons: write are not imimediately persisted to disk, reads are across multiple disks and mem locations, no geo-replication, latency

Cassandra: Facebook’s storage system. It uses BigTable data model, and uses Dynamo to locate records. Pros: elastic scalability, easy management peer-to-peer, bigtable model is nice, flexible schema columns ..etc Cons: does not support geo-replication and consistency.

The numbers comparing the storage systems:

Setup: 8 cores 2x quad core, 8gb ram, workloads 120 million 1kb records 20 gb per server. Write heavy loads 50/50 read update, read heavy 95/5.

Read latency vs actual throughput for read heavy. Sharded my sql is best. PNUTS and Cassandra did well. Hbase did bad (died after 100 ops/sec . Cassandra and PNUTS did well and died at 4500

Qualitative comparisons: storage layer: filebased: hbase and cassandra, mysql based: pnuts and sharded mysql. Write Persistence: writes committed synchronously to disk PNUTS cassandra and sharded. Writes async HBAase. Read pattern: find record in mysqk (disk or buffer pool) PNUTS, sharded. Replication: intra-region: hbase and cassandra, inter and intra region (pnuts, mysql not guaranteed). Mapping record to srever: router pnuts and hbase. Cloud

Main point is: push for a community based benchmark for a cloud storage systems. YCS Benchmark. send mail to ragu and brian cooper.

Shadoop: sherpa + hadoop. Sherpa optimized for low-latency record-level access b-trees. HDFS optimized for batch oriented acces: file-syste.

Questions:

Ken Birman, Cornell University: Have you considered design techniques that will express stability and predictability along with scalability ?

Answer: not thought about it explicitly. But have not examined design techniques that do not perform wildly at scale. There are some interesting possible techniques such as in-memory systems. Most developers were worried about availability and performance at scale. Issues of stability at scale are still at early stages and have not been explored yet.

—: You described how the master moves over. What happens for the records on the master when a crash happens.

Answer: the protocols will ensure that when a failure in one data center and the record master one of two things happen. Either the master moves cleanly, or blocking. So if you try to write with time-line consistency, you will not progress. Example of when could this happens: write on west, failure happens, you move to east. It depends on the failure cause. If it is just because of disk issues, the message bus still contains the data, and when that data reaches new master it is made a master. The real problem is if the failure happens to the messaging system or to the link. It takes alot of work to find if the failure happens on the messaging system or not.

Doug Terry, MSR: how do you decide what is acceptable to give up to get CAP properties ?

Answer: it is crude. Developers ask for what they want and they implement it.

Blogging at SOSP and LADIS 2009

hussam — Sat, 10 Oct 2009 13:38:40 +0000

I arrived yesterday night at Big Sky Montana, a nice mountain resort covered in snow. I will be attending the 22nd ACM Symposium on Operating Systems Principles (SOSP 2009) as well as the 3rd ACM SIGOPS International Workshop on Large-Scale Distributed Systems and Middleware (LADIS 2009). The programs (LADIS, SOSP) are very promising. I am certainly looking forward to the keynotes and technical sessions. I will try to live-blog the events as much as possible. I haven’t done that before so I don’t know how it’ll go. However, my plan is to have one post per session and update that post after each talk in that session. We’ll see how it goes.