[LADIS 2009] Technical session #3 – Storage
First Talk: Consistency without concurrency control by Marc Shapiro
This seemed like an interesting piece of work. Unfortunately i came in a bit late from the break and so my writing is sloppy and doesn’t do it much justice. However the paper about CRDTs and TreeDoc has been published in ICDCS.
Problem motivation: TreeDoc is a storage structure that uses binary tree encoding to address and store data. Inserting data is done by adding leaves to the tree. Reading the document consists of reading the binary tree using an “In Order” traversal. Deleting portions of the tree involves marking nodes with tombstones. However, trees can grow very badly, so removing deleted nodes and “rebalancing” the tree is needed. However, now after the rebalancing the tree addresses do not have the same meaning as before, so incoming updates might be inserted in the wrong location. So how can we agree on current addresses without concurrency control.
Tree located at two types of sites: Core and Nebula. The core is a smaller group that runs 2-phase commit to manage updates. the Nebula is a larger set of remote sites that do not run a consistency protocol. Catch-up protocol: if a core and nebula are networked partitioned, core proceeds with updates and buffers operations, let’s say that the nebula also gets some updates and buffers them. Then when the nebula gets the updates from the core, and replays it and the replays its own operations.
main point: There is a need for useful data structures that support operations that commute. The commutativity gives us convergence between multiple sites without concurrency control. TreeDoc is an example of such data structure. The main point with such data structures is that we should take care of garbage collection because it becomes a big issue.
Second Talk: Provenance as First Class Cloud Data by Kiran-Kumar Muniswamy-Reddy
This talk gave motivation for why would provenance be useful in cloud computing services. The speaker argued that provenance can allow us to reason better about the data from cloud services. The speaker argued that native support for provenance in cloud services will be beneficial.
Provenance tells us where did the data come from, its dependencies, and origins. Provenance is essentially a DAG that captures links between objects. Motivating example applications: web-search vs. cloud-search: both have tons of resources, however web search uses hyperlinks to infer dependencies, while no such thing exists for cloud-search. Provenance can provide a solution for that, and this has been argued for in a previous paper by Shah in usenix ’07. Another example, pre-fetching. Provenance can tell us which documents are related to each other, and this allows you to pre-fetch related items for performance. Other examples include ACLs and auditing apps.
Requirements for provenance: consistency, long-term persistence, queryable, security, coordinate compute facilities and storage facilities.
Third Talk: Cassandra – A Decentralized Structured Storage System by Prashant Malik
Why Cassandra? Lots of data (copies of messages, reverse indices of messages, per user data ..etc), random queries ..etc.
Design goals: high availability, eventual consistency (trade-off strong consistency in favor of high availability), incremental scalability, optimistic replication, “knobs” to tune trade-offs between consistency durability and latency, low total cost of ownership, minimal administration.
Data model: similar to the BigTable data model. columns are indexed by key, data is stored in column families, and the columns are sorted by value or by timestamp. Super columns allow columns to be added dynamically.
Write operations, a client issues a write request to a random node in the Cassandra cluster. The “partitioner” determines the nodes responsible for the data. Locally, write operations are logged and then applied to an in-memory version. Commit log is stored on a dedicated disk local to the machine.
Write properties: there are no locks in the critical path, we have sequential disk access. It behaves like a write back cache, and we have append support without read ahead. Atomicity guarantee for a key per replica. “Always Writable”, writes accepted even during failures, in that case the write is handed-off to some other node and loaded back to the correct place when node comes back up.
Reads are sent from the client to any node in the cassandra cluster, and then depending about the knobs the reads either get the most recent value or a quorrum.
Gossip is used between replicas using the Scuttlebutt protocol which has low overhead. Failure detection assigns a failure suspicion to nodes that increases with time until you hear again from users.
Lessons learned: add fancy features only when absolutely necessary. Failures are the norm not the exception. You need system-level monitoring. Value simple designs.
Fourth Talk: Towards Decoupling Storage and Computation in Hadoop with SuperDataNodes by George Porter
Hadoop is growing, gaining adopting, and used in production (Facebook, last.fm, linked in). E.g., facebook imports 25/day to 1k hadoop nodes. A key to that growth and efficiency relies on coupling compute and storage: benefits of moving computation to data, scheduling, locality reduce traffic, map parallelism (“grep” type workload).
So, when to couple storage with computation? This is a critical and complicated design decision, and this is not always done right. Examples, Emerging best practices with dedicated clusters. Your data center design may not be based on the needs for Hadoop (adding map/reduce to existing cluster, or a small workgroup who like the programming model such as Pig, Hive, and Mahout).
Goal is to support late binding between storage and computation. Explore alternative balances between the two (specifically explore the extreme point of decoupling storage and compute nodes). An observation from the Facebook deployment is that the scheduler is really good at scheduling nodes to local nodes for small tasks and bad for scheduling them in rack-locality for large tasks.
SuperDataNode approach: key features include a stateless worker tier, and storage node with shared pool of disks under single O/S, and a high bisection bandwidth worker tier.
There has been alot of talk about advantages of coupling storage and computation, what are the advantages of decoupling them. Advantages include, decoupling amount of storage from number of worker nodes. More intra-rack bandwidth than inter-rack bandwidth. Support for “archival” data, subset of data with low probability of access. Increased uniformity for job scheduling and block placement. Ease of management, workers become stateless; SDN management similar to that of a regular storage node. Replication only for node failures.
Limitations of SDN, scarce storage bandwidth between workers and SDN. Effective throughput with N disks in SDN (@ 100MB/sec each) 1:N ration of bandwidth between local and remote disks. Effect on fault -tolerance. Disk vs Node vs Link failure model. Cost. Performance depends on the work loads.
Evaluation compared a baseline hadoop cluster and an SDN cluster with 10 servers. The results showed that SDN performed better for grep and sort like workloads, and a bad case was random writers were hadoop performed better (workload was just each worker write to disk as fast as possible .. 100% parallelism).