Archive for the ‘Systems’ Category

[LADIS 2009] Technical Session #2 – Applications and Services

October 10th, 2009 No comments

First Talk: Are Clouds Ready for Large Distributed Applications? by Kay Sripanidkulchai

This talk essentially focused on how can enterprise applications be transported to cloud computing settings. Issues focused on are: deployment, this is more complex than just booting up VMs due to data and functionality dependencies. The second issue, availability. Enterprise apps are heavily engineered to maximize uptime. According to a published study, current cloud services can expect up to 5 hours of down time per year. Enterprise customers however really expect 1 hour of downtime per year. So how can this gap be bridged ? The third issue is that of problem resolution.

Bridging the availability gap: ideas include: 1) implementing scaling architectures in the cloud, 2) developing APIs to allow multiple clouds to interact with each other so as to develop failover techniques, 3) Live VM migration to mask failures.

As for problem resolution: categories of issues raised regarding EC2 on EC2 boards: 10% of topics discussed are feature request, 56% user how-to. As for problems 25% cloud error, 64% user error, 11% unknown error. One of the important things that enterprise customers want is being able to know if something is not running correctly, is the issue with the cloud platform, the VM, faulty hardware or what. So techniques and tools have to be developed in that regards.

Second Talk: Cloudifying Source Code Repositories: How Much Does it Cost? by Michael Siegenthaler

Cloud computing used to be mainly used by large companies that have the resources that enabled them to build and maintain the datacenters. Now this is accessible to people outside these companies for low costs.

Why move source control to the cloud? resilient storage, no physical server to administrate, scale to large communities. Used SVN which is very popular, store data on S3 (problem with eventual consistency), used Yahoo Zookeeper (a coordination service) as a lock service. How to measure costs for SVN on S3? measure cost per diff files and stored files. Back of the envelope analysis of cost shows it is inexpensive even for large projects such as Debian and KDE. A trend to notice is that code repos are getting larger in size, but the price of storing a GB is decreasing with time.

Architecture: machines talk to front-end servers on EC2 and storage is on S3. The front-end need not be on EC2, the cloud is there mainly for storage. A problem with a naive implementation is that eventual consistency in S3 means that multiple revision numbers can be issued for conflicting updates. For this reason locking is required. The commit process essentially has a hook that acquires a lock from ZooKeeper and pull for the most recent version number. The most recent version is retrieved from S3 (retry if not found due to eventual consistency), then make commit and release lock and ZooKeeper increments the version number.

Performance evaluation: usage patterns: Apache foundation has 1 repo for 74 project with average 1.10 commits per minute and a max of 7 per minute. The Debian community has 506 repos with 1.12 commits per minute in aggregate and 6 in max. These were used as experiment traces. The results showed that as you add more front-end servers from EC2 the performance does not suffocate due to possible lock contention, and this was tried with differing number of clients.

Third Talk: Cloud9: A Software Testing Service by Stefan Bucur

There is a need to facilitate automatic testing of programs. Cloud computing can make this have better performance. Testing frameworks should provide autonomy (no human intervention), usability, performance. Cloud9 ( is a web service for testing cloud applications.

Symbolic Execution: when testing a function, instead of feeding it input values, send it an input abstraction (say, lambda) and whenever we see a control flow branching (such as an if statement) create a subtree of execution. One idea is to send each of these subtrees to a separate machine and test all possible execution paths at once. A naive approach can have many problems. For example trees can expand exponentially, so incrementally getting new resources to run can be problematic. A solution to that is to pre-allocate all needed machines. There are many challenges in parallel symbolic execution in the cloud such as dynamically load balancing trees among workers and state transfers. Along with other problems such as picking the right strategy portfolios

Preliminary results show that parallel symbolic execution on the cloud can give over linear improvement over conventional methods and KLEE.

Categories: Conferences, Systems Tags:

[LADIS 2009] Technical Session #1 – Programming Models

October 10th, 2009 No comments

First talk: Cloud-TM: Harnessing the Cloud with Distributed Transactional Memories given by Luis Rodrigues

MapReduce is nice if your data and task fits the model. However, it is unnatural for many scenarios. Another model for programming in the cloud is: PGAS (Partitioned Global Address Space), it masks machines as different addresses, but it has falls short because programmers do not know how many machines is their program running on ahead of time. D-STM (Distributed Software Transactional Memories) extends the TM abstraction across the boundaries of a single machine.

Research challenges:

  • Automatic parallelization: extremely hard, but transactional support makes it easier to implement strategies based on the speculative execution portions of the code.
  • Fault tolerance: only started to be considered by recent D-STMs
  • Coping with Workload Heterogeneity. STM performance is heavily dependent on the workload, different algorithms exist optimized for different workloads, but this needs to be automated.
  • Automatic Resource Provisioning.
  • Persistence ACI vs ACID

Dependable Distributed STM (D2-STM) is a distributed fully replicated STM that uses atomic broadcast to coordinate replicas. Bloom filters used to reduce messages. Some prelim results: speculative replication, a technique that runs potentially conflicting transactions speculatively in order to hide the inter-replica coordination latency. Identifying and predicting the data access, we are developing stochastic techniques for identifying and predicting data access pattern. Thread-level speculation.

An application they built based on their techniques. FenixEDU manages the on-line campus activities used in production by the Technical University of Lisbon and being installed in other machines. 1000s of students. Web app, OO-domain model, Relational DMBS to store data, object/relational mapping tool to store objects in the db, runs on a STM implementation.

FenixEDU can be run in the cloud. We want programmers to be able to use the OO model they are familiar with, resource management has to be automatic, and consistency is crucial.

Conclusions: D-STM have many good properties that make them a promising technology to support distributed applications with consistency requirements in the cloud.

Second Talk: Storing and Accessing Live Mashup Content in the Cloud by Krzysztof Ostrowski

Interactive collaborative cloud services currently go through a “centralized” cloud service and users just poll for the updates. In the Cornell Live Objects model, clients collaborate together on the “edge”.

Cloud vs Edge ? if all on edge, we have persistence but potentially no consistency. If all on the cloud, we have consistency from the clients’ view but no scalability. For example, in second life, we can only handle 40 clients/server. The edge is much larger than the cloud. Many more under utilized machines in the edge than in the cloud.

In live objects, we used checkpointed channel. that are abstractions for communication that have proxies on machines and partially reside on the edge and cloud. Data is in that channel. Channels have a proxy and a network facing component. An event on a channel is considered to be delivered only if the proxy declares that and notifies the application above. If an update is received on a checkpoint channel (CC) we get a new CC`. For the programming model, CCs are given types in the programming language that reflect what kind of data is contained in each channel. Channels can have references to other channels which allow users to build complex structures and enable applications that subscribe to many channels representing many different objects. For example, a shared desktop can have references to the different objects in that desktop.

Conclusion: Checkpointed Channels are a new storage abstraction where users express interest in data regardless of where it resides or how it is stored (in a cloud service or a P2P system). Tremendous opportunity for scaling by splitting data between the cloud and edge.

Questions: questions mostly focused on the synchronized timing of correlated channels. For example, if a channel contains video data, and another channel contains audio data, how can the two be synchronized.

Third Talk: A Unified Execution Model for Cloud Computing by Eric Van Hensbergen

In current cloud programming settings, we have two models: platform as a service, and infrastructure as a service. That is, are we given unmanaged machines (EC2) that we are free to use as we want. Or are we given a managed platform (Google App Engine and Microsoft Azure) that manage the underlying complexity but limit what you can do. Idea, can we break down this barrier ? This is similar to previous studies on distributed operating systems from the past. However, the difference here is that they have to be more flexible, elastic, and work at scale.

Many techniques were discussed such as synthetic file systems, execution & control mechanisms, and aggregation techniques. Support for BlueGene/P Preliminary support for EC2.

Categories: Conferences, Systems Tags:

[LADIS 2009] Keynote #1 – Data Serving in the Cloud

October 10th, 2009 No comments

Keynote #1 by: Raghu Ramakrishnan from Yahoo! Research. Published abstract and Speaker bio .

Raghu’s talk focused mostly on how is data stored in large scale for cloud services. His talk was in three parts, the first part discussed general principles, the second part discussed the PNUTS (internally called Sherpa) at Yahoo, and in the last part he proposed having a community driven benchmark targeted at what he called VLSD Data Stores: Very Large Scale Distributed Data Stores.

Here are the “raw” notes I took while he was giving his presentation. Sorry about the roughness

Two types of cloud services at Yahoo!:

  1. Horizontal (Platform) Clouds Services: e.g., storage, web front, ..etc
  2. Functional Services: e.g., Content Optimization, Search Index, Ads Optimization, ML for spam detection ..etc

Yahoo!’s Cloud: massive user base (> 500M unique users per month) and very high requests per second.

VLSD DS: Very Large Scale Distributed Data Stores. Three types of data stores used in the cloud categorized by focus:

  • Large data analysis (e.g., Hadoop). Data warehousing, scan oriented workloads, focus on sequential disk I/O. focus on cpu cycles.
  • Structured record storage (e.g., PNUTS/Sherpa). CRUD, point lookups and short scans, index table, opt for latency.
  • Blob storage (e.g., MObStore). object retrieval and streaming, scalable file storage, opt for GB storage

In the last 30 years, the world has changed significantly. Things have become more elastic. Customers need scalability, flexible schemas, geographic distribution, high availability, reliable storage. Web serving apps can do without complicated queries and strong transactions. Some consistency is desirable, but not necessarily full ACID.

Typical applications for Yahoo VLSD:

  • user logins and profiles (changes must not be lost)
  • events (news alerts, social network activity, ad clicks)
  • app-specific data (flickr photo edits ..etc)

In VLSD data servincg stores, must:

  • Partition data across store. How are partitions determined? can they be changed easily ?
  • Availability and failure tolerance what failures are handled.
  • How is data Replicated? sync, or async, geo or non.

Brewer’s CAP theorem: Consistency Availability Partition tolerance. Can not have all three, must forgo one. Approaches to handle CAP, “BASE”, no ACID, use a single version of DB reconcile later. Defer transaction commitment.



  • Small records (<= 100KB)
  • Structured records (lots of fields and adding)
  • Extreme data scale (tens of TB)
  • Extreme request scale (Tens of thousands of res/sec
  • Low latency globally (20+ datacenters worldwide
  • High availability and reliability

What is PNUTS/Sherpa: parallel database (sharded), geographic replication, structured flexible scheme (NO schema evolution, at any point in time each table has a set of fields, but not all records have values for all fields). Hosted and managed. This is PNUTS today. In the future it will add support for indexes and views to be maintained async. PNUTS is built on other cloud services such as Tribble for pub/sub messaging, and Zookeeper for consistency

The actual data are stored on commodity boxes called storage units, data is broken into tablets. Storage units have tablets from multiple tables and a table’s tablets can be split on multiple storage units. The routers have maps for tablets to storage units. The same architecture is replicated at many areas. Using Tribble for message passing.

Data Model. Per-record ops: get/set/delete, multi-record ops: multiget/scan/getrange, Web service RESTful API

Tablets are hashed for load distribution. Tablets also can be ordered tables, and this allows better scans and frequent and range queries. in ordered tablets the data is ordered inside the tablet, but tablets can be shuffled on storage units. Index maintenance, how to have lots of interesting indexes and views without killing performance ? Solution is asynchrony.

Processing reads and updates. An update goes to a router that routes to a storage unit, and then the write is sent to two message brokers and then SUCCESS is returned to the SU and an update is back to the router. The two message brokers provides persistence (just like a write ahead log) but data is garbage collected, so availability and FT is provided by replication. Reads and multireads are straight forward through lookup from router. For Bulk Loads, pre-allocate tablets to avoid hotspots.

Asynchrony, replication, consistency

Replicaion from one datacenter to another happens eventually (order of 1 or 2 seconds). If copies are async updated, what can we say about stale copies ? ACID guarantees require sync updates which is very expensive. Eventual consistency: copies can drift apart but will eventually converge if the system is allowed to quiesce. Do we have middle ground ? Eventual consistency might not be enough.

If user update in one area, then network partition, then update in another region by same user, what will the final value be ? eventual consis will give one of the two values, but we want a specific last value.

PNUTS consistency model, Goal make it easier for apps to reason about updates and cope with ansync, what happens to a record with primary key “alice” ? Each record has a master, and each record has a version number that changes with updates. Masters can change.

Writes always go to the current version. Possibly stale versions at non-master location. Support test-and-set write per record transaction. Reads can get stale versions. But “read-uptodate” gets most recent version. Other variations, read forward will give you records with versions non-decreasing for sequential reads.

Operability: tablets are initially assigned to some storage units. An SU can get hot, so tablets are moved to other tablets. A tablet master (tablet controller) will always know about tablet moves. Consistency techniques: Per-record mastering and per tablet mastering.

Mastering: Alice making changes mostly in west cost, so master for the her records are in the west cost When alice moves to east cost, first few updates are bounced to west cost, then the tablet master is moved to east cost. Coping with failures, when failure happens, mastership is moved to another location, after recovery, mastership can stay in new location or move back to place.

Comparing Some Cloud Serving Stores

Many cloud DB (and nosql systems out there: PNUTS, BigTable, Azure, Cassandra, Megastore, Amazone. How do they compare ? Can we have a community drivern benchmark for comparing this ?

Baseline: Sharded MySQL, PNUTS, Cassandra, BigTable

Shard Server: server is apache + plugin + mysql, mysql scema key varchar(255) value mediumtext, flexible schema: value is blob of key/value pairs this is to have dynamic schemas to compare.

Pros of sharding: simple, infinitely scalable, low latency, geo-replication. Cons: not elastic (resharding is hard), poor suport for load balancing, failr over, replication unreliable, asyc log shipping.

Azure SDS: cloud of SQL server intstances. App partitions data into instance-sized pieces, transactions ardn queries within an instance (SDS instance = storage + per-field indexing)

Google MegaStore: transactions across groups: entity group hierarchically linked records, can transactionally update multiple records with an entity group, build on big table

PNUTS pros and cons: reliable geo-replication, scalability consistency model, elastic scaling, easy load balancing, Cons: system complexity relative to sharded my SQL to support geo-replication, consistency etc. Latency added by router

HBASE: HBASE is like BigTable on top Hadoop. When you try to write a record, this is spread to HRegion Server: records partitioned by column family into HStores each HStore contains many MapFiles. All writes to HStore applied to single memchche, Reads consult MapFiles and memcache, Memcaches flushed as MapFailes (HDFS files) when full. Compactions limit the number ofMapFile. Pros and Cons: Pros: log-based storage for high write throughput, elastic scaling, easy load balancing, column storage for OLAP workloads. Cons: write are not imimediately persisted to disk, reads are across multiple disks and mem locations, no geo-replication, latency

Cassandra: Facebook’s storage system. It uses BigTable data model, and uses Dynamo to locate records. Pros: elastic scalability, easy management peer-to-peer, bigtable model is nice, flexible schema columns ..etc Cons: does not support geo-replication and consistency.

The numbers comparing the storage systems:

Setup: 8 cores 2x quad core, 8gb ram, workloads 120 million 1kb records 20 gb per server. Write heavy loads 50/50 read update, read heavy 95/5.

Read latency vs actual throughput for read heavy. Sharded my sql is best. PNUTS and Cassandra did well. Hbase did bad (died after 100 ops/sec . Cassandra and PNUTS did well and died at 4500

Qualitative comparisons: storage layer: filebased: hbase and cassandra, mysql based: pnuts and sharded mysql. Write Persistence: writes committed synchronously to disk PNUTS cassandra and sharded. Writes async HBAase. Read pattern: find record in mysqk (disk or buffer pool) PNUTS, sharded. Replication: intra-region: hbase and cassandra, inter and intra region (pnuts, mysql not guaranteed). Mapping record to srever: router pnuts and hbase. Cloud

Main point is: push for a community based benchmark for a cloud storage systems. YCS Benchmark. send mail to ragu and brian cooper.

Shadoop: sherpa + hadoop. Sherpa optimized for low-latency record-level access b-trees. HDFS optimized for batch oriented acces: file-syste.


Ken Birman, Cornell University: Have you considered design techniques that will express stability and predictability along with scalability ?

Answer: not thought about it explicitly. But have not examined design techniques that do not perform wildly at scale. There are some interesting possible techniques such as in-memory systems. Most developers were worried about availability and performance at scale. Issues of stability at scale are still at early stages and have not been explored yet.

—: You described how the master moves over. What happens for the records on the master when a crash happens.

Answer: the protocols will ensure that when a failure in one data center and the record master one of two things happen. Either the master moves cleanly, or blocking. So if you try to write with time-line consistency, you will not progress. Example of when could this happens: write on west, failure happens, you move to east. It depends on the failure cause. If it is just because of disk issues, the message bus still contains the data, and when that data reaches new master it is made a master. The real problem is if the failure happens to the messaging system or to the link. It takes alot of work to find if the failure happens on the messaging system or not.

Doug Terry, MSR: how do you decide what is acceptable to give up to get CAP properties ?

Answer: it is crude. Developers ask for what they want and they implement it.

Categories: Conferences, Systems Tags: