Keynote #1 by: Raghu Ramakrishnan from Yahoo! Research. Published abstract and Speaker bio .
Raghu’s talk focused mostly on how is data stored in large scale for cloud services. His talk was in three parts, the first part discussed general principles, the second part discussed the PNUTS (internally called Sherpa) at Yahoo, and in the last part he proposed having a community driven benchmark targeted at what he called VLSD Data Stores: Very Large Scale Distributed Data Stores.
Here are the “raw” notes I took while he was giving his presentation. Sorry about the roughness
Two types of cloud services at Yahoo!:
- Horizontal (Platform) Clouds Services: e.g., storage, web front, ..etc
- Functional Services: e.g., Content Optimization, Search Index, Ads Optimization, ML for spam detection ..etc
Yahoo!’s Cloud: massive user base (> 500M unique users per month) and very high requests per second.
VLSD DS: Very Large Scale Distributed Data Stores. Three types of data stores used in the cloud categorized by focus:
- Large data analysis (e.g., Hadoop). Data warehousing, scan oriented workloads, focus on sequential disk I/O. focus on cpu cycles.
- Structured record storage (e.g., PNUTS/Sherpa). CRUD, point lookups and short scans, index table, opt for latency.
- Blob storage (e.g., MObStore). object retrieval and streaming, scalable file storage, opt for GB storage
In the last 30 years, the world has changed significantly. Things have become more elastic. Customers need scalability, flexible schemas, geographic distribution, high availability, reliable storage. Web serving apps can do without complicated queries and strong transactions. Some consistency is desirable, but not necessarily full ACID.
Typical applications for Yahoo VLSD:
- user logins and profiles (changes must not be lost)
- events (news alerts, social network activity, ad clicks)
- app-specific data (flickr photo edits ..etc)
In VLSD data servincg stores, must:
- Partition data across store. How are partitions determined? can they be changed easily ?
- Availability and failure tolerance what failures are handled.
- How is data Replicated? sync, or async, geo or non.
Brewer’s CAP theorem: Consistency Availability Partition tolerance. Can not have all three, must forgo one. Approaches to handle CAP, “BASE”, no ACID, use a single version of DB reconcile later. Defer transaction commitment.
- Small records (<= 100KB)
- Structured records (lots of fields and adding)
- Extreme data scale (tens of TB)
- Extreme request scale (Tens of thousands of res/sec
- Low latency globally (20+ datacenters worldwide
- High availability and reliability
What is PNUTS/Sherpa: parallel database (sharded), geographic replication, structured flexible scheme (NO schema evolution, at any point in time each table has a set of fields, but not all records have values for all fields). Hosted and managed. This is PNUTS today. In the future it will add support for indexes and views to be maintained async. PNUTS is built on other cloud services such as Tribble for pub/sub messaging, and Zookeeper for consistency
The actual data are stored on commodity boxes called storage units, data is broken into tablets. Storage units have tablets from multiple tables and a table’s tablets can be split on multiple storage units. The routers have maps for tablets to storage units. The same architecture is replicated at many areas. Using Tribble for message passing.
Data Model. Per-record ops: get/set/delete, multi-record ops: multiget/scan/getrange, Web service RESTful API
Tablets are hashed for load distribution. Tablets also can be ordered tables, and this allows better scans and frequent and range queries. in ordered tablets the data is ordered inside the tablet, but tablets can be shuffled on storage units. Index maintenance, how to have lots of interesting indexes and views without killing performance ? Solution is asynchrony.
Processing reads and updates. An update goes to a router that routes to a storage unit, and then the write is sent to two message brokers and then SUCCESS is returned to the SU and an update is back to the router. The two message brokers provides persistence (just like a write ahead log) but data is garbage collected, so availability and FT is provided by replication. Reads and multireads are straight forward through lookup from router. For Bulk Loads, pre-allocate tablets to avoid hotspots.
Asynchrony, replication, consistency
Replicaion from one datacenter to another happens eventually (order of 1 or 2 seconds). If copies are async updated, what can we say about stale copies ? ACID guarantees require sync updates which is very expensive. Eventual consistency: copies can drift apart but will eventually converge if the system is allowed to quiesce. Do we have middle ground ? Eventual consistency might not be enough.
If user update in one area, then network partition, then update in another region by same user, what will the final value be ? eventual consis will give one of the two values, but we want a specific last value.
PNUTS consistency model, Goal make it easier for apps to reason about updates and cope with ansync, what happens to a record with primary key “alice” ? Each record has a master, and each record has a version number that changes with updates. Masters can change.
Writes always go to the current version. Possibly stale versions at non-master location. Support test-and-set write per record transaction. Reads can get stale versions. But “read-uptodate” gets most recent version. Other variations, read forward will give you records with versions non-decreasing for sequential reads.
Operability: tablets are initially assigned to some storage units. An SU can get hot, so tablets are moved to other tablets. A tablet master (tablet controller) will always know about tablet moves. Consistency techniques: Per-record mastering and per tablet mastering.
Mastering: Alice making changes mostly in west cost, so master for the her records are in the west cost When alice moves to east cost, first few updates are bounced to west cost, then the tablet master is moved to east cost. Coping with failures, when failure happens, mastership is moved to another location, after recovery, mastership can stay in new location or move back to place.
Comparing Some Cloud Serving Stores
Many cloud DB (and nosql systems out there: PNUTS, BigTable, Azure, Cassandra, Megastore, Amazone. How do they compare ? Can we have a community drivern benchmark for comparing this ?
Baseline: Sharded MySQL, PNUTS, Cassandra, BigTable
Shard Server: server is apache + plugin + mysql, mysql scema key varchar(255) value mediumtext, flexible schema: value is blob of key/value pairs this is to have dynamic schemas to compare.
Pros of sharding: simple, infinitely scalable, low latency, geo-replication. Cons: not elastic (resharding is hard), poor suport for load balancing, failr over, replication unreliable, asyc log shipping.
Azure SDS: cloud of SQL server intstances. App partitions data into instance-sized pieces, transactions ardn queries within an instance (SDS instance = storage + per-field indexing)
Google MegaStore: transactions across groups: entity group hierarchically linked records, can transactionally update multiple records with an entity group, build on big table
PNUTS pros and cons: reliable geo-replication, scalability consistency model, elastic scaling, easy load balancing, Cons: system complexity relative to sharded my SQL to support geo-replication, consistency etc. Latency added by router
HBASE: HBASE is like BigTable on top Hadoop. When you try to write a record, this is spread to HRegion Server: records partitioned by column family into HStores each HStore contains many MapFiles. All writes to HStore applied to single memchche, Reads consult MapFiles and memcache, Memcaches flushed as MapFailes (HDFS files) when full. Compactions limit the number ofMapFile. Pros and Cons: Pros: log-based storage for high write throughput, elastic scaling, easy load balancing, column storage for OLAP workloads. Cons: write are not imimediately persisted to disk, reads are across multiple disks and mem locations, no geo-replication, latency
Cassandra: Facebook’s storage system. It uses BigTable data model, and uses Dynamo to locate records. Pros: elastic scalability, easy management peer-to-peer, bigtable model is nice, flexible schema columns ..etc Cons: does not support geo-replication and consistency.
The numbers comparing the storage systems:
Setup: 8 cores 2x quad core, 8gb ram, workloads 120 million 1kb records 20 gb per server. Write heavy loads 50/50 read update, read heavy 95/5.
Read latency vs actual throughput for read heavy. Sharded my sql is best. PNUTS and Cassandra did well. Hbase did bad (died after 100 ops/sec . Cassandra and PNUTS did well and died at 4500
Qualitative comparisons: storage layer: filebased: hbase and cassandra, mysql based: pnuts and sharded mysql. Write Persistence: writes committed synchronously to disk PNUTS cassandra and sharded. Writes async HBAase. Read pattern: find record in mysqk (disk or buffer pool) PNUTS, sharded. Replication: intra-region: hbase and cassandra, inter and intra region (pnuts, mysql not guaranteed). Mapping record to srever: router pnuts and hbase. Cloud
Main point is: push for a community based benchmark for a cloud storage systems. YCS Benchmark. send mail to ragu and brian cooper.
Shadoop: sherpa + hadoop. Sherpa optimized for low-latency record-level access b-trees. HDFS optimized for batch oriented acces: file-syste.
Ken Birman, Cornell University: Have you considered design techniques that will express stability and predictability along with scalability ?
Answer: not thought about it explicitly. But have not examined design techniques that do not perform wildly at scale. There are some interesting possible techniques such as in-memory systems. Most developers were worried about availability and performance at scale. Issues of stability at scale are still at early stages and have not been explored yet.
—: You described how the master moves over. What happens for the records on the master when a crash happens.
Answer: the protocols will ensure that when a failure in one data center and the record master one of two things happen. Either the master moves cleanly, or blocking. So if you try to write with time-line consistency, you will not progress. Example of when could this happens: write on west, failure happens, you move to east. It depends on the failure cause. If it is just because of disk issues, the message bus still contains the data, and when that data reaches new master it is made a master. The real problem is if the failure happens to the messaging system or to the link. It takes alot of work to find if the failure happens on the messaging system or not.
Doug Terry, MSR: how do you decide what is acceptable to give up to get CAP properties ?
Answer: it is crude. Developers ask for what they want and they implement it.