Archive for the ‘Systems’ Category

Erlang or Stackless Python

May 31st, 2010 No comments

I recently wrote a simple, event-driven, simulation framework to allow me to quickly prototype and test distributed protocols. Unfortunately it did not scale in the number of messages simulated. So I am planning to re-implement it.

To keep with the spirit of rapid prototyping, I want to use a dynamically typed language. My final two candidates are Erlang and Stackless Python (or rather PyPy). There certainly is a lot of buzz around Erlang these days: My advisor is very enthusiastic about it, and after reading through the tutorial and dummy protocols, I can see why. On the other hand, Stackless Python has the familiar syntax and the huge library of modules.

I read through many posts comparing the two and I finally decided to stick with Python for now. I will code up the framework (hopefully this week) and report back on my findings. However, I am personally still interested in coding something in Erlang, so who knows :-)

Categories: Systems Tags:

[LADIS 2009] Technical Session #4 – Monitoring and Repair

October 11th, 2009 No comments

Second Talk: A case for the accountable cloud by Andreas Haeberlen

Three cloud stories .. a threatening cloud, a promising cloud, and a nice cloud :-)

The problem with current clouds is that the user does not know what the cloud service provider is doing with the customer’s code and data. Also, from the cloud service provider’s perspective, the operator does not know what is the code that they are running for customers supposed to do.

Alice is the customer running a service on the cloud owned and operated by Bob.

A solution: what if we had an oracle that Alice and Bob could ask about cloud problems? We want completeness, (if something is faulty, we will know) accuracy (no false positives), verifiability (the oracle can prove its diagnoses is correct).

Idea: make clud accountable to alice+bob. Cloud records its actions in a tamper-evident log, alice and bob can audit, use log to construct evidence that a fault does or does not exist.

Discussion: 1) Isn’t this too pessimistic? bob isn’t malicious ..maybe, but bob can get hacked, or things can just go wrong. 2) shouldn’t bob use fault tolerance instead? yes whenever we can, but masking faults is never perfect, we still need to check. 3) why would a provider want to deploy this? this feature will be attractive to prospective customers, and helpful for support. 4) Are these the right guarantees? completeness (no false negatives), could be relaxed with probabilistic completeness; verifiability could be relaxed only provide some evidence; accuracy (no false positives) can not be relaxed because we need to have confidence when we rule out problems.

A call to action: cloud accountability should do; deliverable provable guarantees, work for most cloud apps, require no changes to application code, cover a wide spectrum of properties, low overhead.

Work in Progress: Accountable Virtual Machines (AVM), goal: provide accountability for arbitrary unmodified software. cloud records enough data to enable deterministic replay, alice can replay log with a known-good copy of the software, can audit any part of the original execution.

Conclusion: current cloud designs carry risks for both customers and providers (mainly because of split administration problem). Proposed solution: accountable cloud. Lots of research opportunities.

Third Talk: Learning from the Past for Resolving Dilemmas of Asynchrony by Paul Ezhilchelvan

In an asynchronous model you can not bound message delivery time or even message processing time by a machine. However, in a probabilistic synchronous model, we can bound times within a certain probability via proactive measurements. The new central hypothesis of the new model is that most of the time, performance of the past is indicative of the performance of the near future (i.e. delay in the past is the indicative of delay in the future).

Design steps include doing proactive measurements, using them to establish synchrony bounds, and assign time bounds based on that, try that and see how it works and enable exceptions.

On-going work: development of exceptions (to deal with exceptional cases when mistakes are detected). Open environments are asynchronous, use crash signals for notification of extreme unexpected behavior.

Categories: Conferences, Systems Tags:

[LADIS 2009] Technical Session #5 – Communication

October 11th, 2009 No comments

First Talk: Bulletin Board: A Scalable and Robust Eventually Consistent Shared Memory over a Peer-to-Peer Overlay by Gregory Chockler

WebSphere Virtual Enterprise (WVE) is a product for managing resources in a data center. The product is a distributed system whose nodes and controllers need to communicate and share information, and BulletinBoard (BB) is used for that. BB is a platform service for facilitating group-based information sharing in a data center. It is critical component of WVE, and its primary application is monitoring and control, but the designers believe that it could be useful for other weakly consistent services.

Motivation & Contribution: Prior implementation of group communication implemented internall as not designed to grow 10 folds, and that was based on Virtual Synchronous group communication; robustness, stability, high runtime overheads as the system grew beyond several 100s of processes; static hierarchy introduced configuration problems. So the goal was to provide a new implementation to resolve the scaling and stability issues of the prior implementation (and implement this in a short time! so this constraint had important implications on the design decisions).

BB supports a write-sub (write subscribe) service model. It is a cross between pub-sub systems, shared memory systems, and traditional group communication systems. In pub-sub communication is async and done through topics. In shared memory we have overwrite semantics, singe writer per topic and process, and notifications are snapshots of state.

Consistency Semantics (single topic). PRAM Consistency: notified snapshots are consistent with the other process order of writes. A note made was that developers who built services on top of that turned out to understand this semantics of consistency.

Liveness Semantics (single topics). Uses Eventual inclusion: eventually each write by a correct and connected process is included into the notified snapshot. Eventual exclusion means that failed processes will be eventually excluded from updates.

Performance and Scalability Goals: adequate latency, scalable runtime costs, throughput is less of an issue (management load is fixed and low). Low overhead. Robustness, scalability in the presence of large number of processes and topics (2883 topics in a system of 127 processes, note that the initial target was around 1000 processes).

Approach: decided to build this on an overlay network called SON. Service Overlay Network (SON). SON is a semi-structured P2P overlay, already in the product, and self-* (recover from changes quickly without problems), resilient, and supports peer membership and broadcast. The research question here was whether if BB can be implemented efficiently on top of a P2P overlay like SON?

Architecture: SON with IAM (interest aware membership) built on top of it and BB on top of that (but BB can interact directly with SON).

Reliable Shared State Maintenance in SON for BB: is made fully decentralized, and update propagation is optimized for bimodal topic popularity. Overlay broadcast or iterative unicast over direct TCP connections if # subscribers of a topic is less than a certain threshold (and group broadcast otherwise). For Reliability, periodic refresh of the latest written value (on a long cycle) if not overwritten (this was a bad decision in retrospect) with state transfer to new/reconnected subscribers.

Experimental study on different topologies showed low cpu overhead and latency, but these numbers increased as the topology increased in size. Analysis of that revealed that this was because the periodic refreshes were stacked and caused increased CPU & latency overheads. An additional problem was in broadcast flooding, and when that was removed cpu & latency overheads stayed flat as the topology increased in size.

Lessons learned: communication cost is the major factor affecting scalability of overlay based implementations, and that anti-entropy techniques are best fit for such services.

Second Talk: Optimizing Information Flow in the Gossip Objects Platform by Ymir Vigfusson

In gossip, nodes exchange information with a random peer periodically in rounds. Gossip has appealing properties such as bounded network traffic, scalability in group size, robustness against failures, coding simplicity. This is nice when gossip is considered individually per application. In cloud computing with nodes joining many groups, the traffic is no longer bounded per node (but per topic).

The Gossip Objects (GO) platform is a general platform for running gossip for multiple applications on a single node. It bounds the gossip traffic going out of a particular node. The talk focused on how to select rumors to publish out from multiple applications on a single node such that we reduce number of messages. This is possible because rumor messages are small and have a short destination. An observation made is that rumors can be delivered indirectly, uninterested nodes can forward rumors to interested nodes.

The GO heuristic: recipient selection is biased towards higher group traffic. The content is selected by computing a utility of a rumor which is defined as the probability of that rumor will add information to a host that didn’t know that info.

Simulation, a first simulation of an extreme example with only two nodes joining many groups. The GO heuristic showed promising results. Then a real-world evaluation was conducted based on a 55 minute trace of the IBM WebSphere Virtual Enterprise Bulletin Board layer. The trace had 127 nodes and 1364 groups, and the evaluation showed that GO had placed a cap on traffic compared to random and random with stacking heuristics for GO. Additionally, the GO heuristic was able to deliver rumors faster than the other heuristic, and the number of messages needed to deliver the messages to interested nodes, and the GO heuristic had multiple orders of reduction over other heuristics and traditional rumor spreading.

Conclusion: GO implemented novel ideas such as per-node gossip, rumor stacking (pushing the rumor to the MTU size), utility based rumor dissemination, and adapting to traffic rates. GO gives per-node guarantees even when the # of groups scales up. Experimental results were compelling.


Mike Spreitzer, IBM Research: What would happen if the number of groups increases?

Answer: Study of available real-world traces showed a pattern of overlap. We also conducted simulation with other group membership patterns and the results were similar.

—: What was the normal rumor size? And what would happen if that increased?

Answer: The average rumor size was 100Bytes. If the message size increased we will stack less rumors, but our platform can also reject really large rumors.

—: Have you thought about network-level encoding?

Answer: Not yet, but we plan to in the future.

—: Have you thought of leveraging other dissemination techniques to run under GO?

Answer: Actually, we thought about the opposite direction where we would run other communication protocols and map them under the hood to GO. Results are pending.

Categories: Conferences, Systems Tags: