Happy Perihelion
Happy Perihelion from the Big Red Bits team.

Happy Perihelion from the Big Red Bits team.

This started as a joke over dinner with some friends
(click on the picture for more legible fonts)
Coming to think about it … why stop at quintuple blindness? how about automated reviewers or mixing reviewers from different eras? Ok, I’ll stop here
I’ve been reading about the Bounded Degree Spanning Tree problem and I thought of writing some of what I am learning here. It illustrates a beautiful techique called Iterated Rounding and uses the combinatorial idea of uncrossing. I’ll try to give a high-level idea of the argument and give references on the details. The first result of this kind was given by Goemans (although there were previous results with weaker guarantees) by Goemans in Minimum Bounded Degree Spanning Trees, but the result based on iterated rounding and a subsequent improvement are due to Singh and Lau in a serie of papers. A main reference is Approximating minimum bounded degree spanning trees to within one of optimal.
The problem of bounded degree spanning tree is as follows: consider a graph with edge weights and we for some nodes
a degree bound
. We want to find, among the spanning trees for which the degree of
is
the one with minimum cost. It is clearly a hard problem, since taking all weights equal to
and
for all nodes is the Hamiltonian Path problem, which is NP-complete. We will get a different kind of approximation. Let OPT be the optimal solution: we will show an algorithm that gives a spanning tree of cost
such that each node
has degree
(this can be improved to
with a more sofisticated algorithm, also based on Iterated Rounding).
As always, the first step to design an approximation algorithm is to relax it to an LP. We consider the following LP:
The first constraint expresses that in a spanning tree, there are at most edges, the second prevent the formation of cycles and the third guarantees the degree bounds. For
we have the standard Minimal Spanning Tree problem and for this problem the polytope is integral. With the degree bounds, we lose this nice property. We can solve this LP using the Ellipsoid Method. The separation oracle for the
is done by a flow computation.
Iterated Rounding
Now, let’s go ahead and solve the LP. It would be great if we had an integral solution: we would be done. It is unfortunately not the case, but we can still hope it is almost integral in some sense: for example, some edges are integral and we can take them to the final solution and recurse the algorithm on a smaller graph. This is not far from truth and that’s the main idea of the iterated rounding. We will show that the support of the optimal solution has some nice structure. Consider the following lemma:
Lemma 1 For any basic solution
of the LP, either there is
with just one incident edge in the support
or there is one
such that that at most
edges are incident to it.
If we can prove this lemma, we can solve the problem in the following way: we begin with an empty tree: then we solve the LP and look at the support . There are two possibilities according to the lemma:
The algorithm eventually stops, since in each iteration we have less edges or less nodes in and the solution is as desired. The main effort is therefore to prove the lemma. But before, let’s look at the lemma: it is of the following kind: “any basic solution of the LP has some nice properties, which envolve having a not too big (at least in some point) support”. So, it involves proving that the support is not too large. That is our next task as we are trying to prove the lemma. And we will be done with:
Theorem 2 The algorithm described above produces a spanning tree of cost
(the LP values and therefore
)in which each node
has degree
.
Bounding the size of the support
We would like now to prove some result like the Lemma above: that in the solution of the LP we have either one with degree
in
or we have a node in
with degree
. First, we suppose the opposite, that
has all the nodes with degree
and all the nodes in
have degree
. This implies that we have a large number of edges in the support. From the degrees, we know that:
We want to prove that the support of the LP can’t be too large. The first question is: how to estimate the size of the support of a basic solution. The constraints look like that:

A basic solution can be represented by picking rows of the matrix and making them tight. So, if we have a general
LP, we pick some submatrix
of
which is
and the basic solution is just
. The lines of matrix
can be of three types: they can be
, which are corresponding to
,
that correspond to
or
corresponding to
. There are
vectors in total. The size of the support
is smaller or equal the number of rows of the form
in the basic solution. Therefore the idea to bound the size of the support is to prove that “all basic solutions can be represented by a small number of rows in the form
. And this is done using the following:
Lemma 3 Assuming
, for any basic solution
, there is
and a family
of sets such that:
- The restrictions correspondent to
and
are tight for
![]()
is an independent set
![]()
is a laminar family
The first 3 items are straightfoward properties of basic solutions. The fourth one, means that for two sets , one of three things happen:
,
or
. Now, we based on the previous lemma and in the following result that can be easily proved by induction, we will prove Lemma 1.
Lemma 4 If
is a laminar family over the set
where each set
contains at least
elements, then
.
Now, the proof of Lemma 1 is easy. Let’s do it and then we come back to prove Lemma 3. Simply see that what contradicts
.
Uncrossing argument
And now we arrive in the technical heart of the proof, which is proving Lemma 3. This says that given any basic solution, given any feasible solution, we can write it as a “structured” basic solution. We start with any basic feasible solution. This already satifies (1)-(3), then we need to change that solution to satisfy condition (4) as well. We need to get rid crossing elements, i.e., in the form:

We do that by the means of the:
Lemma 5 (Uncrossing Lemma) If
and
are intersecting and tight (tight in the sense that their respective constraint is tight), then
and
are also tight and:
Which corresponds to that picture:

Proof: First, we note that is a supermodular function, i.e.:
We can see that by case analysis. Every edge appearing in the left side appears in the right side with at least the same multiplicity. Notice also that it holds with strict inequality iff there are edges from to
. Now, we have:
where the first relation is trivial, the second is by feasibility, the third is by supermodularity and the lastone is by tightness. So, all hold with equality and therefore and
are tight. We also proved that:
so there can be no edge from to
in
and therefore, thinking just of edges in
we have:
Uncrossing arguments are found everywhere in combinatorics. Now, we show how the Uncrossing Lemma can be used to prove Lemma 1:
Proof: Let be any basic solution. It can be represented by a pair
where
and
is a family of sets. We will show that the same basic solution can be represented by
where
is a laminar family and has the same size of
.
Let be all sets that are tight under
and
a maximal laminar family of tights sets in
, such that
are independent. I claim that
.
In fact, suppose , then there are sets of
we could add to
without violating independence – the problem is that those sets would cross some set. Pick such
intersecting fewer possible sets in
. The set
intersects some
. Since both are tight we can use the Uncrossing Lemma and we get:
since , we can’t have simultaneously
and
in
. Let’s consider two cases:

In either case we have a contradiction, so we proved that . So we can generate all the space of tight sets with a laminar family.
And this finishes the proof. Let’s go over all that we’ve done: we started with an LP and we wanted to prove that the support of each solution was not too large. We wanted that because we wanted to prove that there was one node with degree one in the support or a node in with small (
) degree. To prove that the degree of the support is small, we show that any basic solution has a representation in terms of a laminar family. Then we use the fact that laminar families can’t be very large families of sets. For that, we use the celebrated Uncrossing Lemma.
Note: Most of this is based on my notes on David Williamson’s Approximation Algorithms class. I spent some time thinking about this algorithm and therefore I decided o post it here.
I’ve been taking two classes in probability this semester and in those I saw the proofs of a lot of interesting theorems which I knew about previously but I have never seen the proof, as the Central Limit Theorem, the Laws of Large Numbers and so on… Also, some theory which is looks somewhat ugly in the undergrad courses becomes very clear with the proper formal treatment. Today I was thinking what was the main take-home message that a computer scientist could take from those classes and. at ;east for me, this message is the various ways of looking to probability distributions. I’ve heard about moments, Laplace transform, Fourier transform and other tools like that, but I never realized before their true power. Probably still today, most of their true power is hidden from me, but I am starting to look at them in a different way. Let me try to go over a few examples of different ways we can look at probability distributions and show cases where they are interesting.
Most of ways of looking at probability distributions are associated with multiplicative system: a multiplicative system is a set of real-valued functions with the property that if
then
. Those kinds of sets are powerful because of the Multiplicative Systems Theorem:
Theorem 1 (Multiplicative Systems Theorem) If
is a multiplicative system,
is a linear space containing
(the constant function
) and is closed under bounded convergence, then
implies that
contains all bounded
-measurable functions.
The theorem might look a bit cryptic if you are not familiar with the definitions, but it boils down to the following translation:
Theorem 2 (Translation of the Multiplicative Systems Theorem) If
is “general” multiplicative system,
and
are random variable such that
for all
then
and
have the same distribution.
where general excludes some troublesome cases like or all constant functions, for example. In technical terms, we wanted
to be the Borel
-algebra. But let’s not worry about those technical details and just look at the translated version. We now, discuss several kinds of multiplicative systems:
If we know moment generating functions, we can calculate expectation very easily, since . For example, suppose we have a process like that: there is one bacteria in time
. In each timestep, either this bacteria dies (with probability
), continues alive without reproducing (with probability
or has
offsprings (with probability
). In that case
. Each time, the same happens, independently with each of the bacteria alive in that moment. The question is, what is the expected number of bacteria in time
?
It looks like a complicated problem with just elementary tools, but it is a simple problem if we have moment generating functions. Just let be the variable associated with the
bacteria of time
. It is zero if it dies,
if it stays the same and
if it has
offsprings. Let also
be the number of bacteria in time
. We want to know
. First, see that:
Now, let’s write that in terms of moment generating functions:
which is just:
since the variables are all independent and identically distributed. Now, notice that:
by the definition of moment generating function, so we effectively proved that:
We proved that is just
iterated
times. Now, calculating the expectation is easy, using the fact that
and
. Just see that:
. Then, clearly
. Using similar technique we can prove a lot more things about this process, just by analyzing the behavior of the moment generating function.
where we also used Markov Inequality: . Passing to the Laplace transform is the main ingredient in the Chernoff bound and it allows us to sort of “decouple” the random variables in the sum. There are several other cases where the Laplace transform proves itsself very useful and turns things that looked very complicated when we saw in undergrad courses into simple and clear things. One clear example of that is the motivation for the Poisson random variable:
If are independend exponentially distributed random variables with mean
, then
. An elementary calculation shows that its laplace transform is
. Let
, i.e., the time of the
arrival. We want to know what is the distribution of
. How to do that?
Now, we need to find such that
. Now it is just a matter of solving this equation and we get:
. Now, the Poisson varible
measures the number of arrivals in
and therefore:
One fact that always puzzled me was: why is the normal distribution so important? What does it have in special to be the limiting distribution in the Central Limit Theorem, i.e., if
is a sequence of independent random variables,
then
under some natural conditions on the variables. The reason the normal is so special is because it is a “fixed point” for the Fourier Transform. We can see that
. And there we have something special about it that makes me believe the Central Limit Theorem.
————————-
This blog post was based on lectures by Professor Dynkin at Cornell.
BigRedBits is again pleased to have Igor Gorodezky as a guest blogger directly from UCLA. I leave you with his excelent post on the Wilson’s algorithm.
——————————————
Igor again, with another mathematical dispatch from UCLA, where I’m spending the semester eating and breathing combinatorics as part of the 2009 program on combinatorics and its applications at IPAM. In the course of some reading related to a problem with which I’ve been occupying myself, I ran across a neat algorithmic result – Wilson’s algorithm for uniformly generating spanning trees of a graph. With Renato’s kind permission, let me once again make myself at home here at Big Red Bits and tell you all about this little gem.
The problem is straightforward, and I’ve essentially already stated it: given an undirected, connected graph , we want an algorithm that outputs uniformly random spanning trees of
. In the early ’90s, Aldous and Broder independently discovered an algorithm for accomplishing this task. This algorithm generates a tree
by, roughly speaking, performing a random walk on
and adding the edge
to
every time that the walk steps from
to
and
is a vertex that has not been seen before.
Wilson’s algorithm (D. B. Wilson, “Generating random spanning trees more quickly than the cover time,” STOC ‘96) takes a slightly different approach. Let us fix a root vertex . Wilson’s algorithm can be stated as a loop-erased random walk on
as follows.
Algorithm 1 (Loop-erased random walk) Maintain a tree
, initialized to consist of
alone. While there remains a vertex
not in
: perform a random walk starting at
, erasing loops as they are created, until the walk encounters a vertex in
, then add to
the cycle-erased simple path from
to
.
We observe that the algorithm halts with probability 1 (its expected running time is actually polynomial, but let’s not concern ourselves with these issues here), and outputs a random directed spanning tree oriented towards . It is a minor miracle that this tree is in fact sampled uniformly from the set of all such trees. Let us note that this offers a solution to the original problem, as sampling
randomly and then running the algorithm will produce a uniformly generated spanning tree of
.
It remains, then, to prove that the algorithm produces uniform spanning trees rooted at (by which we mean directed spanning trees oriented towards
). To this we dedicate the remainder of this post.
1. A “different” algorithm
Wilson’s proof is delightfully sneaky: we begin by stating and analyzing a seemingly different algorithm, the cycle-popping algorithm. We will prove that this algorithm has the desired properties, and then argue that it is equivalent to the loop-erased random walk (henceforth LERW).
The cycle-popping algorithm works as follows. Given and
, associate with each non-root vertex
an infinite stack of neighbors. More formally, to each
we associate
where each is uniformly (and independently) sampled from the set of neighbors of
. Note that each stack is not a random walk, just a list of neighbors. We refer to the left-most element above as the top of
, and by popping the stack
we mean removing this top vertex from
.
Define the stack graph to be the directed graph on
that has an edge from
to
if
is at the top of the stack
. Clearly, if
has
vertices then
is an oriented subgraph of
with
edges. The following lemma follows immediately.
Lemma 1 Either
is a directed spanning tree oriented towards
or it contains a directed cycle.
If there is a directed cycle in
we may pop it by popping
for every
. This eliminates
, but of course might create other directed cycles. Without resolving this tension quite yet, let us go ahead and formally state the cycle-popping algorithm.
Algorithm 2 (Cycle-popping algorithm) Create a stack
for every
. While
contains any directed cycles, pop a cycle from the stacks. If this process ever terminates, output
.
Note that by the lemma, if the algorithm ever terminates then its output is a spanning tree rooted at . We claim that the algorithm terminates with probability 1, and moreover generates spanning trees rooted at
uniformly.
To this end, some more definitions: let us say that given a stack , the vertex
is at level
. The level of a vertex in a stack is static, and is defined when the stack is created. That is, the level of
does not change even if
advances to the top of the stack as a result of the stack getting popped.
We regard the sequence of stack graphs produced by the algorithm as leveled stack graphs: each non-root vertex is assigned the level of its stack. Observe that the level of
in
is the number of times that
has been popped. In the same way, we regard cycles encountered by the algorithm as leveled cycles, and we can regard the tree produced by the algorithm (if indeed one is produced) as a leveled tree.
The analysis of the algorithm relies on the following key lemma (Theorem 4 in Wilson’s paper), which tells us that the order in which the algorithm pops cycles is irrelevant.
Lemma 2 For a given set of stacks, either the cycle-popping algorithm never terminates, or there exists a unique leveled spanning tree
rooted at
such that the algorithm outputs
irrespective of the order in which cycles are popped.
Proof: Fix a set of stacks . Consider a leveled cycle
that is pop-able, i.e.~there exist leveled cycles
that can be popped in sequence. We claim that if the algorithm pops any cycle not equal to
, then there still must exist a series of cycles that ends in
and that can be popped in sequence. In other words, if
is pop-able then it remains pop-able, no matter which cycles are popped, until
itself is actually popped.
Let be a cycle popped by the algorithm. If
then the claim is clearly true. Also, if
shares no vertices with
, then the claim is true again. So assume otherwise, and let
be the first in the series
to share a vertex with
. Let us show that
by contradiction.
If , then
and
must share a vertex
that has different successors in
and
. But by definition of
, none of the
contain
, and this implies that
has the same level in
and
. Therefore its successor in both cycles is the same, a contradiction. This proves
.
Moreover, the argument above proves that and
are equal as leveled cycles (i.e.~every vertex has the same level in both cycles). Hence
is a series of cycles that can be popped in sequence, which proves the original claim about .
We conclude that given a set of stacks, either there is an infinite number of pop-able cycles, in which case there will always be an infinite number and the algorithm will never terminate, or there is a finite number of such cycles. In the latter case, every one of these cycles is eventually popped, and the algorithm produces a spanning tree rooted at
. The level of each non-root vertex in
is given by (one plus) the number of popped cycles that contained
.
Wilson summarizes the cycle-popping algorithm thusly: “[T]he stacks uniquely define a tree together with a partially ordered set of cycles layered on top of it. The algorithm peels off these cycles to find the tree.”
Theorem 3 The cycle-popping algorithm terminates with probability 1, and the tree that it outputs is a uniformly sampled spanning tree rooted at
.
Proof: The first claim is easy: has a spanning tree, therefore it has a directed spanning tree oriented towards
. The stacks generated in the first step of the algorithm will contain such a tree, and hence the algorithm will terminate, with probability 1.
Now, consider a spanning tree rooted at
. We’ll abuse notation and let
be the event that
is produced by the algorithm. Similarly, given a collection of leveled cycles
, we will write
for the event that
is the set of leveled cycles popped by the algorithm before it terminates. Finally, let
be the event that the algorithm popped the leveled cycles in
and terminated, with the resulting leveled tree being equal to
.
By the independence of the stack entries, we have , where
is the probability that the algorithm’s output is a leveled version of
, a quantity which a moment’s reflection will reveal is independent of
. Now,
which, as desired, is independent of .
2. Conclusion
We have shown that the cycle-popping algorithm generates spanning trees rooted at uniformly. It remains to observe that the LERW algorithm is nothing more than an implementation of the cycle-popping algorithm! Instead of initially generating the (infinitely long) stacks and then looking for cycles to pop, the LERW generates stack elements as necessary via random walk (computer scientists might recognize this as the Principle of Deferred Decisions). If the LERW encounters a loop, then it has found a cycle in the stack graph
induced by the stacks that the LERW has been generating. Erasing the loop is equivalent to popping this cycle. We conclude that the LERW algorithm generates spanning trees rooted at
uniformly.
In my last post about hats, I told I’ll soon post another version with some more problems, which I ended up not doing and would talk a bit more about those kind of problems. I ended up not doing, but here are a few nice problems:
Those
people are again a room, each with a hat which is either black or white (picked with probability
at random) and they can see the color of the other people’s hats but they can’t see their own color. They write in a piece of paper either “BLACK” or “WHITE”. The whole team wins if all of them get their colors right. The whole team loses, if at least one writes the wrong color. Before entering the room and getting the hats, they can strategyze. What is a strategy that makes them win with
probability?
If they all choose their colors at random, the probability of winning is very small: . So we should try to correlate them somehow. The solution is again related with error correcting codes. We can think of the hats as a string of bits. How to correct one bit if it is lost? The simple engineering solution is to add a parity check. We append to the string
a bit
. So, if bit
is lost, we know it is
. We can use this idea to solve the puzzle above: if hats are places with
probability, the parity check will be
with probability
and
with probability
. They can decide before hand that everyone will use
and with probability
they are right and everyone gets his hat color right. Now, let’s extend this problem in some ways:
The same problem, but there are
hat colors, they are choosen independently with probability
and they win if everyone gets his color right. Find a strategy that wins with probability
.
There are again
hat colors, they are choosen independently with probability
and they win if at least a fraction
(
) of the people guesses the right color. Find a strategy that wins with probability
.
Again to the problem where we just have BLACK and WHITE colors, they are chosen with probability
and everyone needs to find the right color to win, can you prove that
is the best one can do? And what about the two other problems above?
The first two use variations of the parity check idea in the solution. For the second case, given any strategy of the players, for each string they have probability
. Therefore the total probability of winning is
. Let
, i.e., the same input but with the bit
flipped. Notice that the answer of player
is the same (or at least has the same probabilities) in both
and
, since he can’t distinguish between
and
. Therefore,
. So,
. This way, no strategy can have more than probability of winning.
Another variation of it:
Suppose now we have two colors BLACK and WHITE and the hats are drawn from one distribution
, i.e., we have a probability distribution over
and we draw the colors from that distribution. Notice that now the hats are not uncorrelated. How to win again with probability
(to win, everyone needs the right answer).
I like a lot those hat problems. A friend of mine just pointed out to me that there is a very nice paper by Bobby Kleinberg generalizing several aspects of hat problems, for example, when players have limited visibility of other players hats.

I began being interested by this sort of problem after reading the Derandomization of Auctions paper. Hat guessing games are not just a good model for error correcting codes, but they are also a good model for truthful auctions. Consider an auction with a set single parameter agents, i.e., an auction where each player gives one bid
indicating how much he is willing to pay to win. We have a set of constraints:
of all feasible allocations. Based on the bids
we choose an allocation
and we charge payments to the bidders. An example of a problem like this is the Digital Goods Auction, where
.
In this blog post, I discussed the concept of truthful auction. If an auction is randomized, an universal truthful auction is an auction that is truthful even if all the random bits in the mechanism are revealed to the bidders. Consider the Digital Goods Auction. We can characterize universal truthful digital goods auction as bid-independent auctions. A bid-independent auction is given by function , which associated for each
a random variable
. In that auction, we offer the service to player
at price
. If
we allocate to
and charge him
. Otherwise, we don’t allocate and we charge nothing.
It is not hard to see that all universal truthful mechanisms are like that: if is the probability that player
gets the item bidding
let
be an uniform random variable on
and define
. Notice that here
, but we are inverting with respect to
. It is a simple exercise to prove that.
With this characterization, universal truthful auctions suddenly look very much like hat guessing games: we need to design a function that looks at everyone else’s bid but not on our own and in some sense, “guesses” what we probably have and with that calculated the price we offer. It would be great to be able to design a function that returns . That is unfortunately impossible. But how to approximate
nicely? Some papers, like the Derandomization of Auctions and Competitiveness via Consensus use this idea.
I was discussing last week with my officemates Hu Fu and Ashwin about the Cayley-Hamilton Theorem. The theorem is the following, given an matrix
we can define its characteristic polynomial by
. The Cayley-Hamilton Theorem says that
. The polynomiale is something like:
so we can just see it as a formal polynomial and think of:
which is an matrix. The theorem says it is the zero matrix. We thought for a while, looked in the Wikipedia, and there there were a few proofs, but not the one-line proof I was looking for. Later, I got this proof that I sent to Hu Fu:
Write the matrix in the basis of its eigenvectors, then we can write
where
is the diagonal matrix with the eigenvalues in the main diagonal.
and since we have
. Now, it is simple to see that:
and therefore:
And that was the one-line proof. One even simpler proof is: let be the eigenvectors, then
, so
must be
since it returns zero for all elements of a basis. Well, I sent that to Hu Fu and he told me the proof had a bug. Not really a bug, but I was proving only for symmetric matrices. More generally, I was proving for diagonalizable matrices. He showed me, for example, the matrix:
which has only one eigenvalue and the the eigenvectors are all of the form
for
. So, the dimension of the space spanned by the eigenvectors is
, less than the dimension of the matrix. This never happens for symmetric matrices, and I guess after some time as a computer scientist, I got used to work only with symmetric matrices for almost everything I use: metrics, quadratic forms, correlation matrices, … but there is more out there then only symmetric matrices. The good news is that this proof is not hard to fix for the general case.
First, it is easy to prove that for each root of the characteristic polynomial there is one eigenvector associated to it (just see that and therefore there must be
, so if all the roots are distinct, then there is a basis of eigenvalues, and therefore the matrix is diagonalizable (notice that maybe we will need to use complex eigenvalues, but it is ok). The good thing is that a matrix having two identical eigenvalues is a “coincidence”. We can identify matrices with
. The matrices with identical eigenvalues form a zero measure subset of
, they are in fact the roots of a polynomial in
. This polynomial is the resultant polynomial
. Therefore, we proved Cayley-Hamilton theorem in the complement of a zero-measure set in
. Since
is a continuous function, it extends naturally to all matrices
.
We can also interpret that probabilistically: get a matrix where
is taken uniformly at random from
. Then
has with probability
all different eigenvalues. So,
with probability
. Now, just make
.
Ok, this proves the Theorem for real and complex matrices, but what about a matrix defined over a general field where we can’t use those continuity arguments. A way to get around it is by using Jordan Canonical Form, which is a generalization of eigenvector decomposition. Not all matrices have eigenvector decomposition, but all matrices over an algebraic closed field can be written in Jordan Canonical Form. Given any there is a matrix
so that:
where are blocks of the form:
By the same argument as above, we just need to prove Cayley Hamilton for each block in separate. So we need to prove that . If the block has size
, then it is exacly the proof above. If the block is bigger, then we need to look at how does
looks like. By inspection:
Tipically, for we have in each row, starting in column
the sequence
, i.e.,
. So, we have
If block has size
, then
has multiplicity
in
and therefore
and therefore,
as we wanted to prove.
It turned out not to be a very very short proof, but it is still short, since it uses mostly elementary stuff and the proof is really intuitive in some sense. I took some lessons from that: (i) first it reinforces my idea that, if I need to say something about a matrix, the first thing I do is to look at its eigenvectors decomposition. A lot of Linear Algebra problems are very simple when we consider things in the right basis. Normally the right basis is the eigenvector basis. (ii) not all matrices are diagonalizable. But in those cases, Jordan Canonical Form comes in our help and we can do almost the same as we did with eigenvalue decomposition.
Second Talk: A case for the accountable cloud by Andreas Haeberlen
Three cloud stories .. a threatening cloud, a promising cloud, and a nice cloud
The problem with current clouds is that the user does not know what the cloud service provider is doing with the customer’s code and data. Also, from the cloud service provider’s perspective, the operator does not know what is the code that they are running for customers supposed to do.
Alice is the customer running a service on the cloud owned and operated by Bob.
A solution: what if we had an oracle that Alice and Bob could ask about cloud problems? We want completeness, (if something is faulty, we will know) accuracy (no false positives), verifiability (the oracle can prove its diagnoses is correct).
Idea: make clud accountable to alice+bob. Cloud records its actions in a tamper-evident log, alice and bob can audit, use log to construct evidence that a fault does or does not exist.
Discussion: 1) Isn’t this too pessimistic? bob isn’t malicious ..maybe, but bob can get hacked, or things can just go wrong. 2) shouldn’t bob use fault tolerance instead? yes whenever we can, but masking faults is never perfect, we still need to check. 3) why would a provider want to deploy this? this feature will be attractive to prospective customers, and helpful for support. 4) Are these the right guarantees? completeness (no false negatives), could be relaxed with probabilistic completeness; verifiability could be relaxed only provide some evidence; accuracy (no false positives) can not be relaxed because we need to have confidence when we rule out problems.
A call to action: cloud accountability should do; deliverable provable guarantees, work for most cloud apps, require no changes to application code, cover a wide spectrum of properties, low overhead.
Work in Progress: Accountable Virtual Machines (AVM), goal: provide accountability for arbitrary unmodified software. cloud records enough data to enable deterministic replay, alice can replay log with a known-good copy of the software, can audit any part of the original execution.
Conclusion: current cloud designs carry risks for both customers and providers (mainly because of split administration problem). Proposed solution: accountable cloud. Lots of research opportunities.
Third Talk: Learning from the Past for Resolving Dilemmas of Asynchrony by Paul Ezhilchelvan
In an asynchronous model you can not bound message delivery time or even message processing time by a machine. However, in a probabilistic synchronous model, we can bound times within a certain probability via proactive measurements. The new central hypothesis of the new model is that most of the time, performance of the past is indicative of the performance of the near future (i.e. delay in the past is the indicative of delay in the future).
Design steps include doing proactive measurements, using them to establish synchrony bounds, and assign time bounds based on that, try that and see how it works and enable exceptions.
On-going work: development of exceptions (to deal with exceptional cases when mistakes are detected). Open environments are asynchronous, use crash signals for notification of extreme unexpected behavior.
First Talk: Bulletin Board: A Scalable and Robust Eventually Consistent Shared Memory over a Peer-to-Peer Overlay by Gregory Chockler
WebSphere Virtual Enterprise (WVE) is a product for managing resources in a data center. The product is a distributed system whose nodes and controllers need to communicate and share information, and BulletinBoard (BB) is used for that. BB is a platform service for facilitating group-based information sharing in a data center. It is critical component of WVE, and its primary application is monitoring and control, but the designers believe that it could be useful for other weakly consistent services.
Motivation & Contribution: Prior implementation of group communication implemented internall as not designed to grow 10 folds, and that was based on Virtual Synchronous group communication; robustness, stability, high runtime overheads as the system grew beyond several 100s of processes; static hierarchy introduced configuration problems. So the goal was to provide a new implementation to resolve the scaling and stability issues of the prior implementation (and implement this in a short time! so this constraint had important implications on the design decisions).
BB supports a write-sub (write subscribe) service model. It is a cross between pub-sub systems, shared memory systems, and traditional group communication systems. In pub-sub communication is async and done through topics. In shared memory we have overwrite semantics, singe writer per topic and process, and notifications are snapshots of state.
Consistency Semantics (single topic). PRAM Consistency: notified snapshots are consistent with the other process order of writes. A note made was that developers who built services on top of that turned out to understand this semantics of consistency.
Liveness Semantics (single topics). Uses Eventual inclusion: eventually each write by a correct and connected process is included into the notified snapshot. Eventual exclusion means that failed processes will be eventually excluded from updates.
Performance and Scalability Goals: adequate latency, scalable runtime costs, throughput is less of an issue (management load is fixed and low). Low overhead. Robustness, scalability in the presence of large number of processes and topics (2883 topics in a system of 127 processes, note that the initial target was around 1000 processes).
Approach: decided to build this on an overlay network called SON. Service Overlay Network (SON). SON is a semi-structured P2P overlay, already in the product, and self-* (recover from changes quickly without problems), resilient, and supports peer membership and broadcast. The research question here was whether if BB can be implemented efficiently on top of a P2P overlay like SON?
Architecture: SON with IAM (interest aware membership) built on top of it and BB on top of that (but BB can interact directly with SON).
Reliable Shared State Maintenance in SON for BB: is made fully decentralized, and update propagation is optimized for bimodal topic popularity. Overlay broadcast or iterative unicast over direct TCP connections if # subscribers of a topic is less than a certain threshold (and group broadcast otherwise). For Reliability, periodic refresh of the latest written value (on a long cycle) if not overwritten (this was a bad decision in retrospect) with state transfer to new/reconnected subscribers.
Experimental study on different topologies showed low cpu overhead and latency, but these numbers increased as the topology increased in size. Analysis of that revealed that this was because the periodic refreshes were stacked and caused increased CPU & latency overheads. An additional problem was in broadcast flooding, and when that was removed cpu & latency overheads stayed flat as the topology increased in size.
Lessons learned: communication cost is the major factor affecting scalability of overlay based implementations, and that anti-entropy techniques are best fit for such services.
Second Talk: Optimizing Information Flow in the Gossip Objects Platform by Ymir Vigfusson
In gossip, nodes exchange information with a random peer periodically in rounds. Gossip has appealing properties such as bounded network traffic, scalability in group size, robustness against failures, coding simplicity. This is nice when gossip is considered individually per application. In cloud computing with nodes joining many groups, the traffic is no longer bounded per node (but per topic).
The Gossip Objects (GO) platform is a general platform for running gossip for multiple applications on a single node. It bounds the gossip traffic going out of a particular node. The talk focused on how to select rumors to publish out from multiple applications on a single node such that we reduce number of messages. This is possible because rumor messages are small and have a short destination. An observation made is that rumors can be delivered indirectly, uninterested nodes can forward rumors to interested nodes.
The GO heuristic: recipient selection is biased towards higher group traffic. The content is selected by computing a utility of a rumor which is defined as the probability of that rumor will add information to a host that didn’t know that info.
Simulation, a first simulation of an extreme example with only two nodes joining many groups. The GO heuristic showed promising results. Then a real-world evaluation was conducted based on a 55 minute trace of the IBM WebSphere Virtual Enterprise Bulletin Board layer. The trace had 127 nodes and 1364 groups, and the evaluation showed that GO had placed a cap on traffic compared to random and random with stacking heuristics for GO. Additionally, the GO heuristic was able to deliver rumors faster than the other heuristic, and the number of messages needed to deliver the messages to interested nodes, and the GO heuristic had multiple orders of reduction over other heuristics and traditional rumor spreading.
Conclusion: GO implemented novel ideas such as per-node gossip, rumor stacking (pushing the rumor to the MTU size), utility based rumor dissemination, and adapting to traffic rates. GO gives per-node guarantees even when the # of groups scales up. Experimental results were compelling.
Questions:
Mike Spreitzer, IBM Research: What would happen if the number of groups increases?
Answer: Study of available real-world traces showed a pattern of overlap. We also conducted simulation with other group membership patterns and the results were similar.
—: What was the normal rumor size? And what would happen if that increased?
Answer: The average rumor size was 100Bytes. If the message size increased we will stack less rumors, but our platform can also reject really large rumors.
—: Have you thought about network-level encoding?
Answer: Not yet, but we plan to in the future.
—: Have you thought of leveraging other dissemination techniques to run under GO?
Answer: Actually, we thought about the opposite direction where we would run other communication protocols and map them under the hood to GO. Results are pending.
Keynote #4 by David Nichols from Microsoft. Published abstract and speaker bio.
In his talk, David shared some stories about his experience of using SQL in a data center environment to provide cloud services. The speaker was a bit fast while talking, I captured most of his message and the important parts, but i had to skip some parts.
In Windows Live, when building a new service they prefer to use off-the-shelf products such as SQL. Why SQL? familiar tested programming model (real queries, real transactions, good data modeling, excellent at OLTP, easy to find developers that know it). Solid systems software (used often and fine tuned many times and updated). Challenges with using SQL, living without single-image database model (no global transactions or global indexes). Administration and maintenance overhead. Breaking things at scale.
DB partitioned by user, many users per instance DB because it is easy and self contained. User info are small enough that you can place multiple users on single location. Front ends send requests to proper DB. Location is determined by lookup (a Lookup Partition Service – LPS – maps users to partitions). DBs are partitioned by hash to avoid hotspots.
Architecture: Three stages of scale out: bigger server, functional division, and data division.
A problem with scaling out: updates to multiple sevices and users (e.g., add messenger buddy, upload a photo which is writen to file store and recent activity store). Two-phase commit is out (because the risk of having the crash that locks your data out is too high), instead us ad hoc methods: for example: write A intent, write B, write A; another example, write A and work item, let work item write B; another example: write A, then B, tolerate inconsistency.
Another problem is how do you read data about multiple users or even all users. Example scenario, user updates his status, his friends need to know. The old way (inefficient) to do that is to write a change about the users into the profile of all affected users, easy to query, but heavy write load.
Data Availability and Reliability. Replication is used for all user data using SQL replication. Front ends have library (WebStore) to notice failures and switch to secondary. Original scheme was one-to-one which was too slow because parallel transactions vs. single replication stream. Next try was to have four DBs talking to four DBs which fixed most speed problems, but too much load on secondaries after failure. Current approach uses 8-host pods, 25% load increase for secondaries on failure (8×8 matrix, and the replication was done on the transpose of the matrix). However, still not fast enough for key tables (100’s of write threads vs. 5 replication streams). Manual replication (FE’s run SProcs at both primary and secondary, but small probability of inconsistent data). Replication runs a few seconds behind (ops reluctant to auto-promote secondary due to potential data in replication stream), new SQL tech should fix this.
Data loss causes: above the app (external application and old data); in the app (software bugs especially migration logic bugs); below the app (controller failure, disk failure). Mitigation techniques: audit trails and soft deletes for above app problems; per-user backup for software bugs; tape backup, sql replication, and RAID for below app problems (however these are expensive).
Managing Replication: fail safe set, a set of databases in some sort of replication membership. Typical fail safe set is two to four DBs (most are two). Fail safe are the true target of partition schemes.
Upgrade options: upgrade partitions: run DDL in each partition (via WebStore), this is complicated by replication, after all DBs are done, upgrade FEs (SProcs are compatible; changed APIs get new names). Migrate users: can take various forms (between servers, within a server, or even within services), and migrating users can be complex, slow, error-prone, and nobody’s likes it.
Some Operation stories.
Capacity management: growth is in units of servers. when to buy more? test teams provides one opinion, ops team aims to find max resource and stay below limit, two kinds of limits, graceful and catastrophic. Interesting thing about graceful vs catastrophic limits .. if you back off from graceful limits, you can usually go back to your original state (good state), however for catastrophic limits, even if you back off you can remain in a bad situation.
Ops lessons: 1) never do the same thing to all machines at once -stats queries, re-indexing have all crashed clusters in the past. 2) Smaller DBs are better, already coping with many DBs, plus re-indexing backups, upgrades are all faster for small DBs. 3) Read-only mode is powerful (failure maintenance and migration all use it). 4) Use the the live site to try things out (new code new SQL settings etc) “taste vs test”.
Conclusions: SQL can be tamed, it has some real issues but mostly manageable with some infrastructure, and its ops cost not out of line. It is hard to do better than SQL, it keeps improving, each time we go to design something, we find that SQL already design it, perhaps not in the form we want exactly, but close enough and not worth the effort probably. However SQL is not always the best solution.
SQL wish list. Easy ones: partitioned data support, easy migration/placement control, reporting, jobs; supporting aggregated data pattern, improved manageability. Hard ones: taming DB schema evolution, soft delete/versioning support of some kind, and A–D transactions (Atomic & Durable).