Probability Puzzles

February 16th, 2010 renatoppl No comments

Today in a dinner with Thanh, Hu and Joel I heard about a paradox I haven’t heard so far. Probability is full of cute problems that challenge our understanding of the basic concepts. The most famous of them is the Monty Hall Problem, which asks:

You are on a TV game show and there are {3} doors – one of them contains a prize, say a car and the other two door contain things you don’t care about, say goats. You choose a door. Then the TV host, who knows where the prize is, opens one door you haven’t chosen and that he knows has a goat. Then he asks if you want to stick to the door you have chosen or if you want to change to the other door. What should you do?

Probably you’ve already came across this question in some moment of your life and the answer is that changing doors would double your probability of getting the price. There are several ways of convincing your intuitions:

  • Do the math: when you chose the door, there were three options so the prize is in the door you chose with {\frac{1}{3}} probability and in the other door with probability {\frac{2}{3}} (note that the presenter can always open some door with a goat, so conditioning on that event doesn’t give you any new information).
  • Do the actual experiment (computationally) as done here. One can always ask a friend to help, get some goats and perform the actual experiment.
  • To convince yourself that “it doesn’t matter” is not correct, think {100} doors. You choose one and the TV host open {98} of them and asks if you want to change or stick with your first choice. Wouldn’t you change?

I’ve seen TV shows where this happened and I acknowledge that other things may be involved: there might be behavioral and psychologic issues associated with the Monty Hall problem – and possibly those would interest Dan Ariely, whose book I began reading today – and looks quite fun. But the problem they told me about today in dinner was another: the envelope problem:

There are two envelopes and you are told that in one of them there is twice the amount that there is in the other. You choose one of the envelopes at random and open it: it contains {100} bucks. Now, you don’t know if the other envelope has {50} bucks or {200} bucks. Then someone asks you if you wanted to pay {10} bucks and change to the other envelope. Should you change?

Now, consider two different solutions to this problem: the first is fallacious and the second is correct:

  1. If I don’t change, I get {100} bucks, if I change I pay a penalty of {10} and I get either {50} or {200} with equal probability, so my expected prize if I change is {\frac{200+50}{2}-10 = 115 > 100}, so I should change.
  2. I know there is one envelope with {x} and one with {2x}, then my expected prize if I don’t change is {\frac{x + 2x}{2} = \frac{3}{2}x}. If I change, my expected prize is {\frac{x + 2x}{2} - 10 < \frac{3}{2}x}, so I should not change.

The fallacy in the first argument is perceiving a probability distribution where there is no one. Either the other envelope contains {50} bucks or it contains {200} bucks – we just don’t know, but there is no probability distribution there – it is a deterministic choice by the game designer. Most of those paradoxes are a result of either an ill-defined probability space, as Bertrand’s Paradox or a wrong comprehension of the probability space, as in Monty Hall or in several paradoxes exploring the same idea as: Three Prisioners, Sleeping Beauty, Boy or Girl Paradox, …

73a_humpty-dumpty

There was very recently a thrilling discussion about a variant on the envelope paradox in the xkcd blag – which is the blog accompaning that amazing webcomic. There was a recent blog post with a very intriguing problem. A better idea is to go there and read the discussion, but if you are not doing so, let me summarize it here. The problem is:

There are two envelopes containing each of them a distinct real number. You pick one envelope at random, open it and see the number, then you are asked to guess if the number in the other envelope is larger or smaller then the previous one. Can you guess correctly with more than {\frac{1}{2}} probability?

A related problem is: given that you are playing the envelope game and there are number {A} and {B} (with {A < B}). You pick one envelope at random and then you are able to look at the content of the first envelope you open and then decide to switch or not. Is there a strategy that gives you expected earnings greater than {\frac{A+B}{2}} ?

The very unexpected answers is yes !!! The strategy that Randall presents in the blog and there is a link to the source here is: let {X} be a random variable on {{\mathbb R}} such that for each {a<b} we have {P(a < X < b) > 0}, for example, the normal distribution or the logistic distribution.

Sample {X} then open the envelope and find a number {S} now, if {X < S} say the other number is lower and if {X > S} say the other number is higher. You get it right with probability

\displaystyle  P(\text{picked }A) P(X > A) + P(\text{picked }B) P(X < B) = \frac{1}{2} (1 + P(A < X < B))

which is impressive. If you follow your guess, your expected earning {Y} is:

\displaystyle \begin{aligned} &P(\text{picked }A) \mathop{\mathbb E}[Y \vert \text{picked }A] + P(\text{picked }B) \mathop{\mathbb E}[Y \vert \text{picked }B] = \\ & = \frac{1}{2} [P(X<A) A + P(X>A) B] + \frac{1}{2} [P(X<B) B + P(X>B) A] \\ &= \frac{1}{2}[A [P(X<A) + P(X>B)] + B [P(X>A) + P(X<B)]] > \frac{A+B}{2} \\ \end{aligned}

The xkcd pointed to this cool archive of puzzles and riddles. I was also told that the xkcd puzzle forum is also a source of excellent puzzles, as this:

You are the most eligible bachelor in the kingdom, and as such the King has invited you to his castle so that you may choose one of his three daughters to marry. The eldest princess is honest and always tells the truth. The youngest princess is dishonest and always lies. The middle princess is mischievous and tells the truth sometimes and lies the rest of the time. As you will be forever married to one of the princesses, you want to marry the eldest (truth-teller) or the youngest (liar) because at least you know where you stand with them. The problem is that you cannot tell which sister is which just by their appearance, and the King will only grant you ONE yes or no question which you may only address to ONE of the sisters. What yes or no question can you ask which will ensure you do not marry the middle sister?

copied from here.

Categories: puzzles Tags: ,

Walrasian Equilibrium I

February 15th, 2010 renatoppl No comments

Currently I’ve been trying to understand more about the dynamics of markets and basic concepts of microeconomic theory and, as always, writing a blog post will help me to keep my ideas clear. First, why are markets interesting from a computer scientist/mathematician point of view?

  • Markets are multi-objective optimization problems: one can think of the possible state of a market some point in a space of possible {x = (x_1, \hdots, x_N) \in \Omega}. Each player of the market controls one variable, say {x_i} and is interested in maximizing one objective function {f_i(x)}. So, player {i} is trying to set {x_i = \text{argmax}_{x_i} f_i(x_i, x_{-i})}.
  • Markets are a computational model: one can think of a market as a way of performing a certain computation – as extracting some kind of information, as a prediction market, stock exchanges, … If we think of it as a computational device, we are asking the same questions: given those preferences which are implicit functions to each of the agents, calculate “fair” prices of items.
  • Markets are distributed systems where each part of the system has a selfish interest.

A market is composed by a set {L} of commodities, {I} of consumers and {J} of producers. Now, we describe how to characterize each of them:

  • Each consumer is defined by a set {X_i \subseteq {\mathbb R}^L} of commodities combinations he is interested (typically we take {X_i = {\mathbb R}^L_+}) and an utility function {u_i : X_i \rightarrow {\mathbb R}} expressing his interest for this bundle of commodities. Consumer {i} will try to maximize {u_i(\cdot)} in a further restricted {X_i}.
  • Each producer is define by a set {Y_j \subset {\mathbb R}^L} it has the capacity to produce.
  • Endowments: Each consumer comes to the market with an initial endowment {\omega_i \in {\mathbb R}^L}, so for {j \in L}, {\omega_{ij}} is the amount of commodity {j} that consumer {i} originally has. The initial total endowment of the market is given by {\overline{\omega} = \sum_{i \in I} \omega_i}, which is a vector indicating how much of each commodity originally exists in the market.
  • Shares: consumers have shares in the companies, so for {i \in I}, {j \in J}, consumer {i} has {\theta_{ij}} shares of company {j}, such that {\sum_i \theta_{ij} = 1}.

Something very crucial is missing in this picture: a way to compare commodities and something that makes exchanges possible: the answer to that is to attribute prices to the items. How to attribute prices to the items so that the market works fine? A price vector is a vector {p \in {\mathbb R}^L_+}. Consider the following scenario after prices {p_\ell, \ell \in L} are established to commodities:

  • by producing {y_j \in Y_j}, company {j \in J} gets profit {p \cdot y_j}, so each company will try to maximize its profit producing {y_j^* \in \text{argmax}_{y_j \in Y_j} p \cdot y_j}.
  • each consumer sells its initial endowment and gets the profit respective to the companies he owns. So, consumer {i} gets {w_i = p \cdot \omega_i + \sum_{j \in J} \theta_{ij} p \cdot y_j^*}.
  • now consumer {i} uses the money he has to buy the best bundle he can afford, which is {x_i^* = \text{argmax} \{u_i(x_i); x_i \in X_i, p \cdot x_i \leq w_i\}}.

The amount of commodities in the market must conserve, so that is possible only if we get:

\displaystyle \sum_{i \in I} x_i^* = \sum_{i \in I} \omega_i + \sum_{j \in J} y_j^*

First, it is not clear if such a price vector exists. If it exists, is it unique? If this is an equilibrium, is it the best thing for the consumers? How those prices can be set in practice without a centralized authority? Can people lie? Below, let’s collect a couple of questions I’ll try to answer (yes, no or unknown) in this and the following posts.

Question 1: Does a price vector {p \geq 0} always exist that generates an equilibrium?

Question 2: If it exists, is it unique?

Question 3: Can we describe an efficent method to find {p} ?

Question 4: Is it the best thing for the consumers in the following sense: if {(x^*,y^*,p)} is an equilibrium, are there feasible {(x,y)} such that {u_i(x_i) \geq u_i(x_i^*)} and for at least one consumer {u_i(x_i) > u_i(x_i^*)}? (This is called Pareto improvement)

Question 5: A central authority could use the knowledge about functions {u_i} and endowments {\omega_i} to calculate the price vector {p} using some method. Can consumers {i} be better off by lieing about their utility and endowments?

Question 6: How prices get defined without a central authority? Is there a dynamic/game-theoretical model to that?

For simplicity, let’s think of Exchange Economies, which are economies with no producers. Let’s define it formally:

Definition 1 An exchange economy is composed by a set of commodities {L} and a set of consumers {I} each with an utility {u_i : X_i \rightarrow {\mathbb R}} and an initial endowment {\omega_i \in {\mathbb R}^L_+}.

Definition 2 A price vector {p \in {\mathbb R}^L_+} is a Walrasian equilibrium for an exchange economy {(L, I, \{X_i, u_i, \omega_i\}_{i \in I})} if there is {x_i^* \in X_i} such that:

  1. {x_i^*} s.t. {u_i(x_i^*) = \max \{u_i(x_i); x_i \in X_i, p \cdot x_i \leq p \cdot \omega_i\}}
  2. {\sum_i x_i^* \leq \sum_i \omega_i}
  3. {p \cdot (\sum_i x_i^* - \omega_i)= 0}

The first condition says that each consumer is maximizing his utility given his prices, the second says that we can’t buy more commodities than what is available in the market and the third, called Walras’ Law, says that if there is surplus of a certain product, it should have price zero. It is by far the most unnatural of those, but it can be easily justifiable in some circumnstances: suppose we say that utilities are non-satiated if for each {x_i \in X_i} and {\epsilon > 0}, there is {x'_i \in X_i}, {\Vert x_i - x'_i \Vert < \epsilon} such that {u_i(x'_i) > u_i(x_i)}. If {u_i} are differentiable, that would mean {\nabla u_i \neq 0}, for example a linear function {u_i(x_i) = \sum_\ell u_{i\ell} x_{i\ell}} with some {u_{i\ell} > 0}. In that case, {\sum_i (p \cdot x_i^* - p \cdot \omega_i) < 0} and some player has money surplus and therefore he could increase his utility.

Now, we define for each price vector {p} the excess demand function {z_i(p) = x_i(p) - \omega_i} and {z(p) = \sum_i z_i(p)}. Now, under non-satiated utilities, by the last argument, we have that {p} is an equilibrium vector iff {z(p) \leq 0}. Actually, if {u_i} are also strong monotone, i.e., {u_i(x_i + \xi) \geq u_i(x_i)} for each {\xi \geq 0}, then it becomes: {p} is an equilibrium iff {z(p) = 0}, which means that the market clears:

\displaystyle \sum_i x_i^* = \sum_i \omega_i

The question that is easier to answer is Question 4 and it is sometimes refered as the First Fundamental Theorem of Welfare Economics:

Theorem 3 Given non-satiated preferences, each equilibrium {(x,p)} is Pareto, i.e. there is no other feasible allocation {x'} such that for all {i}, {u_i(x'_i) \geq u_i(x_i)} with the inequality strict for at least one component.

Proof: Suppose there were, since {u_i(x'_i) \geq u_i(x_i)} then {p \cdot x'_i \geq p \cdot \omega_i}, because if {p \cdot x'_i < p \cdot \omega_i} then we could improve the utility of {x'_i} still within the budget, contradicting the optimality of {u_i(x_i)} for that budget. And clearly {u_i(x'_i) > u_i(x_i)} implies {p \cdot x'_i > p \cdot \omega_i}.

Summing over {i}, we get {\sum_i p x'_i > \sum_i p \omega_i}, what is a contradiction, because since {x'} is feasible, {\sum_i x'_i \leq \sum_i \omega_i} and therefore {\sum_i p \cdot x'_i \leq \sum_i p \cdot \omega_i}. \Box

Now, let’s tackle Question 1. We assume linearly of utility: {u_i(x_i) = \sum_{\ell \in L} u_{i \ell} x_{i \ell}} for {u_{i \ell} > 0}. This gives us strong monotonicity and local nonsatiated preferences.

Theorem 4 Under linear utilities, there is always an equilibrium price vector {p}.

Consider the function {z} defined above: {z(p) = \sum_i (x_i(p) - \omega_i)} where {x_i(p)} is the bundle of best possible utility. Now, since we are using linear utilities we can’t guarantee there will be only one such bundle, so instead of considering a function, consider {x_i} and {z} as being correspondences: {x_i, z: {\mathbb R}^L_+ \rightarrow \mathcal{P}({\mathbb R}^L)}, i.e., {x_i(p) \subseteq {\mathbb R}^L_+} is the set of all allocations that maximize {u_i(x_i)} subject to {p\cdot x_i \leq p \cdot \omega_i}. Since {u_i} are linear functionals, we can calculate {x_i(p)} by a Fractional Knapsack algorithm: we sort commodities {\ell \in L} by {\frac{p_\ell}{u_{i\ell}}} and start buying in the cost-benefit order (the ones that provide more utility per buck spent). Most of the time there will be just one solution, but in points where {\frac{p_\ell}{u_{i\ell}} = \frac{p_k}{u_{i k}}}, then {x_i(p)} might be a convex region. This correpondence is upper hemicontinuous, which is the correspondence analogue to continuity for functions. As Wikipedia defines:

Definition 5 A correspondence {\Gamma : A \rightarrow B} is said to be upper hemicontinuous at the point {a \in A} if for any open neighbourhood {V} of {\Gamma(a)} there exists a neighbourhood {U} of a such that {\Gamma(x)} is a subset of {V} for all {x} in {U}.

It is not hard to see that {z} is upper hemicontinuous according to that definition. Our goal is to prove that there is one price vector {p} for which {\overline{\omega} \in x_i(p)} or: {0 \in z(p)}. To prove that we use Kakutani’s Fixed Point Theorem. Before we go into that, we’ll explore some other properties of {z}:

  • 0-Homogeneous: {z(\alpha p) = z(p), \forall \alpha > 0}
  • Walras’ Law: {p \cdot z(p) = 0}. For any {\sum_i (x_i - \omega_i) \in z(p)} we know {\sum_i (p \cdot x_i - p \cdot \omega_i) \leq 0} by the definition of {x_i}. So, if it not zero, some {i} has money surplus what is absurd given that preferences are strongly monotone.
  • Bounded: {z(p)} is bounded from below, i.e., {z_\ell(p) \geq -s, \forall p, \ell} for some {s > 0}. Simply take {s = \max \omega_i}
  • Boundary behavior: if {p^k \rightarrow p \neq 0} with {p_\ell = 0}, then {z_\ell(p^k) \rightarrow \infty}. That is clear from the fractional knapsack algorithm when one desirable item gets price zero.

Now, we are in shape for applying Kakutani’s Fixed Point Theorem:

Theorem 6 (Kakutani, 1941) If {f:A \rightarrow \mathcal{P}(A)} is an upper hemicontinuous correspondence such that {f(a)} is a convex non-empty set for all {a \in A} then {f} has a fixed point, i.e., {\exists x \in A} s.t. {x \in f(x)}.

Since prices are {0}-homogeneous, consider the simplex {\Delta = \{p \geq 0; \sum_\ell p_\ell = 1\}}, its relative interior {\Delta^0 = \{p > 0; \sum_\ell p_\ell = 1\}} and the boundary {\partial \Delta = \Delta \setminus \Delta^0}. Now we define the following price correcting correspondence {f:\Delta \rightarrow \mathcal{P}(\Delta)}.

If some price {p \in \Delta^0} is set, it generates demand {d \in z(p)}. For that demand, the price that would maximize profit would be {q}, i.e. {q\cdot d \geq q' \cdot d} for all {q' \in \Delta}. It is natural to re-adjust the prices to {q}. So we define for {p \in \Delta^0}:

\displaystyle f(p) = \{q \in \Delta; q \text{ is a best response price to some } d \in z(p)\}

and for {p \in \partial \Delta}:

\displaystyle f(p) = \{q \in \Delta; q \cdot p = 0\}

Now, I claim that this correspondence satisfies the conditions in Kakutani’s Theorem. We skip a formal proof of this fact, but this is intuitive for the interior {\Delta^0} – let’s give the intuition why this is true as we approach the boundary: if {\Delta^0 \ni p^k \rightarrow p \in \partial \Delta}, then {p^k_\ell \rightarrow 0}, therefore the demans explodes: {z_\ell(p^k) \rightarrow \infty} and as a result the best thing to do is to set the prices of those commodities much higher than the rest. Therefore, the price of the commodities whose demand explode are positive while the prices of the commodities where the price doesn’t get value zero.

Now, after waiving our hands about the upper continuity of {f}, we have by Kakutani’s Theorem a point {p \in \Delta} such that {p \in f(p)}. By the definition of {f} we must have {p \in \Delta^0} (because for {p \in \partial \Delta}, {p \notin f(p)}. Now, I claim {z(p) = 0}. In fact if {z(p) \neq 0}, still {p z(p) = 0} by Walras’ Law. So, if {z(p) \neq 0} then there is {\ell} with {z_\ell (p) < 0} and therefore {q_\ell = 0} for all {q \in f(p)}, and {f(p) \subseteq \partial \Delta}. For this reason {z(p) = 0}.

In the next blog post (or serie of blog posts, let’s see) we discuss issues related to the other questions: uniqueness, dynamics, game-theoretical considerations, …

Happy Perihelion

January 2nd, 2010 hussam No comments

Happy Perihelion from the Big Red Bits team.

Happy Perihelion

Categories: Comics Tags:

On Peer Review Blindness

November 22nd, 2009 hussam No comments

This started as a joke over dinner with some friends

Peer Review Overload(click on the picture for more legible fonts)

Coming to think about it … why stop at quintuple blindness? how about automated reviewers or mixing reviewers from different eras? Ok, I’ll stop here :-)

Categories: Comics Tags:

Bounded Degree Spanning Tree and an Uncrossing Lemma

November 17th, 2009 renatoppl 1 comment

I’ve been reading about the Bounded Degree Spanning Tree problem and I thought of writing some of what I am learning here. It illustrates a beautiful techique called Iterated Rounding and uses the combinatorial idea of uncrossing. I’ll try to give a high-level idea of the argument and give references on the details. The first result of this kind was given by Goemans (although there were previous results with weaker guarantees) by Goemans in Minimum Bounded Degree Spanning Trees, but the result based on iterated rounding and a subsequent improvement are due to Singh and Lau in a serie of papers. A main reference is Approximating minimum bounded degree spanning trees to within one of optimal.

The problem of bounded degree spanning tree is as follows: consider a graph {G = (V,E)} with edge weights and we for some nodes {W \subseteq V} a degree bound {b_v, v \in W}. We want to find, among the spanning trees for which the degree of {v} is {\leq b_v} the one with minimum cost. It is clearly a hard problem, since taking all weights equal to {1} and {b_v = 2} for all nodes is the Hamiltonian Path problem, which is NP-complete. We will get a different kind of approximation. Let OPT be the optimal solution: we will show an algorithm that gives a spanning tree of cost {\leq OPT} such that each node {v} has degree {\leq b_v + 2} (this can be improved to {b_v + 1} with a more sofisticated algorithm, also based on Iterated Rounding).

As always, the first step to design an approximation algorithm is to relax it to an LP. We consider the following LP:

\displaystyle \left. \begin{aligned} & \min \sum_{e\in E} c_e x_e \text{ s.t. } \\ & \qquad \left. \begin{aligned} & \sum_{e \in E} x_e = \vert V \vert - 1 \\ & \sum_{e \in E(S)} x_e \leq \vert S \vert - 1 & \forall S \subseteq V, \vert S \vert \geq 2 \\ & \sum_{e \in \delta(v)} x_e \leq b_v & \forall v \in W\\ & x_e \geq 0 & \forall e \in E \end{aligned} \right. \end{aligned} \right.

The first constraint expresses that in a spanning tree, there are at most {\vert V \vert - 1} edges, the second prevent the formation of cycles and the third guarantees the degree bounds. For {W = \emptyset} we have the standard Minimal Spanning Tree problem and for this problem the polytope is integral. With the degree bounds, we lose this nice property. We can solve this LP using the Ellipsoid Method. The separation oracle for the {\sum_{e \in E(S)} x_e \leq \vert S \vert - 1} is done by a flow computation.

Iterated Rounding

Now, let’s go ahead and solve the LP. It would be great if we had an integral solution: we would be done. It is unfortunately not the case, but we can still hope it is almost integral in some sense: for example, some edges are integral and we can take them to the final solution and recurse the algorithm on a smaller graph. This is not far from truth and that’s the main idea of the iterated rounding. We will show that the support of the optimal solution {E(x) = \{e \in E; x_e > 0\}} has some nice structure. Consider the following lemma:

Lemma 1 For any basic solution {x} of the LP, either there is {v} with just one incident edge in the support {E(x)} or there is one {v \in W} such that that at most {3} edges are incident to it.

If we can prove this lemma, we can solve the problem in the following way: we begin with an empty tree: then we solve the LP and look at the support {E(x)}. There are two possibilities according to the lemma:

  • If there is one node {v} with just one edge {e = (u,v)} incident to it in the support, we add it to the tree, remove {v} from {V}, decrease {b_u}, make {E = E(x)} (the trick is to remove in each iteration edges from {E} that are not in the support. Clearly, removing those edges doesn’t hurt the objective value) and run the algorithm again. Notice that the LP called in the recursion has value less or equal then the actual LP {- c_e}. So if by induction we get a spanning tree respecting the new degree bounds plus two and value less or equal than the new LP value, we can just add {e} and we have a solution with value less or equal than the one of the original LP respecting the degree bounds plus two.
  • Otherwise, there is one node {v \in W} that has degree {3} in the support. So, we just remove that degree bound on that vertex (i.e. remove {v} from {W}), make {E = E(x)} (again,eliminate the edges not in the support) and run the algorithm again. Clearly, if one node is still in {W}, it has {b_v \geq 1}, since there are only three edges in the support, there will be for the rest of the computation, just three edges incident to it, so there will be at most three edges more incident to it. So it will exceed its original {b_v} by at most {2}.

The algorithm eventually stops, since in each iteration we have less edges or less nodes in {W} and the solution is as desired. The main effort is therefore to prove the lemma. But before, let’s look at the lemma: it is of the following kind: “any basic solution of the LP has some nice properties, which envolve having a not too big (at least in some point) support”. So, it involves proving that the support is not too large. That is our next task as we are trying to prove the lemma. And we will be done with:

Theorem 2 The algorithm described above produces a spanning tree of cost {\leq Z^*} (the LP values and therefore {\leq OPT})in which each node {v \in W} has degree {\leq b_v + 2}.

Bounding the size of the support

We would like now to prove some result like the Lemma above: that in the solution of the LP we have either one {v} with degree {1} in {E(x)} or we have a node in {W} with degree {\leq 3}. First, we suppose the opposite, that {(V,E(x))} has all the nodes with degree {\geq 2} and all the nodes in {W} have degree {\geq 4}. This implies that we have a large number of edges in the support. From the degrees, we know that:

\displaystyle \vert E(x) \vert \geq \frac{1}{2} \left( 2( \vert V \vert - \vert W \vert ) + 4 \vert W \vert \right) = \vert V \vert + \vert W \vert

We want to prove that the support {\vert E(x) \vert} of the LP can’t be too large. The first question is: how to estimate the size of the support of a basic solution. The constraints look like that:

bst_fig1

A basic solution can be represented by picking {\vert E \vert} rows of the matrix and making them tight. So, if we have a general {Ax \leq b} LP, we pick some submatrix {A'} of {A} which is {n \times n} and the basic solution is just {x = A'^{-1} b'}. The lines of matrix {A'} can be of three types: they can be {\chi_S}, which are corresponding to {\sum_{e \in E(S)} x_e \leq \vert S \vert - 1}, {\chi_v} that correspond to {\sum_{e \in \delta(v)} x_e \leq b_v} or {\delta_e} corresponding to {x_e \geq 0}. There are {\vert E \vert} vectors in total. The size of the support {E(x)} is smaller or equal the number of rows of the form {\chi_S, \chi_v} in the basic solution. Therefore the idea to bound the size of the support is to prove that “all basic solutions can be represented by a small number of rows in the form {\chi_S, \chi_v}. And this is done using the following:

Lemma 3 Assuming {E = E(x)}, for any basic solution {x}, there is {Z \subseteq W} and a family {\mathcal{S}} of sets such that:

  1. The restrictions correspondent to {S \in \mathcal{S}} and {v \in Z} are tight for {x}
  2. {\{ \chi_S; S \in \mathcal{S} \} \cup \{ \chi_v; v \in Z \}} is an independent set
  3. {\vert \mathcal{S} \vert + \vert Z \vert = \vert E(x) \vert}
  4. {\mathcal{S}} is a laminar family

The first 3 items are straightfoward properties of basic solutions. The fourth one, means that for two sets {S_1, S_2 \in \mathcal{S}}, one of three things happen: {S_1 \subseteq S_2}, {S_2 \subseteq S_1} or {S_1 \cap S_2 = \emptyset}. Now, we based on the previous lemma and in the following result that can be easily proved by induction, we will prove Lemma 1.

Lemma 4 If {\mathcal{S}} is a laminar family over the set {V} where each set {S \in \mathcal{S}} contains at least {2} elements, then {\vert \mathcal{S} \vert \leq \vert V \vert - 1}.

Now, the proof of Lemma 1 is easy. Let’s do it and then we come back to prove Lemma 3. Simply see that {\vert E(x) \vert = \vert \mathcal{S} \vert + \vert Z \vert \leq \vert V \vert - 1 + \vert W \vert} what contradicts {\vert E(x) \vert \geq \vert V \vert + \vert W \vert}.

Uncrossing argument

And now we arrive in the technical heart of the proof, which is proving Lemma 3. This says that given any basic solution, given any feasible solution, we can write it as a “structured” basic solution. We start with any basic feasible solution. This already satifies (1)-(3), then we need to change that solution to satisfy condition (4) as well. We need to get rid crossing elements, i.e., {S,T \in \mathcal{S}} in the form:

bst_fig2

We do that by the means of the:

Lemma 5 (Uncrossing Lemma) If {S} and {T} are intersecting and tight (tight in the sense that their respective constraint is tight), then {S \cup T} and {S \cap T} are also tight and:

\displaystyle \chi_{S \cap T} + \chi_{S \cup T} = \chi_S + \chi_T

Which corresponds to that picture:

bst_fig3

Proof: First, we note that {x(E(S))} is a supermodular function, i.e.:

\displaystyle x(E(S)) + x(E(T)) \leq x(E(S \cap T)) + x(E(S \cup T))

We can see that by case analysis. Every edge appearing in the left side appears in the right side with at least the same multiplicity. Notice also that it holds with strict inequality iff there are edges from {S\setminus T} to {T \setminus S}. Now, we have:

\displaystyle \begin{aligned} (\vert S \vert - 1) + (\vert T \vert - 1) & = (\vert S \cap T \vert - 1) + (\vert S \cup T \vert - 1) \geq \\ & \geq x(E(S \cap T)) + x(E(S \cup T)) \geq \\ & \geq x(E(S)) + x(E(T)) = (\vert S \vert - 1) + (\vert T \vert - 1) \end{aligned}

where the first relation is trivial, the second is by feasibility, the third is by supermodularity and the lastone is by tightness. So, all hold with equality and therefore {S \cap T} and {S \cup T} are tight. We also proved that:

\displaystyle x(E(S \cap T)) + x(E(S \cup T)) = x(E(S)) + x(E(T))

so there can be no edge from {S\setminus T} to {T \setminus S} in {E(x)} and therefore, thinking just of edges in {E(x)} we have:

\displaystyle \chi_{S \cap T} + \chi_{S \cup T} = \chi_S + \chi_T

\Box

Uncrossing arguments are found everywhere in combinatorics. Now, we show how the Uncrossing Lemma can be used to prove Lemma 1:

Proof: Let {x} be any basic solution. It can be represented by a pair {(Y, \mathcal{C})} where {Y \subseteq W} and {\mathcal{C}} is a family of sets. We will show that the same basic solution can be represented by {(Y, \mathcal{L})} where {\mathcal{L}} is a laminar family and has the same size of {\mathcal{C}}.

Let {\mathcal{S}} be all sets that are tight under {x} and {\mathcal{L}} a maximal laminar family of tights sets in {\mathcal{S}}, such that {\{\chi_S; S \in \mathcal{L} \} \cup \{\chi_v; v \in Z \}} are independent. I claim that {\vert \mathcal{L} \vert = dim(span(\mathcal{S}))}.

In fact, suppose {\vert \mathcal{L} \vert < dim(span(\mathcal{S}))}, then there are sets of {\mathcal{S}} we could add to {\mathcal{L}} without violating independence – the problem is that those sets would cross some set. Pick such {S \in \mathcal{S}} intersecting fewer possible sets in {\mathcal{L}}. The set {S} intersects some {T \in \mathcal{L}}. Since both are tight we can use the Uncrossing Lemma and we get:

\displaystyle \chi_{S \cap T} + \chi_{S \cup T} = \chi_S + \chi_T

since {\chi_S \notin span(\mathcal{L})}, we can’t have simultaneously {\chi_{S \cap T}} and {\chi_{S \cup T}} in {span(\mathcal{L})}. Let’s consider two cases:

  1. {\chi_{S \cap T} \notin span(\mathcal{L})}, then {S \cap T \in span(\mathcal{S})} is in {span(\mathcal{S}) \setminus span(\mathcal{L})} and intersects fewer sets of {\mathcal{L}} than {S}, since all sets that intersect {S \cap T} in {\mathcal{L}} must intersect {S} as well (since no set can cross {T}).bst_fig4
  2. {\chi_{S \cup T} \notin span(\mathcal{L})}, then {S \cup T \in span(\mathcal{S})} is in {span(\mathcal{S}) \setminus span(\mathcal{L})} and intersects fewer sets of {\mathcal{L}} than {S}, since all sets that intersect {S \cup T} in {\mathcal{L}} must intersect {S}.

In either case we have a contradiction, so we proved that {\vert \mathcal{L} \vert = dim(span(\mathcal{S}))}. So we can generate all the space of tight sets with a laminar family. \Box

And this finishes the proof. Let’s go over all that we’ve done: we started with an LP and we wanted to prove that the support of each solution was not too large. We wanted that because we wanted to prove that there was one node with degree one in the support or a node in {W} with small ({\leq 3}) degree. To prove that the degree of the support is small, we show that any basic solution has a representation in terms of a laminar family. Then we use the fact that laminar families can’t be very large families of sets. For that, we use the celebrated Uncrossing Lemma.

Note: Most of this is based on my notes on David Williamson’s Approximation Algorithms class. I spent some time thinking about this algorithm and therefore I decided o post it here.

Looking at probability distributions

November 12th, 2009 renatoppl No comments

I’ve been taking two classes in probability this semester and in those I saw the proofs of a lot of interesting theorems which I knew about previously but I have never seen the proof, as the Central Limit Theorem, the Laws of Large Numbers and so on… Also, some theory which is looks somewhat ugly in the undergrad courses becomes very clear with the proper formal treatment. Today I was thinking what was the main take-home message that a computer scientist could take from those classes and. at ;east for me, this message is the various ways of looking to probability distributions. I’ve heard about moments, Laplace transform, Fourier transform and other tools like that, but I never realized before their true power. Probably still today, most of their true power is hidden from me, but I am starting to look at them in a different way. Let me try to go over a few examples of different ways we can look at probability distributions and show cases where they are interesting.

Most of ways of looking at probability distributions are associated with multiplicative system: a multiplicative system {Q} is a set of real-valued functions with the property that if {f_1, f_2 \in Q} then {f_1 \dot f_2 \in Q}. Those kinds of sets are powerful because of the Multiplicative Systems Theorem:

Theorem 1 (Multiplicative Systems Theorem) If {Q} is a multiplicative system, {H} is a linear space containing {1} (the constant function {1}) and is closed under bounded convergence, then {Q \subseteq H} implies that {H} contains all bounded {\sigma(Q)}-measurable functions.

The theorem might look a bit cryptic if you are not familiar with the definitions, but it boils down to the following translation:

Theorem 2 (Translation of the Multiplicative Systems Theorem) If {Q} is “general” multiplicative system, {X} and {Y} are random variable such that {\mathop{\mathbb E}{f(X)} = \mathop{\mathbb E}{f(Y)}} for all {f \in Q} then {X} and {Y} have the same distribution.

where general excludes some troublesome cases like {Q = \{1\}} or all constant functions, for example. In technical terms, we wanted {\sigma(Q) = \sigma\{f^{-1}((-\infty,a]); a\in {\mathbb R}, f \in Q \}} to be the Borel {\sigma}-algebra. But let’s not worry about those technical details and just look at the translated version. We now, discuss several kinds of multiplicative systems:

  1. The most common description of the a random variable is by the cummulative distribution function {F(u) = P\{X \leq u\}}. This is associated with {Q = \{ 1_{(-\infty,u]}(x); u \in {\mathbb R} \}} notice that simply {F(u) = \mathop{\mathbb E}{1_{(-\infty,u]}}}.
  2. We can characterize a random variable by its moments: the variable {X} is characterized by the set {M_n = \int_{\mathbb R} x^n \mu_X(dx)}. Given the moemnts {M_1, M_2, \hdots}, the variable is totally characterized, i.e., if two variables have the same moments, then they have the same distribution by the Multiplicative Systems Theorem. This description is associated with the system {Q = \{x^n; n = 1, 2, \hdots\}}
  3. Moment Generating Function: If {X} is a variable that assumes only integer values, we can describe the it as {p_0, p_1, p_2, \hdots}, where {p_n = P\{ X = n \}}. An interesting way of representing those probabilities is as the moment generating function {\psi_X(z) = \mathop{\mathbb E}{z^X} = \sum_{n=0}^\infty p_n z^n}. This is associated with the multiplicative system {Q = \{z^x, 0 \leq z < 1\}}.Now suppose we are given two discrete independent variables {X} and {Y}. What do we know about {X+Y}. It is easy to know its expectation, its variance, … but what about more complicated things? What is the distribution of {X+Y} ? Moment generating functions answer this question very easily, since:

    \displaystyle \psi_{X+Y} (z) = \mathop{\mathbb E} [z^{X+Y}] = \mathop{\mathbb E} [z^{X}] \cdot \mathop{\mathbb E}[z^{Y}] = \psi_X(z) \cdot \psi_Y(z)

    If we know moment generating functions, we can calculate expectation very easily, since {\mathop{\mathbb E}[X] = \psi'_X(1)}. For example, suppose we have a process like that: there is one bacteria in time {t = 0}. In each timestep, either this bacteria dies (with probability {p_0}), continues alive without reproducing (with probability {p_1} or has {k} offsprings (with probability {p_{1+k}}). In that case {\sum_0^\infty p_n = 1}. Each time, the same happens, independently with each of the bacteria alive in that moment. The question is, what is the expected number of bacteria in time {t} ?

    It looks like a complicated problem with just elementary tools, but it is a simple problem if we have moment generating functions. Just let {X_{ti}} be the variable associated with the {i^{th}} bacteria of time {t}. It is zero if it dies, {1} if it stays the same and {k+1} if it has {k} offsprings. Let also {N_t} be the number of bacteria in time {t}. We want to know {\mathop{\mathbb E} N_t}. First, see that:

    \displaystyle N_t = \sum_{i=1}^{N_{t-1}} X_{ti}

    Now, let’s write that in terms of moment generating functions:

    \displaystyle \psi_{N_t}(z) = \mathop{\mathbb E}{z^{X_{t1} + \hdots + X_{tN_t}}} = \sum_{k=0}^\infty P\{N_{t-1} = k\} \cdot \mathop{\mathbb E}[z^{X_{t1} + \hdots + X_{tN_t}} \vert N_t = k]

    which is just:

    \displaystyle \psi_{N_t}(z) = \sum_{k=0}^\infty P\{N_{t-1} = k\} \cdot \mathop{\mathbb E}[z^{X_{t1} + \hdots + X_{tk}}] = \sum_{k=0}^\infty P\{N_{t-1} = k \} \cdot \psi_X (z)^k

    since the variables are all independent and identically distributed. Now, notice that:

    \displaystyle \psi_{N_{t-1}}(z) = \sum_{k=0}^\infty P\{N_{t-1} = k\} \cdot z^k

    by the definition of moment generating function, so we effectively proved that:

    \displaystyle \psi_{N_{t}}(z) = \psi_{N_{t-1}}(\psi_X(z)) = \psi_X \psi_X \hdots \psi_X(z)

    We proved that {\psi_{N_{t}}(z)} is just {\psi(z)} iterated {t} times. Now, calculating the expectation is easy, using the fact that {\psi_X(1) = 1} and {\psi'_X(1) = \mathop{\mathbb E} X}. Just see that: {\mathop{\mathbb E} N_t = \psi_{N_{t}}'(1) = \psi_{N_{t-1}}'(\psi_X(1)) \cdot \psi'_X(1) = \psi_{N_{t-1}}'(1) \cdot \mathop{\mathbb E} X }. Then, clearly {\mathop{\mathbb E} N_t = (\mathop{\mathbb E} X)^t}. Using similar technique we can prove a lot more things about this process, just by analyzing the behavior of the moment generating function.

  4. Laplace Tranform: Now, moving to continuous variables, if {X} is a continuous non-negative variable we can define its Laplace tranform as: {\phi_X(u) = \mathop{\mathbb E} [e^{u X}] = \int_0^\infty e^{ux} \mu_X(dx)}, where {\mu_X(dx)} stands for the distribution of {X}, for example, {\rho_X(x) dx}. This is associated with the multiplicative system {Q = \{e^{ux}; u \geq 0\}}. Again, by the Multiplicative Systems Theorem, if {\phi_X = \phi_Y}, then the two variables have the same distribution. The Laplace tranform has the same nice properties as the Moment Generating Function, for example, {\phi_{X+Y}(u) = \mathop{\mathbb E}[e^{(X+Y)u}] = \mathop{\mathbb E}[e^{Xu}] \cdot \mathop{\mathbb E}[e^{Yu}] = \phi_X(u) \cdot \phi_Y(u)}.And it allows us to do similar tricks than the one I just showed for Moment Generating Functions. One common trick that is used, for example, in the proof of Chernoff bounds is, given independent non-negative random variables:

    \displaystyle P\left\{\sum_i X_i > u\right\} = P\left\{e^{\sum_i X_i} > e^u\right\} \leq \frac{\mathop{\mathbb E}[e^{\sum_i X_i} ]}{e^u} = \frac{\prod_i \mathop{\mathbb E}[e^{X_i} ]}{e^u}

    where we also used Markov Inequality: {\mathop{\mathbb E}[X] \geq \beta P\{X \geq \beta\}}. Passing to the Laplace transform is the main ingredient in the Chernoff bound and it allows us to sort of “decouple” the random variables in the sum. There are several other cases where the Laplace transform proves itsself very useful and turns things that looked very complicated when we saw in undergrad courses into simple and clear things. One clear example of that is the motivation for the Poisson random variable:

    If {T_i} are independend exponentially distributed random variables with mean {\lambda}, then {\rho_{T_i}(t) = \lambda e^{- \lambda t}}. An elementary calculation shows that its laplace transform is {\phi_{T_i}(u) = \frac{\lambda}{\lambda+u}}. Let {S_n = T-0 + T_1 + T_2 + \hdots + T_n}, i.e., the time of the {(n+1)^{th}} arrival. We want to know what is the distribution of {S_n}. How to do that?

    \displaystyle \phi_{S_n}(u) = \mathop{\mathbb E}[e^{u(T_0 + \hdots + T_n)}] = \phi_T(u)^{n+1} = \left( \frac{\lambda}{\lambda+u} \right)^{n+1}

    Now, we need to find {\rho_{S_n}} such that {\int_0^\infty \rho_{S_n}(t) e^{iu} dt = \left( \frac{\lambda}{\lambda+u} \right)^{n+1}}. Now it is just a matter of solving this equation and we get: {\rho_{S_n}(t) = \frac{\lambda (\lambda t)^n}{n!} e^{-\lambda t}}. Now, the Poisson varible {N_t} measures the number of arrivals in {[0,t]} and therefore:

    \displaystyle \begin{aligned} P\{N_t = n\} & = P\{S_{n-1} < t < S_n\} = P\{S_n > t\} - P\{S_{n-1} \geq t\} \\ & = \int_t^\infty \rho_{S_n}(t) dt - \int_t^\infty \rho_{S_{n-1}}(t) dt = \frac{(\lambda t)^n}{n!} e^{-\lambda t} \end{aligned}

  5. Characteristic Function or Fourier Tranform: Taking {Q = \{e^{i\lambda x}; x \in \mathbb{C}\}} we get the Fourier Transform: {f_X(\lambda) = \mathop{\mathbb E} e^{i \lambda x}} which also has some of the nice properties of the previous ones and some additional ones. The characteristic functions were the main actors in the development of all the probability techniques that lead to the main result of 19th century Probability Theory: the Central Limit Theorem. We know that moment generating functions and Laplace transforms completely characterize the distributions, but it is not clear how to recover a distribution once we have a transform. For Fourier Transform there is a cleas and simple way of doing that by means of the Inversion Formula:

    \displaystyle \rho_X(x) = \frac{1}{2 \pi} \int_{\mathbb R} e^{-i \lambda x} f_X(\lambda) d\lambda

    One fact that always puzzled me was: why is the normal distribution {\rho_N (x) = \frac{1}{\sqrt{2 \pi}} e^{-x^2/2} } so important? What does it have in special to be the limiting distribution in the Central Limit Theorem, i.e., if {X_1, X_2, \hdots, X_n} is a sequence of independent random variables, {S_n = \sum_1^n X_i - \sum_1^n \mathop{\mathbb E} X_i} then {S_n / \sqrt{var S_n} \rightarrow N(0,1)} under some natural conditions on the variables. The reason the normal is so special is because it is a “fixed point” for the Fourier Transform. We can see that {f_N(\lambda) = e^{-\lambda_2/2}}. And there we have something special about it that makes me believe the Central Limit Theorem.

————————-

This blog post was based on lectures by Professor Dynkin at Cornell.

Categories: theory Tags: ,

Random Spanning Trees

November 3rd, 2009 renatoppl No comments

BigRedBits is again pleased to have Igor Gorodezky as a guest blogger directly from UCLA. I leave you with his excelent post on the Wilson’s algorithm.

——————————————

Igor again, with another mathematical dispatch from UCLA, where I’m spending the semester eating and breathing combinatorics as part of the 2009 program on combinatorics and its applications at IPAM. In the course of some reading related to a problem with which I’ve been occupying myself, I ran across a neat algorithmic result – Wilson’s algorithm for uniformly generating spanning trees of a graph. With Renato’s kind permission, let me once again make myself at home here at Big Red Bits and tell you all about this little gem.

The problem is straightforward, and I’ve essentially already stated it: given an undirected, connected graph {G}, we want an algorithm that outputs uniformly random spanning trees of {G}. In the early ’90s, Aldous and Broder independently discovered an algorithm for accomplishing this task. This algorithm generates a tree {T} by, roughly speaking, performing a random walk on {G} and adding the edge {(u,v)} to {T} every time that the walk steps from {u} to {v} and {v} is a vertex that has not been seen before.

Wilson’s algorithm (D. B. Wilson, “Generating random spanning trees more quickly than the cover time,” STOC ‘96) takes a slightly different approach. Let us fix a root vertex {r}. Wilson’s algorithm can be stated as a loop-erased random walk on {G} as follows.

Algorithm 1 (Loop-erased random walk) Maintain a tree {T}, initialized to consist of {r} alone. While there remains a vertex {v} not in {T}: perform a random walk starting at {v}, erasing loops as they are created, until the walk encounters a vertex in {T}, then add to {T} the cycle-erased simple path from {v} to {T}.

We observe that the algorithm halts with probability 1 (its expected running time is actually polynomial, but let’s not concern ourselves with these issues here), and outputs a random directed spanning tree oriented towards {r}. It is a minor miracle that this tree is in fact sampled uniformly from the set of all such trees. Let us note that this offers a solution to the original problem, as sampling {r} randomly and then running the algorithm will produce a uniformly generated spanning tree of {G}.

It remains, then, to prove that the algorithm produces uniform spanning trees rooted at {r} (by which we mean directed spanning trees oriented towards {r}). To this we dedicate the remainder of this post.

1. A “different” algorithm

Wilson’s proof is delightfully sneaky: we begin by stating and analyzing a seemingly different algorithm, the cycle-popping algorithm. We will prove that this algorithm has the desired properties, and then argue that it is equivalent to the loop-erased random walk (henceforth LERW).

The cycle-popping algorithm works as follows. Given {G} and {r}, associate with each non-root vertex {v} an infinite stack of neighbors. More formally, to each {v \neq r} we associate

\displaystyle  \mathcal S_v = [u_{0}, u_{1}, \dots ]

where each {u_{i}} is uniformly (and independently) sampled from the set of neighbors of {v}. Note that each stack is not a random walk, just a list of neighbors. We refer to the left-most element above as the top of {\mathcal S_v}, and by popping the stack {\mathcal S_v} we mean removing this top vertex from {\mathcal S_v}.

Define the stack graph {G_{\mathcal S}} to be the directed graph on {V} that has an edge from {v} to {u} if {u} is at the top of the stack {\mathcal S_v}. Clearly, if {G} has {n} vertices then {G_{\mathcal S}} is an oriented subgraph of {G} with {n-1} edges. The following lemma follows immediately.

Lemma 1 Either {G_{\mathcal S}} is a directed spanning tree oriented towards {r} or it contains a directed cycle.

If there is a directed cycle {C} in {G_{\mathcal S}} we may pop it by popping {\mathcal S_v} for every {v \in C}. This eliminates {C}, but of course might create other directed cycles. Without resolving this tension quite yet, let us go ahead and formally state the cycle-popping algorithm.

Algorithm 2 (Cycle-popping algorithm) Create a stack {\mathcal S_v} for every {v \neq r}. While {G_{\mathcal S}} contains any directed cycles, pop a cycle from the stacks. If this process ever terminates, output {G_{\mathcal S}}.

Note that by the lemma, if the algorithm ever terminates then its output is a spanning tree rooted at {r}. We claim that the algorithm terminates with probability 1, and moreover generates spanning trees rooted at {r} uniformly.

To this end, some more definitions: let us say that given a stack {\mathcal S_v}, the vertex {u_i} is at level {i}. The level of a vertex in a stack is static, and is defined when the stack is created. That is, the level of {u_i} does not change even if {u_i} advances to the top of the stack as a result of the stack getting popped.

We regard the sequence of stack graphs produced by the algorithm as leveled stack graphs: each non-root vertex {v} is assigned the level of its stack. Observe that the level of {v} in {G_{\mathcal S}} is the number of times that {\mathcal S_v} has been popped. In the same way, we regard cycles encountered by the algorithm as leveled cycles, and we can regard the tree produced by the algorithm (if indeed one is produced) as a leveled tree.

The analysis of the algorithm relies on the following key lemma (Theorem 4 in Wilson’s paper), which tells us that the order in which the algorithm pops cycles is irrelevant.

Lemma 2 For a given set of stacks, either the cycle-popping algorithm never terminates, or there exists a unique leveled spanning tree {T} rooted at {r} such that the algorithm outputs {T} irrespective of the order in which cycles are popped.

Proof: Fix a set of stacks {\{ \mathcal S_v \}_{v \neq r}}. Consider a leveled cycle {C} that is pop-able, i.e.~there exist leveled cycles {C_1, C_2, \dots, C_k=C} that can be popped in sequence. We claim that if the algorithm pops any cycle not equal to {C}, then there still must exist a series of cycles that ends in {C} and that can be popped in sequence. In other words, if {C} is pop-able then it remains pop-able, no matter which cycles are popped, until {C} itself is actually popped.

Let {C'} be a cycle popped by the algorithm. If {C'=C_1} then the claim is clearly true. Also, if {C'} shares no vertices with {C_1, \dots, C_k}, then the claim is true again. So assume otherwise, and let {C_i} be the first in the series {C_1, \dots, C_k} to share a vertex with {C'}. Let us show that {C'=C_i} by contradiction.

If {C_i \neq C'}, then {C_i} and {C'} must share a vertex {w} that has different successors in {C'} and {C_i}. But by definition of {C_i}, none of the {C_1, \dots, C_{i-1}} contain {w}, and this implies that {w} has the same level in {C'} and {C_i}. Therefore its successor in both cycles is the same, a contradiction. This proves {C_i=C'}.

Moreover, the argument above proves that {C_i} and {C'} are equal as leveled cycles (i.e.~every vertex has the same level in both cycles). Hence

\displaystyle  C'=C_i, C_1, C_2, \dots, C_{i-1}, C_{i+1}, \dots, C_k=C

is a series of cycles that can be popped in sequence, which proves the original claim about {C}.

We conclude that given a set of stacks, either there is an infinite number of pop-able cycles, in which case there will always be an infinite number and the algorithm will never terminate, or there is a finite number of such cycles. In the latter case, every one of these cycles is eventually popped, and the algorithm produces a spanning tree {T} rooted at {r}. The level of each non-root vertex in {T} is given by (one plus) the number of popped cycles that contained {v}. \Box

Wilson summarizes the cycle-popping algorithm thusly: “[T]he stacks uniquely define a tree together with a partially ordered set of cycles layered on top of it. The algorithm peels off these cycles to find the tree.”

Theorem 3 The cycle-popping algorithm terminates with probability 1, and the tree that it outputs is a uniformly sampled spanning tree rooted at {r}.

Proof: The first claim is easy: {G} has a spanning tree, therefore it has a directed spanning tree oriented towards {r}. The stacks generated in the first step of the algorithm will contain such a tree, and hence the algorithm will terminate, with probability 1.

Now, consider a spanning tree {T} rooted at {r}. We’ll abuse notation and let {T} be the event that {T} is produced by the algorithm. Similarly, given a collection of leveled cycles {\mathcal C}, we will write {\mathcal C} for the event that {\mathcal C} is the set of leveled cycles popped by the algorithm before it terminates. Finally, let {\mathcal C \wedge T} be the event that the algorithm popped the leveled cycles in {\mathcal C} and terminated, with the resulting leveled tree being equal to {T}.

By the independence of the stack entries, we have {\Pr[\mathcal C \wedge T] = \Pr[\mathcal C] \cdot p}, where {p} is the probability that the algorithm’s output is a leveled version of {T}, a quantity which a moment’s reflection will reveal is independent of {T}. Now,

\displaystyle  \Pr[T] = \sum_{\mathcal C} \Pr[\mathcal C \wedge T] = p \sum_{\mathcal C} \Pr[\mathcal C]

which, as desired, is independent of {T}. \Box

2. Conclusion

We have shown that the cycle-popping algorithm generates spanning trees rooted at {r} uniformly. It remains to observe that the LERW algorithm is nothing more than an implementation of the cycle-popping algorithm! Instead of initially generating the (infinitely long) stacks and then looking for cycles to pop, the LERW generates stack elements as necessary via random walk (computer scientists might recognize this as the Principle of Deferred Decisions). If the LERW encounters a loop, then it has found a cycle in the stack graph {G_{\mathcal S}} induced by the stacks that the LERW has been generating. Erasing the loop is equivalent to popping this cycle. We conclude that the LERW algorithm generates spanning trees rooted at {r} uniformly.

Categories: theory Tags: , ,

More about hats and auctions

October 29th, 2009 renatoppl No comments

In my last post about hats, I told I’ll soon post another version with some more problems, which I ended up not doing and would talk a bit more about those kind of problems. I ended up not doing, but here are a few nice problems:

Those {n} people are again a room, each with a hat which is either black or white (picked with probability {\frac{1}{2}} at random) and they can see the color of the other people’s hats but they can’t see their own color. They write in a piece of paper either “BLACK” or “WHITE”. The whole team wins if all of them get their colors right. The whole team loses, if at least one writes the wrong color. Before entering the room and getting the hats, they can strategyze. What is a strategy that makes them win with {\frac{1}{2}} probability?

If they all choose their colors at random, the probability of winning is very small: {\frac{1}{2^n}}. So we should try to correlate them somehow. The solution is again related with error correcting codes. We can think of the hats as a string of bits. How to correct one bit if it is lost? The simple engineering solution is to add a parity check. We append to the string {x_0, x_1, \hdots, x_n} a bit {y = \sum_i x_i \mod 2}. So, if bit {i} is lost, we know it is {x_i = (y + \sum_{j \neq i} x_j) \mod 2}. We can use this idea to solve the puzzle above: if hats are places with {\frac{1}{2}} probability, the parity check will be {0} with probability {\frac{1}{2}} and {1} with probability {\frac{1}{2}}. They can decide before hand that everyone will use {y = 0} and with probability {\frac{1}{2}} they are right and everyone gets his hat color right. Now, let’s extend this problem in some ways:

The same problem, but there are {k} hat colors, they are choosen independently with probability {\frac{1}{k}} and they win if everyone gets his color right. Find a strategy that wins with probability {\frac{1}{k}}.

There are again {k} hat colors, they are choosen independently with probability {\frac{1}{k}} and they win if at least a fraction {f} ({0 < f < 1}) of the people guesses the right color. Find a strategy that wins with probability {\frac{1}{fk}}.

Again to the problem where we just have BLACK and WHITE colors, they are chosen with probability {\frac{1}{2}} and everyone needs to find the right color to win, can you prove that {\frac{1}{2}} is the best one can do? And what about the two other problems above?

The first two use variations of the parity check idea in the solution. For the second case, given any strategy of the players, for each string {x \in \{0,1\}^n} they have probability {p_x}. Therefore the total probability of winning is {\frac{1}{2^n}\sum_{x \in \{0,1\}^n} p_x}. Let {x' = (1-x_1, x_2, \hdots, x_n)}, i.e., the same input but with the bit {1} flipped. Notice that the answer of player {1} is the same (or at least has the same probabilities) in both {x} and {x'}, since he can’t distinguish between {x} and {x'}. Therefore, {p_{x} + p_{x'} \leq 1}. So,

\displaystyle 2 \frac{1}{2^n}\sum_{x \in \{0,1\}^n} p_x = \frac{1}{2^n}\sum_{x \in \{0,1\}^n} p_x + \frac{1}{2^n}\sum_{x \in \{0,1\}^n} p_x' \leq 1

. This way, no strategy can have more than {\frac{1}{2}} probability of winning.

Another variation of it:

Suppose now we have two colors BLACK and WHITE and the hats are drawn from one distribution {D}, i.e., we have a probability distribution over {x \in \{0,1\}^n} and we draw the colors from that distribution. Notice that now the hats are not uncorrelated. How to win again with probability {\frac{1}{2}} (to win, everyone needs the right answer).

I like a lot those hat problems. A friend of mine just pointed out to me that there is a very nice paper by Bobby Kleinberg generalizing several aspects of hat problems, for example, when players have limited visibility of other players hats.

hats1

I began being interested by this sort of problem after reading the Derandomization of Auctions paper. Hat guessing games are not just a good model for error correcting codes, but they are also a good model for truthful auctions. Consider an auction with a set {N} single parameter agents, i.e., an auction where each player gives one bid {b_i} indicating how much he is willing to pay to win. We have a set of constraints: {\mathcal{X} \subseteq 2^N} of all feasible allocations. Based on the bids {(b_i)_{i \in N}} we choose an allocation {S \in \mathcal{X}} and we charge payments to the bidders. An example of a problem like this is the Digital Goods Auction, where {\mathcal{X} = 2^N}.

In this blog post, I discussed the concept of truthful auction. If an auction is randomized, an universal truthful auction is an auction that is truthful even if all the random bits in the mechanism are revealed to the bidders. Consider the Digital Goods Auction. We can characterize universal truthful digital goods auction as bid-independent auctions. A bid-independent auction is given by function {f_i(b_{-i})}, which associated for each {b_{-i}} a random variable {f_i(b_{-i})}. In that auction, we offer the service to player {i} at price {f_i(b_{-i})}. If {b_i \geq f_i(b_{-i})} we allocate to {i} and charge him {f_i(b_{-i})}. Otherwise, we don’t allocate and we charge nothing.

It is not hard to see that all universal truthful mechanisms are like that: if {x_i(b_i)} is the probability that player {i} gets the item bidding {b_i} let {U} be an uniform random variable on {[0,1]} and define {f_i(b_{-i}) = x_i^{-1}(U)}. Notice that here {x_i(.) = x_i(., b_{-i})}, but we are inverting with respect to {b_i}. It is a simple exercise to prove that.

With this characterization, universal truthful auctions suddenly look very much like hat guessing games: we need to design a function that looks at everyone else’s bid but not on our own and in some sense, “guesses” what we probably have and with that calculated the price we offer. It would be great to be able to design a function that returns {f(b_{-i}) = b_i}. That is unfortunately impossible. But how to approximate {b_i} nicely? Some papers, like the Derandomization of Auctions and Competitiveness via Consensus use this idea.

Cayley-Hamilton Theorem and Jordan Canonical Form

October 28th, 2009 renatoppl 2 comments

I was discussing last week with my officemates Hu Fu and Ashwin about the Cayley-Hamilton Theorem. The theorem is the following, given an {n \times n} matrix {A} we can define its characteristic polynomial by {p_A(\lambda) = \det(A - I\lambda)}. The Cayley-Hamilton Theorem says that {p_A(A) = 0}. The polynomiale is something like:

\displaystyle p_A(x) = a_k x^k + a_{k-1} x^{k-1} + \hdots + a_1 x^1 + a_0

so we can just see it as a formal polynomial and think of:

\displaystyle p_A(A) = a_k A^k + a_{k-1} A^{k-1} + \hdots + a_1 A + a_0 I

which is an {n \times n} matrix. The theorem says it is the zero matrix. We thought for a while, looked in the Wikipedia, and there there were a few proofs, but not the one-line proof I was looking for. Later, I got this proof that I sent to Hu Fu:

Write the matrix {A} in the basis of its eigenvectors, then we can write {A = \Gamma^t D \Gamma} where {D} is the diagonal matrix with the eigenvalues in the main diagonal.

\displaystyle A^k = (\Gamma^t D \Gamma) \hdots (\Gamma^t D \Gamma) = \Gamma^t D^k \Gamma

and since {D = \text{diag}(\lambda_1, \hdots, \lambda_n)} we have {D^k = \text{diag}(\lambda_1^k, \hdots, \lambda_n^k)}. Now, it is simple to see that:

\displaystyle p_A(A) = \Gamma^t p(D) \Gamma

and therefore:

\displaystyle p(D) = \begin{bmatrix}& p_A(\lambda_1) & & & \\ & & p_A(\lambda_2) & & & \\ & & & \ddots & \\ & & & & p_A(\lambda_n) \end{bmatrix} = \begin{bmatrix} & 0 & & & \\ & & 0 & & & \\ & & & \ddots & \\ & & & & 0 \end{bmatrix} = 0

And that was the one-line proof. One even simpler proof is: let {v_i} be the eigenvectors, then {p_A(A)v_i = p_A(\lambda_i)A = 0}, so {p_A(A)} must be {0} since it returns zero for all elements of a basis. Well, I sent that to Hu Fu and he told me the proof had a bug. Not really a bug, but I was proving only for symmetric matrices. More generally, I was proving for diagonalizable matrices. He showed me, for example, the matrix:

\displaystyle  \begin{bmatrix} 0 & 1 \\ 0 & 0 \end{bmatrix}

which has only one eigenvalue {0} and the the eigenvectors are all of the form {(x,0)} for {x \in {\mathbb R}}. So, the dimension of the space spanned by the eigenvectors is {1}, less than the dimension of the matrix. This never happens for symmetric matrices, and I guess after some time as a computer scientist, I got used to work only with symmetric matrices for almost everything I use: metrics, quadratic forms, correlation matrices, … but there is more out there then only symmetric matrices. The good news is that this proof is not hard to fix for the general case.

First, it is easy to prove that for each root of the characteristic polynomial there is one eigenvector associated to it (just see that {\det(A - \lambda I) = 0} and therefore there must be {v \in \ker(A - \lambda I) \setminus \{ 0 \}}, so if all the roots are distinct, then there is a basis of eigenvalues, and therefore the matrix is diagonalizable (notice that maybe we will need to use complex eigenvalues, but it is ok). The good thing is that a matrix having two identical eigenvalues is a “coincidence”. We can identify matrices with {{\mathbb R}^{n^2}}. The matrices with identical eigenvalues form a zero measure subset of {{\mathbb R}^{n^2}}, they are in fact the roots of a polynomial in {x_{ij}}. This polynomial is the resultant polynomial {R_{p,p'} = 0}. Therefore, we proved Cayley-Hamilton theorem in the complement of a zero-measure set in {{\mathbb R}^{n^2}}. Since {A \mapsto p_A(A)} is a continuous function, it extends naturally to all matrices {A \in {\mathbb R}^{n^2}}.

We can also interpret that probabilistically: get a matrix {U} where {U_{ij}} is taken uniformly at random from {[0,1]}. Then {A + \epsilon U} has with probability {1} all different eigenvalues. So, {p_{A+\epsilon U} (A+\epsilon U) = 0} with probability {1}. Now, just make {\epsilon \rightarrow 0}.

Ok, this proves the Theorem for real and complex matrices, but what about a matrix defined over a general field where we can’t use those continuity arguments. A way to get around it is by using Jordan Canonical Form, which is a generalization of eigenvector decomposition. Not all matrices have eigenvector decomposition, but all matrices over an algebraic closed field can be written in Jordan Canonical Form. Given any {A \in \overline{K}^{n^2}} there is a matrix {\Gamma \in \overline{K}^{n^2}} so that:

\displaystyle A = \Gamma^{-1} \begin{bmatrix}& B_1 & & & \\ & & B_2 & & & \\ & & & \ddots & \\ & & & & B_p \end{bmatrix}\Gamma

where {B_i} are blocks of the form:

\displaystyle B_i = \begin{bmatrix}& \lambda & 1 & & \\ & & \lambda & 1 & & \\ & & & \ddots & \\ & & & & \lambda \end{bmatrix}

By the same argument as above, we just need to prove Cayley Hamilton for each block in separate. So we need to prove that {p_A(B_i) = 0}. If the block has size {1}, then it is exacly the proof above. If the block is bigger, then we need to look at how does {B_i^k} looks like. By inspection:

\displaystyle B_i^2 = \begin{bmatrix}& \lambda^2 & 2 \lambda & 1 & & & \\ & & \lambda^2 & 2 \lambda & 1 & & & \\ & & & & & \ddots & \\ & & & & & \lambda^2 \end{bmatrix}

Tipically, for {B_i^k} we have in each row, starting in column {k} the sequence {\lambda^k, k \lambda^{k-1}, k(k-1) \lambda^{k-1}, \hdots}, i.e., {\frac{d^0}{d\lambda^0} \lambda^k, \frac{d^1}{d\lambda^1} \lambda^k , \frac{d^2}{d\lambda^2} \lambda^k \hdots}. So, we have

\displaystyle p(B_i) = \begin{bmatrix} p(\lambda) & p'(\lambda) & p''(\lambda) & p'''(\lambda) & \hdots \\ & p(\lambda) & p'(\lambda) & p''(\lambda) & \hdots \\ 				& & p(\lambda) & p'(\lambda) & \hdots \\ 				& & & p(\lambda) & \hdots \\ 				& & & & \ddots \\ \end{bmatrix}

If block {B_i} has size {k}, then {\lambda_i} has multiplicity {k} in {p(.)} and therefore {p(\lambda_i) = p'(\lambda_i) = \hdots = p^{(k-1)}(\lambda_i)} and therefore, {p(B_i) = 0} as we wanted to prove.

It turned out not to be a very very short proof, but it is still short, since it uses mostly elementary stuff and the proof is really intuitive in some sense. I took some lessons from that: (i) first it reinforces my idea that, if I need to say something about a matrix, the first thing I do is to look at its eigenvectors decomposition. A lot of Linear Algebra problems are very simple when we consider things in the right basis. Normally the right basis is the eigenvector basis. (ii) not all matrices are diagonalizable. But in those cases, Jordan Canonical Form comes in our help and we can do almost the same as we did with eigenvalue decomposition.

[LADIS 2009] Technical Session #4 – Monitoring and Repair

October 11th, 2009 hussam No comments

Second Talk: A case for the accountable cloud by Andreas Haeberlen

Three cloud stories .. a threatening cloud, a promising cloud, and a nice cloud :-)

The problem with current clouds is that the user does not know what the cloud service provider is doing with the customer’s code and data. Also, from the cloud service provider’s perspective, the operator does not know what is the code that they are running for customers supposed to do.

Alice is the customer running a service on the cloud owned and operated by Bob.

A solution: what if we had an oracle that Alice and Bob could ask about cloud problems? We want completeness, (if something is faulty, we will know) accuracy (no false positives), verifiability (the oracle can prove its diagnoses is correct).

Idea: make clud accountable to alice+bob. Cloud records its actions in a tamper-evident log, alice and bob can audit, use log to construct evidence that a fault does or does not exist.

Discussion: 1) Isn’t this too pessimistic? bob isn’t malicious ..maybe, but bob can get hacked, or things can just go wrong. 2) shouldn’t bob use fault tolerance instead? yes whenever we can, but masking faults is never perfect, we still need to check. 3) why would a provider want to deploy this? this feature will be attractive to prospective customers, and helpful for support. 4) Are these the right guarantees? completeness (no false negatives), could be relaxed with probabilistic completeness; verifiability could be relaxed only provide some evidence; accuracy (no false positives) can not be relaxed because we need to have confidence when we rule out problems.

A call to action: cloud accountability should do; deliverable provable guarantees, work for most cloud apps, require no changes to application code, cover a wide spectrum of properties, low overhead.

Work in Progress: Accountable Virtual Machines (AVM), goal: provide accountability for arbitrary unmodified software. cloud records enough data to enable deterministic replay, alice can replay log with a known-good copy of the software, can audit any part of the original execution.

Conclusion: current cloud designs carry risks for both customers and providers (mainly because of split administration problem). Proposed solution: accountable cloud. Lots of research opportunities.

Third Talk: Learning from the Past for Resolving Dilemmas of Asynchrony by Paul Ezhilchelvan

In an asynchronous model you can not bound message delivery time or even message processing time by a machine. However, in a probabilistic synchronous model, we can bound times within a certain probability via proactive measurements. The new central hypothesis of the new model is that most of the time, performance of the past is indicative of the performance of the near future (i.e. delay in the past is the indicative of delay in the future).

Design steps include doing proactive measurements, using them to establish synchrony bounds, and assign time bounds based on that, try that and see how it works and enable exceptions.

On-going work: development of exceptions (to deal with exceptional cases when mistakes are detected). Open environments are asynchronous, use crash signals for notification of extreme unexpected behavior.

Categories: Conferences, Systems Tags: