### Archive

Posts Tagged ‘theory’

## Random Spanning Trees

BigRedBits is again pleased to have Igor Gorodezky as a guest blogger directly from UCLA. I leave you with his excelent post on the Wilson’s algorithm.

——————————————

Igor again, with another mathematical dispatch from UCLA, where I’m spending the semester eating and breathing combinatorics as part of the 2009 program on combinatorics and its applications at IPAM. In the course of some reading related to a problem with which I’ve been occupying myself, I ran across a neat algorithmic result – Wilson’s algorithm for uniformly generating spanning trees of a graph. With Renato’s kind permission, let me once again make myself at home here at Big Red Bits and tell you all about this little gem.

The problem is straightforward, and I’ve essentially already stated it: given an undirected, connected graph ${G}$, we want an algorithm that outputs uniformly random spanning trees of ${G}$. In the early ’90s, Aldous and Broder independently discovered an algorithm for accomplishing this task. This algorithm generates a tree ${T}$ by, roughly speaking, performing a random walk on ${G}$ and adding the edge ${(u,v)}$ to ${T}$ every time that the walk steps from ${u}$ to ${v}$ and ${v}$ is a vertex that has not been seen before.

Wilson’s algorithm (D. B. Wilson, “Generating random spanning trees more quickly than the cover time,” STOC ’96) takes a slightly different approach. Let us fix a root vertex ${r}$. Wilson’s algorithm can be stated as a loop-erased random walk on ${G}$ as follows.

Algorithm 1 (Loop-erased random walk) Maintain a tree ${T}$, initialized to consist of ${r}$ alone. While there remains a vertex ${v}$ not in ${T}$: perform a random walk starting at ${v}$, erasing loops as they are created, until the walk encounters a vertex in ${T}$, then add to ${T}$ the cycle-erased simple path from ${v}$ to ${T}$.

We observe that the algorithm halts with probability 1 (its expected running time is actually polynomial, but let’s not concern ourselves with these issues here), and outputs a random directed spanning tree oriented towards ${r}$. It is a minor miracle that this tree is in fact sampled uniformly from the set of all such trees. Let us note that this offers a solution to the original problem, as sampling ${r}$ randomly and then running the algorithm will produce a uniformly generated spanning tree of ${G}$.

It remains, then, to prove that the algorithm produces uniform spanning trees rooted at ${r}$ (by which we mean directed spanning trees oriented towards ${r}$). To this we dedicate the remainder of this post.

1. A “different” algorithm

Wilson’s proof is delightfully sneaky: we begin by stating and analyzing a seemingly different algorithm, the cycle-popping algorithm. We will prove that this algorithm has the desired properties, and then argue that it is equivalent to the loop-erased random walk (henceforth LERW).

The cycle-popping algorithm works as follows. Given ${G}$ and ${r}$, associate with each non-root vertex ${v}$ an infinite stack of neighbors. More formally, to each ${v \neq r}$ we associate

$\displaystyle \mathcal S_v = [u_{0}, u_{1}, \dots ]$

where each ${u_{i}}$ is uniformly (and independently) sampled from the set of neighbors of ${v}$. Note that each stack is not a random walk, just a list of neighbors. We refer to the left-most element above as the top of ${\mathcal S_v}$, and by popping the stack ${\mathcal S_v}$ we mean removing this top vertex from ${\mathcal S_v}$.

Define the stack graph ${G_{\mathcal S}}$ to be the directed graph on ${V}$ that has an edge from ${v}$ to ${u}$ if ${u}$ is at the top of the stack ${\mathcal S_v}$. Clearly, if ${G}$ has ${n}$ vertices then ${G_{\mathcal S}}$ is an oriented subgraph of ${G}$ with ${n-1}$ edges. The following lemma follows immediately.

Lemma 1 Either ${G_{\mathcal S}}$ is a directed spanning tree oriented towards ${r}$ or it contains a directed cycle.

If there is a directed cycle ${C}$ in ${G_{\mathcal S}}$ we may pop it by popping ${\mathcal S_v}$ for every ${v \in C}$. This eliminates ${C}$, but of course might create other directed cycles. Without resolving this tension quite yet, let us go ahead and formally state the cycle-popping algorithm.

Algorithm 2 (Cycle-popping algorithm) Create a stack ${\mathcal S_v}$ for every ${v \neq r}$. While ${G_{\mathcal S}}$ contains any directed cycles, pop a cycle from the stacks. If this process ever terminates, output ${G_{\mathcal S}}$.

Note that by the lemma, if the algorithm ever terminates then its output is a spanning tree rooted at ${r}$. We claim that the algorithm terminates with probability 1, and moreover generates spanning trees rooted at ${r}$ uniformly.

To this end, some more definitions: let us say that given a stack ${\mathcal S_v}$, the vertex ${u_i}$ is at level ${i}$. The level of a vertex in a stack is static, and is defined when the stack is created. That is, the level of ${u_i}$ does not change even if ${u_i}$ advances to the top of the stack as a result of the stack getting popped.

We regard the sequence of stack graphs produced by the algorithm as leveled stack graphs: each non-root vertex ${v}$ is assigned the level of its stack. Observe that the level of ${v}$ in ${G_{\mathcal S}}$ is the number of times that ${\mathcal S_v}$ has been popped. In the same way, we regard cycles encountered by the algorithm as leveled cycles, and we can regard the tree produced by the algorithm (if indeed one is produced) as a leveled tree.

The analysis of the algorithm relies on the following key lemma (Theorem 4 in Wilson’s paper), which tells us that the order in which the algorithm pops cycles is irrelevant.

Lemma 2 For a given set of stacks, either the cycle-popping algorithm never terminates, or there exists a unique leveled spanning tree ${T}$ rooted at ${r}$ such that the algorithm outputs ${T}$ irrespective of the order in which cycles are popped.

Proof: Fix a set of stacks ${\{ \mathcal S_v \}_{v \neq r}}$. Consider a leveled cycle ${C}$ that is pop-able, i.e.~there exist leveled cycles ${C_1, C_2, \dots, C_k=C}$ that can be popped in sequence. We claim that if the algorithm pops any cycle not equal to ${C}$, then there still must exist a series of cycles that ends in ${C}$ and that can be popped in sequence. In other words, if ${C}$ is pop-able then it remains pop-able, no matter which cycles are popped, until ${C}$ itself is actually popped.

Let ${C'}$ be a cycle popped by the algorithm. If ${C'=C_1}$ then the claim is clearly true. Also, if ${C'}$ shares no vertices with ${C_1, \dots, C_k}$, then the claim is true again. So assume otherwise, and let ${C_i}$ be the first in the series ${C_1, \dots, C_k}$ to share a vertex with ${C'}$. Let us show that ${C'=C_i}$ by contradiction.

If ${C_i \neq C'}$, then ${C_i}$ and ${C'}$ must share a vertex ${w}$ that has different successors in ${C'}$ and ${C_i}$. But by definition of ${C_i}$, none of the ${C_1, \dots, C_{i-1}}$ contain ${w}$, and this implies that ${w}$ has the same level in ${C'}$ and ${C_i}$. Therefore its successor in both cycles is the same, a contradiction. This proves ${C_i=C'}$.

Moreover, the argument above proves that ${C_i}$ and ${C'}$ are equal as leveled cycles (i.e.~every vertex has the same level in both cycles). Hence

$\displaystyle C'=C_i, C_1, C_2, \dots, C_{i-1}, C_{i+1}, \dots, C_k=C$

is a series of cycles that can be popped in sequence, which proves the original claim about ${C}$.

We conclude that given a set of stacks, either there is an infinite number of pop-able cycles, in which case there will always be an infinite number and the algorithm will never terminate, or there is a finite number of such cycles. In the latter case, every one of these cycles is eventually popped, and the algorithm produces a spanning tree ${T}$ rooted at ${r}$. The level of each non-root vertex in ${T}$ is given by (one plus) the number of popped cycles that contained ${v}$. $\Box$

Wilson summarizes the cycle-popping algorithm thusly: “[T]he stacks uniquely define a tree together with a partially ordered set of cycles layered on top of it. The algorithm peels off these cycles to find the tree.”

Theorem 3 The cycle-popping algorithm terminates with probability 1, and the tree that it outputs is a uniformly sampled spanning tree rooted at ${r}$.

Proof: The first claim is easy: ${G}$ has a spanning tree, therefore it has a directed spanning tree oriented towards ${r}$. The stacks generated in the first step of the algorithm will contain such a tree, and hence the algorithm will terminate, with probability 1.

Now, consider a spanning tree ${T}$ rooted at ${r}$. We’ll abuse notation and let ${T}$ be the event that ${T}$ is produced by the algorithm. Similarly, given a collection of leveled cycles ${\mathcal C}$, we will write ${\mathcal C}$ for the event that ${\mathcal C}$ is the set of leveled cycles popped by the algorithm before it terminates. Finally, let ${\mathcal C \wedge T}$ be the event that the algorithm popped the leveled cycles in ${\mathcal C}$ and terminated, with the resulting leveled tree being equal to ${T}$.

By the independence of the stack entries, we have ${\Pr[\mathcal C \wedge T] = \Pr[\mathcal C] \cdot p}$, where ${p}$ is the probability that the algorithm’s output is a leveled version of ${T}$, a quantity which a moment’s reflection will reveal is independent of ${T}$. Now,

$\displaystyle \Pr[T] = \sum_{\mathcal C} \Pr[\mathcal C \wedge T] = p \sum_{\mathcal C} \Pr[\mathcal C]$

which, as desired, is independent of ${T}$. $\Box$

2. Conclusion

We have shown that the cycle-popping algorithm generates spanning trees rooted at ${r}$ uniformly. It remains to observe that the LERW algorithm is nothing more than an implementation of the cycle-popping algorithm! Instead of initially generating the (infinitely long) stacks and then looking for cycles to pop, the LERW generates stack elements as necessary via random walk (computer scientists might recognize this as the Principle of Deferred Decisions). If the LERW encounters a loop, then it has found a cycle in the stack graph ${G_{\mathcal S}}$ induced by the stacks that the LERW has been generating. Erasing the loop is equivalent to popping this cycle. We conclude that the LERW algorithm generates spanning trees rooted at ${r}$ uniformly.

Categories: theory Tags:

## More about hats and auctions

In my last post about hats, I told I’ll soon post another version with some more problems, which I ended up not doing and would talk a bit more about those kind of problems. I ended up not doing, but here are a few nice problems:

Those ${n}$ people are again a room, each with a hat which is either black or white (picked with probability ${\frac{1}{2}}$ at random) and they can see the color of the other people’s hats but they can’t see their own color. They write in a piece of paper either “BLACK” or “WHITE”. The whole team wins if all of them get their colors right. The whole team loses, if at least one writes the wrong color. Before entering the room and getting the hats, they can strategyze. What is a strategy that makes them win with ${\frac{1}{2}}$ probability?

If they all choose their colors at random, the probability of winning is very small: ${\frac{1}{2^n}}$. So we should try to correlate them somehow. The solution is again related with error correcting codes. We can think of the hats as a string of bits. How to correct one bit if it is lost? The simple engineering solution is to add a parity check. We append to the string ${x_0, x_1, \hdots, x_n}$ a bit ${y = \sum_i x_i \mod 2}$. So, if bit ${i}$ is lost, we know it is ${x_i = (y + \sum_{j \neq i} x_j) \mod 2}$. We can use this idea to solve the puzzle above: if hats are places with ${\frac{1}{2}}$ probability, the parity check will be ${0}$ with probability ${\frac{1}{2}}$ and ${1}$ with probability ${\frac{1}{2}}$. They can decide before hand that everyone will use ${y = 0}$ and with probability ${\frac{1}{2}}$ they are right and everyone gets his hat color right. Now, let’s extend this problem in some ways:

The same problem, but there are ${k}$ hat colors, they are choosen independently with probability ${\frac{1}{k}}$ and they win if everyone gets his color right. Find a strategy that wins with probability ${\frac{1}{k}}$.

There are again ${k}$ hat colors, they are choosen independently with probability ${\frac{1}{k}}$ and they win if at least a fraction ${f}$ (${0 < f < 1}$) of the people guesses the right color. Find a strategy that wins with probability ${\frac{1}{fk}}$.

Again to the problem where we just have BLACK and WHITE colors, they are chosen with probability ${\frac{1}{2}}$ and everyone needs to find the right color to win, can you prove that ${\frac{1}{2}}$ is the best one can do? And what about the two other problems above?

The first two use variations of the parity check idea in the solution. For the second case, given any strategy of the players, for each string ${x \in \{0,1\}^n}$ they have probability ${p_x}$. Therefore the total probability of winning is ${\frac{1}{2^n}\sum_{x \in \{0,1\}^n} p_x}$. Let ${x' = (1-x_1, x_2, \hdots, x_n)}$, i.e., the same input but with the bit ${1}$ flipped. Notice that the answer of player ${1}$ is the same (or at least has the same probabilities) in both ${x}$ and ${x'}$, since he can’t distinguish between ${x}$ and ${x'}$. Therefore, ${p_{x} + p_{x'} \leq 1}$. So,

$\displaystyle 2 \frac{1}{2^n}\sum_{x \in \{0,1\}^n} p_x = \frac{1}{2^n}\sum_{x \in \{0,1\}^n} p_x + \frac{1}{2^n}\sum_{x \in \{0,1\}^n} p_x' \leq 1$

. This way, no strategy can have more than ${\frac{1}{2}}$ probability of winning.

Another variation of it:

Suppose now we have two colors BLACK and WHITE and the hats are drawn from one distribution ${D}$, i.e., we have a probability distribution over ${x \in \{0,1\}^n}$ and we draw the colors from that distribution. Notice that now the hats are not uncorrelated. How to win again with probability ${\frac{1}{2}}$ (to win, everyone needs the right answer).

I like a lot those hat problems. A friend of mine just pointed out to me that there is a very nice paper by Bobby Kleinberg generalizing several aspects of hat problems, for example, when players have limited visibility of other players hats.

I began being interested by this sort of problem after reading the Derandomization of Auctions paper. Hat guessing games are not just a good model for error correcting codes, but they are also a good model for truthful auctions. Consider an auction with a set ${N}$ single parameter agents, i.e., an auction where each player gives one bid ${b_i}$ indicating how much he is willing to pay to win. We have a set of constraints: ${\mathcal{X} \subseteq 2^N}$ of all feasible allocations. Based on the bids ${(b_i)_{i \in N}}$ we choose an allocation ${S \in \mathcal{X}}$ and we charge payments to the bidders. An example of a problem like this is the Digital Goods Auction, where ${\mathcal{X} = 2^N}$.

In this blog post, I discussed the concept of truthful auction. If an auction is randomized, an universal truthful auction is an auction that is truthful even if all the random bits in the mechanism are revealed to the bidders. Consider the Digital Goods Auction. We can characterize universal truthful digital goods auction as bid-independent auctions. A bid-independent auction is given by function ${f_i(b_{-i})}$, which associated for each ${b_{-i}}$ a random variable ${f_i(b_{-i})}$. In that auction, we offer the service to player ${i}$ at price ${f_i(b_{-i})}$. If ${b_i \geq f_i(b_{-i})}$ we allocate to ${i}$ and charge him ${f_i(b_{-i})}$. Otherwise, we don’t allocate and we charge nothing.

It is not hard to see that all universal truthful mechanisms are like that: if ${x_i(b_i)}$ is the probability that player ${i}$ gets the item bidding ${b_i}$ let ${U}$ be an uniform random variable on ${[0,1]}$ and define ${f_i(b_{-i}) = x_i^{-1}(U)}$. Notice that here ${x_i(.) = x_i(., b_{-i})}$, but we are inverting with respect to ${b_i}$. It is a simple exercise to prove that.

With this characterization, universal truthful auctions suddenly look very much like hat guessing games: we need to design a function that looks at everyone else’s bid but not on our own and in some sense, “guesses” what we probably have and with that calculated the price we offer. It would be great to be able to design a function that returns ${f(b_{-i}) = b_i}$. That is unfortunately impossible. But how to approximate ${b_i}$ nicely? Some papers, like the Derandomization of Auctions and Competitiveness via Consensus use this idea.

Categories: Tags:

## Cayley-Hamilton Theorem and Jordan Canonical Form

I was discussing last week with my officemates Hu Fu and Ashwin about the Cayley-Hamilton Theorem. The theorem is the following, given an ${n \times n}$ matrix ${A}$ we can define its characteristic polynomial by ${p_A(\lambda) = \det(A - I\lambda)}$. The Cayley-Hamilton Theorem says that ${p_A(A) = 0}$. The polynomiale is something like:

$\displaystyle p_A(x) = a_k x^k + a_{k-1} x^{k-1} + \hdots + a_1 x^1 + a_0$

so we can just see it as a formal polynomial and think of:

$\displaystyle p_A(A) = a_k A^k + a_{k-1} A^{k-1} + \hdots + a_1 A + a_0 I$

which is an ${n \times n}$ matrix. The theorem says it is the zero matrix. We thought for a while, looked in the Wikipedia, and there there were a few proofs, but not the one-line proof I was looking for. Later, I got this proof that I sent to Hu Fu:

Write the matrix ${A}$ in the basis of its eigenvectors, then we can write ${A = \Gamma^t D \Gamma}$ where ${D}$ is the diagonal matrix with the eigenvalues in the main diagonal.

$\displaystyle A^k = (\Gamma^t D \Gamma) \hdots (\Gamma^t D \Gamma) = \Gamma^t D^k \Gamma$

and since ${D = \text{diag}(\lambda_1, \hdots, \lambda_n)}$ we have ${D^k = \text{diag}(\lambda_1^k, \hdots, \lambda_n^k)}$. Now, it is simple to see that:

$\displaystyle p_A(A) = \Gamma^t p(D) \Gamma$

and therefore:

$\displaystyle p(D) = \begin{bmatrix}& p_A(\lambda_1) & & & \\ & & p_A(\lambda_2) & & & \\ & & & \ddots & \\ & & & & p_A(\lambda_n) \end{bmatrix} = \begin{bmatrix} & 0 & & & \\ & & 0 & & & \\ & & & \ddots & \\ & & & & 0 \end{bmatrix} = 0$

And that was the one-line proof. One even simpler proof is: let ${v_i}$ be the eigenvectors, then ${p_A(A)v_i = p_A(\lambda_i)A = 0}$, so ${p_A(A)}$ must be ${0}$ since it returns zero for all elements of a basis. Well, I sent that to Hu Fu and he told me the proof had a bug. Not really a bug, but I was proving only for symmetric matrices. More generally, I was proving for diagonalizable matrices. He showed me, for example, the matrix:

$\displaystyle \begin{bmatrix} 0 & 1 \\ 0 & 0 \end{bmatrix}$

which has only one eigenvalue ${0}$ and the the eigenvectors are all of the form ${(x,0)}$ for ${x \in {\mathbb R}}$. So, the dimension of the space spanned by the eigenvectors is ${1}$, less than the dimension of the matrix. This never happens for symmetric matrices, and I guess after some time as a computer scientist, I got used to work only with symmetric matrices for almost everything I use: metrics, quadratic forms, correlation matrices, … but there is more out there then only symmetric matrices. The good news is that this proof is not hard to fix for the general case.

First, it is easy to prove that for each root of the characteristic polynomial there is one eigenvector associated to it (just see that ${\det(A - \lambda I) = 0}$ and therefore there must be ${v \in \ker(A - \lambda I) \setminus \{ 0 \}}$, so if all the roots are distinct, then there is a basis of eigenvalues, and therefore the matrix is diagonalizable (notice that maybe we will need to use complex eigenvalues, but it is ok). The good thing is that a matrix having two identical eigenvalues is a “coincidence”. We can identify matrices with ${{\mathbb R}^{n^2}}$. The matrices with identical eigenvalues form a zero measure subset of ${{\mathbb R}^{n^2}}$, they are in fact the roots of a polynomial in ${x_{ij}}$. This polynomial is the resultant polynomial ${R_{p,p'} = 0}$. Therefore, we proved Cayley-Hamilton theorem in the complement of a zero-measure set in ${{\mathbb R}^{n^2}}$. Since ${A \mapsto p_A(A)}$ is a continuous function, it extends naturally to all matrices ${A \in {\mathbb R}^{n^2}}$.

We can also interpret that probabilistically: get a matrix ${U}$ where ${U_{ij}}$ is taken uniformly at random from ${[0,1]}$. Then ${A + \epsilon U}$ has with probability ${1}$ all different eigenvalues. So, ${p_{A+\epsilon U} (A+\epsilon U) = 0}$ with probability ${1}$. Now, just make ${\epsilon \rightarrow 0}$.

Ok, this proves the Theorem for real and complex matrices, but what about a matrix defined over a general field where we can’t use those continuity arguments. A way to get around it is by using Jordan Canonical Form, which is a generalization of eigenvector decomposition. Not all matrices have eigenvector decomposition, but all matrices over an algebraic closed field can be written in Jordan Canonical Form. Given any ${A \in \overline{K}^{n^2}}$ there is a matrix ${\Gamma \in \overline{K}^{n^2}}$ so that:

$\displaystyle A = \Gamma^{-1} \begin{bmatrix}& B_1 & & & \\ & & B_2 & & & \\ & & & \ddots & \\ & & & & B_p \end{bmatrix}\Gamma$

where ${B_i}$ are blocks of the form:

$\displaystyle B_i = \begin{bmatrix}& \lambda & 1 & & \\ & & \lambda & 1 & & \\ & & & \ddots & \\ & & & & \lambda \end{bmatrix}$

By the same argument as above, we just need to prove Cayley Hamilton for each block in separate. So we need to prove that ${p_A(B_i) = 0}$. If the block has size ${1}$, then it is exacly the proof above. If the block is bigger, then we need to look at how does ${B_i^k}$ looks like. By inspection:

$\displaystyle B_i^2 = \begin{bmatrix}& \lambda^2 & 2 \lambda & 1 & & & \\ & & \lambda^2 & 2 \lambda & 1 & & & \\ & & & & & \ddots & \\ & & & & & \lambda^2 \end{bmatrix}$

Tipically, for ${B_i^k}$ we have in each row, starting in column ${k}$ the sequence ${\lambda^k, k \lambda^{k-1}, k(k-1) \lambda^{k-1}, \hdots}$, i.e., ${\frac{d^0}{d\lambda^0} \lambda^k, \frac{d^1}{d\lambda^1} \lambda^k , \frac{d^2}{d\lambda^2} \lambda^k \hdots}$. So, we have

$\displaystyle p(B_i) = \begin{bmatrix} p(\lambda) & p'(\lambda) & p''(\lambda) & p'''(\lambda) & \hdots \\ & p(\lambda) & p'(\lambda) & p''(\lambda) & \hdots \\ & & p(\lambda) & p'(\lambda) & \hdots \\ & & & p(\lambda) & \hdots \\ & & & & \ddots \\ \end{bmatrix}$

If block ${B_i}$ has size ${k}$, then ${\lambda_i}$ has multiplicity ${k}$ in ${p(.)}$ and therefore ${p(\lambda_i) = p'(\lambda_i) = \hdots = p^{(k-1)}(\lambda_i)}$ and therefore, ${p(B_i) = 0}$ as we wanted to prove.

It turned out not to be a very very short proof, but it is still short, since it uses mostly elementary stuff and the proof is really intuitive in some sense. I took some lessons from that: (i) first it reinforces my idea that, if I need to say something about a matrix, the first thing I do is to look at its eigenvectors decomposition. A lot of Linear Algebra problems are very simple when we consider things in the right basis. Normally the right basis is the eigenvector basis. (ii) not all matrices are diagonalizable. But in those cases, Jordan Canonical Form comes in our help and we can do almost the same as we did with eigenvalue decomposition.

Categories: theory Tags: