probability – Big Red Bits

Bayesian updates and the Lake Wobegon effect

renatoppl — Mon, 26 Sep 2011 01:48:27 +0000

We seem to have a good mathematical understanding of Bayesian updates, but somehow a very poor understanding of its practical implications. There are many situations in practice that we easily perceive as irrational, one of the most famous is the so calledÂ Lake Wobegon effect, named after theÂ fictional town in Minnesota, where “all the women are strong, all the men are good looking, and all the children are above average”. It is described as a cognitive bias where individuals tend to overestimate their own capabilities. In fact, when drivers are asked to rate their own skilled compared to the average in three groups: low-skilled, medium-skilled and high-skilled, most rate themselves above the average.

In fact, the behavioral economics literate is full of examples like this where the observed data is far from what you would expect to observe if all agents were rational – and those are normally attributed to cognitive biases. I was always a bitÂ suspicious of such arguments: it was never clear if agents were simply not being rational or whether their true objective wasn’t being captured by the model. I always thought the second was a lot more likely.

One of the main problems of the irrationality argument is that they ignore the fact that agents live in a world where its states are not completely observed. InÂ a beautiful paper in Econometrica called “Apparent Overconfidence“, Benoit and Dubra argue that:

“But the simple truism thatÂ most people cannot be better than the median does not imply that most people cannot rationally rate themselves above the median.”

The authors show that it is possible to reverse engineer a signaling scheme such that the data is mostly consistent with the observation. Let me try to give a simple example they give in the introduction: consider that each driver has one of three types of skill: low, medium or high: and . However, they can’t observe this. They can only observe some sample of their driving. Let’s say for simplicity that they can observe a signal that says if they caused an accident or not. Assume also that the larger that skill of a driver, the higher it is his probability of causing an accident, say:

Before observing each driver things of himself as having probability $\frac{1}{3}$ of having each type of skill. Now, after observing , they update their belief according to Bayes rule, i.e.,

doing the calculations, we have that and for the of the drivers that didn’t suffer an accident, they’ll evaluateÂ , , , so:

\mathbb{P}(L \cup M \vert \neg A)' title='\mathbb{P}(H \vert \neg A) > \mathbb{P}(L \cup M \vert \neg A)' class='latex' />

and therefore will report high-skill. Notice this is totally consistent with rational Bayesian-updaters. The main question in the paper is: “when it is possible to reverse engineer a signaling scheme ?”. More formally, let be a set of types of users and let , i.e., is a distribution on the types which is common knowledge. Now, if we ask agents to report their type, their report is some . Is there a signaling scheme which can be interpreted as a random variable correlated with such that is the distribution rational Bayesian updaters would report based on what they observed from ? The authors give necessary and sufficient condition on when this is possible given .

—————————–

A note also related to the Lake Wobegon effect: I started reading a very nice book by Duncan Watts called “Everything Is Obvious: *Once You Know the Answer” about traps of the common-sense. The discussion is different then above, but it also talks about the dangers of applying our usual common sense, which is very useful to our daily life, to scientific results. I highly recommend reading the intro of the book, which is open in Amazon. He gives examples of social phenomena where, once you are told them, you think: “oh yeah, this is obvious”. But then if you were told the exact opposite (in fact, he begins the example by telling you the opposite from the observed in data), you’d also think “yes, yes, this is obvious” and come up with very natural explanations. His point is that common sense is very useful to explaining data observations, specially observations of social data. On the other hand, it is performs very poorly on predicting how the data will look like before actually seeing it.

MHR, Regular Distributions and Myerson’s Lemma

renatoppl — Mon, 30 May 2011 10:46:08 +0000

Monotone Hazard Rate (MHR) distributions and its superclass regular distributions keep appearing in the Mechanism Design literature and this is due to a very good reason: they are the class of distributions for which Myerson’s Optimal Auction is simple and natural. Let’s brief discuss some properties of those distributions. First, two definitions:

Hazard rate of a distribution :
Myerson virtual value of a distribution :

We can interpret the hazard rate in the following way: think of as a random variable that indicates the time that a light bulb will take to extinguish. If we are in time and the light bulb hasn’t extinguished so far, what is the probability it will extinguish in the next time:

t] \approx \frac{f(t) \delta}{1-F(t)}' title='\mathbb{P}[T \leq t+\delta \vert T > t] \approx \frac{f(t) \delta}{1-F(t)}' class='latex' />

We say that a distribution is monotone hazard rate, if is non-decreasing. This is very natural for light bulbs, for example. Many of the distributions that we are used to are MHR, for example, uniform, exponential and normal. The way that I like to think about MHR distributions is the following: if some distribution has hazard rate , then it means that . If we define , then , so:

From this characterization, it is simple to see that the extremal distributions for this class, i.e. the distributions that are in the edge of being MHR and non-MHR are constant hazard rate, which correspond to the exponential distribution for . They way I like to think about those distributions is that whenever you are able to prove something about the exponential distribution, then you can prove a similar statement about MHR distributions. Consider those three examples:

Example 1: for MHR distributions. This fact is straightforward for the exponential distribution. For the exponential distribution and therefore

\lambda^{-1}] = 1-F(\lambda^{-1}) = e^{-1} ' title='\mathbb{P}[\phi(z) \geq 0] \geq \mathbb{P}[z > \lambda^{-1}] = 1-F(\lambda^{-1}) = e^{-1} ' class='latex' />

but the proof for MHR is equally simple: Let , therefore .

Example 2: Given iid where is MHR and and , then . The proof for the exponential distribution is trivial, and in fact, this is tight for the exponential, the trick is to use the convexity of . We use that in the following way:

Since , we have that . This way, we get:

Example 3: For MHR distributions, there is a simple lemma that relates the virtual value and the real value and this lemma is quite useful in various settings: let 0 \}' title='r = \inf \{z; \phi(z) > 0 \}' class='latex' />, then for , . Again, this is tight for exponential distribution. The proof is quite trivial:

Now, MHR distributions are a subclass of regular distributions, which are the distributions for which Myerson’s virtual value is a monotone function. I usually find harder to think about regular distributions than to think about MHR (in fact, I don’t know so many examples that are regular, but not MHR. Here is one, though, called the equal-revenue-distribution. Consider distributed according to . The cumulative distribution is given by . The interesting thing of this distribution is that posted prices get the same revenue regardless of the price. For example, if we post any price , then a customer with valuations buys the item if r' title='z > r' class='latex' /> by price , gettingÂ revenue is . This can be expressed by the fact that . I was a bit puzzled by this fact, because of Myerson’s Lemma:

Myerson Lemma: If a mechanism sells to some player that has valuation with probability when he has value , then the revenue is .

And it seemed that the auctioneers was doomed to get zero revenue, since . For example, suppose we fix some price and we sell the item if by price . Then it seems that Myerson’s Lemma should go through by a derivation like that (for this special case, although the general proof is quite similar):

but those don’t seem to match, since one side is zero and the other is 1. The mistake we did above is classic, which is to calculate . We wrote:

but both are infinity! This made me realize that Myerson’s Lemma needs the condition that , which is quite a natural a distribution over valuations of a good. So, one of the bugs of the the equal-revenue-distribution is that . A family that is close to this, but doesn’t suffer this bug is: for , then . For 2' title='\alpha > 2' class='latex' /> we have , then we get .

DP and the ErdÅ‘sâ€“RÃ©nyi model

renatoppl — Mon, 16 May 2011 21:41:28 +0000

Yesterday I was in a pub with Vasilis Syrgkanis and Elisa Celis and we were discussing about how to calculate the expected size of a connected component in , the ErdÅ‘sâ€“RÃ©nyi model. is the classical random graph obtained by considering nodes and adding each edge independently with probability . A lot is known about its properties, which very interestingly change qualitatively as the value of changes relativeto . For example, for then there is no component greater than with high probability. When , 1' title='c>1' class='latex' /> and , then the graph has a giant component. All those phenomena are very well studied in the context of probabilistic combinatorics and also in social networks. I remember learning about them in Jon Kleinberg’s Structure of Information Networks class.

So, coming back to our conversation, we were thinking on how to calculate the size of a connected component. Fix some node in – it doesn’t matter which node, since all nodes are equivalent before we start tossing the random coins. Now, let be the size of the connected component of node . The question is how to calculate .

Recently I’ve been learning MATLAB (actually, I am learning Octave, but it is the same) and I am very amazed by it and impressed about why I haven’t learned it before. It is a programming language that somehow knows exactly how mathematicians think and the syntax is very intuitive. All the operations that you think of performing when doing mathematics, they have implemented. Not that you can’t do that in C++ or Python, in fact, I’ve been doing that all my life, but in Octave, things are so simple. So, I thought this was a nice opportunity for playing a bit with it.

We can calculate using a dynamic programming algorithm in time – well, maybe we can do it more efficiently, but the DP I thought was the following: let’s calculate where it is the expected size of the -connected component of a random graph with nodes where the edges between and other nodes have probability and an edge between and have probability . What we want to compute is .

What we can do is to use the Principle of Deferred Decisions,Â and toss the coins for the edges between and the other nodes. With probability , there are edges between and the other nodes, say nodes . If we collapse those nodes to we end up with a graph of nodes and the problem is equivalent to plus the size of the connected component of in the collapsed graph.

One difference, however is that the probability that the collapsed node is connected to a node of the nodes is the probability that at least one of is connected to , which is . In this way, we can write:

where . Now, we can calculate by using DP, simply by filling an table. In Octave, we can do it this way:


function component = C(N,p)
  C_table = zeros(N,N);
  for n = 1:N for s =1:N
    C_table(n,s) = binopdf(0,n-1,1-((1-p)^s)) ;
    for k = 1:n-1
      C_table(n,s) += binopdf(k,n-1,1-((1-p)^s)) * (k + C_table(n-k,k));
    end
  end end
  component = C_table(N,1);
endfunction

And in fact we can call for say and and see how varies. This allows us, for example, to observe the sharp transition that happens before the giant component is formed. The plot we get is:

ErdÅ‘sâ€“RÃ©nyi model

Probability Puzzles

renatoppl — Wed, 17 Feb 2010 02:54:53 +0000

Today in a dinner with Thanh, Hu and Joel I heard about a paradox I haven’t heard so far. Probability is full of cute problems that challenge our understanding of the basic concepts. The most famous of them is the Monty Hall Problem, which asks:

You are on a TV game show and there are doors – one of them contains a prize, say a car and the other two door contain things you don’t care about, say goats. You choose a door. Then the TV host, who knows where the prize is, opens one door you haven’t chosen and that he knows has a goat. Then he asks if you want to stick to the door you have chosen or if you want to change to the other door. What should you do?

Probably you’ve already came across this question in some moment of your life and the answer is that changing doors would double your probability of getting the price. There are several ways of convincing your intuitions:

Do the math: when you chose the door, there were three options so the prize is in the door you chose with probability and in the other door with probability (note that the presenter can always open some door with a goat, so conditioning on that event doesn’t give you any new information).
Do the actual experiment (computationally) as done here. One can always ask a friend to help, get some goats and perform the actual experiment.
To convince yourself that “it doesn’t matter” is not correct, think doors. You choose one and the TV host open of them and asks if you want to change or stick with your first choice. Wouldn’t you change?

I’ve seen TV shows where this happened and I acknowledge that other things may be involved: there might be behavioral and psychologic issues associated with the Monty Hall problem – and possibly those would interest Dan Ariely, whose book I began reading today – and looks quite fun. But the problem they told me about today in dinner was another: the envelope problem:

There are two envelopes and you are told that in one of them there is twice the amount that there is in the other. You choose one of the envelopes at random and open it: it contains bucks. Now, you don’t know if the other envelope has bucks or bucks. Then someone asks you if you wanted to pay bucks and change to the other envelope. Should you change?

Now, consider two different solutions to this problem: the first is fallacious and the second is correct:

If I don’t change, I get bucks, if I change I pay a penalty of and I get either or with equal probability, so my expected prize if I change is 100}' title='{\frac{200+50}{2}-10 = 115 > 100}' class='latex' />, so I should change.
I know there is one envelope with and one with , then my expected prize if I don’t change is . If I change, my expected prize is , so I should not change.

The fallacy in the first argument is perceiving a probability distribution where there is no one. Either the other envelope contains bucks or it contains bucks – we just don’t know, but there is no probability distribution there – it is a deterministic choice by the game designer. Most of those paradoxes are a result of either an ill-defined probability space, as Bertrand’s Paradox or a wrong comprehension of the probability space, as in Monty Hall or in several paradoxes exploring the same idea as: Three Prisioners, Sleeping Beauty, Boy or Girl Paradox, …

There was very recently a thrilling discussion about a variant on the envelope paradox in the xkcd blag – which is the blog accompaning that amazing webcomic. There was a recent blog post with a very intriguing problem. A better idea is to go there and read the discussion, but if you are not doing so, let me summarize it here. The problem is:

There are two envelopes containing each of them a distinct real number. You pick one envelope at random, open it and see the number, then you are asked to guess if the number in the other envelope is larger or smaller then the previous one. Can you guess correctly with more than probability?

A related problem is: given that you are playing the envelope game and there are number and (with ). You pick one envelope at random and then you are able to look at the content of the first envelope you open and then decide to switch or not. Is there a strategy that gives you expected earnings greater than ?

The very unexpected answers is yes !!! The strategy that Randall presents in the blog and there is a link to the source here is: let be a random variable on such that for each we have 0}' title='{P(a < X < b) > 0}' class='latex' />, for example, the normal distribution or the logistic distribution.

Sample then open the envelope and find a number now, if say the other number is lower and if S}' title='{X > S}' class='latex' /> say the other number is higher. You get it right with probability

A) + P(\text{picked }B) P(X < B) = \frac{1}{2} (1 + P(A < X < B)) ' title='\displaystyle P(\text{picked }A) P(X > A) + P(\text{picked }B) P(X < B) = \frac{1}{2} (1 + P(A < X < B)) ' class='latex' />

which is impressive. If you follow your guess, your expected earning is:

A) B] + \frac{1}{2} [P(XB) A] \\ &= \frac{1}{2}[A [P(XB)] + B [P(X>A) + P(X \frac{A+B}{2} \\ \end{aligned}' title='\displaystyle \begin{aligned} &P(\text{picked }A) \mathop{\mathbb E}[Y \vert \text{picked }A] + P(\text{picked }B) \mathop{\mathbb E}[Y \vert \text{picked }B] = \\ & = \frac{1}{2} [P(XA) B] + \frac{1}{2} [P(XB) A] \\ &= \frac{1}{2}[A [P(XB)] + B [P(X>A) + P(X \frac{A+B}{2} \\ \end{aligned}' class='latex' />

The xkcd pointed to this cool archive of puzzles and riddles. I was also told that the xkcd puzzle forum is also a source of excellent puzzles, as this:

You are the most eligible bachelor in the kingdom, and as such the King has invited you to his castle so that you may choose one of his three daughters to marry. The eldest princess is honest and always tells the truth. The youngest princess is dishonest and always lies. The middle princess is mischievous and tells the truth sometimes and lies the rest of the time. As you will be forever married to one of the princesses, you want to marry the eldest (truth-teller) or the youngest (liar) because at least you know where you stand with them. The problem is that you cannot tell which sister is which just by their appearance, and the King will only grant you ONE yes or no question which you may only address to ONE of the sisters. What yes or no question can you ask which will ensure you do not marry the middle sister?

copied from here.

Looking at probability distributions

renatoppl — Fri, 13 Nov 2009 03:16:27 +0000

I’ve been taking two classes in probability this semester and in those I saw the proofs of a lot of interesting theorems which I knew about previously but I have never seen the proof, as the Central Limit Theorem, the Laws of Large Numbers and so on… Also, some theory which is looks somewhat ugly in the undergrad courses becomes very clear with the proper formal treatment. Today I was thinking what was the main take-home message that a computer scientist could take from those classes and. at ;east for me, this message is the various ways of looking to probability distributions. I’ve heard about moments, Laplace transform, Fourier transform and other tools like that, but I never realized before their true power. Probably still today, most of their true power is hidden from me, but I am starting to look at them in a different way. Let me try to go over a few examples of different ways we can look at probability distributions and show cases where they are interesting.

Most of ways of looking at probability distributions are associated with multiplicative system: a multiplicative system is a set of real-valued functions with the property that if then . Those kinds of sets are powerful because of the Multiplicative Systems Theorem:

Theorem 1 (Multiplicative Systems Theorem) If is a multiplicative system, is a linear space containing (the constant function ) and is closed under bounded convergence, then implies that contains all bounded -measurable functions.

The theorem might look a bit cryptic if you are not familiar with the definitions, but it boils down to the following translation:

Theorem 2 (Translation of the Multiplicative Systems Theorem) If is “general” multiplicative system, and are random variable such that for all then and have the same distribution.

where general excludes some troublesome cases like or all constant functions, for example. In technical terms, we wanted to be the Borel -algebra. But let’s not worry about those technical details and just look at the translated version. We now, discuss several kinds of multiplicative systems:

The most common description of the a random variable is by the cummulative distribution function . This is associated with notice that simply .
We can characterize a random variable by its moments: the variable is characterized by the set . Given the moemnts , the variable is totally characterized, i.e., if two variables have the same moments, then they have the same distribution by the Multiplicative Systems Theorem. This description is associated with the system
Moment Generating Function: If is a variable that assumes only integer values, we can describe the it as , where . An interesting way of representing those probabilities is as the moment generating function . This is associated with the multiplicative system .Now suppose we are given two discrete independent variables and . What do we know about . It is easy to know its expectation, its variance, … but what about more complicated things? What is the distribution of ? Moment generating functions answer this question very easily, since:

If we know moment generating functions, we can calculate expectation very easily, since . For example, suppose we have a process like that: there is one bacteria in time . In each timestep, either this bacteria dies (with probability ), continues alive without reproducing (with probability or has offsprings (with probability ). In that case . Each time, the same happens, independently with each of the bacteria alive in that moment. The question is, what is the expected number of bacteria in time ?

It looks like a complicated problem with just elementary tools, but it is a simple problem if we have moment generating functions. Just let be the variable associated with the bacteria of time . It is zero if it dies, if it stays the same and if it has offsprings. Let also be the number of bacteria in time . We want to know . First, see that:

Now, let’s write that in terms of moment generating functions:

which is just:

since the variables are all independent and identically distributed. Now, notice that:

by the definition of moment generating function, so we effectively proved that:

We proved that is just iterated times. Now, calculating the expectation is easy, using the fact that and . Just see that: . Then, clearly . Using similar technique we can prove a lot more things about this process, just by analyzing the behavior of the moment generating function.
Laplace Tranform: Now, moving to continuous variables, if is a continuous non-negative variable we can define its Laplace tranform as: , where stands for the distribution of , for example, . This is associated with the multiplicative system . Again, by the Multiplicative Systems Theorem, if , then the two variables have the same distribution. The Laplace tranform has the same nice properties as the Moment Generating Function, for example, .And it allows us to do similar tricks than the one I just showed for Moment Generating Functions. One common trick that is used, for example, in the proof of Chernoff bounds is, given independent non-negative random variables:
u\right\} = P\left\{e^{\sum_i X_i} > e^u\right\} \leq \frac{\mathop{\mathbb E}[e^{\sum_i X_i} ]}{e^u} = \frac{\prod_i \mathop{\mathbb E}[e^{X_i} ]}{e^u} ' title='\displaystyle P\left\{\sum_i X_i > u\right\} = P\left\{e^{\sum_i X_i} > e^u\right\} \leq \frac{\mathop{\mathbb E}[e^{\sum_i X_i} ]}{e^u} = \frac{\prod_i \mathop{\mathbb E}[e^{X_i} ]}{e^u} ' class='latex' />

where we also used Markov Inequality: . Passing to the Laplace transform is the main ingredient in the Chernoff bound and it allows us to sort of “decouple” the random variables in the sum. There are several other cases where the Laplace transform proves itsself very useful and turns things that looked very complicated when we saw in undergrad courses into simple and clear things. One clear example of that is the motivation for the Poisson random variable:

If are independend exponentially distributed random variables with mean , then . An elementary calculation shows that its laplace transform is . Let , i.e., the time of the arrival. We want to know what is the distribution of . How to do that?

Now, we need to find such that . Now it is just a matter of solving this equation and we get: . Now, the Poisson varible measures the number of arrivals in and therefore:

t\} - P\{S_{n-1} \geq t\} \\ & = \int_t^\infty \rho_{S_n}(t) dt - \int_t^\infty \rho_{S_{n-1}}(t) dt = \frac{(\lambda t)^n}{n!} e^{-\lambda t} \end{aligned}' title='\displaystyle \begin{aligned} P\{N_t = n\} & = P\{S_{n-1} < t < S_n\} = P\{S_n > t\} - P\{S_{n-1} \geq t\} \\ & = \int_t^\infty \rho_{S_n}(t) dt - \int_t^\infty \rho_{S_{n-1}}(t) dt = \frac{(\lambda t)^n}{n!} e^{-\lambda t} \end{aligned}' class='latex' />
Characteristic Function or Fourier Tranform: Taking we get the Fourier Transform: which also has some of the nice properties of the previous ones and some additional ones. The characteristic functions were the main actors in the development of all the probability techniques that lead to the main result of 19th century Probability Theory: the Central Limit Theorem. We know that moment generating functions and Laplace transforms completely characterize the distributions, but it is not clear how to recover a distribution once we have a transform. For Fourier Transform there is a cleas and simple way of doing that by means of the Inversion Formula:

One fact that always puzzled me was: why is the normal distribution so important? What does it have in special to be the limiting distribution in the Central Limit Theorem, i.e., if is a sequence of independent random variables, then under some natural conditions on the variables. The reason the normal is so special is because it is a “fixed point” for the Fourier Transform. We can see that . And there we have something special about it that makes me believe the Central Limit Theorem.

————————-

This blog post was based on lectures by Professor Dynkin at Cornell.

Random Spanning Trees

renatoppl — Wed, 04 Nov 2009 04:51:59 +0000

BigRedBits is again pleased to have Igor Gorodezky as a guest blogger directly from UCLA. I leave you with his excelent post on the Wilson’s algorithm.

——————————————

Igor again, with another mathematical dispatch from UCLA, where I’m spending the semester eating and breathing combinatorics as part of the 2009 program on combinatorics and its applications at IPAM. In the course of some reading related to a problem with which I’ve been occupying myself, I ran across a neat algorithmic result – Wilson’s algorithm for uniformly generating spanning trees of a graph. With Renato’s kind permission, let me once again make myself at home here at Big Red Bits and tell you all about this little gem.

The problem is straightforward, and I’ve essentially already stated it: given an undirected, connected graph , we want an algorithm that outputs uniformly random spanning trees of . In the early ’90s, Aldous and Broder independently discovered an algorithm for accomplishing this task. This algorithm generates a tree by, roughly speaking, performing a random walk on and adding the edge to every time that the walk steps from to and is a vertex that has not been seen before.

Wilson’s algorithm (D. B. Wilson, “Generating random spanning trees more quickly than the cover time,” STOC ’96) takes a slightly different approach. Let us fix a root vertex . Wilson’s algorithm can be stated as a loop-erased random walk on as follows.

Algorithm 1 (Loop-erased random walk) Maintain a tree , initialized to consist of alone. While there remains a vertex not in : perform a random walk starting at , erasing loops as they are created, until the walk encounters a vertex in , then add to the cycle-erased simple path from to .

We observe that the algorithm halts with probability 1 (its expected running time is actually polynomial, but let’s not concern ourselves with these issues here), and outputs a random directed spanning tree oriented towards . It is a minor miracle that this tree is in fact sampled uniformly from the set of all such trees. Let us note that this offers a solution to the original problem, as sampling randomly and then running the algorithm will produce a uniformly generated spanning tree of .

It remains, then, to prove that the algorithm produces uniform spanning trees rooted at (by which we mean directed spanning trees oriented towards ). To this we dedicate the remainder of this post.

1. A “different” algorithm

Wilson’s proof is delightfully sneaky: we begin by stating and analyzing a seemingly different algorithm, the cycle-popping algorithm. We will prove that this algorithm has the desired properties, and then argue that it is equivalent to the loop-erased random walk (henceforth LERW).

The cycle-popping algorithm works as follows. Given and , associate with each non-root vertex an infinite stack of neighbors. More formally, to each we associate

where each is uniformly (and independently) sampled from the set of neighbors of . Note that each stack is not a random walk, just a list of neighbors. We refer to the left-most element above as the top of , and by popping the stack we mean removing this top vertex from .

Define the stack graph to be the directed graph on that has an edge from to if is at the top of the stack . Clearly, if has vertices then is an oriented subgraph of with edges. The following lemma follows immediately.

Lemma 1 Either is a directed spanning tree oriented towards or it contains a directed cycle.

If there is a directed cycle in we may pop it by popping for every . This eliminates , but of course might create other directed cycles. Without resolving this tension quite yet, let us go ahead and formally state the cycle-popping algorithm.

Algorithm 2 (Cycle-popping algorithm) Create a stack for every . While contains any directed cycles, pop a cycle from the stacks. If this process ever terminates, output .

Note that by the lemma, if the algorithm ever terminates then its output is a spanning tree rooted at . We claim that the algorithm terminates with probability 1, and moreover generates spanning trees rooted at uniformly.

To this end, some more definitions: let us say that given a stack , the vertex is at level . The level of a vertex in a stack is static, and is defined when the stack is created. That is, the level of does not change even if advances to the top of the stack as a result of the stack getting popped.

We regard the sequence of stack graphs produced by the algorithm as leveled stack graphs: each non-root vertex is assigned the level of its stack. Observe that the level of in is the number of times that has been popped. In the same way, we regard cycles encountered by the algorithm as leveled cycles, and we can regard the tree produced by the algorithm (if indeed one is produced) as a leveled tree.

The analysis of the algorithm relies on the following key lemma (Theorem 4 in Wilson’s paper), which tells us that the order in which the algorithm pops cycles is irrelevant.

Lemma 2 For a given set of stacks, either the cycle-popping algorithm never terminates, or there exists a unique leveled spanning tree rooted at such that the algorithm outputs irrespective of the order in which cycles are popped.

Proof: Fix a set of stacks . Consider a leveled cycle that is pop-able, i.e.~there exist leveled cycles that can be popped in sequence. We claim that if the algorithm pops any cycle not equal to , then there still must exist a series of cycles that ends in and that can be popped in sequence. In other words, if is pop-able then it remains pop-able, no matter which cycles are popped, until itself is actually popped.

Let be a cycle popped by the algorithm. If then the claim is clearly true. Also, if shares no vertices with , then the claim is true again. So assume otherwise, and let be the first in the series to share a vertex with . Let us show that by contradiction.

If , then and must share a vertex that has different successors in and . But by definition of , none of the contain , and this implies that has the same level in and . Therefore its successor in both cycles is the same, a contradiction. This proves .

Moreover, the argument above proves that and are equal as leveled cycles (i.e.~every vertex has the same level in both cycles). Hence

is a series of cycles that can be popped in sequence, which proves the original claim about .

We conclude that given a set of stacks, either there is an infinite number of pop-able cycles, in which case there will always be an infinite number and the algorithm will never terminate, or there is a finite number of such cycles. In the latter case, every one of these cycles is eventually popped, and the algorithm produces a spanning tree rooted at . The level of each non-root vertex in is given by (one plus) the number of popped cycles that contained .

Wilson summarizes the cycle-popping algorithm thusly: “[T]he stacks uniquely define a tree together with a partially ordered set of cycles layered on top of it. The algorithm peels off these cycles to find the tree.”

Theorem 3 The cycle-popping algorithm terminates with probability 1, and the tree that it outputs is a uniformly sampled spanning tree rooted at .

Proof: The first claim is easy: has a spanning tree, therefore it has a directed spanning tree oriented towards . The stacks generated in the first step of the algorithm will contain such a tree, and hence the algorithm will terminate, with probability 1.

Now, consider a spanning tree rooted at . We’ll abuse notation and let be the event that is produced by the algorithm. Similarly, given a collection of leveled cycles , we will write for the event that is the set of leveled cycles popped by the algorithm before it terminates. Finally, let be the event that the algorithm popped the leveled cycles in and terminated, with the resulting leveled tree being equal to .

By the independence of the stack entries, we have , where is the probability that the algorithm’s output is a leveled version of , a quantity which a moment’s reflection will reveal is independent of . Now,

which, as desired, is independent of .

2. Conclusion

We have shown that the cycle-popping algorithm generates spanning trees rooted at uniformly. It remains to observe that the LERW algorithm is nothing more than an implementation of the cycle-popping algorithm! Instead of initially generating the (infinitely long) stacks and then looking for cycles to pop, the LERW generates stack elements as necessary via random walk (computer scientists might recognize this as the Principle of Deferred Decisions). If the LERW encounters a loop, then it has found a cycle in the stack graph induced by the stacks that the LERW has been generating. Erasing the loop is equivalent to popping this cycle. We conclude that the LERW algorithm generates spanning trees rooted at uniformly.

Entropy

renatoppl — Fri, 28 Aug 2009 01:28:54 +0000

Today was the first day of classes here at Cornell and as usual, I attend to a lot of different classes to try to decide which ones to take. I usually feel like I wanted to take them all, but there is this constant struggle: if I take too many classes I have no time to do research and to read random things that happen to catch my attention at that moment, and if I don’t take many classes I feel like not learning a lot of interesting stuff I wanted to be learning. The solution in the middle of the way is to audit a lot of classes and start dropping them as a start needing more time: what happens usually quickly. This particular fall I decided that I need to build a stronger background in probability – since I am finding a lot of probabilistic stuff in my way and I have nothing more than my undergrad course and things I learned on demand. I attended at least three probability classes with different flavours today and I decided to blog about a simple, yet very impressive result I saw in one of them.

Since I took a class on “Principles of Telecommunications” in my undergrad, I became impressed by Shannon’s Information Theory and the concept of entropy. There was one theorem that I always heard about but never saw the proof. I thought it was a somewhat complicated proof, but it turned out not to be that much.

Consder an alphabet and a probability distribution over it. I want to associate to each a string of -digits to represent each simbol of the alphabet. One way of allowing the code to be decodable is to make them a proper code. A proper code is a code such that given any and , is not a prefix of . There are several codes like this, but some are more efficient then others. Since the letters have different frequencies, it makes sense to code a frequent letter (say ‘e’ in English) with few bits and a letter that doesn’t appear much, say ‘q’ with more bits. We want to find a proper code to minimize:

The celebrated theorem by Shannon shows that for any proper code (actually it holds more generally for any decodable code), we have where is the entropy of the alphabet, defined as:

even more impressive is that we can achieve something very close to it:

Theorem 1 There is a code such that .

With an additional trick we can get for any 0}' title='{\epsilon > 0}' class='latex' />. The first part is trickier and I won’t do here (but again, it is not as hard as I thought it would be). For proving that there is a code with average length we use the following lemma:

Lemma 2 There is a proper code for with code-lengths if and only if

Proof: Let and imagine all the possible codewords of length as a complete binary tree. Since it is a proper code, no two codes and are in the same path to the root. So, picking one node as a codeword means that we can’t pick any node in the subtree from it. Also, for each leave, the is at most one codeword in its path to the root. Therefore we can assign each leaf of the tree to a single codeword or to no codeword at all. It is easy to see that a codeword with size has associated with it leaves. Since there are leaves in total, we have that:

what proves one direction of the result. Now, to prove the converse direction, we can propose a greedy algorithm: given and such that , let . Now, suppose . Start with leaves in a whole block. Start dividing them in blocks and assign one to . Now we define the recursive step: when we analyze , the leaves are divided in blocks, some occupied, some not. Divide each free block in blocks and assign one of them to . It is not hard to see that each block corresponds to one node in the tree (the common ancestor of all the leaves in that block) and that it corresponds to a proper code.

Now, using this we show how to find a code with with . For each , since we can always find such that . Now, clearly:

and:

Cool, but now how to bring it to ? The idea is to code multiple blocks at the same time (even if they are independent, we are not taking advantage of correlation between the blocks). Consider and the probability function induced on it, i.e.:

It is not hard ot see that with has entropy because:

and then we can just apply the last theorem to that: we can find a function that codifies symbols with symbols such that:

since codifies symbols, we are actually interested in and therefore we get: