Posts Tagged ‘probability’

Bayesian updates and the Lake Wobegon effect

September 25th, 2011 No comments

We seem to have a good mathematical understanding of Bayesian updates, but somehow a very poor understanding of its practical implications. There are many situations in practice that we easily perceive as irrational, one of the most famous is the so called Lake Wobegon effect, named after the fictional town in Minnesota, where “all the women are strong, all the men are good looking, and all the children are above average”. It is described as a cognitive bias where individuals tend to overestimate their own capabilities. In fact, when drivers are asked to rate their own skilled compared to the average in three groups: low-skilled, medium-skilled and high-skilled, most rate themselves above the average.

In fact, the behavioral economics literate is full of examples like this where the observed data is far from what you would expect to observe if all agents were rational – and those are normally attributed to cognitive biases. I was always a bit suspicious of such arguments: it was never clear if agents were simply not being rational or whether their true objective wasn’t being captured by the model. I always thought the second was a lot more likely.

One of the main problems of the irrationality argument is that they ignore the fact that agents live in a world where its states are not completely observed. In a beautiful paper in Econometrica called “Apparent Overconfidence“, Benoit and Dubra argue that:

“But the simple truism that most people cannot be better than the median does not imply that most people cannot rationally rate themselves above the median.”

The authors show that it is possible to reverse engineer a signaling scheme such that the data is mostly consistent with the observation. Let me try to give a simple example they give in the introduction: consider that each driver has one of three types of skill: low, medium or high: \{L,M,H\} and \mathbb{P}(L) = \mathbb{P}(M) = \mathbb{P}(H) = \frac{1}{3}. However, they can’t observe this. They can only observe some sample of their driving. Let’s say for simplicity that they can observe a signal A that says if they caused an accident or not. Assume also that the larger that skill of a driver, the higher it is his probability of causing an accident, say:

\mathbb{P}(A \vert L) = \frac{47}{80}, \mathbb{P}(A \vert L) = \frac{9}{16}, \mathbb{P}(A \vert L) = \frac{1}{20}

Before observing A each driver things of himself as having probability $\frac{1}{3}$ of having each type of skill. Now, after observing A, they update their belief according to Bayes rule, i.e.,

\mathbb{P}(s \vert A) = \frac{ \mathbb{P}(A \vert s)   \mathbb{P}(s)  }{ \sum_{s'}  \mathbb{P}(A \vert s')   \mathbb{P}(s')  }

doing the calculations, we have that \mathbb{P}(A) = \frac{2}{5} and for the \frac{3}{5} of the drivers that didn’t suffer an accident, they’ll evaluate \mathbb{P}(L \vert \neg A) = \frac{11}{48}, \mathbb{P}(M \vert \neg A) = \frac{35}{144}, \mathbb{P}(H \vert \neg A) = \frac{19}{36}, so:

\mathbb{P}(H \vert \neg A) > \mathbb{P}(L \cup M \vert \neg A)

and therefore will report high-skill. Notice this is totally consistent with rational Bayesian-updaters. The main question in the paper is: “when it is possible to reverse engineer a signaling scheme ?”. More formally, let \Theta be a set of types of users and let \theta \sim H \in \Delta(\Theta), i.e., H is a distribution on the types which is common knowledge. Now, if we ask agents to report their type, their report is some H' \in \Delta(\Theta), H' \neq H. Is there a signaling scheme S which can be interpreted as a random variable correlated with \theta such that H' is the distribution rational Bayesian updaters would report based on what they observed from S ? The authors give necessary and sufficient condition on when this is possible given H,H'.


A note also related to the Lake Wobegon effect: I started reading a very nice book by Duncan Watts called “Everything Is Obvious: *Once You Know the Answer” about traps of the common-sense. The discussion is different then above, but it also talks about the dangers of applying our usual common sense, which is very useful to our daily life, to scientific results. I highly recommend reading the intro of the book, which is open in Amazon. He gives examples of social phenomena where, once you are told them, you think: “oh yeah, this is obvious”. But then if you were told the exact opposite (in fact, he begins the example by telling you the opposite from the observed in data), you’d also think “yes, yes, this is obvious” and come up with very natural explanations. His point is that common sense is very useful to explaining data observations, specially observations of social data. On the other hand, it is performs very poorly on predicting how the data will look like before actually seeing it.

Categories: theory Tags: ,

MHR, Regular Distributions and Myerson’s Lemma

May 30th, 2011 No comments

Monotone Hazard Rate (MHR) distributions and its superclass regular distributions keep appearing in the Mechanism Design literature and this is due to a very good reason: they are the class of distributions for which Myerson’s Optimal Auction is simple and natural. Let’s brief discuss some properties of those distributions. First, two definitions:

  1. Hazard rate of a distribution f : h(z) = \frac{f(z)}{1-F(z)}
  2. Myerson virtual value of a distribution f : \phi(z) = z - \frac{1-F(z)}{f(z)}

We can interpret the hazard rate in the following way: think of T \sim f as a random variable that indicates the time that a light bulb will take to extinguish. If we are in time t and the light bulb hasn’t extinguished so far, what is the probability it will extinguish in the next \delta time:

\mathbb{P}[T \leq t+\delta \vert T > t] \approx \frac{f(t) \delta}{1-F(t)}

We say that a distribution is monotone hazard rate, if h(z) is non-decreasing. This is very natural for light bulbs, for example. Many of the distributions that we are used to are MHR, for example, uniform, exponential and normal. The way that I like to think about MHR distributions is the following: if some distribution has hazard rate h(z), then it means that F'(z) = (1-F(z)) h(z). If we define G(z) = 1-F(z), then (log G(z))' = \frac{G'(z)}{G(z)} = -h(z), so:

F(z) = 1-\text{exp}(-\int_0^z h(u) du)

From this characterization, it is simple to see that the extremal distributions for this class, i.e. the distributions that are in the edge of being MHR and non-MHR are constant hazard rate, which correspond to the exponential distribution F(z) = 1-e^{-\lambda z} for z \in [0,\infty). They way I like to think about those distributions is that whenever you are able to prove something about the exponential distribution, then you can prove a similar statement about MHR distributions. Consider those three examples:

Example 1: \mathbb{P}[\phi(z) \geq 0] \geq \frac{1}{e} for MHR distributions. This fact is straightforward for the exponential distribution. For the exponential distribution \phi(z) = z-\lambda^{-1} and therefore

\mathbb{P}[\phi(z) \geq 0] \geq \mathbb{P}[z > \lambda^{-1}] = 1-F(\lambda^{-1}) = e^{-1}

but the proof for MHR is equally simple: Let r = \inf \{z; \phi(z) \geq 0\}, therefore r h(r) \leq 1.
P\{ \phi(v) \geq 0\} = P\{ v \geq r \} = 1 - F(r) = e^{-\int_0^r h(u) du} \geq e^{-r h(r)} \geq e^{-1}

Example 2: Given z_1, z_2 \sim f iid where f is MHR and v_1 = \max \{z_1, z_2\} and v_2 = \min \{z_1, z_2\}, then \mathbb{E}[v_2] \geq \frac{1}{3} \mathbb{E}[v_1]. The proof for the exponential distribution is trivial, and in fact, this is tight for the exponential, the trick is to use the convexity of z \mapsto \int_0^z h(u) du. We use that \int_0^{2z} h \geq 2 \int_0^z h in the following way:

\mathbb{E} [v_2] = \int_0^\infty (1 - F(z))^2 dz = \int_0^\infty e^{-2 \int_0^z h} dz

\geq \int_0^\infty e^{-\int_0^{2z} h} dz= \frac{1}{2} \int_0^\infty 1 - F(z) dz = \frac{1}{2} \mathbb{E} [z]

Since \mathbb{E} [v_1 + v_2] = \mathbb{E} [z_1 + z_2] = 2 \mathbb{E} [z], we have that \mathbb{E}[v_1] = 2 \mathbb{E}[z] - \mathbb{E}[v_2] \leq \frac{3}{2} \mathbb{E}[z]. This way, we get: \mathbb{E}[v_2] \geq \frac{1}{2}\mathbb{E}[z] \geq \frac{1}{2} \cdot \frac{2}{3} \mathbb{E}[v_1] = \frac{1}{3} \mathbb{E}[v_1]

Example 3: For MHR distributions, there is a simple lemma that relates the virtual value and the real value and this lemma is quite useful in various settings: let r = \inf \{z; \phi(z) > 0 \}, then for z \geq r, \phi(z) \geq z - r. Again, this is tight for exponential distribution. The proof is quite trivial:

x - \phi(x) = \frac{1-F(x)}{f(x)} \leq \frac{1-F(r)}{f(r)} = r

Now, MHR distributions are a subclass of regular distributions, which are the distributions for which Myerson’s virtual value \phi(z) is a monotone function. I usually find harder to think about regular distributions than to think about MHR (in fact, I don’t know so many examples that are regular, but not MHR. Here is one, though, called the equal-revenue-distribution. Consider z \in [1, \infty) distributed according to f(z) = 1/z^2. The cumulative distribution is given by F(z) = 1-1/z. The interesting thing of this distribution is that posted prices get the same revenue regardless of the price. For example, if we post any price r \in [1,\infty), then a customer with valuations z \sim f buys the item if z > r by price r, getting  revenue is r (1-F(r)) = 1. This can be expressed by the fact that \phi(z) = 0. I was a bit puzzled by this fact, because of Myerson’s Lemma:

Myerson Lemma: If a mechanism sells to some player that has valuation v \sim f with probability x(v) when he has value v, then the revenue is \mathbb{E} [x(v) \phi(v)].

And it seemed that the auctioneers was doomed to get zero revenue, since \phi(z) = 0. For example, suppose we fix some price r and we sell the item if v \geq r by price r. Then it seems that Myerson’s Lemma should go through by a derivation like that (for this special case, although the general proof is quite similar):

\mathbb{E} [x(v) \phi(v)] = \int_r^\infty \phi(z) f(z) dz = \int_r^\infty z f(z) - (1-F(z)) dz =

= \int_r^\infty [ z f(z) - \int_z^\infty f(u) du ] dz = \int_r^\infty z f(z) dz - \int_r^\infty \int_r^u f(u) dz du

= r (1-F(r))

but those don’t seem to match, since one side is zero and the other is 1. The mistake we did above is classic, which is to calculate \infty - \infty. We wrote:

\mathbb{E}[\phi(v)] = \int_r^\infty z f(z) dz - \int_r^\infty 1-F(z) dz

but both are infinity! This made me realize that Myerson’s Lemma needs the condition that \mathbb{E}[z] < \infty, which is quite a natural a distribution over valuations of a good. So, one of the bugs of the the equal-revenue-distribution is that \mathbb{E}[z] = \infty. A family that is close to this, but doesn’t suffer this bug is: f(z) = \frac{\alpha-1}{z^\alpha} for z \in [1,\infty), then F(z) = 1 - z^{1-\alpha}. For \alpha > 2 we have \mathbb{E}[v] < \infty, then we get \phi(z) = \frac{\alpha-2}{\alpha-1} z.

DP and the Erdős–Rényi model

May 16th, 2011 No comments

Yesterday I was in a pub with Vasilis Syrgkanis and Elisa Celis and we were discussing about how to calculate the expected size of a connected component in G(n,p), the ErdÅ‘s–Rényi model. G(n,p) is the classical random graph obtained by considering n nodes and adding each edge (i,j) independently with probability p. A lot is known about its properties, which very interestingly change qualitatively as the value of p changes relativeto n. For example, for p <\frac{1}{n} then there is no component greater than O(\log n) with high probability. When p = \frac{c}{n}, c>1 and n \rightarrow \infty, then the graph has a giant component. All those phenomena are very well studied in the context of probabilistic combinatorics and also in social networks. I remember learning about them in Jon Kleinberg’s Structure of Information Networks class.

So, coming back to our conversation, we were thinking on how to calculate the size of a connected component. Fix some node u in G(n,p) – it doesn’t matter which node, since all nodes are equivalent before we start tossing the random coins. Now, let C_u be the size of the connected component of node u. The question is how to calculate C(n,p) = \mathbb{E} [C_u].

Recently I’ve been learning MATLAB (actually, I am learning Octave, but it is the same) and I am very amazed by it and impressed about why I haven’t learned it before. It is a programming language that somehow knows exactly how mathematicians think and the syntax is very intuitive. All the operations that you think of performing when doing mathematics, they have implemented. Not that you can’t do that in C++ or Python, in fact, I’ve been doing that all my life, but in Octave, things are so simple. So, I thought this was a nice opportunity for playing a bit with it.

We can calculate C(n,p) using a dynamic programming algorithm in time O(n^2) – well, maybe we can do it more efficiently, but the DP I thought was the following: let’s calculate \mathcal{C}(n,s,p) where it is the expected size of the u-connected component of a random graph with n nodes where the edges between u and other nodes have probability p' = 1 - (1-p)^s and an edge between v_1 and v_2 have probability p. What we want to compute is C(n,p) = \mathcal{C}(n,1,p).

What we can do is to use the Principle of Deferred Decisions,  and toss the coins for the n-1 edges between u and the other nodes. With probability bin(k,n,p') = {n \choose k} (p')^k (1-‘p)^{n-k}, there are k edges between u and the other nodes, say nodes w_1, \hdots, w_k. If we collapse those nodes to u we end up with a graph of n-k nodes and the problem is equivalent to k plus the size of the connected component of u in the collapsed graph.

One difference, however is that the probability that the collapsed node u is connected to a node v of the n-1-k nodes is the probability that at least one of w_i is connected to v, which is 1-(1-p)^k. In this way, we can write:

\mathcal{C}(n,s,p) = 1 \cdot bin(0,n-1,p') + \sum_{k=1}^{n-1} bin(k,n-1,p') [ k +  \mathcal{C}(n-k,k,p)]

where p' = 1-(1-p)^s. Now, we can calculate C(n,p) by using DP, simply by filling an n \times n table. In Octave, we can do it this way:

function component = C(N,p)
  C_table = zeros(N,N);
  for n = 1:N for s =1:N
    C_table(n,s) = binopdf(0,n-1,1-((1-p)^s)) ;
    for k = 1:n-1
      C_table(n,s) += binopdf(k,n-1,1-((1-p)^s)) * (k + C_table(n-k,k));
  end end
  component = C_table(N,1);

And in fact we can call C(n,p) for say n = 200 and p = 0.01 .. 0.3 and see how C(n,p) varies. This allows us, for example, to observe the sharp transition that happens before the giant component is formed. The plot we get is:

Erdős–Rényi model

Categories: theory Tags: