Archive

Posts Tagged ‘mathematics’

Looking at probability distributions

November 12th, 2009 No comments

I’ve been taking two classes in probability this semester and in those I saw the proofs of a lot of interesting theorems which I knew about previously but I have never seen the proof, as the Central Limit Theorem, the Laws of Large Numbers and so on… Also, some theory which is looks somewhat ugly in the undergrad courses becomes very clear with the proper formal treatment. Today I was thinking what was the main take-home message that a computer scientist could take from those classes and. at ;east for me, this message is the various ways of looking to probability distributions. I’ve heard about moments, Laplace transform, Fourier transform and other tools like that, but I never realized before their true power. Probably still today, most of their true power is hidden from me, but I am starting to look at them in a different way. Let me try to go over a few examples of different ways we can look at probability distributions and show cases where they are interesting.

Most of ways of looking at probability distributions are associated with multiplicative system: a multiplicative system {Q} is a set of real-valued functions with the property that if {f_1, f_2 \in Q} then {f_1 \dot f_2 \in Q}. Those kinds of sets are powerful because of the Multiplicative Systems Theorem:

Theorem 1 (Multiplicative Systems Theorem) If {Q} is a multiplicative system, {H} is a linear space containing {1} (the constant function {1}) and is closed under bounded convergence, then {Q \subseteq H} implies that {H} contains all bounded {\sigma(Q)}-measurable functions.

The theorem might look a bit cryptic if you are not familiar with the definitions, but it boils down to the following translation:

Theorem 2 (Translation of the Multiplicative Systems Theorem) If {Q} is “general” multiplicative system, {X} and {Y} are random variable such that {\mathop{\mathbb E}{f(X)} = \mathop{\mathbb E}{f(Y)}} for all {f \in Q} then {X} and {Y} have the same distribution.

where general excludes some troublesome cases like {Q = \{1\}} or all constant functions, for example. In technical terms, we wanted {\sigma(Q) = \sigma\{f^{-1}((-\infty,a]); a\in {\mathbb R}, f \in Q \}} to be the Borel {\sigma}-algebra. But let’s not worry about those technical details and just look at the translated version. We now, discuss several kinds of multiplicative systems:

  1. The most common description of the a random variable is by the cummulative distribution function {F(u) = P\{X \leq u\}}. This is associated with {Q = \{ 1_{(-\infty,u]}(x); u \in {\mathbb R} \}} notice that simply {F(u) = \mathop{\mathbb E}{1_{(-\infty,u]}}}.
  2. We can characterize a random variable by its moments: the variable {X} is characterized by the set {M_n = \int_{\mathbb R} x^n \mu_X(dx)}. Given the moemnts {M_1, M_2, \hdots}, the variable is totally characterized, i.e., if two variables have the same moments, then they have the same distribution by the Multiplicative Systems Theorem. This description is associated with the system {Q = \{x^n; n = 1, 2, \hdots\}}
  3. Moment Generating Function: If {X} is a variable that assumes only integer values, we can describe the it as {p_0, p_1, p_2, \hdots}, where {p_n = P\{ X = n \}}. An interesting way of representing those probabilities is as the moment generating function {\psi_X(z) = \mathop{\mathbb E}{z^X} = \sum_{n=0}^\infty p_n z^n}. This is associated with the multiplicative system {Q = \{z^x, 0 \leq z < 1\}}.Now suppose we are given two discrete independent variables {X} and {Y}. What do we know about {X+Y}. It is easy to know its expectation, its variance, … but what about more complicated things? What is the distribution of {X+Y} ? Moment generating functions answer this question very easily, since:

    \displaystyle \psi_{X+Y} (z) = \mathop{\mathbb E} [z^{X+Y}] = \mathop{\mathbb E} [z^{X}] \cdot \mathop{\mathbb E}[z^{Y}] = \psi_X(z) \cdot \psi_Y(z)

    If we know moment generating functions, we can calculate expectation very easily, since {\mathop{\mathbb E}[X] = \psi'_X(1)}. For example, suppose we have a process like that: there is one bacteria in time {t = 0}. In each timestep, either this bacteria dies (with probability {p_0}), continues alive without reproducing (with probability {p_1} or has {k} offsprings (with probability {p_{1+k}}). In that case {\sum_0^\infty p_n = 1}. Each time, the same happens, independently with each of the bacteria alive in that moment. The question is, what is the expected number of bacteria in time {t} ?

    It looks like a complicated problem with just elementary tools, but it is a simple problem if we have moment generating functions. Just let {X_{ti}} be the variable associated with the {i^{th}} bacteria of time {t}. It is zero if it dies, {1} if it stays the same and {k+1} if it has {k} offsprings. Let also {N_t} be the number of bacteria in time {t}. We want to know {\mathop{\mathbb E} N_t}. First, see that:

    \displaystyle N_t = \sum_{i=1}^{N_{t-1}} X_{ti}

    Now, let’s write that in terms of moment generating functions:

    \displaystyle \psi_{N_t}(z) = \mathop{\mathbb E}{z^{X_{t1} + \hdots + X_{tN_t}}} = \sum_{k=0}^\infty P\{N_{t-1} = k\} \cdot \mathop{\mathbb E}[z^{X_{t1} + \hdots + X_{tN_t}} \vert N_t = k]

    which is just:

    \displaystyle \psi_{N_t}(z) = \sum_{k=0}^\infty P\{N_{t-1} = k\} \cdot \mathop{\mathbb E}[z^{X_{t1} + \hdots + X_{tk}}] = \sum_{k=0}^\infty P\{N_{t-1} = k \} \cdot \psi_X (z)^k

    since the variables are all independent and identically distributed. Now, notice that:

    \displaystyle \psi_{N_{t-1}}(z) = \sum_{k=0}^\infty P\{N_{t-1} = k\} \cdot z^k

    by the definition of moment generating function, so we effectively proved that:

    \displaystyle \psi_{N_{t}}(z) = \psi_{N_{t-1}}(\psi_X(z)) = \psi_X \psi_X \hdots \psi_X(z)

    We proved that {\psi_{N_{t}}(z)} is just {\psi(z)} iterated {t} times. Now, calculating the expectation is easy, using the fact that {\psi_X(1) = 1} and {\psi'_X(1) = \mathop{\mathbb E} X}. Just see that: {\mathop{\mathbb E} N_t = \psi_{N_{t}}'(1) = \psi_{N_{t-1}}'(\psi_X(1)) \cdot \psi'_X(1) = \psi_{N_{t-1}}'(1) \cdot \mathop{\mathbb E} X }. Then, clearly {\mathop{\mathbb E} N_t = (\mathop{\mathbb E} X)^t}. Using similar technique we can prove a lot more things about this process, just by analyzing the behavior of the moment generating function.

  4. Laplace Tranform: Now, moving to continuous variables, if {X} is a continuous non-negative variable we can define its Laplace tranform as: {\phi_X(u) = \mathop{\mathbb E} [e^{u X}] = \int_0^\infty e^{ux} \mu_X(dx)}, where {\mu_X(dx)} stands for the distribution of {X}, for example, {\rho_X(x) dx}. This is associated with the multiplicative system {Q = \{e^{ux}; u \geq 0\}}. Again, by the Multiplicative Systems Theorem, if {\phi_X = \phi_Y}, then the two variables have the same distribution. The Laplace tranform has the same nice properties as the Moment Generating Function, for example, {\phi_{X+Y}(u) = \mathop{\mathbb E}[e^{(X+Y)u}] = \mathop{\mathbb E}[e^{Xu}] \cdot \mathop{\mathbb E}[e^{Yu}] = \phi_X(u) \cdot \phi_Y(u)}.And it allows us to do similar tricks than the one I just showed for Moment Generating Functions. One common trick that is used, for example, in the proof of Chernoff bounds is, given independent non-negative random variables:

    \displaystyle P\left\{\sum_i X_i > u\right\} = P\left\{e^{\sum_i X_i} > e^u\right\} \leq \frac{\mathop{\mathbb E}[e^{\sum_i X_i} ]}{e^u} = \frac{\prod_i \mathop{\mathbb E}[e^{X_i} ]}{e^u}

    where we also used Markov Inequality: {\mathop{\mathbb E}[X] \geq \beta P\{X \geq \beta\}}. Passing to the Laplace transform is the main ingredient in the Chernoff bound and it allows us to sort of “decouple” the random variables in the sum. There are several other cases where the Laplace transform proves itsself very useful and turns things that looked very complicated when we saw in undergrad courses into simple and clear things. One clear example of that is the motivation for the Poisson random variable:

    If {T_i} are independend exponentially distributed random variables with mean {\lambda}, then {\rho_{T_i}(t) = \lambda e^{- \lambda t}}. An elementary calculation shows that its laplace transform is {\phi_{T_i}(u) = \frac{\lambda}{\lambda+u}}. Let {S_n = T-0 + T_1 + T_2 + \hdots + T_n}, i.e., the time of the {(n+1)^{th}} arrival. We want to know what is the distribution of {S_n}. How to do that?

    \displaystyle \phi_{S_n}(u) = \mathop{\mathbb E}[e^{u(T_0 + \hdots + T_n)}] = \phi_T(u)^{n+1} = \left( \frac{\lambda}{\lambda+u} \right)^{n+1}

    Now, we need to find {\rho_{S_n}} such that {\int_0^\infty \rho_{S_n}(t) e^{iu} dt = \left( \frac{\lambda}{\lambda+u} \right)^{n+1}}. Now it is just a matter of solving this equation and we get: {\rho_{S_n}(t) = \frac{\lambda (\lambda t)^n}{n!} e^{-\lambda t}}. Now, the Poisson varible {N_t} measures the number of arrivals in {[0,t]} and therefore:

    \displaystyle \begin{aligned} P\{N_t = n\} & = P\{S_{n-1} < t < S_n\} = P\{S_n > t\} - P\{S_{n-1} \geq t\} \\ & = \int_t^\infty \rho_{S_n}(t) dt - \int_t^\infty \rho_{S_{n-1}}(t) dt = \frac{(\lambda t)^n}{n!} e^{-\lambda t} \end{aligned}

  5. Characteristic Function or Fourier Tranform: Taking {Q = \{e^{i\lambda x}; x \in \mathbb{C}\}} we get the Fourier Transform: {f_X(\lambda) = \mathop{\mathbb E} e^{i \lambda x}} which also has some of the nice properties of the previous ones and some additional ones. The characteristic functions were the main actors in the development of all the probability techniques that lead to the main result of 19th century Probability Theory: the Central Limit Theorem. We know that moment generating functions and Laplace transforms completely characterize the distributions, but it is not clear how to recover a distribution once we have a transform. For Fourier Transform there is a cleas and simple way of doing that by means of the Inversion Formula:

    \displaystyle \rho_X(x) = \frac{1}{2 \pi} \int_{\mathbb R} e^{-i \lambda x} f_X(\lambda) d\lambda

    One fact that always puzzled me was: why is the normal distribution {\rho_N (x) = \frac{1}{\sqrt{2 \pi}} e^{-x^2/2} } so important? What does it have in special to be the limiting distribution in the Central Limit Theorem, i.e., if {X_1, X_2, \hdots, X_n} is a sequence of independent random variables, {S_n = \sum_1^n X_i - \sum_1^n \mathop{\mathbb E} X_i} then {S_n / \sqrt{var S_n} \rightarrow N(0,1)} under some natural conditions on the variables. The reason the normal is so special is because it is a “fixed point” for the Fourier Transform. We can see that {f_N(\lambda) = e^{-\lambda_2/2}}. And there we have something special about it that makes me believe the Central Limit Theorem.

————————-

This blog post was based on lectures by Professor Dynkin at Cornell.

Categories: theory Tags: ,

Entropy

August 27th, 2009 No comments

Today was the first day of classes here at Cornell and as usual, I attend to a lot of different classes to try to decide which ones to take. I usually feel like I wanted to take them all, but there is this constant struggle: if I take too many classes I have no time to do research and to read random things that happen to catch my attention at that moment, and if I don’t take many classes I feel like not learning a lot of interesting stuff I wanted to be learning. The solution in the middle of the way is to audit a lot of classes and start dropping them as a start needing more time: what happens usually quickly. This particular fall I decided that I need to build a stronger background in probability – since I am finding a lot of probabilistic stuff in my way and I have nothing more than my undergrad course and things I learned on demand. I attended at least three probability classes with different flavours today and I decided to blog about a simple, yet very impressive result I saw in one of them.

Since I took a class on “Principles of Telecommunications” in my undergrad, I became impressed by Shannon’s Information Theory and the concept of entropy. There was one theorem that I always heard about but never saw the proof. I thought it was a somewhat complicated proof, but it turned out not to be that much.

Consder an alphabet {\Omega} and a probability distribution over it. I want to associate to each {\omega \in \Omega} a string {c(\omega)} of {k(\omega)} {\{0,1\}}-digits to represent each simbol of the alphabet. One way of allowing the code to be decodable is to make them a proper code. A proper code is a code such that given any {\omega_1} and {\omega_2}, {c(\omega_1)} is not a prefix of {c(\omega_2)}. There are several codes like this, but some are more efficient then others. Since the letters have different frequencies, it makes sense to code a frequent letter (say ‘e’ in English) with few bits and a letter that doesn’t appear much, say ‘q’ with more bits. We want to find a proper code to minimize:

\displaystyle \mathop{\mathbb E}[k(\omega)] = \sum_{\omega \in \Omega} k(\omega) p(\omega)

The celebrated theorem by Shannon shows that for any proper code (actually it holds more generally for any decodable code), we have {\mathop{\mathbb E}[k(\omega)] \geq H} where {H} is the entropy of the alphabet, defined as:

\displaystyle H = - \sum_{\omega} p(\omega) \log_2 p(\omega)

even more impressive is that we can achieve something very close to it:

Theorem 1 There is a code such that {\mathop{\mathbb E}[k(\omega)] \leq H + 1}.

With an additional trick we can get {H + \epsilon} for any {\epsilon > 0}. The first part is trickier and I won’t do here (but again, it is not as hard as I thought it would be). For proving that there is a code with average length {\leq H + 1} we use the following lemma:

Lemma 2 There is a proper code for {\Omega} with code-lengths {k(\omega)} if and only if {\sum_\omega 2^{-k(\omega)} \leq 1}

Proof: Let {N = \max_\omega k(\omega)} and imagine all the possible codewords of length {\leq N} as a complete binary tree. Since it is a proper code, no two codes {c(\omega_1)} and {c(\omega_2)} are in the same path to the root. So, picking one node as a codeword means that we can’t pick any node in the subtree from it. Also, for each leave, the is at most one codeword in its path to the root. Therefore we can assign each leaf of the tree to a single codeword or to no codeword at all. It is easy to see that a codeword with size {k(\omega)} has associated with it {2^{N - k(\omega)}} leaves. Since there are {2^N} leaves in total, we have that:

\displaystyle \sum_\omega 2^{N-k(\omega)} \leq 2^N

what proves one direction of the result. Now, to prove the converse direction, we can propose a greedy algorithm: given {\Omega} and {k(\omega)} such that {\sum_\omega 2^{-k(\omega)} \leq 1}, let {N = \max_\omega k(\omega)}. Now, suppose {k(\omega_1) \leq k(\omega_2) \leq k(\omega_3) \leq \hdots}. Start with {2^N} leaves in a whole block. Start dividing them in {2^{k(\omega_1)}} blocks and assign one to {\omega_1}. Now we define the recursive step: when we analyze {\omega_j}, the leaves are divided in {2^{k(\omega_j-1)}} blocks, some occupied, some not. Divide each free block in {2^{k(\omega_j) - k(\omega_j-1)}} blocks and assign one of them to {\omega_j}. It is not hard to see that each block corresponds to one node in the tree (the common ancestor of all the leaves in that block) and that it corresponds to a proper code. \Box

Now, using this we show how to find a code with with {\mathop{\mathbb E}[k(\omega)] \leq H + 1}. For each {\omega}, since {p(\omega) \in (0,1]} we can always find {k(\omega)} such that {\frac{1}{2} p(\omega) \leq 2^{-k(\omega)} \leq p(\omega)}. Now, clearly:

\displaystyle \sum_\omega 2^{-k(\omega)} \leq \sum_\omega p(\omega) = 1

and:

\displaystyle \mathop{\mathbb E}[k(\omega)] = \sum_\omega k(\omega) p(\omega) \leq \sum_\omega [1 - \log_2 p(\omega)] p(\omega) = H + 1

Cool, but now how to bring it to {H + \epsilon} ? The idea is to code multiple blocks at the same time (even if they are independent, we are not taking advantage of correlation between the blocks). Consider {\Omega^k} and the probability function induced on it, i.e.:

\displaystyle p_k (\omega_1, \hdots, \omega_k) = \prod_{i=1}^k p(\omega_i)

It is not hard ot see that {\Omega^k} with {p_k} has entropy {kH} because:

\displaystyle \begin{aligned} \sum_{\omega_1, \hdots, \omega_k} p_k(\omega_1, \hdots, \omega_k) \log_2 p_k(\omega_1, \hdots, \omega_k) =\\ = \sum_{\omega_1, \hdots, \omega_k} \prod_i p(\omega_i) \sum_i \log_2 p(\omega_i) =\\ = \sum_i \sum_\omega p(\omega) \log_2 p(\omega) = kH \end{aligned}

and then we can just apply the last theorem to that: we can find a function that codifies {k} symbols {\omega = (\omega_1, \hdots, \omega_k)} with {l(\omega)} symbols such that:

{kH \leq \mathop{\mathbb E}[l(\omega)] \leq kH + 1} since {l(\omega)} codifies {k} symbols, we are actually interested in {\mathop{\mathbb E}[l(\omega)/k]} and therefore we get:

\displaystyle H \leq \mathop{\mathbb E}\left[\frac{l(\omega)}{k}\right] \leq H + \frac{1}{k}

Programming and Storytelling

August 21st, 2009 No comments

Recently I looked in Papadimitriou’s website looking for something else and found this great article called: “MythematiCS: In Praise of Storytelling in the Teaching of Computer Science and Math”. He begins by pointing out the in early times knowledge was transferred mostly by storytelling – and there is much more place in contemporary technical teaching to storytelling than most of people realize. He has several interesting points: one of them is that we can think of writting a computer program as telling a story. For example, the variables are the characters: they have characteristics (data types) and the whole story (program) is about playing around with them. Sometimes they have multiple faces and behaviors depending on the circumstance (polymorphism). Iteration and recursion are common literary tools, used for example in fairy tales “in the first day, this happens, in the second day, that happens, then…” or “he might be able to do that just if he does that…”. He mentions one of my favourite books: “If on a Winter’s Night a Traveler” as a great example of recursion. This made me think that maybe Italo Calvino is my favourite author because his stories are so beautifully constructed in an almost mathematical fashion – like an Escher paiting or  Bach music. They went very far in showing the beauty of math and showing it is really one art. For example, this beautiful representation of the hyperbolic plane:

Escher_Circle_Limit_III

Back to programming there are still a lot of interesting relations: several novels are multi-threaded. We look at the novels from perspectives of multiple characters. Stories also need to “compile and run”, which in this case mean, make sense and be accepted by people. I was thinking that there are a lot of books which everyone knows about but very few people have ever read (Ulisses, for example). Are those NP-complete problems?

Back to Papadimitriou’s article, he talks about a few interesting books that do a good job in mixing together math and stories. One that he mentions I read a long time ago, still when I was in high-school and it did a great job in further stimulating me on math. The book was The Parrot’s Theorem. Recently I also read one other book that he mentioned: Surreal Numbers, by Don Knuth. Although I am a great fan of almost everything Knuth writes, this book didn’t caught me much. I think it may be because I am not the right audience. If I read it a couple of years back I might have enjoyed it much more.

When I was in Greece last year, I came across this very interesting comic book: Logicomix. It was in Greek but just by looking into it I figured out it was something about math and it seemed pretty cool. Later I found out this was written by Papadimitriou and Doxiadis, which made me even more curious to read it. Now I am waiting the English translation of it. One last pointer: Doxiadis has a webpage with some interesting essays about the relations of mathematical proofs, computer programming and storytelling.

as t

MythematiCS: (1)

In Praise of Storytelling in the Teaching of
Computer Science and Math