Big Red Bits

Entropy

August 27th, 2009 renatoppl No comments

Today was the first day of classes here at Cornell and as usual, I attend to a lot of different classes to try to decide which ones to take. I usually feel like I wanted to take them all, but there is this constant struggle: if I take too many classes I have no time to do research and to read random things that happen to catch my attention at that moment, and if I don’t take many classes I feel like not learning a lot of interesting stuff I wanted to be learning. The solution in the middle of the way is to audit a lot of classes and start dropping them as a start needing more time: what happens usually quickly. This particular fall I decided that I need to build a stronger background in probability – since I am finding a lot of probabilistic stuff in my way and I have nothing more than my undergrad course and things I learned on demand. I attended at least three probability classes with different flavours today and I decided to blog about a simple, yet very impressive result I saw in one of them.

Since I took a class on “Principles of Telecommunications” in my undergrad, I became impressed by Shannon’s Information Theory and the concept of entropy. There was one theorem that I always heard about but never saw the proof. I thought it was a somewhat complicated proof, but it turned out not to be that much.

Consder an alphabet ${\Omega}$ and a probability distribution over it. I want to associate to each ${\omega \in \Omega}$ a string ${c(\omega)}$ of ${k(\omega)}$ ${\{0,1\}}$ -digits to represent each simbol of the alphabet. One way of allowing the code to be decodable is to make them a proper code. A proper code is a code such that given any ${\omega_1}$ and ${\omega_2}$ , ${c(\omega_1)}$ is not a prefix of ${c(\omega_2)}$ . There are several codes like this, but some are more efficient then others. Since the letters have different frequencies, it makes sense to code a frequent letter (say ‘e’ in English) with few bits and a letter that doesn’t appear much, say ‘q’ with more bits. We want to find a proper code to minimize:

$\displaystyle \mathop{\mathbb E}[k(\omega)] = \sum_{\omega \in \Omega} k(\omega) p(\omega)$

The celebrated theorem by Shannon shows that for any proper code (actually it holds more generally for any decodable code), we have ${\mathop{\mathbb E}[k(\omega)] \geq H}$ where ${H}$ is the entropy of the alphabet, defined as:

$\displaystyle H = - \sum_{\omega} p(\omega) \log_2 p(\omega)$

even more impressive is that we can achieve something very close to it:

Theorem 1 There is a code such that ${\mathop{\mathbb E}[k(\omega)] \leq H + 1}$ .

With an additional trick we can get ${H + \epsilon}$ for any ${\epsilon > 0}$ . The first part is trickier and I won’t do here (but again, it is not as hard as I thought it would be). For proving that there is a code with average length ${\leq H + 1}$ we use the following lemma:

Lemma 2 There is a proper code for ${\Omega}$ with code-lengths ${k(\omega)}$ if and only if ${\sum_\omega 2^{-k(\omega)} \leq 1}$

Proof: Let ${N = \max_\omega k(\omega)}$ and imagine all the possible codewords of length ${\leq N}$ as a complete binary tree. Since it is a proper code, no two codes ${c(\omega_1)}$ and ${c(\omega_2)}$ are in the same path to the root. So, picking one node as a codeword means that we can’t pick any node in the subtree from it. Also, for each leave, the is at most one codeword in its path to the root. Therefore we can assign each leaf of the tree to a single codeword or to no codeword at all. It is easy to see that a codeword with size ${k(\omega)}$ has associated with it ${2^{N - k(\omega)}}$ leaves. Since there are ${2^N}$ leaves in total, we have that:

$\displaystyle \sum_\omega 2^{N-k(\omega)} \leq 2^N$

what proves one direction of the result. Now, to prove the converse direction, we can propose a greedy algorithm: given ${\Omega}$ and ${k(\omega)}$ such that ${\sum_\omega 2^{-k(\omega)} \leq 1}$ , let ${N = \max_\omega k(\omega)}$ . Now, suppose ${k(\omega_1) \leq k(\omega_2) \leq k(\omega_3) \leq \hdots}$ . Start with ${2^N}$ leaves in a whole block. Start dividing them in ${2^{k(\omega_1)}}$ blocks and assign one to ${\omega_1}$ . Now we define the recursive step: when we analyze ${\omega_j}$ , the leaves are divided in ${2^{k(\omega_j-1)}}$ blocks, some occupied, some not. Divide each free block in ${2^{k(\omega_j) - k(\omega_j-1)}}$ blocks and assign one of them to ${\omega_j}$ . It is not hard to see that each block corresponds to one node in the tree (the common ancestor of all the leaves in that block) and that it corresponds to a proper code. $\Box$

Now, using this we show how to find a code with with ${\mathop{\mathbb E}[k(\omega)] \leq H + 1}$ . For each ${\omega}$ , since ${p(\omega) \in (0,1]}$ we can always find ${k(\omega)}$ such that ${\frac{1}{2} p(\omega) \leq 2^{-k(\omega)} \leq p(\omega)}$ . Now, clearly:

$\displaystyle \sum_\omega 2^{-k(\omega)} \leq \sum_\omega p(\omega) = 1$

and:

$\displaystyle \mathop{\mathbb E}[k(\omega)] = \sum_\omega k(\omega) p(\omega) \leq \sum_\omega [1 - \log_2 p(\omega)] p(\omega) = H + 1$

Cool, but now how to bring it to ${H + \epsilon}$ ? The idea is to code multiple blocks at the same time (even if they are independent, we are not taking advantage of correlation between the blocks). Consider ${\Omega^k}$ and the probability function induced on it, i.e.:

$\displaystyle p_k (\omega_1, \hdots, \omega_k) = \prod_{i=1}^k p(\omega_i)$

It is not hard ot see that ${\Omega^k}$ with ${p_k}$ has entropy ${kH}$ because:

$\displaystyle \begin{aligned} \sum_{\omega_1, \hdots, \omega_k} p_k(\omega_1, \hdots, \omega_k) \log_2 p_k(\omega_1, \hdots, \omega_k) =\\ = \sum_{\omega_1, \hdots, \omega_k} \prod_i p(\omega_i) \sum_i \log_2 p(\omega_i) =\\ = \sum_i \sum_\omega p(\omega) \log_2 p(\omega) = kH \end{aligned}$

and then we can just apply the last theorem to that: we can find a function that codifies ${k}$ symbols ${\omega = (\omega_1, \hdots, \omega_k)}$ with ${l(\omega)}$ symbols such that:

${kH \leq \mathop{\mathbb E}[l(\omega)] \leq kH + 1}$ since ${l(\omega)}$ codifies ${k}$ symbols, we are actually interested in ${\mathop{\mathbb E}[l(\omega)/k]}$ and therefore we get:

$\displaystyle H \leq \mathop{\mathbb E}\left[\frac{l(\omega)}{k}\right] \leq H + \frac{1}{k}$

Categories: theory Tags: entropy, information theory, mathematics, probability

Programming and Storytelling

August 21st, 2009 renatoppl No comments

Recently I looked in Papadimitriou’s website looking for something else and found this great article called: “MythematiCS: In Praise of Storytelling in the Teaching of Computer Science and Math”. He begins by pointing out the in early times knowledge was transferred mostly by storytelling – and there is much more place in contemporary technical teaching to storytelling than most of people realize. He has several interesting points: one of them is that we can think of writting a computer program as telling a story. For example, the variables are the characters: they have characteristics (data types) and the whole story (program) is about playing around with them. Sometimes they have multiple faces and behaviors depending on the circumstance (polymorphism). Iteration and recursion are common literary tools, used for example in fairy tales “in the first day, this happens, in the second day, that happens, then…” or “he might be able to do that just if he does that…”. He mentions one of my favourite books: “If on a Winter’s Night a Traveler” as a great example of recursion. This made me think that maybe Italo Calvino is my favourite author because his stories are so beautifully constructed in an almost mathematical fashion – like an Escher paiting orÂ Bach music. They went very far in showing the beauty of math and showing it is really one art. For example, this beautiful representation of the hyperbolic plane:

Escher_Circle_Limit_III

Back to programming there are still a lot of interesting relations: several novels are multi-threaded. We look at the novels from perspectives of multiple characters. Stories also need to “compile and run”, which in this case mean, make sense and be accepted by people. I was thinking that there are a lot of books which everyone knows about but very few people have ever read (Ulisses, for example). Are those NP-complete problems?

Back to Papadimitriou’s article, he talks about a few interesting books that do a good job in mixing together math and stories. One that he mentions I read a long time ago, still when I was in high-school and it did a great job in further stimulating me on math. The book was The Parrot’s Theorem. Recently I also read one other book that he mentioned: Surreal Numbers, by Don Knuth. Although I am a great fan of almost everything Knuth writes, this book didn’t caught me much. I think it may be because I am not the right audience. If I read it a couple of years back I might have enjoyed it much more.

When I was in Greece last year, I came across this very interesting comic book: Logicomix. It was in Greek but just by looking into it I figured out it was something about math and it seemed pretty cool. Later I found out this was written by Papadimitriou and Doxiadis, which made me even more curious to read it. Now I am waiting the English translation of it. One last pointer: Doxiadis has a webpage with some interesting essays about the relations of mathematical proofs, computer programming and storytelling.

as t

MythematiCS: ⁽¹⁾

In Praise of Storytelling in the Teaching of
Computer Science and Math

Categories: Uncategorized Tags: books, mathematics, stotytelling

Duality Theorem for Semidefinite Programming

August 9th, 2009 renatoppl No comments

I’ve been reading through Luca Trevisan’s blog posts about relaxations of the Sparsest Cut problem. He discusses three relaxations: (i) spectral relaxation; (ii) Leighton-Rao and (iii) Arora-Rao-Vazirani. (i) and (iii) are semi-definite programs (SDP), which is an incredibly beautiful and useful generalization of Linear Programs. This technique is the core of the famous MAX-CUT algorithm from Goemans and Williamson and was also applied to design approximation algorithms for a large variety of combinatorial problems. As for Linear Programming, there is a Duality Theorem that allows us to give certificates to lower and upper bounds. I was curious to learn more about it and my friend Ashwin suggested me those great lecture notes by Lazlo Lovasz about semi-definite programming. I was surprised that the proof of the duality theorem for SDP is almost the same as for LP, just changing a few steps. We begin by using the Convex Separation Theorem to derive a Semi-definite version of Farkas’ Lemma. We use this version of Farkas’ Lemma to prove the Duality Theorem in a similar manner that is normally done for LP. Let’s go through the proof, but first, let’s define positive semidefinite matrices:

Definition 1 A matrix ${A}$ is said to be positive semidefinite (denoted by ${A \succeq 0}$ ) if for all ${x \in {\mathbb R}^n}$ we have ${x^t A x \geq 0}$ .

It is the same as saying that all eigenvalues of ${A}$ are non-negative, since the smallest eigenvalue of ${A}$ is given by ${min_{\Vert x \Vert = 1} x^t A x}$ . We denote ${A \succ 0}$ when all eigenvalues of ${A}$ are stricly positive. It is the same as ${x^t A x > 0}$ for all ${x \neq 0}$ . Also, given two matrices ${A}$ and ${B}$ we use the following notation: ${A \bullet B = \sum_{ij} A_{ij} B_{ij}}$ what is nothing more than the dot-product when we see those matrices as vectors in ${{\mathbb R}^{n^2}}$ . Therefore:

Definition 2 A semi-definite program (SDP) is an optimization problem where the variable is a matrix ${X}$ and has the form:

$\displaystyle \left. \begin{aligned} & \min C \bullet X \text{ s.t. } \\ & \qquad \left. \begin{aligned} & D_i \bullet X = d_i & \forall i \\ & X \succeq 0 \\ \end{aligned} \right. \end{aligned} \right. \ \ \ \ \ (1)$

that is, it is a linear program and the restriction that the variable, viewed as a matrix, must be positive semi-definite.

The interesting thing is to use that the set ${\mathcal{P} \subseteq {\mathbb R}^{n^2}}$ of semi-definite matrices is a convex cone, i.e., if ${A, B \in \mathcal{P}}$ then ${c_1 A + c_2 B \in \mathcal{P}}$ for any ${c_1, c_2 \geq 0}$ . It is easy to see this is a convex set. Now we use the following theorem to prove Farkas’ Lemma:

Theorem 3 (convex separation) Given two convex sets ${C_1, C_2 \subseteq {\mathbb R}^n}$ such that ${C_1 \cap C_2 = \emptyset}$ then there is ${p, v \in {\mathbb R}^n}$ such that ${v^t (x - p) \leq 0}$ for all ${x \in C_1}$ and ${v^t (x - p) \geq 0}$ for all ${x \in C_2}$ . Besides, if one of them is a cone, it holds with ${p = 0}$ .

Theorem 4 (Semi-definite Farkas’ Lemma) Exacly one of the following problems has a solution:

${x_1 A_1 + \hdots + x_k A_k \succ 0}$ , ${x_i \in {\mathbb R}}$

${A_1 \bullet Y = 0}$ , …, ${A_k \bullet Y = 0}$ , ${Y \succeq 0}$

Proof: If problem ${1}$ doesn’t have a solution then the cone ${\mathcal{P}}$ is disjoint from the convex cone ${\{ \sum_i x_i A_i ; x_i \in {\mathbb R} \}}$ and therefore they can be separated. This means that there is ${Y}$ ( ${Y}$ plays the role of ${v}$ in the convex separation theorem), such that: ${\left( \sum_i x_i A_i \right) \bullet Y \leq 0}$ for all ${x_1, \hdots, x_k \in {\mathbb R}}$ and ${Y \bullet M \geq 0}$ for all ${M \in \mathcal{P}}$ . Now, taking ${x_i = 1}$ and all others ${0}$ and then later ${x_i = -1}$ and all other zero, we can easily prove that ${A_i \bullet Y = 0}$ .

It remains to prove that ${Y \succeq 0}$ . Just take ${M = x x^t}$ for ${x \in {\mathbb R}^n}$ and then we have that ${Y \bullet M = x^t Y x \geq 0}$ for all ${x \in {\mathbb R}^n}$ and therefore, ${Y \succeq 0}$ . $\Box$

Now, we can use this to prove the duality theorem. We will use a slighly different version of Farkas’ Lemma together with an elementary result in Linear Algebra. The following version comes from applying the last theorem to the matrices:

$\displaystyle \begin{bmatrix} A_1 & \\ & 0 \end{bmatrix}, \hdots, \begin{bmatrix} A_k & \\ & 0 \end{bmatrix}, \begin{bmatrix} -B & \\ & 1 \end{bmatrix}$

instead of ${A_1, \hdots, A_{k+1}}$ .

Theorem 5 (Semi-definite Farkas’ Lemma: Inomogeneous Version) Exacly one of the following problems has a solution:

${x_1 A_1 + \hdots + x_k A_k \succ B}$ , ${x_i \in {\mathbb R}}$

${A_1 \bullet Y = 0}$ , …, ${A_k \bullet Y = 0}$ , ${B \bullet Y \geq 0}$ , ${Y \succeq 0}$

where ${A \succ B}$ means ${A - B \succ 0}$ . Now, an elementary Linear Algebra result before we can proceed to the proof of the duality theorem:

Theorem 6 Given two semi-definite matrices ${A}$ and ${B}$ we have ${tr(AB) \geq 0}$ , where ${tr(M) = \sum_i M_{ii}}$

Proof: The trace of a matrix is invariant under change of basis, i.e. for any non-singular matrix ${\Gamma}$ , we have that ${tr(A) = tr(\Gamma^{-1} A \Gamma)}$ . This is very easy to see, since:

$\displaystyle \begin{aligned} tr(\Gamma^{-1} A \Gamma) & = \sum_i (\Gamma^{-1} A \Gamma)_{ii} = \sum_i \sum_j (\Gamma^{-1})_{ij} (A \Gamma)_{ji} = \\ & = \sum_i \sum_j \sum_k (\Gamma^{-1})_{ij} A_{jk} (\Gamma)_{ki} = \\ & = \sum_j \sum_k A_{jk} \left( \sum_i (\Gamma)_{ki} (\Gamma^{-1})_{ij} \right) = \\ & = \sum_j \sum_k A_{jk} \delta_{jk} = tr(A) \\ \end{aligned}$

Now, we can write in the basis of ${A}$ ‘s eigenvectors. Let ${\Gamma}$ be a matrix s.t. ${A = \Gamma^{t} D \Gamma}$ where ${D = diag(\lambda_1, \hdots, \lambda_n)}$ Now:

$\displaystyle \begin{aligned} tr(AB) & = tr(\Gamma^{t} AB \Gamma) = tr(\Gamma^{t} A \Gamma \Gamma^{t} B \Gamma) = tr(D \Gamma^{t} B \Gamma) = \\ & = \sum_{ii} \lambda_i (\Gamma^{t} B \Gamma)_{ii} \geq 0 \end{aligned}$

since ${\lambda_i \geq 0}$ because the matrix is positive semi-definite and:

$\displaystyle (\Gamma^{t} B \Gamma)_{ii} = e_i^t \Gamma^{t} B \Gamma e_i = (\Gamma e_i)^t B (\Gamma e_i)$

$\Box$

Now, we are ready for the Duality Theorem:

Theorem 7 (SDP Duality) The dual of the program 1 is given by:

$\displaystyle \left. \begin{aligned} & \max d^t y \text{ s.t. } \\ & \qquad \left. \begin{aligned} & y_1 D_1 + \hdots + y_k D_k \preceq C \end{aligned} \right. \end{aligned} \right. \ \ \ \ \ (2)$

We call 1 the primal problem and 2 the dual. If ${X}$ is feasible for 1 and ${y}$ is feasible for 2, then ${d^t y \leq C \bullet X}$ . Besides, if the dual has a strict feasible solution (i.e. with ${\sum_i y_i D_i \prec C}$ ) then the dual optimum is attained and both have the the same optimal value.

As we can see in the theorem above, the main difference between this duality theorem and the duality theorem of linear programming is this “stictly feasible” condition in the theorem. Let’s proceed to the proof. As in the the LP proof the outline is the following: first we prove weak duality ( ${d^t y \leq C \bullet X}$ for feasible ${X,y}$ ) and then we consider the dual restrictions with ${d^t y > v^*_{dual}}$ and we apply Farkas’ Lemma:

Proof: Weak duality: If ${X}$ is feasible for 1 and ${y}$ is feasible for 2 then:

$\displaystyle C\bullet X \geq \left( \sum_i y_i D_i \right) \bullet X = \sum_i y_i D_i\bullet X$

because ${C - \sum_i y_i D_i \succeq 0}$ and the ${\bullet}$ of two positive semi-definite matrices is non-negative. Now:

$\displaystyle d^t y = \sum_i y_i d_i = \sum_i y_i D_i\bullet X$

by 1. Therefore, ${C\bullet X \geq d^t y}$

Strong duality: Clearly, the following problem has no solution:

$y_1 D_1 + \hdots + y_k D_k \prec C$

$y^t d > v^*_{dual}$

where ${v^*_{dual}}$ is the optimal value of 2. Now we apply the inomogeneous Farkas’ Lemma to:

$\displaystyle y_1 \begin{bmatrix} -D_1 & \\ & d_1 \end{bmatrix} + \hdots + y_k \begin{bmatrix} -D_k & \\ & d_k \end{bmatrix} \succ \begin{bmatrix} -C & \\ & v^*_{dual} \end{bmatrix}$

and we get a matrix ${X' = \begin{bmatrix} X & x^t \\ x & x_{n+1} \end{bmatrix}}$ such that:

$\displaystyle \begin{bmatrix} -D_i & \\ & d_i \end{bmatrix} \bullet X' = 0, \begin{bmatrix} -C & \\ & v^*_{dual} \end{bmatrix} \bullet X' = 0, X' \succeq 0$

which means: ${D_i \bullet X = d_i x_{n+1}}$ and ${C \bullet X \leq v^*_{dual} x_{n+1}}$ . We know that ${X \succeq 0}$ and ${x_{n+1} \geq 0}$ . We just need to prove that ${x_{n+1} > 0}$ . We already know it is not negative, so we need to prove it is not zero and here we will need the hypothesis that the dual has a strict solution. If ${x_{n+1} = 0}$ then there would be solution to: