Archive

Posts Tagged ‘linear algebra’

Bimatrix games

May 25th, 2011 No comments

This week in Israel is the Workshop on Innovations in Algorithmic Game Theory, which has been a collection of amazing talks and I thought of blogging about something new I learned here. First, Paul Goldberg gave an amazing talk about the Lemke-Howson algorithm and Homotopy Methods. Later, during the poster session Jugal Garg presented the impressive work on an exact algorithm for finding Nash equilibria in rank 1 bimatrix games. Both have in common the use of homotopy, which I found I quite cool idea – and has been present in Game Theory for a while, but I didn’t know about. They also have to do with improving the understanding we have on the Lemke-Howson Algorithm – a (potentially exponential time) algorithm to find Nash equilibrium in 2-person bi-matrix games, but that seems to work pretty well in practice. As always, I though that blogging about it would be a good idea in order to understand those concepts well.

Bimatrix Game and How to compute equilibrium

Bimatrix game is the simplest class of games studied. Basically it is a 2 player game, with n strategies for player 1 and m strategies for player 2 which is represented by a pair of n \times m matrices (A,B). Let x \in \mathbb{R}_+^n, \Vert x \Vert_1 = 1 represent a probability distribution that player 1 assigns to his strategies and y \in \mathbb{R}_+^m, \Vert y \Vert_1 = 1 be the same for player 2. This way, the players experience utilities:

[; u_1 (x,y) = x^t A y,  u_2(x,y) = x^t B y;]

The best understood class of those games is the one where A+B = 0, called zero-sum games. For this class, computing a Nash equilibrium is very easy and it is given by the famous min-max theorem: player 1 finds x maximizing \min_j (x^t A e_j) where e_j is the j-th unit vector. Similarly player 2 finds y maximizing \min_i (e_i^t B y). Then the pair of strategies obtained is a Nash equilibrium – and this verifying that is not hard.

When A+B \neq 0, the problems gets a lot more complicated. Proving that equilibrium exist can be done using various fixed point theorems, as Brouwer or Kakutani. There is a very simple exponential time algorithm for finding it and the key observation is the following, if (x,y) is a Nash equilibrium then:

\forall i: x_i = 0 \text{ or } \forall i': e_i^t A y \geq e_{i'}^t A y

\forall j: y_j = 0 \text{ or } \forall j': x^t B e_j \geq x^t B e_{j'}

which means that each strategy for player i is either in the support or is a best response. Proving that is trivial (if some strategy is in the support and is not a best response, then reducing the probability we play it is an improving deviation). Therefore if we just guess the support of x and the support of y we just need to find some strategies with this support satisfying the inequalities above. This can be done using a simple LP. This is clearly not very efficient, since it involves solving 2^{n+m} LPs. A still exponential, but a lot more practical method, is:

Lemke-Howson Algorithm

A good overview of the L-H algorithm can be found in those lecture notes or in a more detailed version in third chapter of the AGT book (a pdf  can be found in Tim’s website). Here I’ll present a quick overview. The main idea is to define the best-response polytopes P and Q:

[; P = \{(x,v) \in \mathbb{R}^{n+1} \vert x_i \geq 0, x^t B e_j \leq v; \textbf{1}^t x = 1 \};]

[; Q = \{(y,u) \in \mathbb{R}^{m+1} \vert e_i^t A y \leq u, y_j \geq 0; \textbf{1}^t y = 1 \} ;]

The intuition is that a point (x,v) \in P represents the fact that the payoff v of player 2 when player 1 plays x is at least v. We could re-write as v \geq \max_j x^t B e_j.

Each of the polytopes has m+n inequalities and 1 equality. Given x we define L(x) as the indices of the tight inequalities in P and similarly we define L(y) as the indices of the tight inequalities in Q. The theorem in the previous section can be rephrased as:

(x,y) is Nash equilibrium iff L(x) \cup L(y) = \{1, 2, \hdots, n+m\}

So we need to look at points in the polytope below that are fully-labeled. And the way that this is done in L-H is quite ingenious – it is a similar idea that is used by the Simplex Method – walking through the vertices of the polytope looking for the desired vertex. In order to do it, let’s define another polytope:

[;P' = \{x \in \mathbb{R}^n \vert x_i \geq 0, x^t B e_j \leq 1\} ;]

[;Q' = \{y \in \mathbb{R}^m \vert e_i^t A y \leq 1, y_j \geq 0\} ;]

Note that there is a clear bijection between P' \setminus 0 \leftrightarrow P and Q' \setminus 0 \leftrightarrow Q. This is a projective transformation given by x \mapsto (\frac{x}{\Vert x \Vert_1},\Vert x \Vert_1). Notice that the labels are preserved by this transformation and the vertex and edges of the polytope are mapped almost 1-1 (except the vertex 0). Notice a couple of details:

1. A vertex in x \in P' corresponds to a point with n labels. A vertex in y \in Q'' corresponds to a point with m labels.

2. The point (0,0) corresponds to a fully-labeled point of P' \times Q', unfortunately this is the only fully-labeled point that doesn’t correspond to a Nash equilibrium.

3. By taking an edge of the polytope (which is define by n-1 labels in P' and m-1 labels in Q'. We can move between two nodes that are almost fully labeled.

The idea is to consider the set of nodes that are almost fully-labeled: fix some label k and consider all the nodes such that L(x) \cup L(y) \supseteq \{1 .. k-1, k+1 .. n+m\}. Those points are either (0,0), a Nash equilibrium or they have one duplicated label, since \vert L(x) \vert + \vert L(y) \vert = n+m. An edge is composed of n+m-1 labels (no duplicated labels). So, one almost fully-labeled point that is not a Nash or (0,0) is only connected to two other vertices via an edge (which correspond to dropping one of the duplicated labels and following the corresponding edge).

This gives us a topology of the space of Nash equilibria. This tells us a couple of facts: (1) we can find a Nash equilibrium by starting in (0,0) and following the L-H path. In the end of the path, we must find a Nash equilibrium. (ii) If the polytope is not degenerate, then there is an odd number of Nash equilibria and they are connected by L-H edges in the following way:

where the blue dots are the Nash equilibria, the white dots are the almost fully-labeled points and the edges are the L-H edges. The number of Nash equilibria are odd by a simple parity argument.

The classic way in which simplex-like methods walk through the vertices of a polytope is by pivoting. The same way we can implement Lemke-Howson. For an explanation on the implementation, please look at the chapter 3 of the AGT book.

One can ask if we could modify the L-H path following to go through the path faster. Recently, Goldberg, Papadimitriou and Savani proved that finding the Nash equilibrium that L-H outputs is PSPACE-complete. So, in principle, finding this specific equilibrium, seems harder than finding any Nash, which is PPAD-complete.

My current plan is on some other blog posts on Bimatrix games. I wanted to discuss two things: first the homotopy methods and homotopy fixed-point-theorems (which is the heart of the two papers mentioned above) and about new techniques for solving zero-matrix where the strategy space is exponential.

Bounded Degree Spanning Tree and an Uncrossing Lemma

November 17th, 2009 No comments

I’ve been reading about the Bounded Degree Spanning Tree problem and I thought of writing some of what I am learning here. It illustrates a beautiful techique called Iterated Rounding and uses the combinatorial idea of uncrossing. I’ll try to give a high-level idea of the argument and give references on the details. The first result of this kind was given by Goemans (although there were previous results with weaker guarantees) by Goemans in Minimum Bounded Degree Spanning Trees, but the result based on iterated rounding and a subsequent improvement are due to Singh and Lau in a serie of papers. A main reference is Approximating minimum bounded degree spanning trees to within one of optimal.

The problem of bounded degree spanning tree is as follows: consider a graph {G = (V,E)} with edge weights and we for some nodes {W \subseteq V} a degree bound {b_v, v \in W}. We want to find, among the spanning trees for which the degree of {v} is {\leq b_v} the one with minimum cost. It is clearly a hard problem, since taking all weights equal to {1} and {b_v = 2} for all nodes is the Hamiltonian Path problem, which is NP-complete. We will get a different kind of approximation. Let OPT be the optimal solution: we will show an algorithm that gives a spanning tree of cost {\leq OPT} such that each node {v} has degree {\leq b_v + 2} (this can be improved to {b_v + 1} with a more sofisticated algorithm, also based on Iterated Rounding).

As always, the first step to design an approximation algorithm is to relax it to an LP. We consider the following LP:

\displaystyle \left. \begin{aligned} & \min \sum_{e\in E} c_e x_e \text{ s.t. } \\ & \qquad \left. \begin{aligned} & \sum_{e \in E} x_e = \vert V \vert - 1 \\ & \sum_{e \in E(S)} x_e \leq \vert S \vert - 1 & \forall S \subseteq V, \vert S \vert \geq 2 \\ & \sum_{e \in \delta(v)} x_e \leq b_v & \forall v \in W\\ & x_e \geq 0 & \forall e \in E \end{aligned} \right. \end{aligned} \right.

The first constraint expresses that in a spanning tree, there are at most {\vert V \vert - 1} edges, the second prevent the formation of cycles and the third guarantees the degree bounds. For {W = \emptyset} we have the standard Minimal Spanning Tree problem and for this problem the polytope is integral. With the degree bounds, we lose this nice property. We can solve this LP using the Ellipsoid Method. The separation oracle for the {\sum_{e \in E(S)} x_e \leq \vert S \vert - 1} is done by a flow computation.

Iterated Rounding

Now, let’s go ahead and solve the LP. It would be great if we had an integral solution: we would be done. It is unfortunately not the case, but we can still hope it is almost integral in some sense: for example, some edges are integral and we can take them to the final solution and recurse the algorithm on a smaller graph. This is not far from truth and that’s the main idea of the iterated rounding. We will show that the support of the optimal solution {E(x) = \{e \in E; x_e > 0\}} has some nice structure. Consider the following lemma:

Lemma 1 For any basic solution {x} of the LP, either there is {v} with just one incident edge in the support {E(x)} or there is one {v \in W} such that that at most {3} edges are incident to it.

If we can prove this lemma, we can solve the problem in the following way: we begin with an empty tree: then we solve the LP and look at the support {E(x)}. There are two possibilities according to the lemma:

  • If there is one node {v} with just one edge {e = (u,v)} incident to it in the support, we add it to the tree, remove {v} from {V}, decrease {b_u}, make {E = E(x)} (the trick is to remove in each iteration edges from {E} that are not in the support. Clearly, removing those edges doesn’t hurt the objective value) and run the algorithm again. Notice that the LP called in the recursion has value less or equal then the actual LP {- c_e}. So if by induction we get a spanning tree respecting the new degree bounds plus two and value less or equal than the new LP value, we can just add {e} and we have a solution with value less or equal than the one of the original LP respecting the degree bounds plus two.
  • Otherwise, there is one node {v \in W} that has degree {3} in the support. So, we just remove that degree bound on that vertex (i.e. remove {v} from {W}), make {E = E(x)} (again,eliminate the edges not in the support) and run the algorithm again. Clearly, if one node is still in {W}, it has {b_v \geq 1}, since there are only three edges in the support, there will be for the rest of the computation, just three edges incident to it, so there will be at most three edges more incident to it. So it will exceed its original {b_v} by at most {2}.

The algorithm eventually stops, since in each iteration we have less edges or less nodes in {W} and the solution is as desired. The main effort is therefore to prove the lemma. But before, let’s look at the lemma: it is of the following kind: “any basic solution of the LP has some nice properties, which envolve having a not too big (at least in some point) support”. So, it involves proving that the support is not too large. That is our next task as we are trying to prove the lemma. And we will be done with:

Theorem 2 The algorithm described above produces a spanning tree of cost {\leq Z^*} (the LP values and therefore {\leq OPT})in which each node {v \in W} has degree {\leq b_v + 2}.

Bounding the size of the support

We would like now to prove some result like the Lemma above: that in the solution of the LP we have either one {v} with degree {1} in {E(x)} or we have a node in {W} with degree {\leq 3}. First, we suppose the opposite, that {(V,E(x))} has all the nodes with degree {\geq 2} and all the nodes in {W} have degree {\geq 4}. This implies that we have a large number of edges in the support. From the degrees, we know that:

\displaystyle \vert E(x) \vert \geq \frac{1}{2} \left( 2( \vert V \vert - \vert W \vert ) + 4 \vert W \vert \right) = \vert V \vert + \vert W \vert

We want to prove that the support {\vert E(x) \vert} of the LP can’t be too large. The first question is: how to estimate the size of the support of a basic solution. The constraints look like that:

bst_fig1

A basic solution can be represented by picking {\vert E \vert} rows of the matrix and making them tight. So, if we have a general {Ax \leq b} LP, we pick some submatrix {A'} of {A} which is {n \times n} and the basic solution is just {x = A'^{-1} b'}. The lines of matrix {A'} can be of three types: they can be {\chi_S}, which are corresponding to {\sum_{e \in E(S)} x_e \leq \vert S \vert - 1}, {\chi_v} that correspond to {\sum_{e \in \delta(v)} x_e \leq b_v} or {\delta_e} corresponding to {x_e \geq 0}. There are {\vert E \vert} vectors in total. The size of the support {E(x)} is smaller or equal the number of rows of the form {\chi_S, \chi_v} in the basic solution. Therefore the idea to bound the size of the support is to prove that “all basic solutions can be represented by a small number of rows in the form {\chi_S, \chi_v}. And this is done using the following:

Lemma 3 Assuming {E = E(x)}, for any basic solution {x}, there is {Z \subseteq W} and a family {\mathcal{S}} of sets such that:

  1. The restrictions correspondent to {S \in \mathcal{S}} and {v \in Z} are tight for {x}
  2. {\{ \chi_S; S \in \mathcal{S} \} \cup \{ \chi_v; v \in Z \}} is an independent set
  3. {\vert \mathcal{S} \vert + \vert Z \vert = \vert E(x) \vert}
  4. {\mathcal{S}} is a laminar family

The first 3 items are straightfoward properties of basic solutions. The fourth one, means that for two sets {S_1, S_2 \in \mathcal{S}}, one of three things happen: {S_1 \subseteq S_2}, {S_2 \subseteq S_1} or {S_1 \cap S_2 = \emptyset}. Now, we based on the previous lemma and in the following result that can be easily proved by induction, we will prove Lemma 1.

Lemma 4 If {\mathcal{S}} is a laminar family over the set {V} where each set {S \in \mathcal{S}} contains at least {2} elements, then {\vert \mathcal{S} \vert \leq \vert V \vert - 1}.

Now, the proof of Lemma 1 is easy. Let’s do it and then we come back to prove Lemma 3. Simply see that {\vert E(x) \vert = \vert \mathcal{S} \vert + \vert Z \vert \leq \vert V \vert - 1 + \vert W \vert} what contradicts {\vert E(x) \vert \geq \vert V \vert + \vert W \vert}.

Uncrossing argument

And now we arrive in the technical heart of the proof, which is proving Lemma 3. This says that given any basic solution, given any feasible solution, we can write it as a “structured” basic solution. We start with any basic feasible solution. This already satifies (1)-(3), then we need to change that solution to satisfy condition (4) as well. We need to get rid crossing elements, i.e., {S,T \in \mathcal{S}} in the form:

bst_fig2

We do that by the means of the:

Lemma 5 (Uncrossing Lemma) If {S} and {T} are intersecting and tight (tight in the sense that their respective constraint is tight), then {S \cup T} and {S \cap T} are also tight and:

\displaystyle \chi_{S \cap T} + \chi_{S \cup T} = \chi_S + \chi_T

Which corresponds to that picture:

bst_fig3

Proof: First, we note that {x(E(S))} is a supermodular function, i.e.:

\displaystyle x(E(S)) + x(E(T)) \leq x(E(S \cap T)) + x(E(S \cup T))

We can see that by case analysis. Every edge appearing in the left side appears in the right side with at least the same multiplicity. Notice also that it holds with strict inequality iff there are edges from {S\setminus T} to {T \setminus S}. Now, we have:

\displaystyle \begin{aligned} (\vert S \vert - 1) + (\vert T \vert - 1) & = (\vert S \cap T \vert - 1) + (\vert S \cup T \vert - 1) \geq \\ & \geq x(E(S \cap T)) + x(E(S \cup T)) \geq \\ & \geq x(E(S)) + x(E(T)) = (\vert S \vert - 1) + (\vert T \vert - 1) \end{aligned}

where the first relation is trivial, the second is by feasibility, the third is by supermodularity and the lastone is by tightness. So, all hold with equality and therefore {S \cap T} and {S \cup T} are tight. We also proved that:

\displaystyle x(E(S \cap T)) + x(E(S \cup T)) = x(E(S)) + x(E(T))

so there can be no edge from {S\setminus T} to {T \setminus S} in {E(x)} and therefore, thinking just of edges in {E(x)} we have:

\displaystyle \chi_{S \cap T} + \chi_{S \cup T} = \chi_S + \chi_T

\Box

Uncrossing arguments are found everywhere in combinatorics. Now, we show how the Uncrossing Lemma can be used to prove Lemma 1:

Proof: Let {x} be any basic solution. It can be represented by a pair {(Y, \mathcal{C})} where {Y \subseteq W} and {\mathcal{C}} is a family of sets. We will show that the same basic solution can be represented by {(Y, \mathcal{L})} where {\mathcal{L}} is a laminar family and has the same size of {\mathcal{C}}.

Let {\mathcal{S}} be all sets that are tight under {x} and {\mathcal{L}} a maximal laminar family of tights sets in {\mathcal{S}}, such that {\{\chi_S; S \in \mathcal{L} \} \cup \{\chi_v; v \in Z \}} are independent. I claim that {\vert \mathcal{L} \vert = dim(span(\mathcal{S}))}.

In fact, suppose {\vert \mathcal{L} \vert < dim(span(\mathcal{S}))}, then there are sets of {\mathcal{S}} we could add to {\mathcal{L}} without violating independence – the problem is that those sets would cross some set. Pick such {S \in \mathcal{S}} intersecting fewer possible sets in {\mathcal{L}}. The set {S} intersects some {T \in \mathcal{L}}. Since both are tight we can use the Uncrossing Lemma and we get:

\displaystyle \chi_{S \cap T} + \chi_{S \cup T} = \chi_S + \chi_T

since {\chi_S \notin span(\mathcal{L})}, we can’t have simultaneously {\chi_{S \cap T}} and {\chi_{S \cup T}} in {span(\mathcal{L})}. Let’s consider two cases:

  1. {\chi_{S \cap T} \notin span(\mathcal{L})}, then {S \cap T \in span(\mathcal{S})} is in {span(\mathcal{S}) \setminus span(\mathcal{L})} and intersects fewer sets of {\mathcal{L}} than {S}, since all sets that intersect {S \cap T} in {\mathcal{L}} must intersect {S} as well (since no set can cross {T}).bst_fig4
  2. {\chi_{S \cup T} \notin span(\mathcal{L})}, then {S \cup T \in span(\mathcal{S})} is in {span(\mathcal{S}) \setminus span(\mathcal{L})} and intersects fewer sets of {\mathcal{L}} than {S}, since all sets that intersect {S \cup T} in {\mathcal{L}} must intersect {S}.

In either case we have a contradiction, so we proved that {\vert \mathcal{L} \vert = dim(span(\mathcal{S}))}. So we can generate all the space of tight sets with a laminar family. \Box

And this finishes the proof. Let’s go over all that we’ve done: we started with an LP and we wanted to prove that the support of each solution was not too large. We wanted that because we wanted to prove that there was one node with degree one in the support or a node in {W} with small ({\leq 3}) degree. To prove that the degree of the support is small, we show that any basic solution has a representation in terms of a laminar family. Then we use the fact that laminar families can’t be very large families of sets. For that, we use the celebrated Uncrossing Lemma.

Note: Most of this is based on my notes on David Williamson’s Approximation Algorithms class. I spent some time thinking about this algorithm and therefore I decided o post it here.

Cayley-Hamilton Theorem and Jordan Canonical Form

October 28th, 2009 4 comments

I was discussing last week with my officemates Hu Fu and Ashwin about the Cayley-Hamilton Theorem. The theorem is the following, given an {n \times n} matrix {A} we can define its characteristic polynomial by {p_A(\lambda) = \det(A - I\lambda)}. The Cayley-Hamilton Theorem says that {p_A(A) = 0}. The polynomiale is something like:

\displaystyle p_A(x) = a_k x^k + a_{k-1} x^{k-1} + \hdots + a_1 x^1 + a_0

so we can just see it as a formal polynomial and think of:

\displaystyle p_A(A) = a_k A^k + a_{k-1} A^{k-1} + \hdots + a_1 A + a_0 I

which is an {n \times n} matrix. The theorem says it is the zero matrix. We thought for a while, looked in the Wikipedia, and there there were a few proofs, but not the one-line proof I was looking for. Later, I got this proof that I sent to Hu Fu:

Write the matrix {A} in the basis of its eigenvectors, then we can write {A = \Gamma^t D \Gamma} where {D} is the diagonal matrix with the eigenvalues in the main diagonal.

\displaystyle A^k = (\Gamma^t D \Gamma) \hdots (\Gamma^t D \Gamma) = \Gamma^t D^k \Gamma

and since {D = \text{diag}(\lambda_1, \hdots, \lambda_n)} we have {D^k = \text{diag}(\lambda_1^k, \hdots, \lambda_n^k)}. Now, it is simple to see that:

\displaystyle p_A(A) = \Gamma^t p(D) \Gamma

and therefore:

\displaystyle p(D) = \begin{bmatrix}& p_A(\lambda_1) & & & \\ & & p_A(\lambda_2) & & & \\ & & & \ddots & \\ & & & & p_A(\lambda_n) \end{bmatrix} = \begin{bmatrix} & 0 & & & \\ & & 0 & & & \\ & & & \ddots & \\ & & & & 0 \end{bmatrix} = 0

And that was the one-line proof. One even simpler proof is: let {v_i} be the eigenvectors, then {p_A(A)v_i = p_A(\lambda_i)A = 0}, so {p_A(A)} must be {0} since it returns zero for all elements of a basis. Well, I sent that to Hu Fu and he told me the proof had a bug. Not really a bug, but I was proving only for symmetric matrices. More generally, I was proving for diagonalizable matrices. He showed me, for example, the matrix:

\displaystyle  \begin{bmatrix} 0 & 1 \\ 0 & 0 \end{bmatrix}

which has only one eigenvalue {0} and the the eigenvectors are all of the form {(x,0)} for {x \in {\mathbb R}}. So, the dimension of the space spanned by the eigenvectors is {1}, less than the dimension of the matrix. This never happens for symmetric matrices, and I guess after some time as a computer scientist, I got used to work only with symmetric matrices for almost everything I use: metrics, quadratic forms, correlation matrices, … but there is more out there then only symmetric matrices. The good news is that this proof is not hard to fix for the general case.

First, it is easy to prove that for each root of the characteristic polynomial there is one eigenvector associated to it (just see that {\det(A - \lambda I) = 0} and therefore there must be {v \in \ker(A - \lambda I) \setminus \{ 0 \}}, so if all the roots are distinct, then there is a basis of eigenvalues, and therefore the matrix is diagonalizable (notice that maybe we will need to use complex eigenvalues, but it is ok). The good thing is that a matrix having two identical eigenvalues is a “coincidence”. We can identify matrices with {{\mathbb R}^{n^2}}. The matrices with identical eigenvalues form a zero measure subset of {{\mathbb R}^{n^2}}, they are in fact the roots of a polynomial in {x_{ij}}. This polynomial is the resultant polynomial {R_{p,p'} = 0}. Therefore, we proved Cayley-Hamilton theorem in the complement of a zero-measure set in {{\mathbb R}^{n^2}}. Since {A \mapsto p_A(A)} is a continuous function, it extends naturally to all matrices {A \in {\mathbb R}^{n^2}}.

We can also interpret that probabilistically: get a matrix {U} where {U_{ij}} is taken uniformly at random from {[0,1]}. Then {A + \epsilon U} has with probability {1} all different eigenvalues. So, {p_{A+\epsilon U} (A+\epsilon U) = 0} with probability {1}. Now, just make {\epsilon \rightarrow 0}.

Ok, this proves the Theorem for real and complex matrices, but what about a matrix defined over a general field where we can’t use those continuity arguments. A way to get around it is by using Jordan Canonical Form, which is a generalization of eigenvector decomposition. Not all matrices have eigenvector decomposition, but all matrices over an algebraic closed field can be written in Jordan Canonical Form. Given any {A \in \overline{K}^{n^2}} there is a matrix {\Gamma \in \overline{K}^{n^2}} so that:

\displaystyle A = \Gamma^{-1} \begin{bmatrix}& B_1 & & & \\ & & B_2 & & & \\ & & & \ddots & \\ & & & & B_p \end{bmatrix}\Gamma

where {B_i} are blocks of the form:

\displaystyle B_i = \begin{bmatrix}& \lambda & 1 & & \\ & & \lambda & 1 & & \\ & & & \ddots & \\ & & & & \lambda \end{bmatrix}

By the same argument as above, we just need to prove Cayley Hamilton for each block in separate. So we need to prove that {p_A(B_i) = 0}. If the block has size {1}, then it is exacly the proof above. If the block is bigger, then we need to look at how does {B_i^k} looks like. By inspection:

\displaystyle B_i^2 = \begin{bmatrix}& \lambda^2 & 2 \lambda & 1 & & & \\ & & \lambda^2 & 2 \lambda & 1 & & & \\ & & & & & \ddots & \\ & & & & & \lambda^2 \end{bmatrix}

Tipically, for {B_i^k} we have in each row, starting in column {k} the sequence {\lambda^k, k \lambda^{k-1}, k(k-1) \lambda^{k-1}, \hdots}, i.e., {\frac{d^0}{d\lambda^0} \lambda^k, \frac{d^1}{d\lambda^1} \lambda^k , \frac{d^2}{d\lambda^2} \lambda^k \hdots}. So, we have

\displaystyle p(B_i) = \begin{bmatrix} p(\lambda) & p'(\lambda) & p''(\lambda) & p'''(\lambda) & \hdots \\ & p(\lambda) & p'(\lambda) & p''(\lambda) & \hdots \\ 				& & p(\lambda) & p'(\lambda) & \hdots \\ 				& & & p(\lambda) & \hdots \\ 				& & & & \ddots \\ \end{bmatrix}

If block {B_i} has size {k}, then {\lambda_i} has multiplicity {k} in {p(.)} and therefore {p(\lambda_i) = p'(\lambda_i) = \hdots = p^{(k-1)}(\lambda_i)} and therefore, {p(B_i) = 0} as we wanted to prove.

It turned out not to be a very very short proof, but it is still short, since it uses mostly elementary stuff and the proof is really intuitive in some sense. I took some lessons from that: (i) first it reinforces my idea that, if I need to say something about a matrix, the first thing I do is to look at its eigenvectors decomposition. A lot of Linear Algebra problems are very simple when we consider things in the right basis. Normally the right basis is the eigenvector basis. (ii) not all matrices are diagonalizable. But in those cases, Jordan Canonical Form comes in our help and we can do almost the same as we did with eigenvalue decomposition.