Home > theory > Submodularity


Last week a heard at least twice about submodularity what made me want to write about. First, it was in the talk by Carlos Guestrin in the first CompSust conference. His very interesting point was: Machine Learning explored Convexity in the core of many machine learning algorithm (SVMs, for example). We should look for other general models, other than convexity, that have interesting properties and so that we could design algorithms based on that. He proposes submodularity as a natural candidate.

The idea of submodular functions generalizes in a certain way the concept of concave functions. The main property captured is the disminishing returns property. Consider a set {S} and the set {2^S} of its subsets. We say that a function {f: 2^S \rightarrow \mathbb R} is submodular if:

\displaystyle f(S \cup \{z\}) - f(S) \geq f(T \cup \{z\}) - f(T) \text{ for } S \subseteq T

that is, adding a new element {z} to a larger set has smaller marginal gain than adding it to a smaller set. This is very natural in a lot of circumstances: for example, let {S} be a set of books (or papers) anf {f(S)} be the knowledge you get by reading each of them. Since the topics overlap, reading an extra book when you already read {100} has a smaller marginal gain (or at least no greater) than reading it when you read only {10}. One can think of it as a natural generalization of concave functions, which also have this disminishing returns propery. Actually if {\phi} is a concave function, {f(S) = \phi(\vert S \vert)} is submodular.

In the Cascading Behavior in Networks, Jon Kleinberg described a very interesting submodular function: considere a graph {G = (V,E)} and a epidemia spreading through the nodes. Suppose an initial set {S \subseteq V} of infected nodes. As soon as one node {u} becomes infected, each node {v}, s.t. {(u,v) \in E} becomes infected with probability {p}. Let {f(S)} be the expected number of infected nodes in the end. It can be proved that {f} is submodular. One can think of this infection as being some information spreading in a network (say, the blogosphere). Then, if we start spreading this information through {k} nodes, which nodes should we pick to make sure most nodes will get the information in the end?

Another example is vertex cover: we can think of {S \subseteq V} as a set of nodes in a graph and {f(S)} the number of edges covered by those nodes. It is nor hard to see that this function is submodular. Set cover and other covering problems can be modelled as submodular functions. Normally, the main question is something like: how can I maximize {f} given that I can pick at most {k} elements for {S}, or, stated in mathematical terms:

\displaystyle \max_{\vert S \vert \leq k} f(S)

The fact that {f} is submodular makes it easy to approximate it using a greedy algorithm, using the following very interesting result:

Theorem 1 (Nemhauser et al, 78) Given a monotonic (i.e {f(S) \leq f(T)} for {S \subseteq T}) submodular function {f : 2^X \rightarrow {\mathbb R}_+} with {f(\emptyset) = 0}, the set {S = S_k} obtained by taking {S_0 = \emptyset} and

\displaystyle S_{i+1} = S_i \cup \{z_i\} \text{ where } z_i \in \text{argmin}_{z \notin S_i} f(S_i + z)

is a {\left( 1 - \frac{1}{e} \right)} approximation to {\max_{\vert S \vert \leq k} f(S)}.

Proof: First, notice that:

\displaystyle \begin{aligned}f(A \cup B) & = \sum_{j=1}^n f(A \cup \{b_1, \hdots, b_j \}) - f(A \cup \{b_1, \hdots, b_{j-1} \}) \\ & \leq \sum_{j=1}^n f(A \cup \{b_j \}) - f(A) \end{aligned}

where the last part follows by submodularity. Let {T} be the optimal solution for the problem, so we have that:

\displaystyle f(T) \leq f(S_i \cup T) \leq f(S_i) + \sum_{t \in T} f(S_i \cup \{ t \}) - f(S_i) \leq f(S_i) + \sum_{t \in T} f(S_{i+1}) - f(S_i)

because of our choice of {z_i}. So, if {\delta_i = f(S_i) - f(S_{i-1})} then we know that:

\displaystyle f(T) \leq f(S_i) + k \delta_{i+1} \Rightarrow \delta_{i+1} \geq \frac{f(T) - f(S_i)}{k}

and therefore:

\displaystyle f(S_{i+1}) = f(S_i) + \delta_{i+1} \geq \frac{1}{k} f(T) + \left( 1 - \frac{1}{k} \right) f(S_i)

Solving that recursion, it is easy to see that:

\displaystyle f(S_i) \geq \left[ 1 - \left( 1 - \frac{1}{k} \right)^i \right] f(T)

what gives the {1-1/e} bound.\Box

The second time I heard about submodularity last week was in a nice talk in the Theory Discussion Group about the paper Approximating submodular functions everywhere. Basically there is a monotone submodular function {f:2^{[n]} \rightarrow \mathbb R}. The values of {f(S)} are calculated through an oracle. The paper shows how to construct a function {\hat{f}(S)} that approximated {f(S)} everywhere just by calling the oracle a {poly(n)} number of times. The approximation is such that:

\displaystyle \hat{f}(S) \leq f(S) \leq \tilde{O}(\sqrt{n}) \hat{f}(S)

The paper is very deep, so I will not comment about it, but about some nice things I learned in the introduction of the talk about submodular functions and polymatroids. The polymatroid is the following polyhedron associated with a submodular function:

\displaystyle P_f = \left\{ x \in {\mathbb R}_+^n; x(S) \leq f(S), \forall S \subseteq [n] \right\}

where {x(S) = \sum_{i \in S} x_i}.

One nice thing about those objects is that we can easily optimize over them using a greedy algorithm:

Theorem 2 Considere a weight function {w:[n] \rightarrow {\mathbb R}_+} and a monotone submodular function {f:2^{[n]} \rightarrow {\mathbb R}_+}. Then the problem:

\displaystyle \min_{x \in P_f} \langle w, x\rangle

can be solved by the following greedy algorithm: order the elements in the set according to the {w}-value, i.e, {[n] = \{ s_1, \hdots, s_n\}} with {w(s_1) \geq w(s_2) \geq \hdots \geq w(s_n) } and define

\displaystyle x(s_i) = f(\{s_1, \hdots, s_i\}) - f(\{s_1, \hdots, s_{i-1}\})

Proof: We need to prove that {x} is a feasible point of {P_f} and then that {x} is optimal. To prove feasibility, we need to prove that for all {U \subseteq [n]} we have {x(U) \leq f(U)}. We prove for all sets {U \subseteq \{ s_1, \hdots, s_i \}} for all {i} and we do that by induction in {i}. For {i = 0}, it is trivial. Then, supposing we proved for {i}, let’s prove for {i+1}. Take {U \subseteq \{ s_1, \hdots, s_{i+1} \}}. If {s_{i+1} \notin U}, we are done. If not:

\displaystyle \begin{aligned} x(U) & = x(U \setminus \{ s_{i+1} \}) + x(s_{i+1}) \\ & \leq f(U \setminus \{ s_{i+1} \}) + f(\{s_1, \hdots, s_{i+1}\}) - f(\{s_1, \hdots, s_{i}\}) \leq f(U) \end{aligned}

Now, we need to prove optimality and this is essentially a primal-dual argument. The dual of the linear program:

\displaystyle \min \langle w, x\rangle \text{ s.t. } x(U) \leq f(U); x(U) \geq 0; \forall U \subseteq [n]

if the program:

\displaystyle \max \sum_{U \subseteq [n]} f(U) y(U) \text{ s.t. } \sum_{U; i \in U} y(U) \geq w_i; \forall i \in [n]; y(U) \geq 0

Consider the following dual solution: {y(\{s_1, \hdots, s_i\}) = w_i - w_{i+1}} (take {w_{n+1} = 0}) and {y(U) = 0} for all other sets. It is clearly feasible and matches the objective value of the primal solution. Both are therefore optimal. \Box

Notice that the proof has a “matroid flavour” and there is where the name poly-matroid comes from. Note that the rank function of a matroid is a submodular function and that the polymatroid of the rank function is something called the independent set polytope of the matroid. More about it in a later blog post.

  1. No comments yet.
  1. No trackbacks yet.