generalized conditional gradient and its applicationsyaoliang/mytalks/kelowna.pdf · generalized...

36
Generalized Conditional Gradient and Its Applications Yaoliang Yu University of Alberta UBC – Kelowna, 04/18/13 Y-L. Yu (UofA) GCG and Its Apps. UBC – Kelowna, 04/18/13 1 / 25

Upload: haduong

Post on 28-Apr-2018

223 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Generalized Conditional Gradient and Its Applicationsyaoliang/mytalks/kelowna.pdf · Generalized Conditional Gradient and Its Applications ... Generic form for many ML problems: min

Generalized Conditional Gradient and Its Applications

Yaoliang Yu

University of Alberta

UBC – Kelowna, 04/18/13

Y-L. Yu (UofA) GCG and Its Apps. UBC – Kelowna, 04/18/13 1 / 25

Page 2: Generalized Conditional Gradient and Its Applicationsyaoliang/mytalks/kelowna.pdf · Generalized Conditional Gradient and Its Applications ... Generic form for many ML problems: min

1 Introduction

2 Generalized Conditional Gradient

3 Polar Operator

4 Conclusions

Y-L. Yu (UofA) GCG and Its Apps. UBC – Kelowna, 04/18/13 2 / 25

Page 3: Generalized Conditional Gradient and Its Applicationsyaoliang/mytalks/kelowna.pdf · Generalized Conditional Gradient and Its Applications ... Generic form for many ML problems: min

Table of Contents

1 Introduction

2 Generalized Conditional Gradient

3 Polar Operator

4 Conclusions

Y-L. Yu (UofA) GCG and Its Apps. UBC – Kelowna, 04/18/13 3 / 25

Page 4: Generalized Conditional Gradient and Its Applicationsyaoliang/mytalks/kelowna.pdf · Generalized Conditional Gradient and Its Applications ... Generic form for many ML problems: min

Regularized Loss Minimization

Generic form for many ML problems:

minw

f (w) + λ · h(w), where

f is the loss function;

h is the regularizer;

Assuming f and h to be convex/smooth

Interior point method;

Mirror descent / Proximal gradient;

Averaging gradient;

Conditional gradient.

Y-L. Yu (UofA) GCG and Its Apps. UBC – Kelowna, 04/18/13 4 / 25

Page 5: Generalized Conditional Gradient and Its Applicationsyaoliang/mytalks/kelowna.pdf · Generalized Conditional Gradient and Its Applications ... Generic form for many ML problems: min

Machine Learning Examples

Example (Matrix Completion)

minX∈Rm×n

12

∑ij∈O(Xij − Zij)

2 + λ · ‖X‖tr

Netflix problem;

Covariance matrix estimation; etc.

Example (Group Lasso)

minw∈Rd

12‖Aw − b‖2

2 + λ ·∑g∈G ‖w‖g

Statistical estimation;

Inverse problem;

Denoising; etc.

Interesting case: m, n or d are extremely large.

Y-L. Yu (UofA) GCG and Its Apps. UBC – Kelowna, 04/18/13 5 / 25

Page 6: Generalized Conditional Gradient and Its Applicationsyaoliang/mytalks/kelowna.pdf · Generalized Conditional Gradient and Its Applications ... Generic form for many ML problems: min

Conditional gradient (Frank-Wolfe’56)

Consider minx∈C

f (x),

C : compact convex;

f : smooth convex.

1 yt ∈ argminx∈C

〈x ,∇f (xt)〉;

2 xt+1 = (1− η)xt + ηyt .

(Frank-Wolfe’56; Canon-Cullum’68) proved that CG converges at Θ(1/t).

Gained much recent attention due to

its simplicity;

the greedy nature in step 1.

Refs: (Zhang’03; Clarkson’10; Hazan’08; Jaggi-Sulovsky’10; Bach’12; etc.)

Y-L. Yu (UofA) GCG and Its Apps. UBC – Kelowna, 04/18/13 6 / 25

Page 7: Generalized Conditional Gradient and Its Applicationsyaoliang/mytalks/kelowna.pdf · Generalized Conditional Gradient and Its Applications ... Generic form for many ML problems: min

An Example

mina,b

a2 + (b + 1)2, s.t. |a| ≤ 1, 2 ≥ b ≥ 0

Can show f (xk)− f (x?) = 4/k + o(1/k).

Projected gradient converges in two iterations.

Refs: (Levtin-Polyak’66; Polyak’87; Beck-Teboulle’04) for faster rates.

Y-L. Yu (UofA) GCG and Its Apps. UBC – Kelowna, 04/18/13 7 / 25

Page 8: Generalized Conditional Gradient and Its Applicationsyaoliang/mytalks/kelowna.pdf · Generalized Conditional Gradient and Its Applications ... Generic form for many ML problems: min

An Example

mina,b

a2 + (b + 1)2, s.t. |a| ≤ 1, 2 ≥ b ≥ 0

Can show f (xk)− f (x?) = 4/k + o(1/k).

Projected gradient converges in two iterations.

Refs: (Levtin-Polyak’66; Polyak’87; Beck-Teboulle’04) for faster rates.

Y-L. Yu (UofA) GCG and Its Apps. UBC – Kelowna, 04/18/13 7 / 25

Page 9: Generalized Conditional Gradient and Its Applicationsyaoliang/mytalks/kelowna.pdf · Generalized Conditional Gradient and Its Applications ... Generic form for many ML problems: min

An Example

mina,b

a2 + (b + 1)2, s.t. |a| ≤ 1, 2 ≥ b ≥ 0

x1 = (1, 1)

Can show f (xk)− f (x?) = 4/k + o(1/k).

Projected gradient converges in two iterations.

Refs: (Levtin-Polyak’66; Polyak’87; Beck-Teboulle’04) for faster rates.

Y-L. Yu (UofA) GCG and Its Apps. UBC – Kelowna, 04/18/13 7 / 25

Page 10: Generalized Conditional Gradient and Its Applicationsyaoliang/mytalks/kelowna.pdf · Generalized Conditional Gradient and Its Applications ... Generic form for many ML problems: min

An Example

mina,b

a2 + (b + 1)2, s.t. |a| ≤ 1, 2 ≥ b ≥ 0

x1 = (1, 1)

y1 = (−1, 0)

Can show f (xk)− f (x?) = 4/k + o(1/k).

Projected gradient converges in two iterations.

Refs: (Levtin-Polyak’66; Polyak’87; Beck-Teboulle’04) for faster rates.

Y-L. Yu (UofA) GCG and Its Apps. UBC – Kelowna, 04/18/13 7 / 25

Page 11: Generalized Conditional Gradient and Its Applicationsyaoliang/mytalks/kelowna.pdf · Generalized Conditional Gradient and Its Applications ... Generic form for many ML problems: min

An Example

mina,b

a2 + (b + 1)2, s.t. |a| ≤ 1, 2 ≥ b ≥ 0

x1 = (1, 1)

y1 = (−1, 0)

Can show f (xk)− f (x?) = 4/k + o(1/k).

Projected gradient converges in two iterations.

Refs: (Levtin-Polyak’66; Polyak’87; Beck-Teboulle’04) for faster rates.

Y-L. Yu (UofA) GCG and Its Apps. UBC – Kelowna, 04/18/13 7 / 25

Page 12: Generalized Conditional Gradient and Its Applicationsyaoliang/mytalks/kelowna.pdf · Generalized Conditional Gradient and Its Applications ... Generic form for many ML problems: min

An Example

mina,b

a2 + (b + 1)2, s.t. |a| ≤ 1, 2 ≥ b ≥ 0

x1 = (1, 1)

y1 = (−1, 0)

x2

Can show f (xk)− f (x?) = 4/k + o(1/k).

Projected gradient converges in two iterations.

Refs: (Levtin-Polyak’66; Polyak’87; Beck-Teboulle’04) for faster rates.

Y-L. Yu (UofA) GCG and Its Apps. UBC – Kelowna, 04/18/13 7 / 25

Page 13: Generalized Conditional Gradient and Its Applicationsyaoliang/mytalks/kelowna.pdf · Generalized Conditional Gradient and Its Applications ... Generic form for many ML problems: min

An Example

mina,b

a2 + (b + 1)2, s.t. |a| ≤ 1, 2 ≥ b ≥ 0

x1 = (1, 1)

y1 = (−1, 0)

x2

y2 = (1, 0)

Can show f (xk)− f (x?) = 4/k + o(1/k).

Projected gradient converges in two iterations.

Refs: (Levtin-Polyak’66; Polyak’87; Beck-Teboulle’04) for faster rates.

Y-L. Yu (UofA) GCG and Its Apps. UBC – Kelowna, 04/18/13 7 / 25

Page 14: Generalized Conditional Gradient and Its Applicationsyaoliang/mytalks/kelowna.pdf · Generalized Conditional Gradient and Its Applications ... Generic form for many ML problems: min

An Example

mina,b

a2 + (b + 1)2, s.t. |a| ≤ 1, 2 ≥ b ≥ 0

x1 = (1, 1)

y1 = (−1, 0)

x2

y2 = (1, 0)

x3

Can show f (xk)− f (x?) = 4/k + o(1/k).

Projected gradient converges in two iterations.

Refs: (Levtin-Polyak’66; Polyak’87; Beck-Teboulle’04) for faster rates.

Y-L. Yu (UofA) GCG and Its Apps. UBC – Kelowna, 04/18/13 7 / 25

Page 15: Generalized Conditional Gradient and Its Applicationsyaoliang/mytalks/kelowna.pdf · Generalized Conditional Gradient and Its Applications ... Generic form for many ML problems: min

An Example

mina,b

a2 + (b + 1)2, s.t. |a| ≤ 1, 2 ≥ b ≥ 0

x1 = (1, 1)

y1 = (−1, 0)

x2

y2 = (1, 0)

x3x4

Can show f (xk)− f (x?) = 4/k + o(1/k).

Projected gradient converges in two iterations.

Refs: (Levtin-Polyak’66; Polyak’87; Beck-Teboulle’04) for faster rates.

Y-L. Yu (UofA) GCG and Its Apps. UBC – Kelowna, 04/18/13 7 / 25

Page 16: Generalized Conditional Gradient and Its Applicationsyaoliang/mytalks/kelowna.pdf · Generalized Conditional Gradient and Its Applications ... Generic form for many ML problems: min

An Example

mina,b

a2 + (b + 1)2, s.t. |a| ≤ 1, 2 ≥ b ≥ 0

x1 = (1, 1)

y1 = (−1, 0)

x2

y2 = (1, 0)

x3x4

0 5 10 15 20 25 30 35 401

1.5

2

2.5

3

3.5

4

4.5

5

conditional gradient1/(k+2)

Can show f (xk)− f (x?) = 4/k + o(1/k).

Projected gradient converges in two iterations.

Refs: (Levtin-Polyak’66; Polyak’87; Beck-Teboulle’04) for faster rates.

Y-L. Yu (UofA) GCG and Its Apps. UBC – Kelowna, 04/18/13 7 / 25

Page 17: Generalized Conditional Gradient and Its Applicationsyaoliang/mytalks/kelowna.pdf · Generalized Conditional Gradient and Its Applications ... Generic form for many ML problems: min

An Example

mina,b

a2 + (b + 1)2, s.t. |a| ≤ 1, 2 ≥ b ≥ 0

x1 = (1, 1)

y1 = (−1, 0)

x2

y2 = (1, 0)

x3x4

0 5 10 15 20 25 30 35 401

1.5

2

2.5

3

3.5

4

4.5

5

conditional gradient1/(k+2)

Can show f (xk)− f (x?) = 4/k + o(1/k).

Projected gradient converges in two iterations.

Refs: (Levtin-Polyak’66; Polyak’87; Beck-Teboulle’04) for faster rates.

Y-L. Yu (UofA) GCG and Its Apps. UBC – Kelowna, 04/18/13 7 / 25

Page 18: Generalized Conditional Gradient and Its Applicationsyaoliang/mytalks/kelowna.pdf · Generalized Conditional Gradient and Its Applications ... Generic form for many ML problems: min

An Example

mina,b

a2 + (b + 1)2, s.t. |a| ≤ 1, 2 ≥ b ≥ 0

x1 = (1, 1)

y1 = (−1, 0)

x2

y2 = (1, 0)

x3x4

0 5 10 15 20 25 30 35 401

1.5

2

2.5

3

3.5

4

4.5

5

conditional gradient1/(k+2)

Can show f (xk)− f (x?) = 4/k + o(1/k).

Projected gradient converges in two iterations.

Refs: (Levtin-Polyak’66; Polyak’87; Beck-Teboulle’04) for faster rates.

Y-L. Yu (UofA) GCG and Its Apps. UBC – Kelowna, 04/18/13 7 / 25

Page 19: Generalized Conditional Gradient and Its Applicationsyaoliang/mytalks/kelowna.pdf · Generalized Conditional Gradient and Its Applications ... Generic form for many ML problems: min

The revival of CG: sparsity!

The revived popularity of conditional gradient is due to (Clarkson’10;Shalev-Shwartz-Srebro-Zhang’10), both focusing on

minx : ‖x‖1≤1

f (x).

1 yt ← argmin‖y‖1≤1

〈y ;∇f (xt)〉, card(yt) = 1;

2 xt+1←(1−η)xt + ηyt , card(xt+1)≤card(xt) + 1.

Explicit control of the sparsity. 1/ε vs. 1/√ε.

Sparsity, more generally structure, is the key to the success of ML.

Y-L. Yu (UofA) GCG and Its Apps. UBC – Kelowna, 04/18/13 8 / 25

Page 20: Generalized Conditional Gradient and Its Applicationsyaoliang/mytalks/kelowna.pdf · Generalized Conditional Gradient and Its Applications ... Generic form for many ML problems: min

Table of Contents

1 Introduction

2 Generalized Conditional Gradient

3 Polar Operator

4 Conclusions

Y-L. Yu (UofA) GCG and Its Apps. UBC – Kelowna, 04/18/13 9 / 25

Page 21: Generalized Conditional Gradient and Its Applicationsyaoliang/mytalks/kelowna.pdf · Generalized Conditional Gradient and Its Applications ... Generic form for many ML problems: min

Generalized conditional gradient

Consider minx

f (x) + λ · κ(x),

f : smooth convex;

κ: gauge (not necessarily smooth).

Important distinction:

composite, with a non-smooth term;

unconstrained, hence unbounded domain.

1 Polar operator: yt ∈ argminx :κ(x)≤1

〈x ,∇f (xt)〉;

2 line search: st ∈ argmins≥0

f ((1− η)xt + ηsyt) + ληs;

3 xt+1 = (1− η)xt + ηstyt .

Y-L. Yu (UofA) GCG and Its Apps. UBC – Kelowna, 04/18/13 10 / 25

Page 22: Generalized Conditional Gradient and Its Applicationsyaoliang/mytalks/kelowna.pdf · Generalized Conditional Gradient and Its Applications ... Generic form for many ML problems: min

Convergence Rate

minx

f (x) + λ · κ(x)

Theorem (Zhang-Y-Schuurmans’12)

If f and κ have bounded level sets and f ∈ C1, then GCG converges atrate O(1/t), where the constant is independent of λ.

Proof is simple: Line search is as good as knowing κ(x∗);

Note that we upper bound κ((1− η)xt + ηsyt) ≤ (1− η)κ(xt) + ηs;

Still too slow!

Y-L. Yu (UofA) GCG and Its Apps. UBC – Kelowna, 04/18/13 11 / 25

Page 23: Generalized Conditional Gradient and Its Applicationsyaoliang/mytalks/kelowna.pdf · Generalized Conditional Gradient and Its Applications ... Generic form for many ML problems: min

Local improvement

Assume some procedure (say BFGS) that can locally minimize thenonsmooth problem minx f (x) + λ · κ(x), or some variation of it.

Combine this local procedure with some globally convergent routine?

Two conditions:

The local procedure cannot incur big overhead;

Cannot ruin the globally convergent routine.

Both are met by the GCG.

Refs: (Burer-Monteiro’05; Mishra et al’11; Laue’12)

Y-L. Yu (UofA) GCG and Its Apps. UBC – Kelowna, 04/18/13 12 / 25

Page 24: Generalized Conditional Gradient and Its Applicationsyaoliang/mytalks/kelowna.pdf · Generalized Conditional Gradient and Its Applications ... Generic form for many ML problems: min

Case study: Matrix completion with trace norm

Consider minX

12

∑ij∈O

(Xij − Zij)2 + λ · ‖X‖tr.

‖ · ‖tr is the convex hull of rank on the unit ball X : ‖X‖sp ≤ 1.

The only nontrivial step in GCG:

Polar operator: Yt ∈ argmin‖Y ‖tr≤1

〈Y ,Gt〉, amounts to the dominating

singular vectors of −Gt .

In contrast, popular gradient methods need the full SVD of −Gt .

Variation: 12 minU,V

∑ij∈O

((UV )ij − Zij)2 + λ · (‖U‖2

F + ‖V ‖2F ).

Not jointly convex in U and V ;

But smooth in U and V ;

Yt in GCG is rank-1 hence Xt = UV is of rank at most t.

Y-L. Yu (UofA) GCG and Its Apps. UBC – Kelowna, 04/18/13 13 / 25

Page 25: Generalized Conditional Gradient and Its Applicationsyaoliang/mytalks/kelowna.pdf · Generalized Conditional Gradient and Its Applications ... Generic form for many ML problems: min

Case study: Experiment

Y-L. Yu (UofA) GCG and Its Apps. UBC – Kelowna, 04/18/13 14 / 25

Page 26: Generalized Conditional Gradient and Its Applicationsyaoliang/mytalks/kelowna.pdf · Generalized Conditional Gradient and Its Applications ... Generic form for many ML problems: min

Interpretation

Dictionary learning problem:

minD∈Rm×r ,Φ∈Rr×n

L(X ,DΦ).

Many applications: NMF, sparse coding, topic model...

Not jointly convex, in fact NP-hard for fixed r ;

Convexify by not constraining the rank explicitly: relax r !

Refs: (Bengio et al’05; Bach-Mairal-Ponce’08; Zhang-Y-White-Sch’10)

Y-L. Yu (UofA) GCG and Its Apps. UBC – Kelowna, 04/18/13 15 / 25

Page 27: Generalized Conditional Gradient and Its Applicationsyaoliang/mytalks/kelowna.pdf · Generalized Conditional Gradient and Its Applications ... Generic form for many ML problems: min

Convexification

minD,Φ

L(X ,DΦ) + λ · Ω(Φ).

Let D:i have unit norm (say `2);

Put row-wise norm on Φ: implicitlyconstraining the rank;

Rewrite X := DΦ =∑

i ‖Φi :‖ · D:iΦi :‖Φi :‖ ;

Reformulate

minX

L(X , X ) + λ · κ(X ) where

κ(X ) = inf∑i σi : X =∑

i σi · D:iΦi :‖Φi :‖;

Can apply GCG now, PO: mind,φ

d>Gtφ‖φ‖ .

Setting both norms to `2, we recover the matrix completion example.

Y-L. Yu (UofA) GCG and Its Apps. UBC – Kelowna, 04/18/13 16 / 25

Page 28: Generalized Conditional Gradient and Its Applicationsyaoliang/mytalks/kelowna.pdf · Generalized Conditional Gradient and Its Applications ... Generic form for many ML problems: min

Table of Contents

1 Introduction

2 Generalized Conditional Gradient

3 Polar Operator

4 Conclusions

Y-L. Yu (UofA) GCG and Its Apps. UBC – Kelowna, 04/18/13 17 / 25

Page 29: Generalized Conditional Gradient and Its Applicationsyaoliang/mytalks/kelowna.pdf · Generalized Conditional Gradient and Its Applications ... Generic form for many ML problems: min

Computing the Polar

The complexity of GCG is packed into the PO:min

x :κ(x)≤1〈g , x〉

= −κ(−g).

Recall that in the dictionary learning problem:mind,φ

d>Gφ

‖φ‖

= −

maxd‖G>d‖

Can easily become computationally intractable!

Y-L. Yu (UofA) GCG and Its Apps. UBC – Kelowna, 04/18/13 18 / 25

Page 30: Generalized Conditional Gradient and Its Applicationsyaoliang/mytalks/kelowna.pdf · Generalized Conditional Gradient and Its Applications ... Generic form for many ML problems: min

Multi-view Learning

Partition d =

[bw

]and constrain their norms respectively.

Harder than single-view, but still doable (White-Y-Zhang-Sch’12):

max‖b‖=1,‖w‖=1

[b> w>

]GG>

[bw

]= tr

(GG>

[bw

] [b> w>

])2(2+1)

2 > 2.Y-L. Yu (UofA) GCG and Its Apps. UBC – Kelowna, 04/18/13 19 / 25

Page 31: Generalized Conditional Gradient and Its Applicationsyaoliang/mytalks/kelowna.pdf · Generalized Conditional Gradient and Its Applications ... Generic form for many ML problems: min

Reducing PO to Proximal

Consider the group regularizer:

Ψ(w) =∑

g‖w‖g .

Its polar

Ψ(u) = inf

maxg‖zg‖g :

∑gzg = u

does not seem to be easy to compute.

Theorem

For any gauge Ω, its polar Ω(y) equals the smallest ζ ≥ 0 s.t.min

Ω(x)≤ζ‖y − x‖2

2

= ‖y‖2

2 − 2 · proxζΩ(y) = 0,

where proxf (y) = minx12‖x− y‖2

2 + f (x) and Proxf (y) denotes the(unique) minimizer.

Y-L. Yu (UofA) GCG and Its Apps. UBC – Kelowna, 04/18/13 20 / 25

Page 32: Generalized Conditional Gradient and Its Applicationsyaoliang/mytalks/kelowna.pdf · Generalized Conditional Gradient and Its Applications ... Generic form for many ML problems: min

Proximal Gradient

Considerminx∈C

f (x), where f ∈ C1L.

xt+1 = argminx∈C

f (xt) + 〈x − xt ,∇f (xt)〉+ L2‖x − xt‖2

2.

More generallyminx∈C

f (x) + g(x), where f ∈ C1L.

xt+1 = argminx∈C

f (xt) + 〈x − xt ,∇f (xt)〉+ L2‖x − xt‖2

2 + g(x)

= argminx∈C

g(x) + L2‖x − (xt − 1

L∇f (xt))‖22

Y-L. Yu (UofA) GCG and Its Apps. UBC – Kelowna, 04/18/13 21 / 25

Page 33: Generalized Conditional Gradient and Its Applicationsyaoliang/mytalks/kelowna.pdf · Generalized Conditional Gradient and Its Applications ... Generic form for many ML problems: min

Decomposing the Proximal

How to compute the proximal operator for Ψ(w) =∑

g ‖w‖g?

Theorem (NEW?)

ProxΩ+Φ = ProxΦ ProxΩ for all gauges Ω iff Φ = c‖ · ‖2 for some c ≥ 0.

Corollary (Jenatton et al’11)

Let G be a collection of tree-structured groups, thatis, either g ⊆ g ′ or g ′ ⊆ g or g ∩ g ′ = ∅. Then

Prox∑i ‖·‖gi = Prox‖·‖g1

· · · Prox‖·‖gm ,

where we arrange the groups so that

gi ⊂ gj =⇒ i > j .

Prox2Ω = ProxΩ ProxΩ? More generally ProxΩ+Φ = f (ProxΩ,ProxΦ)?

Y-L. Yu (UofA) GCG and Its Apps. UBC – Kelowna, 04/18/13 22 / 25

Page 34: Generalized Conditional Gradient and Its Applicationsyaoliang/mytalks/kelowna.pdf · Generalized Conditional Gradient and Its Applications ... Generic form for many ML problems: min

Table of Contents

1 Introduction

2 Generalized Conditional Gradient

3 Polar Operator

4 Conclusions

Y-L. Yu (UofA) GCG and Its Apps. UBC – Kelowna, 04/18/13 23 / 25

Page 35: Generalized Conditional Gradient and Its Applicationsyaoliang/mytalks/kelowna.pdf · Generalized Conditional Gradient and Its Applications ... Generic form for many ML problems: min

Conclusions

We have

introduced the GCG;

discussed efficient computations of PO;

applied to matrix completion, group Lasso, etc.

Further questions

when the PO is “hard”?

nonsmooth loss?

online? stochastic?

Y-L. Yu (UofA) GCG and Its Apps. UBC – Kelowna, 04/18/13 24 / 25

Page 36: Generalized Conditional Gradient and Its Applicationsyaoliang/mytalks/kelowna.pdf · Generalized Conditional Gradient and Its Applications ... Generic form for many ML problems: min

Thank you !

Y-L. Yu (UofA) GCG and Its Apps. UBC – Kelowna, 04/18/13 25 / 25