lecture 7: 6.883, spring 2016 - mitrakhlin/6.883/lectures/lecture07.pdf · disadvantages tricky if...

94
Alternating Minimization (and Friends) Lecture 7: 6.883, Spring 2016 Suvrit Sra Massachusetts Institute of Technology Feb 29, 2016

Upload: others

Post on 01-Nov-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Alternating Minimization (and Friends)

Lecture 7: 6.883, Spring 2016

Suvrit Sra

Massachusetts Institute of Technology

Feb 29, 2016

Page 2: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Background: Coordinate Descent

For x ∈ Rn consider

min f (x) = f (x1, x2, . . . , xn)

Ancient idea: optimize over individual coordinates

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 2 / 39

Page 3: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Coordinate descent

Coordinate descent For k = 0,1, . . .

Pick an index i from 1, . . . ,n

Optimize the i th coordinate

xk+1i ← argmin

ξ∈Rf (xk+1

1 , . . . , xk+1i−1︸ ︷︷ ︸

done

, ξ︸︷︷︸current

, xki+1, . . . , x

kn︸ ︷︷ ︸

todo

)

Decide when/how to stop; return xk

xk+1i overwrites value in xk

i (implementation)

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 3 / 39

Page 4: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Coordinate descent

Coordinate descent For k = 0,1, . . .

Pick an index i from 1, . . . ,nOptimize the i th coordinate

xk+1i ← argmin

ξ∈Rf (xk+1

1 , . . . , xk+1i−1︸ ︷︷ ︸

done

, ξ︸︷︷︸current

, xki+1, . . . , x

kn︸ ︷︷ ︸

todo

)

Decide when/how to stop; return xk

xk+1i overwrites value in xk

i (implementation)

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 3 / 39

Page 5: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Coordinate descent

Coordinate descent For k = 0,1, . . .

Pick an index i from 1, . . . ,nOptimize the i th coordinate

xk+1i ← argmin

ξ∈Rf (xk+1

1 , . . . , xk+1i−1︸ ︷︷ ︸

done

, ξ︸︷︷︸current

, xki+1, . . . , x

kn︸ ︷︷ ︸

todo

)

Decide when/how to stop; return xk

xk+1i overwrites value in xk

i (implementation)

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 3 / 39

Page 6: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Coordinate descent

♣ One of the simplest optimization methods♣ Various ideas for next coordinate to optimize

♣ Old idea: Gauss-Seidel, Jacobi methods for Ax = b♣ Can be “slow”; sometimes very competitive♣ Gradient, stochastic gradient also “slow”♣ But scalable (eg: libsvm)♣ Renewed interest; esp. stochastic CD♣ Notice: in general CD is “derivative free”

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 4 / 39

Page 7: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Coordinate descent

♣ One of the simplest optimization methods♣ Various ideas for next coordinate to optimize♣ Old idea: Gauss-Seidel, Jacobi methods for Ax = b

♣ Can be “slow”; sometimes very competitive♣ Gradient, stochastic gradient also “slow”♣ But scalable (eg: libsvm)♣ Renewed interest; esp. stochastic CD♣ Notice: in general CD is “derivative free”

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 4 / 39

Page 8: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Coordinate descent

♣ One of the simplest optimization methods♣ Various ideas for next coordinate to optimize♣ Old idea: Gauss-Seidel, Jacobi methods for Ax = b♣ Can be “slow”; sometimes very competitive

♣ Gradient, stochastic gradient also “slow”♣ But scalable (eg: libsvm)♣ Renewed interest; esp. stochastic CD♣ Notice: in general CD is “derivative free”

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 4 / 39

Page 9: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Coordinate descent

♣ One of the simplest optimization methods♣ Various ideas for next coordinate to optimize♣ Old idea: Gauss-Seidel, Jacobi methods for Ax = b♣ Can be “slow”; sometimes very competitive♣ Gradient, stochastic gradient also “slow”

♣ But scalable (eg: libsvm)♣ Renewed interest; esp. stochastic CD♣ Notice: in general CD is “derivative free”

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 4 / 39

Page 10: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Coordinate descent

♣ One of the simplest optimization methods♣ Various ideas for next coordinate to optimize♣ Old idea: Gauss-Seidel, Jacobi methods for Ax = b♣ Can be “slow”; sometimes very competitive♣ Gradient, stochastic gradient also “slow”♣ But scalable (eg: libsvm)

♣ Renewed interest; esp. stochastic CD♣ Notice: in general CD is “derivative free”

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 4 / 39

Page 11: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Coordinate descent

♣ One of the simplest optimization methods♣ Various ideas for next coordinate to optimize♣ Old idea: Gauss-Seidel, Jacobi methods for Ax = b♣ Can be “slow”; sometimes very competitive♣ Gradient, stochastic gradient also “slow”♣ But scalable (eg: libsvm)♣ Renewed interest; esp. stochastic CD

♣ Notice: in general CD is “derivative free”

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 4 / 39

Page 12: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Coordinate descent

♣ One of the simplest optimization methods♣ Various ideas for next coordinate to optimize♣ Old idea: Gauss-Seidel, Jacobi methods for Ax = b♣ Can be “slow”; sometimes very competitive♣ Gradient, stochastic gradient also “slow”♣ But scalable (eg: libsvm)♣ Renewed interest; esp. stochastic CD♣ Notice: in general CD is “derivative free”

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 4 / 39

Page 13: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Example: Least-squares

Assume A ∈ Rm×n

min ‖Ax − b‖22

Coordinate descent update

xj ←∑m

i=1 aij

(bi −

∑l 6=j ailxl

)∑m

i=1 a2ij

(dropped superscripts, since we overwrite)

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 5 / 39

Page 14: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Example: Least-squares

Assume A ∈ Rm×n

min ‖Ax − b‖22

Coordinate descent update

xj ←∑m

i=1 aij

(bi −

∑l 6=j ailxl

)∑m

i=1 a2ij

(dropped superscripts, since we overwrite)

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 5 / 39

Page 15: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Coordinate descent remarksAdvantages♦ Each iteration usually cheap (single variable optimization)♦ No extra storage vectors needed

♦ No stepsize tuning♦ No other pesky parameters that must be tuned♦ Simple to implement♦ Works well for large-scale problems♦ Currently quite popular; parallel versions exist

Disadvantages♠ Tricky if single variable optimization is hard♠ Convergence theory can be complicated♠ Can slow down near optimum♠ Non-differentiable case more tricky

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 6 / 39

Page 16: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Coordinate descent remarksAdvantages♦ Each iteration usually cheap (single variable optimization)♦ No extra storage vectors needed

♦ No stepsize tuning♦ No other pesky parameters that must be tuned♦ Simple to implement♦ Works well for large-scale problems♦ Currently quite popular; parallel versions exist

Disadvantages♠ Tricky if single variable optimization is hard♠ Convergence theory can be complicated♠ Can slow down near optimum♠ Non-differentiable case more tricky

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 6 / 39

Page 17: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Block coordinate descent (BCD)

min f (x) := f (x1, . . . , xn)

x ∈ X1 ×X2 × · · · × Xm.

Gauss-Seidel update

xk+1i ← argmin

ξ∈Xi

f (xk+11 , . . . , xk+1

i−1︸ ︷︷ ︸done

, ξ︸︷︷︸current

, xki+1, . . . , x

km︸ ︷︷ ︸

todo

)

Jacobi update (easy to parallelize)

xk+1i ← argmin

ξ∈Xi

f (xk1 , . . . , x

ki−1︸ ︷︷ ︸

don′t clobber

, ξ︸︷︷︸current

, xki+1, . . . , x

km︸ ︷︷ ︸

todo

)

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 7 / 39

Page 18: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Block coordinate descent (BCD)

min f (x) := f (x1, . . . , xn)

x ∈ X1 ×X2 × · · · × Xm.

Gauss-Seidel update

xk+1i ← argmin

ξ∈Xi

f (xk+11 , . . . , xk+1

i−1︸ ︷︷ ︸done

, ξ︸︷︷︸current

, xki+1, . . . , x

km︸ ︷︷ ︸

todo

)

Jacobi update (easy to parallelize)

xk+1i ← argmin

ξ∈Xi

f (xk1 , . . . , x

ki−1︸ ︷︷ ︸

don′t clobber

, ξ︸︷︷︸current

, xki+1, . . . , x

km︸ ︷︷ ︸

todo

)

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 7 / 39

Page 19: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Block coordinate descent (BCD)

min f (x) := f (x1, . . . , xn)

x ∈ X1 ×X2 × · · · × Xm.

Gauss-Seidel update

xk+1i ← argmin

ξ∈Xi

f (xk+11 , . . . , xk+1

i−1︸ ︷︷ ︸done

, ξ︸︷︷︸current

, xki+1, . . . , x

km︸ ︷︷ ︸

todo

)

Jacobi update (easy to parallelize)

xk+1i ← argmin

ξ∈Xi

f (xk1 , . . . , x

ki−1︸ ︷︷ ︸

don′t clobber

, ξ︸︷︷︸current

, xki+1, . . . , x

km︸ ︷︷ ︸

todo

)

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 7 / 39

Page 20: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Two block BCD

min f (x , y), s.t. x ∈ X , y ∈ Y.

Theorem (Grippo & Sciandrone (2000)). Let f be continuouslydifferentiable; and X , Y be closed and convex sets. Assum-ing both subproblems have solutions, and that the sequence

(xk , yk )

has limit points. Then, every limit point is stationary.

I Subproblems need not have unique solutionsI BCD for 2 blocks aka Alternating Minimization

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 8 / 39

Page 21: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Two block BCD

min f (x , y), s.t. x ∈ X , y ∈ Y.

Theorem (Grippo & Sciandrone (2000)). Let f be continuouslydifferentiable; and X , Y be closed and convex sets. Assum-ing both subproblems have solutions, and that the sequence

(xk , yk )

has limit points. Then, every limit point is stationary.

I Subproblems need not have unique solutionsI BCD for 2 blocks aka Alternating Minimization

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 8 / 39

Page 22: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

AltMin

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 9 / 39

Page 23: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 10 / 39

Page 24: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 10 / 39

Page 25: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Clustering

Original matrixa + a + +z z a + a + +− * − * *− * − * *z z

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 11 / 39

Page 26: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Clustering

Clustered matrixa a + + +z z a a + + +− − * * *− − * * *z z

After clustering and permutation

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 11 / 39

Page 27: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Clustering

Co-clustered matrixa a + + +a a + + +z z z z − − * * *− − * * *

After co-clustering and permutation

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 11 / 39

Page 28: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Clustering

I Let X ∈ Rm×n be the input matrixI Cluster columns of XI Well-known k-means clustering problem can be written as

minB,C

12‖X − BC‖2F s.t. CT C = Diag(sizes)

where B ∈ Rm×k , and C ∈ 0,1k×n.I Optimization problem with 2 blocks; min F (B,C)

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 11 / 39

Page 29: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Clustering

I Let X ∈ Rm×n be the input matrixI Cluster columns of XI Well-known k-means clustering problem can be written as

minB,C

12‖X − BC‖2F s.t. CT C = Diag(sizes)

where B ∈ Rm×k , and C ∈ 0,1k×n.I Optimization problem with 2 blocks; min F (B,C)

Exercise: Write co-clustering in matrix formHint: Write using 3 blocks

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 11 / 39

Page 30: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Matrix Completion

Given matrix with missing entries, fill in the restRecall Netflix million-$ prize problemGiven User-Movie ratings, recommend movies to users

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 12 / 39

Page 31: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Matrix Completion

Input: matrix A with missing entries“Predict” missing entries to “complete” the matrixNetflix: movies x users matrix; available entries wereratings given to movies by usersTask: predict missing entriesWinning methods based on low-rank matrix completion

Am

n

≈ B

m

kCkn

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 12 / 39

Page 32: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Matrix Completion

Input: matrix A with missing entries“Predict” missing entries to “complete” the matrixNetflix: movies x users matrix; available entries wereratings given to movies by usersTask: predict missing entriesWinning methods based on low-rank matrix completion

Am

n

≈ B

m

kCkn

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 12 / 39

Page 33: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Matrix Completion

Am

n

≈ B

m

kCkn

Task: Recover matrix A given a sampling of its entriesTheorem: Can recover most low-rank matrices!

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 12 / 39

Page 34: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Matrix completion

min rank(X )

s.t. Xij = Aij , ∀(i , j) ∈ Ω = Rating pairs

another formulation

min∑

(i,j)∈Ω

(Xij − Aij)2

s.t. rank(X ) ≤ k .

Both are NP-Hard problems

convex relaxation

rank(X ) ≤ k 7→(‖X‖∗ :=

∑m

j=1σj(X )

)≤ k

Candes and Recht prove that convex relaxation solves matrix completion(under assumptions on Ω and A)

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 13 / 39

Page 35: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Matrix completion

min rank(X )

s.t. Xij = Aij , ∀(i , j) ∈ Ω = Rating pairs

another formulation

min∑

(i,j)∈Ω

(Xij − Aij)2

s.t. rank(X ) ≤ k .

Both are NP-Hard problems

convex relaxation

rank(X ) ≤ k 7→(‖X‖∗ :=

∑m

j=1σj(X )

)≤ k

Candes and Recht prove that convex relaxation solves matrix completion(under assumptions on Ω and A)

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 13 / 39

Page 36: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Matrix completion

I convex relaxation does not scale wellI commonly used heuristic is Alternating MinimizationI Write X = BC where B is m × k , C is k × n

minB,C

F (B,C) := ‖PΩ(A)− PΩ(BC)‖2F,

where [PΩ(X )]ij = Xij for (i , j) ∈ Ω, and 0 otherwise.

Result:I Initialize B,C using SVD of PΩ(A)

I AltMin iterations to compute B and CI Can be shown (Jain, Netrapalli, Sanghavi 2012) under assumptions on

Ω (uniform sampling) and A (incoherence, most entries similar inmagnitude) that AltMin generates B and C such that ‖A− BC‖F ≤ εafter O(log(1/ε)) steps.

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 14 / 39

Page 37: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

p(x) :=∑K

k=1 πkpN (x ; Σk , µk)

Gaussian Mixture Model

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 15 / 39

Page 38: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

12‖a ∗ x − y‖2 + λΩ(x)

Image deblurringSuvrit Sra ([email protected]) 6.883: Alternating Minimization 16 / 39

Page 39: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

12‖a ∗ x − y‖2 + λΩ(x)

Image deblurringSuvrit Sra ([email protected]) 6.883: Alternating Minimization 16 / 39

Page 40: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

→(Mairal et al., 2010)∑n

i=112‖y i − Dci‖2 + Ω1(ci) + Ω2(D)

Dict. learning, matrix factorization

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 17 / 39

Page 41: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Online matrix factorization

time t y t = at ∗ x + nt

0 = ∗ + n0

1 = ∗ + n1

2 = ∗ + n2

k = ∗ + nk

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 18 / 39

Page 42: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Non-online formulation |... |

y1 | yn

| ... |

≈ |

... |a1 | at

| ... |

∗ x

Rewrite: a ∗ x = Ax = Xa[y1 y2 · · · yt

]≈ X

[a1 a2 · · · at

]Y ≈ XA

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 19 / 39

Page 43: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Why online?

Example, 5000 frames of size 512× 512Y262144×5000 ≈ X262144×262144A262144×5000

Without structure ≈ 70 billion parameters!With structure, ≈ 4.8 million parameters!

Despite structure, alternatingminimization impracticalFix X , solve for A, requiresupdating ≈ 4.5 million params

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 20 / 39

Page 44: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Why online?

Example, 5000 frames of size 512× 512Y262144×5000 ≈ X262144×262144A262144×5000

Without structure ≈ 70 billion parameters!With structure, ≈ 4.8 million parameters!

Despite structure, alternatingminimization impracticalFix X , solve for A, requiresupdating ≈ 4.5 million params

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 20 / 39

Page 45: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Online matrix factorization

minAt ,x

∑T

t=112‖yt − Atx‖2 + Ω(x) + Γ(At )

Initialize guess x0For t = 1,2, . . .

1. Observe image y t ;2. Use x t−1 to estimate At3. Solve optimization subproblem to obtain x t

Step 2. Model, estimate blur At — separate lecture

Step 3. convex subproblem — reuse convex subroutines

Do Steps 2, 3 online =⇒ realtime processing!

Video

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 21 / 39

Page 46: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Online matrix factorization

minAt ,x

∑T

t=112‖yt − Atx‖2 + Ω(x) + Γ(At )

Initialize guess x0For t = 1,2, . . .

1. Observe image y t ;

2. Use x t−1 to estimate At3. Solve optimization subproblem to obtain x t

Step 2. Model, estimate blur At — separate lecture

Step 3. convex subproblem — reuse convex subroutines

Do Steps 2, 3 online =⇒ realtime processing!

Video

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 21 / 39

Page 47: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Online matrix factorization

minAt ,x

∑T

t=112‖yt − Atx‖2 + Ω(x) + Γ(At )

Initialize guess x0For t = 1,2, . . .

1. Observe image y t ;2. Use x t−1 to estimate At

3. Solve optimization subproblem to obtain x t

Step 2. Model, estimate blur At — separate lecture

Step 3. convex subproblem — reuse convex subroutines

Do Steps 2, 3 online =⇒ realtime processing!

Video

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 21 / 39

Page 48: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Online matrix factorization

minAt ,x

∑T

t=112‖yt − Atx‖2 + Ω(x) + Γ(At )

Initialize guess x0For t = 1,2, . . .

1. Observe image y t ;2. Use x t−1 to estimate At3. Solve optimization subproblem to obtain x t

Step 2. Model, estimate blur At — separate lecture

Step 3. convex subproblem — reuse convex subroutines

Do Steps 2, 3 online =⇒ realtime processing!

Video

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 21 / 39

Page 49: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Online matrix factorization

minAt ,x

∑T

t=112‖yt − Atx‖2 + Ω(x) + Γ(At )

Initialize guess x0For t = 1,2, . . .

1. Observe image y t ;2. Use x t−1 to estimate At3. Solve optimization subproblem to obtain x t

Step 2. Model, estimate blur At — separate lecture

Step 3. convex subproblem — reuse convex subroutines

Do Steps 2, 3 online =⇒ realtime processing!

Video

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 21 / 39

Page 50: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Some math: NMF

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 22 / 39

Page 51: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Nonnegative matrix factorization

We want a low-rank approximation A ≈ BC

SVD yields dense B and CB and C contain negative entries, even if A ≥ 0SVD factors may not be that easy to interpret

NMF imposes B ≥ 0, C ≥ 0

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 23 / 39

Page 52: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Nonnegative matrix factorization

We want a low-rank approximation A ≈ BC

SVD yields dense B and CB and C contain negative entries, even if A ≥ 0SVD factors may not be that easy to interpret

NMF imposes B ≥ 0, C ≥ 0

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 23 / 39

Page 53: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Nonnegative matrix factorization

We want a low-rank approximation A ≈ BC

SVD yields dense B and CB and C contain negative entries, even if A ≥ 0SVD factors may not be that easy to interpret

NMF imposes B ≥ 0, C ≥ 0

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 23 / 39

Page 54: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Algorithms

A ≈ BC s.t. B,C ≥ 0

Least-squares NMF

min 12‖A− BC‖2F s.t. B,C ≥ 0.

KL-Divergence NMF

min∑

ijaij log

(BC)ij

aij− aij + (BC)ij s.t. B,C ≥ 0.

♣ NP-Hard (Vavasis 2007) – no surprise♣ Recently, Arora et al. showed that if the matrix A has a

special “separable” structure, then actually globally optimalNMF is approximately solvable. More recent progress too!

♣ We look at only basic methods in this lecture

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 24 / 39

Page 55: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Algorithms

A ≈ BC s.t. B,C ≥ 0

Least-squares NMF

min 12‖A− BC‖2F s.t. B,C ≥ 0.

KL-Divergence NMF

min∑

ijaij log

(BC)ij

aij− aij + (BC)ij s.t. B,C ≥ 0.

♣ NP-Hard (Vavasis 2007) – no surprise♣ Recently, Arora et al. showed that if the matrix A has a

special “separable” structure, then actually globally optimalNMF is approximately solvable. More recent progress too!

♣ We look at only basic methods in this lecture

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 24 / 39

Page 56: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

NMF Algorithms

Hack: Compute TSVD; “zero-out” negative entriesAlternating minimization (AM)Majorize-Minimize (MM)Global optimization (not covered)“Online” algorithms (not covered)

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 25 / 39

Page 57: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

AltMin / AltDesc

min F (B,C)

Alternating Descent

1 Initialize B0, k ← 02 Compute Ck+1 s.t. F (A,BkCk+1) ≤ F (A,BkCk )

3 Compute Bk+1 s.t. F (A,Bk+1Ck+1) ≤ F (A,BkCk+1)

4 k ← k + 1, and repeat until stopping criteria met.

F (Bk+1,Ck+1) ≤ F (Bk ,Ck+1) ≤ F (Bk ,Ck )

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 26 / 39

Page 58: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

AltMin / AltDesc

min F (B,C)

Alternating Descent

1 Initialize B0, k ← 02 Compute Ck+1 s.t. F (A,BkCk+1) ≤ F (A,BkCk )

3 Compute Bk+1 s.t. F (A,Bk+1Ck+1) ≤ F (A,BkCk+1)

4 k ← k + 1, and repeat until stopping criteria met.

F (Bk+1,Ck+1) ≤ F (Bk ,Ck+1) ≤ F (Bk ,Ck )

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 26 / 39

Page 59: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

AltMinAlternating Least Squares

C = argminC

‖A− BkC‖2F;

Ck+1 ← max(0,C)

B = argminB

‖A− BCk+1‖2F; Bk+1 ← max(0,B)

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 27 / 39

Page 60: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

AltMinAlternating Least Squares

C = argminC

‖A− BkC‖2F; Ck+1 ← max(0,C)

B = argminB

‖A− BCk+1‖2F; Bk+1 ← max(0,B)

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 27 / 39

Page 61: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

AltMinAlternating Least Squares

C = argminC

‖A− BkC‖2F; Ck+1 ← max(0,C)

B = argminB

‖A− BCk+1‖2F; Bk+1 ← max(0,B)

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 27 / 39

Page 62: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

AltMinAlternating Least Squares

C = argminC

‖A− BkC‖2F; Ck+1 ← max(0,C)

B = argminB

‖A− BCk+1‖2F; Bk+1 ← max(0,B)

ALS is fast, simple, often effective, but ...

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 27 / 39

Page 63: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

AltMinAlternating Least Squares

C = argminC

‖A− BkC‖2F; Ck+1 ← max(0,C)

B = argminB

‖A− BCk+1‖2F; Bk+1 ← max(0,B)

ALS is fast, simple, often effective, but ...

‖A− Bk+1Ck+1‖2F ≤ ‖A− BkCk+1‖2F ≤ ‖A− BkCk‖2F

descent need not hold

x∗

xuc

(xuc)+

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 27 / 39

Page 64: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Alternating Minimization: correctly

Use alternating nonnegative least-squares

Ck+1 = argminC

‖A− BkC‖2F s.t. C ≥ 0

Bk+1 = argminB

‖A− BCk+1‖2F s.t. B ≥ 0

Advantages: Guaranteed descent. Theory of block-coordinatedescent guarantees convergence to stationary point.

Disadvantages: more complex; slower than ALS

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 28 / 39

Page 65: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Just Descent

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 29 / 39

Page 66: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Majorize-Minimize (MM)

Consider F (B,C) = 12‖A− BC‖2F: convex separately in B and C

We use F (C) to denote function restricted to C.

Since F (C) separable (over cols of C), we just illustrate

minc≥0

f (c) = 12‖a− Bc‖22

Recall, our aim is: find Ck+1 such that F (Bk ,Ck+1) ≤ F (Bk ,Ck )

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 30 / 39

Page 67: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Majorize-Minimize (MM)

cct ct+1 ct+k

F (c)

G(ct; c)

G(ct+k; c)

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 31 / 39

Page 68: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Descent technique

minc≥0 f (c) = 12‖a− Bc‖22

1 Find a function g(c, c) that satisfies:

g(c, c) = f (c), for all c,g(c, c) ≥ f (c), for all c, c.

2 Compute ct+1 = argminc≥0 g(c, ct )

3 Then we have descent

f (ct+1)def≤ g(ct+1, ct )

argmin≤ g(ct , ct )

def= f (ct ).

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 32 / 39

Page 69: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Descent technique

minc≥0 f (c) = 12‖a− Bc‖22

1 Find a function g(c, c) that satisfies:

g(c, c) = f (c), for all c,g(c, c) ≥ f (c), for all c, c.

2 Compute ct+1 = argminc≥0 g(c, ct )

3 Then we have descent

f (ct+1)def≤ g(ct+1, ct )

argmin≤ g(ct , ct )

def= f (ct ).

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 32 / 39

Page 70: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Descent technique

minc≥0 f (c) = 12‖a− Bc‖22

1 Find a function g(c, c) that satisfies:

g(c, c) = f (c), for all c,g(c, c) ≥ f (c), for all c, c.

2 Compute ct+1 = argminc≥0 g(c, ct )

3 Then we have descent

f (ct+1)def≤ g(ct+1, ct )

argmin≤ g(ct , ct )

def= f (ct ).

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 32 / 39

Page 71: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Descent technique

minc≥0 f (c) = 12‖a− Bc‖22

1 Find a function g(c, c) that satisfies:

g(c, c) = f (c), for all c,g(c, c) ≥ f (c), for all c, c.

2 Compute ct+1 = argminc≥0 g(c, ct )

3 Then we have descent

f (ct+1)

def≤ g(ct+1, ct )

argmin≤ g(ct , ct )

def= f (ct ).

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 32 / 39

Page 72: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Descent technique

minc≥0 f (c) = 12‖a− Bc‖22

1 Find a function g(c, c) that satisfies:

g(c, c) = f (c), for all c,g(c, c) ≥ f (c), for all c, c.

2 Compute ct+1 = argminc≥0 g(c, ct )

3 Then we have descent

f (ct+1)def≤ g(ct+1, ct )

argmin≤ g(ct , ct )

def= f (ct ).

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 32 / 39

Page 73: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Descent technique

minc≥0 f (c) = 12‖a− Bc‖22

1 Find a function g(c, c) that satisfies:

g(c, c) = f (c), for all c,g(c, c) ≥ f (c), for all c, c.

2 Compute ct+1 = argminc≥0 g(c, ct )

3 Then we have descent

f (ct+1)def≤ g(ct+1, ct )

argmin≤ g(ct , ct )

def= f (ct ).

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 32 / 39

Page 74: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Descent technique

minc≥0 f (c) = 12‖a− Bc‖22

1 Find a function g(c, c) that satisfies:

g(c, c) = f (c), for all c,g(c, c) ≥ f (c), for all c, c.

2 Compute ct+1 = argminc≥0 g(c, ct )

3 Then we have descent

f (ct+1)def≤ g(ct+1, ct )

argmin≤ g(ct , ct )

def= f (ct ).

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 32 / 39

Page 75: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Constructing g for f

We exploit that h(x) = 12x2 is a convex function

h(∑

i λixi)≤∑i λih(xi), where λi ≥ 0,

∑i λi = 1

f (c) = 12

∑i(ai − bT

i c)2 =

12

∑ia2

i − 2aibTi c + (bT

i c)2

= 12

∑ia2

i − 2aibTi c + 1

2

∑i

(∑jbijcj

)2

= 12

∑ia2

i − 2aibTi c + 1

2

∑i

(∑jλijbijcj/λij

)2

cvx≤ 1

2

∑ia2

i − 2aibTi c + 1

2

∑ijλij(bijcj/λij

)2

=: g(c, c), where λij are convex coeffts

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 33 / 39

Page 76: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Constructing g for f

We exploit that h(x) = 12x2 is a convex function

h(∑

i λixi)≤∑i λih(xi), where λi ≥ 0,

∑i λi = 1

f (c) = 12

∑i(ai − bT

i c)2 =

12

∑ia2

i − 2aibTi c + (bT

i c)2

= 12

∑ia2

i − 2aibTi c + 1

2

∑i

(∑jbijcj

)2

= 12

∑ia2

i − 2aibTi c + 1

2

∑i

(∑jλijbijcj/λij

)2

cvx≤ 1

2

∑ia2

i − 2aibTi c + 1

2

∑ijλij(bijcj/λij

)2

=: g(c, c), where λij are convex coeffts

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 33 / 39

Page 77: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Constructing g for f

We exploit that h(x) = 12x2 is a convex function

h(∑

i λixi)≤∑i λih(xi), where λi ≥ 0,

∑i λi = 1

f (c) = 12

∑i(ai − bT

i c)2 = 12

∑ia2

i − 2aibTi c + (bT

i c)2

= 12

∑ia2

i − 2aibTi c + 1

2

∑i

(∑jbijcj

)2

= 12

∑ia2

i − 2aibTi c + 1

2

∑i

(∑jλijbijcj/λij

)2

cvx≤ 1

2

∑ia2

i − 2aibTi c + 1

2

∑ijλij(bijcj/λij

)2

=: g(c, c), where λij are convex coeffts

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 33 / 39

Page 78: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Constructing g for f

We exploit that h(x) = 12x2 is a convex function

h(∑

i λixi)≤∑i λih(xi), where λi ≥ 0,

∑i λi = 1

f (c) = 12

∑i(ai − bT

i c)2 = 12

∑ia2

i − 2aibTi c + (bT

i c)2

= 12

∑ia2

i − 2aibTi c + 1

2

∑i

(∑jbijcj

)2

= 12

∑ia2

i − 2aibTi c + 1

2

∑i

(∑jλijbijcj/λij

)2

cvx≤ 1

2

∑ia2

i − 2aibTi c + 1

2

∑ijλij(bijcj/λij

)2

=: g(c, c), where λij are convex coeffts

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 33 / 39

Page 79: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Constructing g for f

We exploit that h(x) = 12x2 is a convex function

h(∑

i λixi)≤∑i λih(xi), where λi ≥ 0,

∑i λi = 1

f (c) = 12

∑i(ai − bT

i c)2 = 12

∑ia2

i − 2aibTi c + (bT

i c)2

= 12

∑ia2

i − 2aibTi c + 1

2

∑i

(∑jbijcj

)2

= 12

∑ia2

i − 2aibTi c

+ 12

∑i

(∑jλijbijcj/λij

)2

cvx≤ 1

2

∑ia2

i − 2aibTi c + 1

2

∑ijλij(bijcj/λij

)2

=: g(c, c), where λij are convex coeffts

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 33 / 39

Page 80: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Constructing g for f

We exploit that h(x) = 12x2 is a convex function

h(∑

i λixi)≤∑i λih(xi), where λi ≥ 0,

∑i λi = 1

f (c) = 12

∑i(ai − bT

i c)2 = 12

∑ia2

i − 2aibTi c + (bT

i c)2

= 12

∑ia2

i − 2aibTi c + 1

2

∑i

(∑jbijcj

)2

= 12

∑ia2

i − 2aibTi c + 1

2

∑i

(∑jλijbijcj/λij

)2

cvx≤ 1

2

∑ia2

i − 2aibTi c + 1

2

∑ijλij(bijcj/λij

)2

=: g(c, c), where λij are convex coeffts

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 33 / 39

Page 81: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Constructing g for f

We exploit that h(x) = 12x2 is a convex function

h(∑

i λixi)≤∑i λih(xi), where λi ≥ 0,

∑i λi = 1

f (c) = 12

∑i(ai − bT

i c)2 = 12

∑ia2

i − 2aibTi c + (bT

i c)2

= 12

∑ia2

i − 2aibTi c + 1

2

∑i

(∑jbijcj

)2

= 12

∑ia2

i − 2aibTi c + 1

2

∑i

(∑jλijbijcj/λij

)2

cvx≤ 1

2

∑ia2

i − 2aibTi c + 1

2

∑ijλij(bijcj/λij

)2

=: g(c, c), where λij are convex coeffts

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 33 / 39

Page 82: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Constructing g for f

We exploit that h(x) = 12x2 is a convex function

h(∑

i λixi)≤∑i λih(xi), where λi ≥ 0,

∑i λi = 1

f (c) = 12

∑i(ai − bT

i c)2 = 12

∑ia2

i − 2aibTi c + (bT

i c)2

= 12

∑ia2

i − 2aibTi c + 1

2

∑i

(∑jbijcj

)2

= 12

∑ia2

i − 2aibTi c + 1

2

∑i

(∑jλijbijcj/λij

)2

cvx≤ 1

2

∑ia2

i − 2aibTi c + 1

2

∑ijλij(bijcj/λij

)2

=: g(c, c), where λij are convex coeffts

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 33 / 39

Page 83: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Constructing g for f

f (c) = 12‖a− Bc‖22

g(c, c) = 12‖a‖22 −

∑iaibT

i c + 12

∑ijλij(bijcj/λij

)2.

Only remains to pick λij as functions of c

λij =bij cj∑k bik ck

=bij cj

bTi c

Exercise: Verify that g(c, c) = f (c);Exercise: Let f (c) =

∑i ai log(ai/(Bc)i)− ai + (Bc)i . Derive an

auxiliary function g(c, c) for this f (c).

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 34 / 39

Page 84: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Constructing g for f

f (c) = 12‖a− Bc‖22

g(c, c) = 12‖a‖22 −

∑iaibT

i c + 12

∑ijλij(bijcj/λij

)2.

Only remains to pick λij as functions of c

λij =bij cj∑k bik ck

=bij cj

bTi c

Exercise: Verify that g(c, c) = f (c);Exercise: Let f (c) =

∑i ai log(ai/(Bc)i)− ai + (Bc)i . Derive an

auxiliary function g(c, c) for this f (c).

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 34 / 39

Page 85: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Constructing g for f

f (c) = 12‖a− Bc‖22

g(c, c) = 12‖a‖22 −

∑iaibT

i c + 12

∑ijλij(bijcj/λij

)2.

Only remains to pick λij as functions of c

λij =bij cj∑k bik ck

=bij cj

bTi c

Exercise: Verify that g(c, c) = f (c);Exercise: Let f (c) =

∑i ai log(ai/(Bc)i)− ai + (Bc)i . Derive an

auxiliary function g(c, c) for this f (c).

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 34 / 39

Page 86: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

NMF updates

Key step

ct+1 = argminc≥0

g(c, ct ).

Exercise: Solve ∂g(c, ct )/∂cp = 0 to obtain

cp = ctp

[BT a]p[BT Bct ]p

This yields the “multiplicative update” algorithm of Lee/Seung (1999).

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 35 / 39

Page 87: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

MM algorithms

We exploited convexity of x2

Expectation Maximization (EM) algorithm exploitsconvexity of − log xOther choices possible, e.g., by varying λij

Our technique one variant of repertoire ofMajorization-Minimization (MM) algorithmsgradient-descent also an MM algorithmRelated to d.c. programmingMM algorithms subject of a separate lecture!

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 36 / 39

Page 88: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

EM algorithm

Assume p(x) =∑K

j=1 πjp(x ; θj) is mixture density.

`(X ; Θ) :=∑n

i=1ln(∑K

j=1πjp(xi ; θj)

).

Use convexity of − log t to compute lower-bound

`(X ; Θ) ≥∑

ijβij ln

(πjp(xi ; θj)/βij

).

E-Step: Optimize over βij , to set them to posterior probabilities:

βij :=πjp(xi ; θj)∑l πlp(xi ; θl)

.

M-Step optimizes the bound over Θ, using above β values

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 37 / 39

Page 89: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

EM algorithm

Assume p(x) =∑K

j=1 πjp(x ; θj) is mixture density.

`(X ; Θ) :=∑n

i=1ln(∑K

j=1πjp(xi ; θj)

).

Use convexity of − log t to compute lower-bound

`(X ; Θ) ≥∑

ijβij ln

(πjp(xi ; θj)/βij

).

E-Step: Optimize over βij , to set them to posterior probabilities:

βij :=πjp(xi ; θj)∑l πlp(xi ; θl)

.

M-Step optimizes the bound over Θ, using above β values

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 37 / 39

Page 90: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

EM algorithm

Assume p(x) =∑K

j=1 πjp(x ; θj) is mixture density.

`(X ; Θ) :=∑n

i=1ln(∑K

j=1πjp(xi ; θj)

).

Use convexity of − log t to compute lower-bound

`(X ; Θ) ≥∑

ijβij ln

(πjp(xi ; θj)/βij

).

E-Step: Optimize over βij , to set them to posterior probabilities:

βij :=πjp(xi ; θj)∑l πlp(xi ; θl)

.

M-Step optimizes the bound over Θ, using above β values

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 37 / 39

Page 91: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

EM algorithm

Assume p(x) =∑K

j=1 πjp(x ; θj) is mixture density.

`(X ; Θ) :=∑n

i=1ln(∑K

j=1πjp(xi ; θj)

).

Use convexity of − log t to compute lower-bound

`(X ; Θ) ≥∑

ijβij ln

(πjp(xi ; θj)/βij

).

E-Step: Optimize over βij , to set them to posterior probabilities:

βij :=πjp(xi ; θj)∑l πlp(xi ; θl)

.

M-Step optimizes the bound over Θ, using above β values

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 37 / 39

Page 92: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Other Alternating methods

Alternating ProjectionsAlternating Reflections(Nonconvex) ADMM (e.g., arXiv:1410.1390)(Nonconvex) Douglas-Rachford (e.g., Borwein’s webpage!)AltMin for global optimization (we saw)BCD with more than 2 blocksADMM with more than 2 blocksSeveral others...

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 38 / 39

Page 93: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Alternating Proximal Method

min L(x , y) := f (x , y) + g(x) + h(y).

Assume: ∇f Lipschitz cont. on bounded subsets of Rm × Rn

g: lower semicontinuous on Rm

h: lower semicontinuous on Rn.Example: f (x , y) = 1

2‖x − y‖2

Alternating Proximal Method

xk+1 ∈ argmin

L(x , yk ) + 12ck‖x − xk‖2

yk+1 ∈ argmin

L(xk+1, y) + 1

2c′k‖y − yk‖2,

here ck , c′k are suitable sequences of positive scalars.

[arXiv:0801.1780. Attouch, Bolte, Redont, Soubeyran. Proximal alternatingminimization and projection methods for nonconvex problems.]

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 39 / 39

Page 94: Lecture 7: 6.883, Spring 2016 - MITrakhlin/6.883/lectures/lecture07.pdf · Disadvantages Tricky if single variable optimization is hard Convergence theory can be complicated Can slow

Alternating Proximal Method

min L(x , y) := f (x , y) + g(x) + h(y).

Assume: ∇f Lipschitz cont. on bounded subsets of Rm × Rn

g: lower semicontinuous on Rm

h: lower semicontinuous on Rn.Example: f (x , y) = 1

2‖x − y‖2

Alternating Proximal Method

xk+1 ∈ argmin

L(x , yk ) + 12ck‖x − xk‖2

yk+1 ∈ argmin

L(xk+1, y) + 1

2c′k‖y − yk‖2,

here ck , c′k are suitable sequences of positive scalars.

[arXiv:0801.1780. Attouch, Bolte, Redont, Soubeyran. Proximal alternatingminimization and projection methods for nonconvex problems.]

Suvrit Sra ([email protected]) 6.883: Alternating Minimization 39 / 39