smart: the stochastic monotone aggregated root-finding algorithm · smart: the stochastic monotone...

SMART: The Stochastic Monotone Aggregated Root-FindingAlgorithm

Damek Davis1

Department of MathematicsUniversity of California, Los Angeles/

School of Operations Research and Information EngineeringCornell University

1http://www.math.ucla.edu/˜damek0 / 31

http://www.math.ucla.edu/~damek

minimizex∈Rm

f(x) := 1n

n∑i=1

fi(aTi x)

• The empirical risk minimization problem (ERM)• A = (a1, . . . , an) ∈ Rm×n.• n = number of training examples• m = number of features.

• Nice Properties:• fi : R→ R smooth, one dimensional, convex• ∇(fi ◦ aTi ) : Rm → Rm in one dimensional space

∇(fi ◦ aTi )(x) = aif′i(a

Ti x) ∈ Range(ai).

• So to compute gradient, need one inner product, one scalar derivative.

1 / 31

• Gradient Descent: (fast; high per iteration cost; low memory)

xk+1 = xk − γ

n

n∑i=1

aif′i(aTi xk)

• Need to compute AT xk and all scalar gradients, then sum them together.

• Stochastic Gradient (slow; low per iteration cost; low memory)

Sample ik ∈ {1, . . . , n} uniformlyxk+1 = xk − γkaikf

′ik (aTikx)

• Need γk → 0, which can be slow!

2 / 31

• Stochastic Variance Reduced Gradient (SVRG): (fast; some high costiterations, but mostly low; low memory)

Sample ik ∈ {1, . . . , n} uniformly

xk+1 = xk − γ

(aikf

′ik (aTikx

k)− aikf′ik (aTi φk) + 1

n

n∑i=1

aif′i(aTi φk)

)

φk+1 =

{xk if k ≡ 0 mod τ ;

φk otherwise.

• Every τ iterations, recompute:

∇f(xk) =1n

n∑i=1

aif′i(a

Ti x

k)

otherwise, use the ∇f(φk).• Two derivatives computed per iteration.• ∇f(xk) stored =⇒ memory is m-dimensional vector• Strong convexity assumed.

3 / 31

• Finito: (fast; low per iteration cost; HIGH memory)


xk+1i =

{1n

∑n

l=1

(xkl − γalf ′l (aTl xkl ))

)if i = ik.

xki otherwise.

• Need to store points xk1 , . . . , xkn AND gradients f ′l (aT1 xk1), . . . , f ′n(aTnxkn).• Strong convexity assumed.

4 / 31

• Stochastic Average Gradient (SAG): (fast; low per iteration cost; lowmemory)


xk+1 = xk − γ

n

(aikf

′ik (aTikx

k) +∑i 6=ik

aizki

)

zk+1i =

{f ′ik (aTikx

k) if i = ik;

zki otherwise.

• Memory is n-dimensional vector (zk1 , . . . , zkn).• Biased gradient:

E

[aikf

′ik

(aTikx) +1n

∑i6=ik

aizki | x

k, . . . , x0

]=

1n∇f(xk)+

(1−

1n

) n∑i=1

aizki .

• COMPLICATED PROOF.• First incremental method where strong convexity NOT assumed.

5 / 31

• SAGA: (fast; low per iteration cost; low memory)


xk+1 = xk − γ

(aikf

′ik (aTikx

k)− aikzkik + 1

n

n∑i=1

aizki

)

zk+1i =

{f ′ik (aTikx

k) if i = ik;

zki otherwise.

• Memory is n-dimensional vector (zk1 , . . . , zkn).• Unbiased gradient.• Relatively simple proof• Strong convexity NOT assumed.

6 / 31

• Image stolen from SAGA paper• SDCA only solves `2 regularized problem, so we ignored it.• Point: All perform about the same, besides Finito perm, which isn’t

guaranteed to converge.7 / 31

• Today:

• SMART extends incremental aggregated gradient and coordinate descentmethods.

• SMART solves the ERM problem, and this seems to be its the mosteffective use, but it can go much further.

• In addition, SMART recovers: SAGA, Finito, SVRG, and SDCA.

• SAGA seems to be the catalyst for a lot the other methods, so let’sextend SAGA as much as possible.

8 / 31

• What if data matrix is sparse?

• The gradient aikf ′ik (aTikxk) only has a few nonzero components.

• SAGA requires dense update for xk because the sum is densem∑i=1

aizki

• We should only update components of x that are in the support of aik ,i.e., apply mask to the gradient sum.

• Makes gradient biased, no reason it should work.

9 / 31

• To avoid biased gradients, need to scale components of updates.

• Let Ci ⊆ {1, . . . ,m} be the support of ai.• Let eCi

=∑

j∈Ciei ← component mask.

• Let q and pi be vectors of probabilities (easy to precompute)

• Sparse SAGA:


xk+1 = xk − γeCik� q �

(pi � (aikf

′ik (aTikx

k)− aikzkik ) + 1

n

n∑i=1

aizki

)

zk+1i =

{f ′ik (aTikx

k) if i = ik;

zki otherwise.

10 / 31

• The sparse update equation is a block coordinate update equation.

• Block Coordinate SAGA:

Sample ik ∈ {1, . . . , n} uniformly and Sk ⊆ {1, . . . ,m} arbitrarily

xk+1 = xk − γeSk � q �

(pi � (aikf

′ik (aTikx

k)− aikzkik ) + 1

n

n∑i=1

aizki

)

zk+1i =

{f ′ik (aTikx

k) if i = ik;

zki otherwise.

• Coordinates and the gradient can be coupled.

• Only one function =⇒ recover block-coordinate descent.

11 / 31

• We only compute one gradient per iteration, but we can gain a bit inperformance if we compute a few more.

• Introducing the trigger graph: G = (V,E).1. V = {1, . . . , n}2. E ⊆ V × V.

• We say that index i in V triggers i in V provided (i, i′) is in E.

• Minibatching SAGA:


xk+1 = xk − γ

(aikf

′ik (aTikx

k)− aikzkik + 1

n

n∑i=1

aizki

)

zk+1i =

{f ′ik (aTikx

k) if ik triggers i;

zki otherwise.

• Improves theoretical convergence rate and practical performance.

12 / 31

• What if we’re not solving ERM problem, but we solve

minn∑i=1

fi(x)

• Then SAGA becomes a high memory method!


xk+1 = xk − γ

(∇fik (xk)− ykik + 1

n

n∑i=1

yki

)

yk+1i =

{∇fik (xk) if i = ik;

yki otherwise.

• Question: Instead of saving individual gradients, can we just store thesum 1

n

∑n

i=1 yki , and periodically recompute it?

13 / 31

• Randomized delay εk + complete trigger graph = SVRG clone

Sample ik ∈ {1, . . . , n} uniformly and εk ∈ {0, 1}

xk+1 = xk − γ

(∇fik (xk)− ykik + 1

n

n∑i=1

yki

)yk+1i = yki + εk(∇fik (xk)− yki ).

• The trick: yki = ∇fi(φk) for old iterate xk.

Sample ik ∈ {1, . . . , n} uniformly and εk ∈ {0, 1}

xk+1 = xk − γ

(∇fik (xk)−∇fik (φk) + 1

n

n∑i=1

∇fi(φk)

)φk = xk + εk(xk − φk).

• On average gradient, full gradient computed once every E[εk] iterates (canbe chosen however you want.)

14 / 31

• Back to ERM problem....

• Can also add importance sampling, which increases the range of step sizeswe can take.

Without importance sampling: γ ≤ (2 max{Li})−1

Without importance sampling: γ ≤

(2n

∑i

Li

)−1

• SAGA with Importance Sampling:

Sample ik ∈ {1, . . . , n} arbitrarily

xk+1 = xk − γ

(pi � (aikf

′ik (aTikx

k)− aikzkik ) + 1

n

n∑i=1

aizki

)

zk+1i =

{f ′ik (aTikx

k) if i = ik;

zki otherwise.

• Improves theoretical convergence rate and practical performance.

15 / 31

• Back to ERM problem....

• Can also add importance sampling, which increases the range of step sizeswe can take.

Without importance sampling: γ ≤ (2 max{Li})−1

Without importance sampling: γ ≤

(2n

∑i

Li

)−1

• SAGA with Importance Sampling:

Sample ik ∈ {1, . . . , n} arbitrarily

xk+1 = xk − γ

(pi � (aikf

′ik (aTikx

k)− aikzkik ) + 1

n

n∑i=1

aizki

)

zk+1i =

{f ′ik (aTikx

k) if i = ik;

zki otherwise.

• Improves theoretical convergence rate and practical performance.15 / 31

• These algorithms are serial; only one gradient is touched per iteration.

• Let’s parallelize: choose dk ∈ {1, . . . , τ}m and eik ∈ {1, . . . , τ}. Set

xk−dk = (xk−dk,11 , . . . , x

k−dk,mm )

• Asynchronous SAGA:


xk+1 = xk − γ

(aikf

′ik (aTikx

k−dk )− aikzk−eik

kik

+ 1n

n∑i=1

aizk−eik

ki

)

zk+1i =

{f ′ik (aTikx

k−eikk ) if i = ik;

zki otherwise.

• Can be combined with block coordinate updates.

• It’s like running serial SAGA on n different processors, without eversyncing them up.

16 / 31

• These algorithms are nice, but they’re lacking generality.

• Recall that SAGA and the other algorithms solve

find x ∈ H such that:n∑i=1

∇fi(x) = 0

• SMART solves the following Root-Finding Problem:

find x ∈ H such that: S(x) = 1n

n∑i=1

Si(x) = 0

where Si : H → H are gradient-like.

17 / 31

• The Coherence Condition: (∃βij > 0) : (∀x ∈ H), (∀x∗ ∈ zer(S))

m∑j=1

n∑i=1

βij‖(Si(x))j − (Si(x∗))j‖2j ≤ 〈S(x), x− x∗〉.

• Smooth convex functions satisfy

(∀x, y ∈ H) 1Li‖∇fi(x)−∇fi(y)‖2 ≤ 〈∇fi(x)−∇fi(y), x− y〉.

if ∇fi is Li-Lipschitz.

• =⇒ property can be summed together for multiple smooth functions:n∑i=1

1nLi‖∇fi(x)−∇fi(x∗)‖2 ≤ 〈 1

n

n∑i=1

∇fi(x), x− x∗〉.

if∑n

i=1∇fi(x∗) = 0.

18 / 31

• Proximal operators: (∀x, y ∈ H)

‖(I − proxγf )(x)− (I − proxγf )(y)‖2

≤ 〈(I − proxγf )(x)− (I − proxγf )(y), x− y〉.

(for smooth and nonsmooth f)

• Projection operators: (∀x, y ∈ H)

‖(I − PC)(x)− (I − PC)(y)‖2 ≤ 〈(I − PC)(x)− (I − PC)(y), x− y〉.

(for closed convex sets)

• Subgradient projectors: (∀x ∈ H) , (∀x∗ ∈ [f ≤ 0])∥∥∥∥ [f(x)]+‖g(x)‖2 g(x)

∥∥∥∥2

≤ 〈 [f(x)]+‖g(x)‖2 g(x), x− x∗〉.

where g(x) ∈ ∂f(x) is a subgradient selector.

19 / 31

Algorithm (SMART)

Let {λk}k∈N be a sequence of stepsizes. Choose x0 ∈ H and y01 , . . . , y

0n ∈ H

arbitrarily except that y0i,j = 0 if S∗ij = 0. Then for k ∈ N, perform the

following three steps:

1. Sampling. choose a set of coordinates Sk, an operator index ik, and dualupdate decision εk.

2. Primal update: set

(∀j ∈ Sk) xk+1j = xkj −

λkqjmn

(1pij

((Sik (xk−dk ))j − y

k−eikk

ik,j

)+

n∑i=1

yk−ei

ki,j

)(∀j 6∈ Sk) xk+1

j = xkj .

3. Dual update: If ik triggers i, set(∀j ∈ Sk with S∗ij 6= 0

)yk+1i,j = yki,j + εk

((Si(xk−dk ))j − yki,j

)(∀j /∈ Sk) yk+1

i,j = yki,j .

Otherwise, set yk+1i,j = yki,j .

20 / 31

• Linear feasibility problem:

Find x ∈ H such that Ax = b

• Randomized Asynchronous Kaczmarz algorithm:

Sample ik ∈ {1, . . . , n} uniformlyxk+1 = xk + λ(bik − 〈aik , x

k−dk 〉)aik

• Here we used Ci = {x | 〈ai, x〉 = bi} and

Si := (I − PCi ).

• No memory needed precisely because Si(x∗) = 0 at any solution.

21 / 31

• Nonsmooth regularization of ERM?

minimizex∈H

g(x) + 1N

N∑i=1

fi(aTi x),

• Operators (that satisfy the coherence condition)

Si = 1Li‖ai‖2N

ai∇fi ◦ aTi proxL−1g i = 1, . . . , N ;

SN+1 = (I − proxL−1g),

• Roots x∗ ∈ zer(S) are not minimizers, but proxL−1g(x) is a minimizer.

• Every time we evaluate Si, we have to evaluate proxL−1g, make thetrigger graph a star and we always update the (N + 1)rst dual variable.

22 / 31

• Asynchronous Proximal SAGA:

Sample ik ∈ {1, . . . , N + 1} with P (ik = i) =

{1

2N if i < N ;12 if i = N.

if ik < N

xk+1 = xk − 1λ(N + 1)

(1pik

(Si(xk−dk )− aizk−eik

ki ) + y

k−eN+1k

N+1 +N∑i=1

aizk−ei

ki

);

else

xk+1 = xk − 1λ(N + 1)

(1

pN+1(SN+1(xk−dk )− yk−e

N+1k

N+1 ) + yk−eN+1

kN+1 +

N∑i=1

aizk−ei

ki

);

end

zk+1i =

{1

Li‖ai‖2N∇fi(aTi proxL−1g(xk−dk )) if i = ik;

zki otherwise.

ykN+1 =(I − proxL−1g

)(xk−dk );

23 / 31

• Monotropic Programming?

minimizexj∈Hj

M∑j=1

gj(xj) + f(x1, . . . , xM );

subject to:M∑j=1

Ajxj = b

• Operator S :∏M+1j=1 Hj →

∏M+1j=1 Hj :

(S(x))M+1 := −γM+1

(M∑l=1

Alxl − b

);

(S(x))j

:= xj − proxγjgj

(xj − γjA∗j

(xM+1 + 2γM+1

(M∑l=1

Alxl − b

))− γj∇jf(x)

).

• x∗ ∈ zer(S) =⇒ (x∗1, . . . , x∗m) solves the monotropic programmingproblem. (Why?)

24 / 31

• TropicSMART:

Sample coordinate jk ∈ {1, . . . ,M + 1} uniformly and set Sk = {jk}.

xk+1M+1 = xkM+1 + γM+1

(M∑l=1

Alxkl − b

);

xk+1j = proxγjgj

(xkj − γjA∗j (2xk+1

M+1 − xkM+1)− γj∇jf(xk)

);

(∀j ∈ Sk) xk+1j = xkj − λ

(xkj − xk+1

j

);

(∀j /∈ Sk) xk+1j = xkj .

25 / 31

• More special cases in the paper.

• What about theory?

26 / 31

Theorem (SMART converges)1. The sequence {xk}k∈N weakly converges to a root of S.

2. If S is essentially strongly quasi monotone (ESQM)

(∃µ > 0) : (∀x ∈ H) 〈S(x), x− Pzer(S)(x)〉 ≥ µ‖x− Pzer(S)(x)‖2,

then {xk}k∈N linearly converges to a root of S.

• Examples of ESQM include S(x) =∑

iai∇fi(aTi x) if each fi is strongly

convex.

27 / 31

• Proof is not difficult, but kind of long.

1. Construct a supermartingale sequence

E[Xk+1|Fk] + Yk ≤ Xk

where

Xk = (distance to solution)2 + (asynchrony residual) + (dual variables residual)Yk = (Residuals that force convergence if they are 0).

2. Then through a series of magical steps, show that the sequence weaklyconverges.

• Linear convergence is somewhat more difficult to show.

28 / 31

101 102

Time (s)

10-1

Obje

ctiv

e E

rror

1 core4 cores8 cores16 cores

Figure: `2-regularized Logistic regression with N = 1000, m = 10000, conditionnumber = 10, matrix A random Gaussian, vector b uniformly distributed.

29 / 31

• A lot left to do.

• Nonconvex case• (asynchronous matrix factorization algorithms soon)

• Do more numerical experiments• Brent Edmunds at UCLA making program that takes operators as input and

runs SMART to find roots.

• Characterize sublinear convergence rates.

• Make more operators =⇒ more algorithms.

30 / 31

Thanks!

• Paper available here: http://arxiv.org/abs/1601.00698

• This material is based upon work supported by the National ScienceFoundation under Award No. 1502405.

31 / 31

http://arxiv.org/abs/1601.00698

smart: the stochastic monotone aggregated root-finding algorithm · smart: the stochastic monotone...

Documents