lecture 11: mirror descent and variable metric...

26
Lecture 11: Mirror Descent and Variable Metric Methods May 8 - 13, 2020 Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13, 2020 1 / 26

Upload: others

Post on 27-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 11: Mirror Descent and Variable Metric Methodsmath.xmu.edu.cn/group/nona/damc/Lecture11.pdf · Definition: Let h be a differentiable convex function, differentiable on

Lecture 11 Mirror Descent and Variable MetricMethods

May 8 - 13 2020

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 1 26

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 2 26

Definition Let h be a differentiable convex function differentiableon C The Bregman divergence associated with h is defined as

Dh(xy) = h(x)minus h(y)minus 〈nablah(y)xminus y〉

Dh(xy) is always non-negative and convex in its first argument x

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 3 26

h is λ-strongly convex over C with respect to the norm 983042 middot 983042 if

h(x) ge h(y) + 〈nablah(y)xminus y〉+ λ

2983042xminus y9830422 forall xy isin C

Note that strong convexity of h is equivalent to

Dh(xy) geλ

2983042xminus y9830422 forall xy isin C

2 Mirror Descent method

Compute subgradient gk isin partf(xk)

Update

xk+1 = argminxisinC

983069f(xk) + 〈gkxminus xk〉+ 1

αkDh(xx

k)

983070

= argminxisinC

983069〈gkx〉+ 1

αkDh(xx

k)

983070

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 4 26

We often attempt to choose h to better match the geometry of theunderlying constraint set C The goal is roughly to choose astrongly convex function h so that the diameter of C is small in thenorm 983042 middot 983042 (need not be ℓ2 or Euclidean norm) with respect towhich h is strongly convex

Example Gradient descent is mirror descent Consider

h(x) =1

2983042x98304222 Dh(xy) =

1

2983042xminus y98304222

The update of mirror descent is

xk+1 = argminxisinC

983069f(xk) + 〈gkxminus xk〉+ 1

αkDh(xx

k)

983070

= argminxisinC

983069〈gkx〉+ 1

αkDh(xx

k)

983070

= argminxisinC

983069〈gkx〉+ 1

2αk983042xminus xk98304222

983070

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 5 26

Example Solving problems on the probability simplex withexponentiated gradient methods

Consider the negative entropy (h is convex)

h(x) =

n983131

j=1

xj log xj

restricting x to the probability simplex x ge 0

n983131

j=1

xj = 1 Then

Dh(xy) =

n983131

j=1

[xj(log xj minus log yj) + (yj minus xj)] =

n983131

j=1

xj logxjyj

the entropic or Kullback-Leibler divergence

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 6 26

The subproblem in mirror descent is

minx

〈gx〉+n983131

j=1

xj logxjyj

st x ge 0

n983131

j=1

xj = 1

By introducing Lagrange multipliers τ isin R for the equalityconstraint and λ isin Rn

+ for the inequality constraint Then weobtain Lagrangian

L(x τλ) = 〈gx〉+n983131

j=1

983063xj log

xjyj

+ τxj minus λjxj

983064minus τ

Taking derivatives with respect to x and setting them to zero yield

xj = yj exp(minusgj minus 1minus τ + λj)

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 7 26

Taking λj = 0 by

n983131

j=1

xj = 1 we have (a special solution)

xi =yi exp(minusgi)

n983131

j=1

yj exp(minusgj)

The update in mirror descent is

xk+1i =

xki exp(minusαkgki )

n983131

j=1

xkj exp(minusαkgkj )

This is the so-called exponentiated gradient update also known asentropic mirror descent

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 8 26

Example using ℓp norms (p isin (1 2]) Consider

h(x) =1

2(pminus 1)983042x9830422p

We have

nablah(x) =1

pminus 1983042x9830422minusp

p

983045sign(x1)|x1|pminus1 middot middot middot sign(xn)|xn|pminus1

983046T

Defineφ(x) = (pminus 1)nablah(x)

and

ψ(y) = 983042y9830422minusqq

983045sign(y1)|y1|qminus1 middot middot middot sign(yn)|yn|qminus1

983046T

where q =p

pminus 1 We have ψ(φ(x)) = x and hence ψ = φminus1

and φ = ψminus1

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 9 26

The mirror descent for C = Rn is

xk+1 = argminxisinRn

983069f(xk) + 〈gkxminus xk〉+ 1

αkDh(xx

k)

983070

= argminxisinRn

983069〈gkx〉+ 1

αkDh(xx

k)

983070

= (nablah)minus1(nablah(xk)minus αkgk)

= ψ(φ(xk)minus αk(pminus 1)gk)

The update involving the inverse of the gradient mapping (nablah)minus1holds more generally that is for any differentiable and strictlyconvex h

This is the original form of the mirror descent update and itjustifies the name mirror descent as the gradient is ldquomirroredrdquothrough the distance-generating function h and back again

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 10 26

The case C = x isin Rn+

n983131

j=1

xj = 1

The subproblem in mirror descent is

minxisinC

〈vx〉+ 1

2983042x9830422p

wherev = αk(pminus 1)gk minus φ(xk)

Exercise Show by KKT conditions that the solution is given byx = ψ([τeminus v]+) where τ isin R satisfies

n983131

j=1

ψj([τeminus v]+) = 1

Binary search can be used to find τ

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 11 26

Theorem 1

Let αk gt 0 be any sequence of non-increasing stepsizes Assume that his strongly convex with respect to 983042 middot 983042 Let 983042 middot 983042lowast denote the dual normLet xk be generated by the mirror descent iteration If Dh(xx983183) le R2

for all x isin C then for all K ge 1

K983131

k=1

[f(xk)minus f(x983183)] leR2

αK+

K983131

k=1

αk

2983042gk9830422lowast

If αk = α is constant then for all K ge 1K983131

k=1

[f(xk)minus f(x983183)] le1

αDh(x983183x

1) +α

2

K983131

k=1

983042gk9830422lowast

If xK =1

K

K983131

k=1

xk or xK = argminxk

f(xk) 983042g983042lowast le M for all

g isin partf(x) for x isin C then f(xK)minus f(x983183) le1

KαDh(x983183x

1) +α

2M2

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 12 26

Exercise Prove that the negative entropy

h(x) =

n983131

j=1

xj log xj

is 1-strongly convex with respect to the ℓ1-norm

Example Let C = x isin Rn+

983123nj=1 xj = 1 Let x1 =

1

ne The

exponentiated gradient method with fixed stepsize α guarantees

f(xK)minus f(x983183) lelog n

Kα+

α

2K

K983131

k=1

983042gk9830422infin

where xK =1

K

K983131

k=1

xk

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 13 26

Example Let C sub x isin Rn 983042x9830421 le R1 Let h(x) =1

2(pminus 1)983042x9830422p

with p = 1 +1

log(2n) Then

K983131

k=1

[f(xk)minus f(x983183)] le2R2

1 log(2n)

αK+

e2

2

K983131

k=1

αk983042gk9830422infin

In particular taking αk =R1

e

983157log(2n)

kand xK =

1

K

K983131

k=1

xk gives

f(xK)minus f(x983183) le3eR1

983155log(2n)radicK

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 14 26

A simulated mirror-descent example Let

A =983045a1 middot middot middot am

983046T isin Rmtimesn

have entries drawn iid N (0 1) Let

bi =1

2(ai1 + ai2) + εi

where εi drawn iid N (0 10minus2) and n = 3000 m = 20 Define

f(x) = 983042Axminus b9830421 =m983131

i=1

|〈aix〉 minus bi|

which has subgradients ATsign(Axminus b) Consider

minx

f(x) st x isin C =

983099983103

983101x isin Rn+

n983131

j=1

xj = 1

983100983104

983102

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 15 26

We compare the subgradient method to exponentiated gradientdescent for this problem noting that the Euclidean projection of avector v isin Rn to the set C has coordinates

xj = [vj minus τ ]+

where τ isin R is chosen so that

n983131

j=1

xj =

n983131

j=1

[vj minus τ ]+ = 1

We use stepsize αk = α0radick where the initial stepsize α0 is chosen

to optimize the convergence guarantee for each of the methods

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 16 26

fkbest minus f983183 versus k

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 17 26

21 Stochastic version

Theorem 2

Let the conditions of Theorem 1 hold except that instead of receiving avector

gk isin partf(xk)

at iteration k the vector gk is a stochastic subgradient satisfying

E[gk|xk] isin partf(xk)

Then for any non-increasing stepsize sequence αk (where αk may bechosen dependent on g1 middot middot middot gk)

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078le E

983077R2

αK+

K983131

k=1

αk

2983042gk9830422lowast

983078

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 18 26

22 Adaptive stepsizes

Let xK =1

K

K983131

k=1

xk If Dh(xx983183) le R2 for all x isin C then

E[f(xK)minus f(x983183)] le E

983077R2

KαK+

1

K

K983131

k=1

αk

2983042gk9830422lowast

983078

If αk = α for all k we see that the choice of stepsize minimizing

R2

Kα+

α

2K

K983131

k=1

983042gk9830422lowast is α983183 =radic2R

983075K983131

k=1

983042gk9830422lowast

983076minus 12

which yields

E[f(xK)minus f(x983183)] leradic2R

KE

983093

983095983075

K983131

k=1

983042gk9830422lowast

983076 12

983094

983096

But this α983183 is not known in advance because

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 19 26

Corollary 3

Let the conditions of Theorem 2 hold Let

αk =R983161983160983160983159

k983131

i=1

983042gi9830422lowast

which is the ldquoup to nowrdquo optimal choice Then

E[f(xK)minus f(x983183)] le3R

KE

983093

983095983075

K983131

k=1

983042gk9830422lowast

983076 12

983094

983096

where

xK =1

K

K983131

k=1

xk

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 20 26

3 Variable metric methods

The idea is to adjust the metric with which one constructs updatesto better reflect problem structure

1 Choose subgradient gk isin partf(xk) or stochastic subgradient gk

satisfyingE[gk|xk] isin partf(xk)

2 Update positive semidefinite matrix Hk isin Rntimesn

3 Compute update

xk+1 = argminxisinC

983069〈gkx〉+ 1

2〈xminus xkHk(xminus xk)〉

983070

Special cases Subgradient method Hk =1

αkI

Newton method Hk = nabla2f(xk)

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 21 26

Theorem 4

Let Hk be a sequence of positive definite matrices where Hk is afunction of g1 middot middot middot gk (and potentially some additional randomness)Let gk be (stochastic) subgradients with

E[gk|xk] isin partf(xk)

Then

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078

le 1

2E

983077K983131

k=2

983059983042xk minus x9831839830422Hk

minus 983042xk minus x9831839830422Hkminus1

983060+ 983042x1 minus x9831839830422H1

983078

+1

2E

983077K983131

k=1

983042gk9830422Hminus1

k

983078

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 22 26

31 Adaptive gradient variable metric method

Theorem 5

LetRinfin = sup

xisinC983042xminus x983183983042infin

be the ℓinfin radius of the set C and the conditions of Theorem 4 hold Let

Hk =1

α

983075diag

983075k983131

i=1

gigiT

98307698307612

where α gt 0 is a pre-specified constant Then we have

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078le

983061R2

infin2

+ α2

983062E[tr(HK)]

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 23 26

Example Support vector machine classification problem

Consider

minx

f(x) =1

m

m983131

i=1

[1minus bi〈aix〉]+ st 983042x983042infin le 1

where the vectors ai isin minus1 0 1n with m = 5000 and n = 1000

For each coordinate j isin [n] we set aij isin plusmn1 to have a randomsign with probability 1j and aij = 0 otherwise

Let u isin minus1 1n uniformly at random We set bi = sign(〈aiu〉)with probability 095 and bi = minussign(〈aiu〉) otherwiseWe choose a stochastic gradient by selecting i isin [m] uniformly atrandom then set

gk isin part[1minus bi〈aixk〉]+

and αk = αradick

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 24 26

f(xk)minus f983183 versus k100 various initial stepsizes

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 25 26

f(xk)minus f983183 versus k100 best initial stepsize

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 26 26

Page 2: Lecture 11: Mirror Descent and Variable Metric Methodsmath.xmu.edu.cn/group/nona/damc/Lecture11.pdf · Definition: Let h be a differentiable convex function, differentiable on

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 2 26

Definition Let h be a differentiable convex function differentiableon C The Bregman divergence associated with h is defined as

Dh(xy) = h(x)minus h(y)minus 〈nablah(y)xminus y〉

Dh(xy) is always non-negative and convex in its first argument x

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 3 26

h is λ-strongly convex over C with respect to the norm 983042 middot 983042 if

h(x) ge h(y) + 〈nablah(y)xminus y〉+ λ

2983042xminus y9830422 forall xy isin C

Note that strong convexity of h is equivalent to

Dh(xy) geλ

2983042xminus y9830422 forall xy isin C

2 Mirror Descent method

Compute subgradient gk isin partf(xk)

Update

xk+1 = argminxisinC

983069f(xk) + 〈gkxminus xk〉+ 1

αkDh(xx

k)

983070

= argminxisinC

983069〈gkx〉+ 1

αkDh(xx

k)

983070

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 4 26

We often attempt to choose h to better match the geometry of theunderlying constraint set C The goal is roughly to choose astrongly convex function h so that the diameter of C is small in thenorm 983042 middot 983042 (need not be ℓ2 or Euclidean norm) with respect towhich h is strongly convex

Example Gradient descent is mirror descent Consider

h(x) =1

2983042x98304222 Dh(xy) =

1

2983042xminus y98304222

The update of mirror descent is

xk+1 = argminxisinC

983069f(xk) + 〈gkxminus xk〉+ 1

αkDh(xx

k)

983070

= argminxisinC

983069〈gkx〉+ 1

αkDh(xx

k)

983070

= argminxisinC

983069〈gkx〉+ 1

2αk983042xminus xk98304222

983070

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 5 26

Example Solving problems on the probability simplex withexponentiated gradient methods

Consider the negative entropy (h is convex)

h(x) =

n983131

j=1

xj log xj

restricting x to the probability simplex x ge 0

n983131

j=1

xj = 1 Then

Dh(xy) =

n983131

j=1

[xj(log xj minus log yj) + (yj minus xj)] =

n983131

j=1

xj logxjyj

the entropic or Kullback-Leibler divergence

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 6 26

The subproblem in mirror descent is

minx

〈gx〉+n983131

j=1

xj logxjyj

st x ge 0

n983131

j=1

xj = 1

By introducing Lagrange multipliers τ isin R for the equalityconstraint and λ isin Rn

+ for the inequality constraint Then weobtain Lagrangian

L(x τλ) = 〈gx〉+n983131

j=1

983063xj log

xjyj

+ τxj minus λjxj

983064minus τ

Taking derivatives with respect to x and setting them to zero yield

xj = yj exp(minusgj minus 1minus τ + λj)

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 7 26

Taking λj = 0 by

n983131

j=1

xj = 1 we have (a special solution)

xi =yi exp(minusgi)

n983131

j=1

yj exp(minusgj)

The update in mirror descent is

xk+1i =

xki exp(minusαkgki )

n983131

j=1

xkj exp(minusαkgkj )

This is the so-called exponentiated gradient update also known asentropic mirror descent

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 8 26

Example using ℓp norms (p isin (1 2]) Consider

h(x) =1

2(pminus 1)983042x9830422p

We have

nablah(x) =1

pminus 1983042x9830422minusp

p

983045sign(x1)|x1|pminus1 middot middot middot sign(xn)|xn|pminus1

983046T

Defineφ(x) = (pminus 1)nablah(x)

and

ψ(y) = 983042y9830422minusqq

983045sign(y1)|y1|qminus1 middot middot middot sign(yn)|yn|qminus1

983046T

where q =p

pminus 1 We have ψ(φ(x)) = x and hence ψ = φminus1

and φ = ψminus1

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 9 26

The mirror descent for C = Rn is

xk+1 = argminxisinRn

983069f(xk) + 〈gkxminus xk〉+ 1

αkDh(xx

k)

983070

= argminxisinRn

983069〈gkx〉+ 1

αkDh(xx

k)

983070

= (nablah)minus1(nablah(xk)minus αkgk)

= ψ(φ(xk)minus αk(pminus 1)gk)

The update involving the inverse of the gradient mapping (nablah)minus1holds more generally that is for any differentiable and strictlyconvex h

This is the original form of the mirror descent update and itjustifies the name mirror descent as the gradient is ldquomirroredrdquothrough the distance-generating function h and back again

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 10 26

The case C = x isin Rn+

n983131

j=1

xj = 1

The subproblem in mirror descent is

minxisinC

〈vx〉+ 1

2983042x9830422p

wherev = αk(pminus 1)gk minus φ(xk)

Exercise Show by KKT conditions that the solution is given byx = ψ([τeminus v]+) where τ isin R satisfies

n983131

j=1

ψj([τeminus v]+) = 1

Binary search can be used to find τ

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 11 26

Theorem 1

Let αk gt 0 be any sequence of non-increasing stepsizes Assume that his strongly convex with respect to 983042 middot 983042 Let 983042 middot 983042lowast denote the dual normLet xk be generated by the mirror descent iteration If Dh(xx983183) le R2

for all x isin C then for all K ge 1

K983131

k=1

[f(xk)minus f(x983183)] leR2

αK+

K983131

k=1

αk

2983042gk9830422lowast

If αk = α is constant then for all K ge 1K983131

k=1

[f(xk)minus f(x983183)] le1

αDh(x983183x

1) +α

2

K983131

k=1

983042gk9830422lowast

If xK =1

K

K983131

k=1

xk or xK = argminxk

f(xk) 983042g983042lowast le M for all

g isin partf(x) for x isin C then f(xK)minus f(x983183) le1

KαDh(x983183x

1) +α

2M2

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 12 26

Exercise Prove that the negative entropy

h(x) =

n983131

j=1

xj log xj

is 1-strongly convex with respect to the ℓ1-norm

Example Let C = x isin Rn+

983123nj=1 xj = 1 Let x1 =

1

ne The

exponentiated gradient method with fixed stepsize α guarantees

f(xK)minus f(x983183) lelog n

Kα+

α

2K

K983131

k=1

983042gk9830422infin

where xK =1

K

K983131

k=1

xk

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 13 26

Example Let C sub x isin Rn 983042x9830421 le R1 Let h(x) =1

2(pminus 1)983042x9830422p

with p = 1 +1

log(2n) Then

K983131

k=1

[f(xk)minus f(x983183)] le2R2

1 log(2n)

αK+

e2

2

K983131

k=1

αk983042gk9830422infin

In particular taking αk =R1

e

983157log(2n)

kand xK =

1

K

K983131

k=1

xk gives

f(xK)minus f(x983183) le3eR1

983155log(2n)radicK

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 14 26

A simulated mirror-descent example Let

A =983045a1 middot middot middot am

983046T isin Rmtimesn

have entries drawn iid N (0 1) Let

bi =1

2(ai1 + ai2) + εi

where εi drawn iid N (0 10minus2) and n = 3000 m = 20 Define

f(x) = 983042Axminus b9830421 =m983131

i=1

|〈aix〉 minus bi|

which has subgradients ATsign(Axminus b) Consider

minx

f(x) st x isin C =

983099983103

983101x isin Rn+

n983131

j=1

xj = 1

983100983104

983102

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 15 26

We compare the subgradient method to exponentiated gradientdescent for this problem noting that the Euclidean projection of avector v isin Rn to the set C has coordinates

xj = [vj minus τ ]+

where τ isin R is chosen so that

n983131

j=1

xj =

n983131

j=1

[vj minus τ ]+ = 1

We use stepsize αk = α0radick where the initial stepsize α0 is chosen

to optimize the convergence guarantee for each of the methods

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 16 26

fkbest minus f983183 versus k

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 17 26

21 Stochastic version

Theorem 2

Let the conditions of Theorem 1 hold except that instead of receiving avector

gk isin partf(xk)

at iteration k the vector gk is a stochastic subgradient satisfying

E[gk|xk] isin partf(xk)

Then for any non-increasing stepsize sequence αk (where αk may bechosen dependent on g1 middot middot middot gk)

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078le E

983077R2

αK+

K983131

k=1

αk

2983042gk9830422lowast

983078

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 18 26

22 Adaptive stepsizes

Let xK =1

K

K983131

k=1

xk If Dh(xx983183) le R2 for all x isin C then

E[f(xK)minus f(x983183)] le E

983077R2

KαK+

1

K

K983131

k=1

αk

2983042gk9830422lowast

983078

If αk = α for all k we see that the choice of stepsize minimizing

R2

Kα+

α

2K

K983131

k=1

983042gk9830422lowast is α983183 =radic2R

983075K983131

k=1

983042gk9830422lowast

983076minus 12

which yields

E[f(xK)minus f(x983183)] leradic2R

KE

983093

983095983075

K983131

k=1

983042gk9830422lowast

983076 12

983094

983096

But this α983183 is not known in advance because

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 19 26

Corollary 3

Let the conditions of Theorem 2 hold Let

αk =R983161983160983160983159

k983131

i=1

983042gi9830422lowast

which is the ldquoup to nowrdquo optimal choice Then

E[f(xK)minus f(x983183)] le3R

KE

983093

983095983075

K983131

k=1

983042gk9830422lowast

983076 12

983094

983096

where

xK =1

K

K983131

k=1

xk

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 20 26

3 Variable metric methods

The idea is to adjust the metric with which one constructs updatesto better reflect problem structure

1 Choose subgradient gk isin partf(xk) or stochastic subgradient gk

satisfyingE[gk|xk] isin partf(xk)

2 Update positive semidefinite matrix Hk isin Rntimesn

3 Compute update

xk+1 = argminxisinC

983069〈gkx〉+ 1

2〈xminus xkHk(xminus xk)〉

983070

Special cases Subgradient method Hk =1

αkI

Newton method Hk = nabla2f(xk)

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 21 26

Theorem 4

Let Hk be a sequence of positive definite matrices where Hk is afunction of g1 middot middot middot gk (and potentially some additional randomness)Let gk be (stochastic) subgradients with

E[gk|xk] isin partf(xk)

Then

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078

le 1

2E

983077K983131

k=2

983059983042xk minus x9831839830422Hk

minus 983042xk minus x9831839830422Hkminus1

983060+ 983042x1 minus x9831839830422H1

983078

+1

2E

983077K983131

k=1

983042gk9830422Hminus1

k

983078

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 22 26

31 Adaptive gradient variable metric method

Theorem 5

LetRinfin = sup

xisinC983042xminus x983183983042infin

be the ℓinfin radius of the set C and the conditions of Theorem 4 hold Let

Hk =1

α

983075diag

983075k983131

i=1

gigiT

98307698307612

where α gt 0 is a pre-specified constant Then we have

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078le

983061R2

infin2

+ α2

983062E[tr(HK)]

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 23 26

Example Support vector machine classification problem

Consider

minx

f(x) =1

m

m983131

i=1

[1minus bi〈aix〉]+ st 983042x983042infin le 1

where the vectors ai isin minus1 0 1n with m = 5000 and n = 1000

For each coordinate j isin [n] we set aij isin plusmn1 to have a randomsign with probability 1j and aij = 0 otherwise

Let u isin minus1 1n uniformly at random We set bi = sign(〈aiu〉)with probability 095 and bi = minussign(〈aiu〉) otherwiseWe choose a stochastic gradient by selecting i isin [m] uniformly atrandom then set

gk isin part[1minus bi〈aixk〉]+

and αk = αradick

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 24 26

f(xk)minus f983183 versus k100 various initial stepsizes

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 25 26

f(xk)minus f983183 versus k100 best initial stepsize

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 26 26

Page 3: Lecture 11: Mirror Descent and Variable Metric Methodsmath.xmu.edu.cn/group/nona/damc/Lecture11.pdf · Definition: Let h be a differentiable convex function, differentiable on

Definition Let h be a differentiable convex function differentiableon C The Bregman divergence associated with h is defined as

Dh(xy) = h(x)minus h(y)minus 〈nablah(y)xminus y〉

Dh(xy) is always non-negative and convex in its first argument x

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 3 26

h is λ-strongly convex over C with respect to the norm 983042 middot 983042 if

h(x) ge h(y) + 〈nablah(y)xminus y〉+ λ

2983042xminus y9830422 forall xy isin C

Note that strong convexity of h is equivalent to

Dh(xy) geλ

2983042xminus y9830422 forall xy isin C

2 Mirror Descent method

Compute subgradient gk isin partf(xk)

Update

xk+1 = argminxisinC

983069f(xk) + 〈gkxminus xk〉+ 1

αkDh(xx

k)

983070

= argminxisinC

983069〈gkx〉+ 1

αkDh(xx

k)

983070

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 4 26

We often attempt to choose h to better match the geometry of theunderlying constraint set C The goal is roughly to choose astrongly convex function h so that the diameter of C is small in thenorm 983042 middot 983042 (need not be ℓ2 or Euclidean norm) with respect towhich h is strongly convex

Example Gradient descent is mirror descent Consider

h(x) =1

2983042x98304222 Dh(xy) =

1

2983042xminus y98304222

The update of mirror descent is

xk+1 = argminxisinC

983069f(xk) + 〈gkxminus xk〉+ 1

αkDh(xx

k)

983070

= argminxisinC

983069〈gkx〉+ 1

αkDh(xx

k)

983070

= argminxisinC

983069〈gkx〉+ 1

2αk983042xminus xk98304222

983070

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 5 26

Example Solving problems on the probability simplex withexponentiated gradient methods

Consider the negative entropy (h is convex)

h(x) =

n983131

j=1

xj log xj

restricting x to the probability simplex x ge 0

n983131

j=1

xj = 1 Then

Dh(xy) =

n983131

j=1

[xj(log xj minus log yj) + (yj minus xj)] =

n983131

j=1

xj logxjyj

the entropic or Kullback-Leibler divergence

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 6 26

The subproblem in mirror descent is

minx

〈gx〉+n983131

j=1

xj logxjyj

st x ge 0

n983131

j=1

xj = 1

By introducing Lagrange multipliers τ isin R for the equalityconstraint and λ isin Rn

+ for the inequality constraint Then weobtain Lagrangian

L(x τλ) = 〈gx〉+n983131

j=1

983063xj log

xjyj

+ τxj minus λjxj

983064minus τ

Taking derivatives with respect to x and setting them to zero yield

xj = yj exp(minusgj minus 1minus τ + λj)

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 7 26

Taking λj = 0 by

n983131

j=1

xj = 1 we have (a special solution)

xi =yi exp(minusgi)

n983131

j=1

yj exp(minusgj)

The update in mirror descent is

xk+1i =

xki exp(minusαkgki )

n983131

j=1

xkj exp(minusαkgkj )

This is the so-called exponentiated gradient update also known asentropic mirror descent

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 8 26

Example using ℓp norms (p isin (1 2]) Consider

h(x) =1

2(pminus 1)983042x9830422p

We have

nablah(x) =1

pminus 1983042x9830422minusp

p

983045sign(x1)|x1|pminus1 middot middot middot sign(xn)|xn|pminus1

983046T

Defineφ(x) = (pminus 1)nablah(x)

and

ψ(y) = 983042y9830422minusqq

983045sign(y1)|y1|qminus1 middot middot middot sign(yn)|yn|qminus1

983046T

where q =p

pminus 1 We have ψ(φ(x)) = x and hence ψ = φminus1

and φ = ψminus1

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 9 26

The mirror descent for C = Rn is

xk+1 = argminxisinRn

983069f(xk) + 〈gkxminus xk〉+ 1

αkDh(xx

k)

983070

= argminxisinRn

983069〈gkx〉+ 1

αkDh(xx

k)

983070

= (nablah)minus1(nablah(xk)minus αkgk)

= ψ(φ(xk)minus αk(pminus 1)gk)

The update involving the inverse of the gradient mapping (nablah)minus1holds more generally that is for any differentiable and strictlyconvex h

This is the original form of the mirror descent update and itjustifies the name mirror descent as the gradient is ldquomirroredrdquothrough the distance-generating function h and back again

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 10 26

The case C = x isin Rn+

n983131

j=1

xj = 1

The subproblem in mirror descent is

minxisinC

〈vx〉+ 1

2983042x9830422p

wherev = αk(pminus 1)gk minus φ(xk)

Exercise Show by KKT conditions that the solution is given byx = ψ([τeminus v]+) where τ isin R satisfies

n983131

j=1

ψj([τeminus v]+) = 1

Binary search can be used to find τ

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 11 26

Theorem 1

Let αk gt 0 be any sequence of non-increasing stepsizes Assume that his strongly convex with respect to 983042 middot 983042 Let 983042 middot 983042lowast denote the dual normLet xk be generated by the mirror descent iteration If Dh(xx983183) le R2

for all x isin C then for all K ge 1

K983131

k=1

[f(xk)minus f(x983183)] leR2

αK+

K983131

k=1

αk

2983042gk9830422lowast

If αk = α is constant then for all K ge 1K983131

k=1

[f(xk)minus f(x983183)] le1

αDh(x983183x

1) +α

2

K983131

k=1

983042gk9830422lowast

If xK =1

K

K983131

k=1

xk or xK = argminxk

f(xk) 983042g983042lowast le M for all

g isin partf(x) for x isin C then f(xK)minus f(x983183) le1

KαDh(x983183x

1) +α

2M2

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 12 26

Exercise Prove that the negative entropy

h(x) =

n983131

j=1

xj log xj

is 1-strongly convex with respect to the ℓ1-norm

Example Let C = x isin Rn+

983123nj=1 xj = 1 Let x1 =

1

ne The

exponentiated gradient method with fixed stepsize α guarantees

f(xK)minus f(x983183) lelog n

Kα+

α

2K

K983131

k=1

983042gk9830422infin

where xK =1

K

K983131

k=1

xk

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 13 26

Example Let C sub x isin Rn 983042x9830421 le R1 Let h(x) =1

2(pminus 1)983042x9830422p

with p = 1 +1

log(2n) Then

K983131

k=1

[f(xk)minus f(x983183)] le2R2

1 log(2n)

αK+

e2

2

K983131

k=1

αk983042gk9830422infin

In particular taking αk =R1

e

983157log(2n)

kand xK =

1

K

K983131

k=1

xk gives

f(xK)minus f(x983183) le3eR1

983155log(2n)radicK

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 14 26

A simulated mirror-descent example Let

A =983045a1 middot middot middot am

983046T isin Rmtimesn

have entries drawn iid N (0 1) Let

bi =1

2(ai1 + ai2) + εi

where εi drawn iid N (0 10minus2) and n = 3000 m = 20 Define

f(x) = 983042Axminus b9830421 =m983131

i=1

|〈aix〉 minus bi|

which has subgradients ATsign(Axminus b) Consider

minx

f(x) st x isin C =

983099983103

983101x isin Rn+

n983131

j=1

xj = 1

983100983104

983102

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 15 26

We compare the subgradient method to exponentiated gradientdescent for this problem noting that the Euclidean projection of avector v isin Rn to the set C has coordinates

xj = [vj minus τ ]+

where τ isin R is chosen so that

n983131

j=1

xj =

n983131

j=1

[vj minus τ ]+ = 1

We use stepsize αk = α0radick where the initial stepsize α0 is chosen

to optimize the convergence guarantee for each of the methods

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 16 26

fkbest minus f983183 versus k

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 17 26

21 Stochastic version

Theorem 2

Let the conditions of Theorem 1 hold except that instead of receiving avector

gk isin partf(xk)

at iteration k the vector gk is a stochastic subgradient satisfying

E[gk|xk] isin partf(xk)

Then for any non-increasing stepsize sequence αk (where αk may bechosen dependent on g1 middot middot middot gk)

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078le E

983077R2

αK+

K983131

k=1

αk

2983042gk9830422lowast

983078

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 18 26

22 Adaptive stepsizes

Let xK =1

K

K983131

k=1

xk If Dh(xx983183) le R2 for all x isin C then

E[f(xK)minus f(x983183)] le E

983077R2

KαK+

1

K

K983131

k=1

αk

2983042gk9830422lowast

983078

If αk = α for all k we see that the choice of stepsize minimizing

R2

Kα+

α

2K

K983131

k=1

983042gk9830422lowast is α983183 =radic2R

983075K983131

k=1

983042gk9830422lowast

983076minus 12

which yields

E[f(xK)minus f(x983183)] leradic2R

KE

983093

983095983075

K983131

k=1

983042gk9830422lowast

983076 12

983094

983096

But this α983183 is not known in advance because

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 19 26

Corollary 3

Let the conditions of Theorem 2 hold Let

αk =R983161983160983160983159

k983131

i=1

983042gi9830422lowast

which is the ldquoup to nowrdquo optimal choice Then

E[f(xK)minus f(x983183)] le3R

KE

983093

983095983075

K983131

k=1

983042gk9830422lowast

983076 12

983094

983096

where

xK =1

K

K983131

k=1

xk

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 20 26

3 Variable metric methods

The idea is to adjust the metric with which one constructs updatesto better reflect problem structure

1 Choose subgradient gk isin partf(xk) or stochastic subgradient gk

satisfyingE[gk|xk] isin partf(xk)

2 Update positive semidefinite matrix Hk isin Rntimesn

3 Compute update

xk+1 = argminxisinC

983069〈gkx〉+ 1

2〈xminus xkHk(xminus xk)〉

983070

Special cases Subgradient method Hk =1

αkI

Newton method Hk = nabla2f(xk)

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 21 26

Theorem 4

Let Hk be a sequence of positive definite matrices where Hk is afunction of g1 middot middot middot gk (and potentially some additional randomness)Let gk be (stochastic) subgradients with

E[gk|xk] isin partf(xk)

Then

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078

le 1

2E

983077K983131

k=2

983059983042xk minus x9831839830422Hk

minus 983042xk minus x9831839830422Hkminus1

983060+ 983042x1 minus x9831839830422H1

983078

+1

2E

983077K983131

k=1

983042gk9830422Hminus1

k

983078

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 22 26

31 Adaptive gradient variable metric method

Theorem 5

LetRinfin = sup

xisinC983042xminus x983183983042infin

be the ℓinfin radius of the set C and the conditions of Theorem 4 hold Let

Hk =1

α

983075diag

983075k983131

i=1

gigiT

98307698307612

where α gt 0 is a pre-specified constant Then we have

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078le

983061R2

infin2

+ α2

983062E[tr(HK)]

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 23 26

Example Support vector machine classification problem

Consider

minx

f(x) =1

m

m983131

i=1

[1minus bi〈aix〉]+ st 983042x983042infin le 1

where the vectors ai isin minus1 0 1n with m = 5000 and n = 1000

For each coordinate j isin [n] we set aij isin plusmn1 to have a randomsign with probability 1j and aij = 0 otherwise

Let u isin minus1 1n uniformly at random We set bi = sign(〈aiu〉)with probability 095 and bi = minussign(〈aiu〉) otherwiseWe choose a stochastic gradient by selecting i isin [m] uniformly atrandom then set

gk isin part[1minus bi〈aixk〉]+

and αk = αradick

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 24 26

f(xk)minus f983183 versus k100 various initial stepsizes

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 25 26

f(xk)minus f983183 versus k100 best initial stepsize

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 26 26

Page 4: Lecture 11: Mirror Descent and Variable Metric Methodsmath.xmu.edu.cn/group/nona/damc/Lecture11.pdf · Definition: Let h be a differentiable convex function, differentiable on

h is λ-strongly convex over C with respect to the norm 983042 middot 983042 if

h(x) ge h(y) + 〈nablah(y)xminus y〉+ λ

2983042xminus y9830422 forall xy isin C

Note that strong convexity of h is equivalent to

Dh(xy) geλ

2983042xminus y9830422 forall xy isin C

2 Mirror Descent method

Compute subgradient gk isin partf(xk)

Update

xk+1 = argminxisinC

983069f(xk) + 〈gkxminus xk〉+ 1

αkDh(xx

k)

983070

= argminxisinC

983069〈gkx〉+ 1

αkDh(xx

k)

983070

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 4 26

We often attempt to choose h to better match the geometry of theunderlying constraint set C The goal is roughly to choose astrongly convex function h so that the diameter of C is small in thenorm 983042 middot 983042 (need not be ℓ2 or Euclidean norm) with respect towhich h is strongly convex

Example Gradient descent is mirror descent Consider

h(x) =1

2983042x98304222 Dh(xy) =

1

2983042xminus y98304222

The update of mirror descent is

xk+1 = argminxisinC

983069f(xk) + 〈gkxminus xk〉+ 1

αkDh(xx

k)

983070

= argminxisinC

983069〈gkx〉+ 1

αkDh(xx

k)

983070

= argminxisinC

983069〈gkx〉+ 1

2αk983042xminus xk98304222

983070

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 5 26

Example Solving problems on the probability simplex withexponentiated gradient methods

Consider the negative entropy (h is convex)

h(x) =

n983131

j=1

xj log xj

restricting x to the probability simplex x ge 0

n983131

j=1

xj = 1 Then

Dh(xy) =

n983131

j=1

[xj(log xj minus log yj) + (yj minus xj)] =

n983131

j=1

xj logxjyj

the entropic or Kullback-Leibler divergence

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 6 26

The subproblem in mirror descent is

minx

〈gx〉+n983131

j=1

xj logxjyj

st x ge 0

n983131

j=1

xj = 1

By introducing Lagrange multipliers τ isin R for the equalityconstraint and λ isin Rn

+ for the inequality constraint Then weobtain Lagrangian

L(x τλ) = 〈gx〉+n983131

j=1

983063xj log

xjyj

+ τxj minus λjxj

983064minus τ

Taking derivatives with respect to x and setting them to zero yield

xj = yj exp(minusgj minus 1minus τ + λj)

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 7 26

Taking λj = 0 by

n983131

j=1

xj = 1 we have (a special solution)

xi =yi exp(minusgi)

n983131

j=1

yj exp(minusgj)

The update in mirror descent is

xk+1i =

xki exp(minusαkgki )

n983131

j=1

xkj exp(minusαkgkj )

This is the so-called exponentiated gradient update also known asentropic mirror descent

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 8 26

Example using ℓp norms (p isin (1 2]) Consider

h(x) =1

2(pminus 1)983042x9830422p

We have

nablah(x) =1

pminus 1983042x9830422minusp

p

983045sign(x1)|x1|pminus1 middot middot middot sign(xn)|xn|pminus1

983046T

Defineφ(x) = (pminus 1)nablah(x)

and

ψ(y) = 983042y9830422minusqq

983045sign(y1)|y1|qminus1 middot middot middot sign(yn)|yn|qminus1

983046T

where q =p

pminus 1 We have ψ(φ(x)) = x and hence ψ = φminus1

and φ = ψminus1

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 9 26

The mirror descent for C = Rn is

xk+1 = argminxisinRn

983069f(xk) + 〈gkxminus xk〉+ 1

αkDh(xx

k)

983070

= argminxisinRn

983069〈gkx〉+ 1

αkDh(xx

k)

983070

= (nablah)minus1(nablah(xk)minus αkgk)

= ψ(φ(xk)minus αk(pminus 1)gk)

The update involving the inverse of the gradient mapping (nablah)minus1holds more generally that is for any differentiable and strictlyconvex h

This is the original form of the mirror descent update and itjustifies the name mirror descent as the gradient is ldquomirroredrdquothrough the distance-generating function h and back again

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 10 26

The case C = x isin Rn+

n983131

j=1

xj = 1

The subproblem in mirror descent is

minxisinC

〈vx〉+ 1

2983042x9830422p

wherev = αk(pminus 1)gk minus φ(xk)

Exercise Show by KKT conditions that the solution is given byx = ψ([τeminus v]+) where τ isin R satisfies

n983131

j=1

ψj([τeminus v]+) = 1

Binary search can be used to find τ

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 11 26

Theorem 1

Let αk gt 0 be any sequence of non-increasing stepsizes Assume that his strongly convex with respect to 983042 middot 983042 Let 983042 middot 983042lowast denote the dual normLet xk be generated by the mirror descent iteration If Dh(xx983183) le R2

for all x isin C then for all K ge 1

K983131

k=1

[f(xk)minus f(x983183)] leR2

αK+

K983131

k=1

αk

2983042gk9830422lowast

If αk = α is constant then for all K ge 1K983131

k=1

[f(xk)minus f(x983183)] le1

αDh(x983183x

1) +α

2

K983131

k=1

983042gk9830422lowast

If xK =1

K

K983131

k=1

xk or xK = argminxk

f(xk) 983042g983042lowast le M for all

g isin partf(x) for x isin C then f(xK)minus f(x983183) le1

KαDh(x983183x

1) +α

2M2

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 12 26

Exercise Prove that the negative entropy

h(x) =

n983131

j=1

xj log xj

is 1-strongly convex with respect to the ℓ1-norm

Example Let C = x isin Rn+

983123nj=1 xj = 1 Let x1 =

1

ne The

exponentiated gradient method with fixed stepsize α guarantees

f(xK)minus f(x983183) lelog n

Kα+

α

2K

K983131

k=1

983042gk9830422infin

where xK =1

K

K983131

k=1

xk

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 13 26

Example Let C sub x isin Rn 983042x9830421 le R1 Let h(x) =1

2(pminus 1)983042x9830422p

with p = 1 +1

log(2n) Then

K983131

k=1

[f(xk)minus f(x983183)] le2R2

1 log(2n)

αK+

e2

2

K983131

k=1

αk983042gk9830422infin

In particular taking αk =R1

e

983157log(2n)

kand xK =

1

K

K983131

k=1

xk gives

f(xK)minus f(x983183) le3eR1

983155log(2n)radicK

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 14 26

A simulated mirror-descent example Let

A =983045a1 middot middot middot am

983046T isin Rmtimesn

have entries drawn iid N (0 1) Let

bi =1

2(ai1 + ai2) + εi

where εi drawn iid N (0 10minus2) and n = 3000 m = 20 Define

f(x) = 983042Axminus b9830421 =m983131

i=1

|〈aix〉 minus bi|

which has subgradients ATsign(Axminus b) Consider

minx

f(x) st x isin C =

983099983103

983101x isin Rn+

n983131

j=1

xj = 1

983100983104

983102

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 15 26

We compare the subgradient method to exponentiated gradientdescent for this problem noting that the Euclidean projection of avector v isin Rn to the set C has coordinates

xj = [vj minus τ ]+

where τ isin R is chosen so that

n983131

j=1

xj =

n983131

j=1

[vj minus τ ]+ = 1

We use stepsize αk = α0radick where the initial stepsize α0 is chosen

to optimize the convergence guarantee for each of the methods

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 16 26

fkbest minus f983183 versus k

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 17 26

21 Stochastic version

Theorem 2

Let the conditions of Theorem 1 hold except that instead of receiving avector

gk isin partf(xk)

at iteration k the vector gk is a stochastic subgradient satisfying

E[gk|xk] isin partf(xk)

Then for any non-increasing stepsize sequence αk (where αk may bechosen dependent on g1 middot middot middot gk)

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078le E

983077R2

αK+

K983131

k=1

αk

2983042gk9830422lowast

983078

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 18 26

22 Adaptive stepsizes

Let xK =1

K

K983131

k=1

xk If Dh(xx983183) le R2 for all x isin C then

E[f(xK)minus f(x983183)] le E

983077R2

KαK+

1

K

K983131

k=1

αk

2983042gk9830422lowast

983078

If αk = α for all k we see that the choice of stepsize minimizing

R2

Kα+

α

2K

K983131

k=1

983042gk9830422lowast is α983183 =radic2R

983075K983131

k=1

983042gk9830422lowast

983076minus 12

which yields

E[f(xK)minus f(x983183)] leradic2R

KE

983093

983095983075

K983131

k=1

983042gk9830422lowast

983076 12

983094

983096

But this α983183 is not known in advance because

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 19 26

Corollary 3

Let the conditions of Theorem 2 hold Let

αk =R983161983160983160983159

k983131

i=1

983042gi9830422lowast

which is the ldquoup to nowrdquo optimal choice Then

E[f(xK)minus f(x983183)] le3R

KE

983093

983095983075

K983131

k=1

983042gk9830422lowast

983076 12

983094

983096

where

xK =1

K

K983131

k=1

xk

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 20 26

3 Variable metric methods

The idea is to adjust the metric with which one constructs updatesto better reflect problem structure

1 Choose subgradient gk isin partf(xk) or stochastic subgradient gk

satisfyingE[gk|xk] isin partf(xk)

2 Update positive semidefinite matrix Hk isin Rntimesn

3 Compute update

xk+1 = argminxisinC

983069〈gkx〉+ 1

2〈xminus xkHk(xminus xk)〉

983070

Special cases Subgradient method Hk =1

αkI

Newton method Hk = nabla2f(xk)

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 21 26

Theorem 4

Let Hk be a sequence of positive definite matrices where Hk is afunction of g1 middot middot middot gk (and potentially some additional randomness)Let gk be (stochastic) subgradients with

E[gk|xk] isin partf(xk)

Then

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078

le 1

2E

983077K983131

k=2

983059983042xk minus x9831839830422Hk

minus 983042xk minus x9831839830422Hkminus1

983060+ 983042x1 minus x9831839830422H1

983078

+1

2E

983077K983131

k=1

983042gk9830422Hminus1

k

983078

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 22 26

31 Adaptive gradient variable metric method

Theorem 5

LetRinfin = sup

xisinC983042xminus x983183983042infin

be the ℓinfin radius of the set C and the conditions of Theorem 4 hold Let

Hk =1

α

983075diag

983075k983131

i=1

gigiT

98307698307612

where α gt 0 is a pre-specified constant Then we have

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078le

983061R2

infin2

+ α2

983062E[tr(HK)]

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 23 26

Example Support vector machine classification problem

Consider

minx

f(x) =1

m

m983131

i=1

[1minus bi〈aix〉]+ st 983042x983042infin le 1

where the vectors ai isin minus1 0 1n with m = 5000 and n = 1000

For each coordinate j isin [n] we set aij isin plusmn1 to have a randomsign with probability 1j and aij = 0 otherwise

Let u isin minus1 1n uniformly at random We set bi = sign(〈aiu〉)with probability 095 and bi = minussign(〈aiu〉) otherwiseWe choose a stochastic gradient by selecting i isin [m] uniformly atrandom then set

gk isin part[1minus bi〈aixk〉]+

and αk = αradick

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 24 26

f(xk)minus f983183 versus k100 various initial stepsizes

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 25 26

f(xk)minus f983183 versus k100 best initial stepsize

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 26 26

Page 5: Lecture 11: Mirror Descent and Variable Metric Methodsmath.xmu.edu.cn/group/nona/damc/Lecture11.pdf · Definition: Let h be a differentiable convex function, differentiable on

We often attempt to choose h to better match the geometry of theunderlying constraint set C The goal is roughly to choose astrongly convex function h so that the diameter of C is small in thenorm 983042 middot 983042 (need not be ℓ2 or Euclidean norm) with respect towhich h is strongly convex

Example Gradient descent is mirror descent Consider

h(x) =1

2983042x98304222 Dh(xy) =

1

2983042xminus y98304222

The update of mirror descent is

xk+1 = argminxisinC

983069f(xk) + 〈gkxminus xk〉+ 1

αkDh(xx

k)

983070

= argminxisinC

983069〈gkx〉+ 1

αkDh(xx

k)

983070

= argminxisinC

983069〈gkx〉+ 1

2αk983042xminus xk98304222

983070

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 5 26

Example Solving problems on the probability simplex withexponentiated gradient methods

Consider the negative entropy (h is convex)

h(x) =

n983131

j=1

xj log xj

restricting x to the probability simplex x ge 0

n983131

j=1

xj = 1 Then

Dh(xy) =

n983131

j=1

[xj(log xj minus log yj) + (yj minus xj)] =

n983131

j=1

xj logxjyj

the entropic or Kullback-Leibler divergence

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 6 26

The subproblem in mirror descent is

minx

〈gx〉+n983131

j=1

xj logxjyj

st x ge 0

n983131

j=1

xj = 1

By introducing Lagrange multipliers τ isin R for the equalityconstraint and λ isin Rn

+ for the inequality constraint Then weobtain Lagrangian

L(x τλ) = 〈gx〉+n983131

j=1

983063xj log

xjyj

+ τxj minus λjxj

983064minus τ

Taking derivatives with respect to x and setting them to zero yield

xj = yj exp(minusgj minus 1minus τ + λj)

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 7 26

Taking λj = 0 by

n983131

j=1

xj = 1 we have (a special solution)

xi =yi exp(minusgi)

n983131

j=1

yj exp(minusgj)

The update in mirror descent is

xk+1i =

xki exp(minusαkgki )

n983131

j=1

xkj exp(minusαkgkj )

This is the so-called exponentiated gradient update also known asentropic mirror descent

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 8 26

Example using ℓp norms (p isin (1 2]) Consider

h(x) =1

2(pminus 1)983042x9830422p

We have

nablah(x) =1

pminus 1983042x9830422minusp

p

983045sign(x1)|x1|pminus1 middot middot middot sign(xn)|xn|pminus1

983046T

Defineφ(x) = (pminus 1)nablah(x)

and

ψ(y) = 983042y9830422minusqq

983045sign(y1)|y1|qminus1 middot middot middot sign(yn)|yn|qminus1

983046T

where q =p

pminus 1 We have ψ(φ(x)) = x and hence ψ = φminus1

and φ = ψminus1

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 9 26

The mirror descent for C = Rn is

xk+1 = argminxisinRn

983069f(xk) + 〈gkxminus xk〉+ 1

αkDh(xx

k)

983070

= argminxisinRn

983069〈gkx〉+ 1

αkDh(xx

k)

983070

= (nablah)minus1(nablah(xk)minus αkgk)

= ψ(φ(xk)minus αk(pminus 1)gk)

The update involving the inverse of the gradient mapping (nablah)minus1holds more generally that is for any differentiable and strictlyconvex h

This is the original form of the mirror descent update and itjustifies the name mirror descent as the gradient is ldquomirroredrdquothrough the distance-generating function h and back again

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 10 26

The case C = x isin Rn+

n983131

j=1

xj = 1

The subproblem in mirror descent is

minxisinC

〈vx〉+ 1

2983042x9830422p

wherev = αk(pminus 1)gk minus φ(xk)

Exercise Show by KKT conditions that the solution is given byx = ψ([τeminus v]+) where τ isin R satisfies

n983131

j=1

ψj([τeminus v]+) = 1

Binary search can be used to find τ

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 11 26

Theorem 1

Let αk gt 0 be any sequence of non-increasing stepsizes Assume that his strongly convex with respect to 983042 middot 983042 Let 983042 middot 983042lowast denote the dual normLet xk be generated by the mirror descent iteration If Dh(xx983183) le R2

for all x isin C then for all K ge 1

K983131

k=1

[f(xk)minus f(x983183)] leR2

αK+

K983131

k=1

αk

2983042gk9830422lowast

If αk = α is constant then for all K ge 1K983131

k=1

[f(xk)minus f(x983183)] le1

αDh(x983183x

1) +α

2

K983131

k=1

983042gk9830422lowast

If xK =1

K

K983131

k=1

xk or xK = argminxk

f(xk) 983042g983042lowast le M for all

g isin partf(x) for x isin C then f(xK)minus f(x983183) le1

KαDh(x983183x

1) +α

2M2

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 12 26

Exercise Prove that the negative entropy

h(x) =

n983131

j=1

xj log xj

is 1-strongly convex with respect to the ℓ1-norm

Example Let C = x isin Rn+

983123nj=1 xj = 1 Let x1 =

1

ne The

exponentiated gradient method with fixed stepsize α guarantees

f(xK)minus f(x983183) lelog n

Kα+

α

2K

K983131

k=1

983042gk9830422infin

where xK =1

K

K983131

k=1

xk

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 13 26

Example Let C sub x isin Rn 983042x9830421 le R1 Let h(x) =1

2(pminus 1)983042x9830422p

with p = 1 +1

log(2n) Then

K983131

k=1

[f(xk)minus f(x983183)] le2R2

1 log(2n)

αK+

e2

2

K983131

k=1

αk983042gk9830422infin

In particular taking αk =R1

e

983157log(2n)

kand xK =

1

K

K983131

k=1

xk gives

f(xK)minus f(x983183) le3eR1

983155log(2n)radicK

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 14 26

A simulated mirror-descent example Let

A =983045a1 middot middot middot am

983046T isin Rmtimesn

have entries drawn iid N (0 1) Let

bi =1

2(ai1 + ai2) + εi

where εi drawn iid N (0 10minus2) and n = 3000 m = 20 Define

f(x) = 983042Axminus b9830421 =m983131

i=1

|〈aix〉 minus bi|

which has subgradients ATsign(Axminus b) Consider

minx

f(x) st x isin C =

983099983103

983101x isin Rn+

n983131

j=1

xj = 1

983100983104

983102

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 15 26

We compare the subgradient method to exponentiated gradientdescent for this problem noting that the Euclidean projection of avector v isin Rn to the set C has coordinates

xj = [vj minus τ ]+

where τ isin R is chosen so that

n983131

j=1

xj =

n983131

j=1

[vj minus τ ]+ = 1

We use stepsize αk = α0radick where the initial stepsize α0 is chosen

to optimize the convergence guarantee for each of the methods

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 16 26

fkbest minus f983183 versus k

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 17 26

21 Stochastic version

Theorem 2

Let the conditions of Theorem 1 hold except that instead of receiving avector

gk isin partf(xk)

at iteration k the vector gk is a stochastic subgradient satisfying

E[gk|xk] isin partf(xk)

Then for any non-increasing stepsize sequence αk (where αk may bechosen dependent on g1 middot middot middot gk)

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078le E

983077R2

αK+

K983131

k=1

αk

2983042gk9830422lowast

983078

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 18 26

22 Adaptive stepsizes

Let xK =1

K

K983131

k=1

xk If Dh(xx983183) le R2 for all x isin C then

E[f(xK)minus f(x983183)] le E

983077R2

KαK+

1

K

K983131

k=1

αk

2983042gk9830422lowast

983078

If αk = α for all k we see that the choice of stepsize minimizing

R2

Kα+

α

2K

K983131

k=1

983042gk9830422lowast is α983183 =radic2R

983075K983131

k=1

983042gk9830422lowast

983076minus 12

which yields

E[f(xK)minus f(x983183)] leradic2R

KE

983093

983095983075

K983131

k=1

983042gk9830422lowast

983076 12

983094

983096

But this α983183 is not known in advance because

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 19 26

Corollary 3

Let the conditions of Theorem 2 hold Let

αk =R983161983160983160983159

k983131

i=1

983042gi9830422lowast

which is the ldquoup to nowrdquo optimal choice Then

E[f(xK)minus f(x983183)] le3R

KE

983093

983095983075

K983131

k=1

983042gk9830422lowast

983076 12

983094

983096

where

xK =1

K

K983131

k=1

xk

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 20 26

3 Variable metric methods

The idea is to adjust the metric with which one constructs updatesto better reflect problem structure

1 Choose subgradient gk isin partf(xk) or stochastic subgradient gk

satisfyingE[gk|xk] isin partf(xk)

2 Update positive semidefinite matrix Hk isin Rntimesn

3 Compute update

xk+1 = argminxisinC

983069〈gkx〉+ 1

2〈xminus xkHk(xminus xk)〉

983070

Special cases Subgradient method Hk =1

αkI

Newton method Hk = nabla2f(xk)

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 21 26

Theorem 4

Let Hk be a sequence of positive definite matrices where Hk is afunction of g1 middot middot middot gk (and potentially some additional randomness)Let gk be (stochastic) subgradients with

E[gk|xk] isin partf(xk)

Then

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078

le 1

2E

983077K983131

k=2

983059983042xk minus x9831839830422Hk

minus 983042xk minus x9831839830422Hkminus1

983060+ 983042x1 minus x9831839830422H1

983078

+1

2E

983077K983131

k=1

983042gk9830422Hminus1

k

983078

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 22 26

31 Adaptive gradient variable metric method

Theorem 5

LetRinfin = sup

xisinC983042xminus x983183983042infin

be the ℓinfin radius of the set C and the conditions of Theorem 4 hold Let

Hk =1

α

983075diag

983075k983131

i=1

gigiT

98307698307612

where α gt 0 is a pre-specified constant Then we have

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078le

983061R2

infin2

+ α2

983062E[tr(HK)]

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 23 26

Example Support vector machine classification problem

Consider

minx

f(x) =1

m

m983131

i=1

[1minus bi〈aix〉]+ st 983042x983042infin le 1

where the vectors ai isin minus1 0 1n with m = 5000 and n = 1000

For each coordinate j isin [n] we set aij isin plusmn1 to have a randomsign with probability 1j and aij = 0 otherwise

Let u isin minus1 1n uniformly at random We set bi = sign(〈aiu〉)with probability 095 and bi = minussign(〈aiu〉) otherwiseWe choose a stochastic gradient by selecting i isin [m] uniformly atrandom then set

gk isin part[1minus bi〈aixk〉]+

and αk = αradick

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 24 26

f(xk)minus f983183 versus k100 various initial stepsizes

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 25 26

f(xk)minus f983183 versus k100 best initial stepsize

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 26 26

Page 6: Lecture 11: Mirror Descent and Variable Metric Methodsmath.xmu.edu.cn/group/nona/damc/Lecture11.pdf · Definition: Let h be a differentiable convex function, differentiable on

Example Solving problems on the probability simplex withexponentiated gradient methods

Consider the negative entropy (h is convex)

h(x) =

n983131

j=1

xj log xj

restricting x to the probability simplex x ge 0

n983131

j=1

xj = 1 Then

Dh(xy) =

n983131

j=1

[xj(log xj minus log yj) + (yj minus xj)] =

n983131

j=1

xj logxjyj

the entropic or Kullback-Leibler divergence

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 6 26

The subproblem in mirror descent is

minx

〈gx〉+n983131

j=1

xj logxjyj

st x ge 0

n983131

j=1

xj = 1

By introducing Lagrange multipliers τ isin R for the equalityconstraint and λ isin Rn

+ for the inequality constraint Then weobtain Lagrangian

L(x τλ) = 〈gx〉+n983131

j=1

983063xj log

xjyj

+ τxj minus λjxj

983064minus τ

Taking derivatives with respect to x and setting them to zero yield

xj = yj exp(minusgj minus 1minus τ + λj)

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 7 26

Taking λj = 0 by

n983131

j=1

xj = 1 we have (a special solution)

xi =yi exp(minusgi)

n983131

j=1

yj exp(minusgj)

The update in mirror descent is

xk+1i =

xki exp(minusαkgki )

n983131

j=1

xkj exp(minusαkgkj )

This is the so-called exponentiated gradient update also known asentropic mirror descent

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 8 26

Example using ℓp norms (p isin (1 2]) Consider

h(x) =1

2(pminus 1)983042x9830422p

We have

nablah(x) =1

pminus 1983042x9830422minusp

p

983045sign(x1)|x1|pminus1 middot middot middot sign(xn)|xn|pminus1

983046T

Defineφ(x) = (pminus 1)nablah(x)

and

ψ(y) = 983042y9830422minusqq

983045sign(y1)|y1|qminus1 middot middot middot sign(yn)|yn|qminus1

983046T

where q =p

pminus 1 We have ψ(φ(x)) = x and hence ψ = φminus1

and φ = ψminus1

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 9 26

The mirror descent for C = Rn is

xk+1 = argminxisinRn

983069f(xk) + 〈gkxminus xk〉+ 1

αkDh(xx

k)

983070

= argminxisinRn

983069〈gkx〉+ 1

αkDh(xx

k)

983070

= (nablah)minus1(nablah(xk)minus αkgk)

= ψ(φ(xk)minus αk(pminus 1)gk)

The update involving the inverse of the gradient mapping (nablah)minus1holds more generally that is for any differentiable and strictlyconvex h

This is the original form of the mirror descent update and itjustifies the name mirror descent as the gradient is ldquomirroredrdquothrough the distance-generating function h and back again

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 10 26

The case C = x isin Rn+

n983131

j=1

xj = 1

The subproblem in mirror descent is

minxisinC

〈vx〉+ 1

2983042x9830422p

wherev = αk(pminus 1)gk minus φ(xk)

Exercise Show by KKT conditions that the solution is given byx = ψ([τeminus v]+) where τ isin R satisfies

n983131

j=1

ψj([τeminus v]+) = 1

Binary search can be used to find τ

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 11 26

Theorem 1

Let αk gt 0 be any sequence of non-increasing stepsizes Assume that his strongly convex with respect to 983042 middot 983042 Let 983042 middot 983042lowast denote the dual normLet xk be generated by the mirror descent iteration If Dh(xx983183) le R2

for all x isin C then for all K ge 1

K983131

k=1

[f(xk)minus f(x983183)] leR2

αK+

K983131

k=1

αk

2983042gk9830422lowast

If αk = α is constant then for all K ge 1K983131

k=1

[f(xk)minus f(x983183)] le1

αDh(x983183x

1) +α

2

K983131

k=1

983042gk9830422lowast

If xK =1

K

K983131

k=1

xk or xK = argminxk

f(xk) 983042g983042lowast le M for all

g isin partf(x) for x isin C then f(xK)minus f(x983183) le1

KαDh(x983183x

1) +α

2M2

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 12 26

Exercise Prove that the negative entropy

h(x) =

n983131

j=1

xj log xj

is 1-strongly convex with respect to the ℓ1-norm

Example Let C = x isin Rn+

983123nj=1 xj = 1 Let x1 =

1

ne The

exponentiated gradient method with fixed stepsize α guarantees

f(xK)minus f(x983183) lelog n

Kα+

α

2K

K983131

k=1

983042gk9830422infin

where xK =1

K

K983131

k=1

xk

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 13 26

Example Let C sub x isin Rn 983042x9830421 le R1 Let h(x) =1

2(pminus 1)983042x9830422p

with p = 1 +1

log(2n) Then

K983131

k=1

[f(xk)minus f(x983183)] le2R2

1 log(2n)

αK+

e2

2

K983131

k=1

αk983042gk9830422infin

In particular taking αk =R1

e

983157log(2n)

kand xK =

1

K

K983131

k=1

xk gives

f(xK)minus f(x983183) le3eR1

983155log(2n)radicK

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 14 26

A simulated mirror-descent example Let

A =983045a1 middot middot middot am

983046T isin Rmtimesn

have entries drawn iid N (0 1) Let

bi =1

2(ai1 + ai2) + εi

where εi drawn iid N (0 10minus2) and n = 3000 m = 20 Define

f(x) = 983042Axminus b9830421 =m983131

i=1

|〈aix〉 minus bi|

which has subgradients ATsign(Axminus b) Consider

minx

f(x) st x isin C =

983099983103

983101x isin Rn+

n983131

j=1

xj = 1

983100983104

983102

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 15 26

We compare the subgradient method to exponentiated gradientdescent for this problem noting that the Euclidean projection of avector v isin Rn to the set C has coordinates

xj = [vj minus τ ]+

where τ isin R is chosen so that

n983131

j=1

xj =

n983131

j=1

[vj minus τ ]+ = 1

We use stepsize αk = α0radick where the initial stepsize α0 is chosen

to optimize the convergence guarantee for each of the methods

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 16 26

fkbest minus f983183 versus k

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 17 26

21 Stochastic version

Theorem 2

Let the conditions of Theorem 1 hold except that instead of receiving avector

gk isin partf(xk)

at iteration k the vector gk is a stochastic subgradient satisfying

E[gk|xk] isin partf(xk)

Then for any non-increasing stepsize sequence αk (where αk may bechosen dependent on g1 middot middot middot gk)

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078le E

983077R2

αK+

K983131

k=1

αk

2983042gk9830422lowast

983078

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 18 26

22 Adaptive stepsizes

Let xK =1

K

K983131

k=1

xk If Dh(xx983183) le R2 for all x isin C then

E[f(xK)minus f(x983183)] le E

983077R2

KαK+

1

K

K983131

k=1

αk

2983042gk9830422lowast

983078

If αk = α for all k we see that the choice of stepsize minimizing

R2

Kα+

α

2K

K983131

k=1

983042gk9830422lowast is α983183 =radic2R

983075K983131

k=1

983042gk9830422lowast

983076minus 12

which yields

E[f(xK)minus f(x983183)] leradic2R

KE

983093

983095983075

K983131

k=1

983042gk9830422lowast

983076 12

983094

983096

But this α983183 is not known in advance because

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 19 26

Corollary 3

Let the conditions of Theorem 2 hold Let

αk =R983161983160983160983159

k983131

i=1

983042gi9830422lowast

which is the ldquoup to nowrdquo optimal choice Then

E[f(xK)minus f(x983183)] le3R

KE

983093

983095983075

K983131

k=1

983042gk9830422lowast

983076 12

983094

983096

where

xK =1

K

K983131

k=1

xk

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 20 26

3 Variable metric methods

The idea is to adjust the metric with which one constructs updatesto better reflect problem structure

1 Choose subgradient gk isin partf(xk) or stochastic subgradient gk

satisfyingE[gk|xk] isin partf(xk)

2 Update positive semidefinite matrix Hk isin Rntimesn

3 Compute update

xk+1 = argminxisinC

983069〈gkx〉+ 1

2〈xminus xkHk(xminus xk)〉

983070

Special cases Subgradient method Hk =1

αkI

Newton method Hk = nabla2f(xk)

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 21 26

Theorem 4

Let Hk be a sequence of positive definite matrices where Hk is afunction of g1 middot middot middot gk (and potentially some additional randomness)Let gk be (stochastic) subgradients with

E[gk|xk] isin partf(xk)

Then

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078

le 1

2E

983077K983131

k=2

983059983042xk minus x9831839830422Hk

minus 983042xk minus x9831839830422Hkminus1

983060+ 983042x1 minus x9831839830422H1

983078

+1

2E

983077K983131

k=1

983042gk9830422Hminus1

k

983078

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 22 26

31 Adaptive gradient variable metric method

Theorem 5

LetRinfin = sup

xisinC983042xminus x983183983042infin

be the ℓinfin radius of the set C and the conditions of Theorem 4 hold Let

Hk =1

α

983075diag

983075k983131

i=1

gigiT

98307698307612

where α gt 0 is a pre-specified constant Then we have

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078le

983061R2

infin2

+ α2

983062E[tr(HK)]

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 23 26

Example Support vector machine classification problem

Consider

minx

f(x) =1

m

m983131

i=1

[1minus bi〈aix〉]+ st 983042x983042infin le 1

where the vectors ai isin minus1 0 1n with m = 5000 and n = 1000

For each coordinate j isin [n] we set aij isin plusmn1 to have a randomsign with probability 1j and aij = 0 otherwise

Let u isin minus1 1n uniformly at random We set bi = sign(〈aiu〉)with probability 095 and bi = minussign(〈aiu〉) otherwiseWe choose a stochastic gradient by selecting i isin [m] uniformly atrandom then set

gk isin part[1minus bi〈aixk〉]+

and αk = αradick

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 24 26

f(xk)minus f983183 versus k100 various initial stepsizes

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 25 26

f(xk)minus f983183 versus k100 best initial stepsize

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 26 26

Page 7: Lecture 11: Mirror Descent and Variable Metric Methodsmath.xmu.edu.cn/group/nona/damc/Lecture11.pdf · Definition: Let h be a differentiable convex function, differentiable on

The subproblem in mirror descent is

minx

〈gx〉+n983131

j=1

xj logxjyj

st x ge 0

n983131

j=1

xj = 1

By introducing Lagrange multipliers τ isin R for the equalityconstraint and λ isin Rn

+ for the inequality constraint Then weobtain Lagrangian

L(x τλ) = 〈gx〉+n983131

j=1

983063xj log

xjyj

+ τxj minus λjxj

983064minus τ

Taking derivatives with respect to x and setting them to zero yield

xj = yj exp(minusgj minus 1minus τ + λj)

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 7 26

Taking λj = 0 by

n983131

j=1

xj = 1 we have (a special solution)

xi =yi exp(minusgi)

n983131

j=1

yj exp(minusgj)

The update in mirror descent is

xk+1i =

xki exp(minusαkgki )

n983131

j=1

xkj exp(minusαkgkj )

This is the so-called exponentiated gradient update also known asentropic mirror descent

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 8 26

Example using ℓp norms (p isin (1 2]) Consider

h(x) =1

2(pminus 1)983042x9830422p

We have

nablah(x) =1

pminus 1983042x9830422minusp

p

983045sign(x1)|x1|pminus1 middot middot middot sign(xn)|xn|pminus1

983046T

Defineφ(x) = (pminus 1)nablah(x)

and

ψ(y) = 983042y9830422minusqq

983045sign(y1)|y1|qminus1 middot middot middot sign(yn)|yn|qminus1

983046T

where q =p

pminus 1 We have ψ(φ(x)) = x and hence ψ = φminus1

and φ = ψminus1

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 9 26

The mirror descent for C = Rn is

xk+1 = argminxisinRn

983069f(xk) + 〈gkxminus xk〉+ 1

αkDh(xx

k)

983070

= argminxisinRn

983069〈gkx〉+ 1

αkDh(xx

k)

983070

= (nablah)minus1(nablah(xk)minus αkgk)

= ψ(φ(xk)minus αk(pminus 1)gk)

The update involving the inverse of the gradient mapping (nablah)minus1holds more generally that is for any differentiable and strictlyconvex h

This is the original form of the mirror descent update and itjustifies the name mirror descent as the gradient is ldquomirroredrdquothrough the distance-generating function h and back again

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 10 26

The case C = x isin Rn+

n983131

j=1

xj = 1

The subproblem in mirror descent is

minxisinC

〈vx〉+ 1

2983042x9830422p

wherev = αk(pminus 1)gk minus φ(xk)

Exercise Show by KKT conditions that the solution is given byx = ψ([τeminus v]+) where τ isin R satisfies

n983131

j=1

ψj([τeminus v]+) = 1

Binary search can be used to find τ

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 11 26

Theorem 1

Let αk gt 0 be any sequence of non-increasing stepsizes Assume that his strongly convex with respect to 983042 middot 983042 Let 983042 middot 983042lowast denote the dual normLet xk be generated by the mirror descent iteration If Dh(xx983183) le R2

for all x isin C then for all K ge 1

K983131

k=1

[f(xk)minus f(x983183)] leR2

αK+

K983131

k=1

αk

2983042gk9830422lowast

If αk = α is constant then for all K ge 1K983131

k=1

[f(xk)minus f(x983183)] le1

αDh(x983183x

1) +α

2

K983131

k=1

983042gk9830422lowast

If xK =1

K

K983131

k=1

xk or xK = argminxk

f(xk) 983042g983042lowast le M for all

g isin partf(x) for x isin C then f(xK)minus f(x983183) le1

KαDh(x983183x

1) +α

2M2

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 12 26

Exercise Prove that the negative entropy

h(x) =

n983131

j=1

xj log xj

is 1-strongly convex with respect to the ℓ1-norm

Example Let C = x isin Rn+

983123nj=1 xj = 1 Let x1 =

1

ne The

exponentiated gradient method with fixed stepsize α guarantees

f(xK)minus f(x983183) lelog n

Kα+

α

2K

K983131

k=1

983042gk9830422infin

where xK =1

K

K983131

k=1

xk

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 13 26

Example Let C sub x isin Rn 983042x9830421 le R1 Let h(x) =1

2(pminus 1)983042x9830422p

with p = 1 +1

log(2n) Then

K983131

k=1

[f(xk)minus f(x983183)] le2R2

1 log(2n)

αK+

e2

2

K983131

k=1

αk983042gk9830422infin

In particular taking αk =R1

e

983157log(2n)

kand xK =

1

K

K983131

k=1

xk gives

f(xK)minus f(x983183) le3eR1

983155log(2n)radicK

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 14 26

A simulated mirror-descent example Let

A =983045a1 middot middot middot am

983046T isin Rmtimesn

have entries drawn iid N (0 1) Let

bi =1

2(ai1 + ai2) + εi

where εi drawn iid N (0 10minus2) and n = 3000 m = 20 Define

f(x) = 983042Axminus b9830421 =m983131

i=1

|〈aix〉 minus bi|

which has subgradients ATsign(Axminus b) Consider

minx

f(x) st x isin C =

983099983103

983101x isin Rn+

n983131

j=1

xj = 1

983100983104

983102

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 15 26

We compare the subgradient method to exponentiated gradientdescent for this problem noting that the Euclidean projection of avector v isin Rn to the set C has coordinates

xj = [vj minus τ ]+

where τ isin R is chosen so that

n983131

j=1

xj =

n983131

j=1

[vj minus τ ]+ = 1

We use stepsize αk = α0radick where the initial stepsize α0 is chosen

to optimize the convergence guarantee for each of the methods

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 16 26

fkbest minus f983183 versus k

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 17 26

21 Stochastic version

Theorem 2

Let the conditions of Theorem 1 hold except that instead of receiving avector

gk isin partf(xk)

at iteration k the vector gk is a stochastic subgradient satisfying

E[gk|xk] isin partf(xk)

Then for any non-increasing stepsize sequence αk (where αk may bechosen dependent on g1 middot middot middot gk)

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078le E

983077R2

αK+

K983131

k=1

αk

2983042gk9830422lowast

983078

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 18 26

22 Adaptive stepsizes

Let xK =1

K

K983131

k=1

xk If Dh(xx983183) le R2 for all x isin C then

E[f(xK)minus f(x983183)] le E

983077R2

KαK+

1

K

K983131

k=1

αk

2983042gk9830422lowast

983078

If αk = α for all k we see that the choice of stepsize minimizing

R2

Kα+

α

2K

K983131

k=1

983042gk9830422lowast is α983183 =radic2R

983075K983131

k=1

983042gk9830422lowast

983076minus 12

which yields

E[f(xK)minus f(x983183)] leradic2R

KE

983093

983095983075

K983131

k=1

983042gk9830422lowast

983076 12

983094

983096

But this α983183 is not known in advance because

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 19 26

Corollary 3

Let the conditions of Theorem 2 hold Let

αk =R983161983160983160983159

k983131

i=1

983042gi9830422lowast

which is the ldquoup to nowrdquo optimal choice Then

E[f(xK)minus f(x983183)] le3R

KE

983093

983095983075

K983131

k=1

983042gk9830422lowast

983076 12

983094

983096

where

xK =1

K

K983131

k=1

xk

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 20 26

3 Variable metric methods

The idea is to adjust the metric with which one constructs updatesto better reflect problem structure

1 Choose subgradient gk isin partf(xk) or stochastic subgradient gk

satisfyingE[gk|xk] isin partf(xk)

2 Update positive semidefinite matrix Hk isin Rntimesn

3 Compute update

xk+1 = argminxisinC

983069〈gkx〉+ 1

2〈xminus xkHk(xminus xk)〉

983070

Special cases Subgradient method Hk =1

αkI

Newton method Hk = nabla2f(xk)

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 21 26

Theorem 4

Let Hk be a sequence of positive definite matrices where Hk is afunction of g1 middot middot middot gk (and potentially some additional randomness)Let gk be (stochastic) subgradients with

E[gk|xk] isin partf(xk)

Then

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078

le 1

2E

983077K983131

k=2

983059983042xk minus x9831839830422Hk

minus 983042xk minus x9831839830422Hkminus1

983060+ 983042x1 minus x9831839830422H1

983078

+1

2E

983077K983131

k=1

983042gk9830422Hminus1

k

983078

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 22 26

31 Adaptive gradient variable metric method

Theorem 5

LetRinfin = sup

xisinC983042xminus x983183983042infin

be the ℓinfin radius of the set C and the conditions of Theorem 4 hold Let

Hk =1

α

983075diag

983075k983131

i=1

gigiT

98307698307612

where α gt 0 is a pre-specified constant Then we have

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078le

983061R2

infin2

+ α2

983062E[tr(HK)]

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 23 26

Example Support vector machine classification problem

Consider

minx

f(x) =1

m

m983131

i=1

[1minus bi〈aix〉]+ st 983042x983042infin le 1

where the vectors ai isin minus1 0 1n with m = 5000 and n = 1000

For each coordinate j isin [n] we set aij isin plusmn1 to have a randomsign with probability 1j and aij = 0 otherwise

Let u isin minus1 1n uniformly at random We set bi = sign(〈aiu〉)with probability 095 and bi = minussign(〈aiu〉) otherwiseWe choose a stochastic gradient by selecting i isin [m] uniformly atrandom then set

gk isin part[1minus bi〈aixk〉]+

and αk = αradick

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 24 26

f(xk)minus f983183 versus k100 various initial stepsizes

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 25 26

f(xk)minus f983183 versus k100 best initial stepsize

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 26 26

Page 8: Lecture 11: Mirror Descent and Variable Metric Methodsmath.xmu.edu.cn/group/nona/damc/Lecture11.pdf · Definition: Let h be a differentiable convex function, differentiable on

Taking λj = 0 by

n983131

j=1

xj = 1 we have (a special solution)

xi =yi exp(minusgi)

n983131

j=1

yj exp(minusgj)

The update in mirror descent is

xk+1i =

xki exp(minusαkgki )

n983131

j=1

xkj exp(minusαkgkj )

This is the so-called exponentiated gradient update also known asentropic mirror descent

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 8 26

Example using ℓp norms (p isin (1 2]) Consider

h(x) =1

2(pminus 1)983042x9830422p

We have

nablah(x) =1

pminus 1983042x9830422minusp

p

983045sign(x1)|x1|pminus1 middot middot middot sign(xn)|xn|pminus1

983046T

Defineφ(x) = (pminus 1)nablah(x)

and

ψ(y) = 983042y9830422minusqq

983045sign(y1)|y1|qminus1 middot middot middot sign(yn)|yn|qminus1

983046T

where q =p

pminus 1 We have ψ(φ(x)) = x and hence ψ = φminus1

and φ = ψminus1

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 9 26

The mirror descent for C = Rn is

xk+1 = argminxisinRn

983069f(xk) + 〈gkxminus xk〉+ 1

αkDh(xx

k)

983070

= argminxisinRn

983069〈gkx〉+ 1

αkDh(xx

k)

983070

= (nablah)minus1(nablah(xk)minus αkgk)

= ψ(φ(xk)minus αk(pminus 1)gk)

The update involving the inverse of the gradient mapping (nablah)minus1holds more generally that is for any differentiable and strictlyconvex h

This is the original form of the mirror descent update and itjustifies the name mirror descent as the gradient is ldquomirroredrdquothrough the distance-generating function h and back again

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 10 26

The case C = x isin Rn+

n983131

j=1

xj = 1

The subproblem in mirror descent is

minxisinC

〈vx〉+ 1

2983042x9830422p

wherev = αk(pminus 1)gk minus φ(xk)

Exercise Show by KKT conditions that the solution is given byx = ψ([τeminus v]+) where τ isin R satisfies

n983131

j=1

ψj([τeminus v]+) = 1

Binary search can be used to find τ

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 11 26

Theorem 1

Let αk gt 0 be any sequence of non-increasing stepsizes Assume that his strongly convex with respect to 983042 middot 983042 Let 983042 middot 983042lowast denote the dual normLet xk be generated by the mirror descent iteration If Dh(xx983183) le R2

for all x isin C then for all K ge 1

K983131

k=1

[f(xk)minus f(x983183)] leR2

αK+

K983131

k=1

αk

2983042gk9830422lowast

If αk = α is constant then for all K ge 1K983131

k=1

[f(xk)minus f(x983183)] le1

αDh(x983183x

1) +α

2

K983131

k=1

983042gk9830422lowast

If xK =1

K

K983131

k=1

xk or xK = argminxk

f(xk) 983042g983042lowast le M for all

g isin partf(x) for x isin C then f(xK)minus f(x983183) le1

KαDh(x983183x

1) +α

2M2

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 12 26

Exercise Prove that the negative entropy

h(x) =

n983131

j=1

xj log xj

is 1-strongly convex with respect to the ℓ1-norm

Example Let C = x isin Rn+

983123nj=1 xj = 1 Let x1 =

1

ne The

exponentiated gradient method with fixed stepsize α guarantees

f(xK)minus f(x983183) lelog n

Kα+

α

2K

K983131

k=1

983042gk9830422infin

where xK =1

K

K983131

k=1

xk

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 13 26

Example Let C sub x isin Rn 983042x9830421 le R1 Let h(x) =1

2(pminus 1)983042x9830422p

with p = 1 +1

log(2n) Then

K983131

k=1

[f(xk)minus f(x983183)] le2R2

1 log(2n)

αK+

e2

2

K983131

k=1

αk983042gk9830422infin

In particular taking αk =R1

e

983157log(2n)

kand xK =

1

K

K983131

k=1

xk gives

f(xK)minus f(x983183) le3eR1

983155log(2n)radicK

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 14 26

A simulated mirror-descent example Let

A =983045a1 middot middot middot am

983046T isin Rmtimesn

have entries drawn iid N (0 1) Let

bi =1

2(ai1 + ai2) + εi

where εi drawn iid N (0 10minus2) and n = 3000 m = 20 Define

f(x) = 983042Axminus b9830421 =m983131

i=1

|〈aix〉 minus bi|

which has subgradients ATsign(Axminus b) Consider

minx

f(x) st x isin C =

983099983103

983101x isin Rn+

n983131

j=1

xj = 1

983100983104

983102

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 15 26

We compare the subgradient method to exponentiated gradientdescent for this problem noting that the Euclidean projection of avector v isin Rn to the set C has coordinates

xj = [vj minus τ ]+

where τ isin R is chosen so that

n983131

j=1

xj =

n983131

j=1

[vj minus τ ]+ = 1

We use stepsize αk = α0radick where the initial stepsize α0 is chosen

to optimize the convergence guarantee for each of the methods

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 16 26

fkbest minus f983183 versus k

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 17 26

21 Stochastic version

Theorem 2

Let the conditions of Theorem 1 hold except that instead of receiving avector

gk isin partf(xk)

at iteration k the vector gk is a stochastic subgradient satisfying

E[gk|xk] isin partf(xk)

Then for any non-increasing stepsize sequence αk (where αk may bechosen dependent on g1 middot middot middot gk)

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078le E

983077R2

αK+

K983131

k=1

αk

2983042gk9830422lowast

983078

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 18 26

22 Adaptive stepsizes

Let xK =1

K

K983131

k=1

xk If Dh(xx983183) le R2 for all x isin C then

E[f(xK)minus f(x983183)] le E

983077R2

KαK+

1

K

K983131

k=1

αk

2983042gk9830422lowast

983078

If αk = α for all k we see that the choice of stepsize minimizing

R2

Kα+

α

2K

K983131

k=1

983042gk9830422lowast is α983183 =radic2R

983075K983131

k=1

983042gk9830422lowast

983076minus 12

which yields

E[f(xK)minus f(x983183)] leradic2R

KE

983093

983095983075

K983131

k=1

983042gk9830422lowast

983076 12

983094

983096

But this α983183 is not known in advance because

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 19 26

Corollary 3

Let the conditions of Theorem 2 hold Let

αk =R983161983160983160983159

k983131

i=1

983042gi9830422lowast

which is the ldquoup to nowrdquo optimal choice Then

E[f(xK)minus f(x983183)] le3R

KE

983093

983095983075

K983131

k=1

983042gk9830422lowast

983076 12

983094

983096

where

xK =1

K

K983131

k=1

xk

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 20 26

3 Variable metric methods

The idea is to adjust the metric with which one constructs updatesto better reflect problem structure

1 Choose subgradient gk isin partf(xk) or stochastic subgradient gk

satisfyingE[gk|xk] isin partf(xk)

2 Update positive semidefinite matrix Hk isin Rntimesn

3 Compute update

xk+1 = argminxisinC

983069〈gkx〉+ 1

2〈xminus xkHk(xminus xk)〉

983070

Special cases Subgradient method Hk =1

αkI

Newton method Hk = nabla2f(xk)

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 21 26

Theorem 4

Let Hk be a sequence of positive definite matrices where Hk is afunction of g1 middot middot middot gk (and potentially some additional randomness)Let gk be (stochastic) subgradients with

E[gk|xk] isin partf(xk)

Then

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078

le 1

2E

983077K983131

k=2

983059983042xk minus x9831839830422Hk

minus 983042xk minus x9831839830422Hkminus1

983060+ 983042x1 minus x9831839830422H1

983078

+1

2E

983077K983131

k=1

983042gk9830422Hminus1

k

983078

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 22 26

31 Adaptive gradient variable metric method

Theorem 5

LetRinfin = sup

xisinC983042xminus x983183983042infin

be the ℓinfin radius of the set C and the conditions of Theorem 4 hold Let

Hk =1

α

983075diag

983075k983131

i=1

gigiT

98307698307612

where α gt 0 is a pre-specified constant Then we have

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078le

983061R2

infin2

+ α2

983062E[tr(HK)]

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 23 26

Example Support vector machine classification problem

Consider

minx

f(x) =1

m

m983131

i=1

[1minus bi〈aix〉]+ st 983042x983042infin le 1

where the vectors ai isin minus1 0 1n with m = 5000 and n = 1000

For each coordinate j isin [n] we set aij isin plusmn1 to have a randomsign with probability 1j and aij = 0 otherwise

Let u isin minus1 1n uniformly at random We set bi = sign(〈aiu〉)with probability 095 and bi = minussign(〈aiu〉) otherwiseWe choose a stochastic gradient by selecting i isin [m] uniformly atrandom then set

gk isin part[1minus bi〈aixk〉]+

and αk = αradick

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 24 26

f(xk)minus f983183 versus k100 various initial stepsizes

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 25 26

f(xk)minus f983183 versus k100 best initial stepsize

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 26 26

Page 9: Lecture 11: Mirror Descent and Variable Metric Methodsmath.xmu.edu.cn/group/nona/damc/Lecture11.pdf · Definition: Let h be a differentiable convex function, differentiable on

Example using ℓp norms (p isin (1 2]) Consider

h(x) =1

2(pminus 1)983042x9830422p

We have

nablah(x) =1

pminus 1983042x9830422minusp

p

983045sign(x1)|x1|pminus1 middot middot middot sign(xn)|xn|pminus1

983046T

Defineφ(x) = (pminus 1)nablah(x)

and

ψ(y) = 983042y9830422minusqq

983045sign(y1)|y1|qminus1 middot middot middot sign(yn)|yn|qminus1

983046T

where q =p

pminus 1 We have ψ(φ(x)) = x and hence ψ = φminus1

and φ = ψminus1

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 9 26

The mirror descent for C = Rn is

xk+1 = argminxisinRn

983069f(xk) + 〈gkxminus xk〉+ 1

αkDh(xx

k)

983070

= argminxisinRn

983069〈gkx〉+ 1

αkDh(xx

k)

983070

= (nablah)minus1(nablah(xk)minus αkgk)

= ψ(φ(xk)minus αk(pminus 1)gk)

The update involving the inverse of the gradient mapping (nablah)minus1holds more generally that is for any differentiable and strictlyconvex h

This is the original form of the mirror descent update and itjustifies the name mirror descent as the gradient is ldquomirroredrdquothrough the distance-generating function h and back again

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 10 26

The case C = x isin Rn+

n983131

j=1

xj = 1

The subproblem in mirror descent is

minxisinC

〈vx〉+ 1

2983042x9830422p

wherev = αk(pminus 1)gk minus φ(xk)

Exercise Show by KKT conditions that the solution is given byx = ψ([τeminus v]+) where τ isin R satisfies

n983131

j=1

ψj([τeminus v]+) = 1

Binary search can be used to find τ

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 11 26

Theorem 1

Let αk gt 0 be any sequence of non-increasing stepsizes Assume that his strongly convex with respect to 983042 middot 983042 Let 983042 middot 983042lowast denote the dual normLet xk be generated by the mirror descent iteration If Dh(xx983183) le R2

for all x isin C then for all K ge 1

K983131

k=1

[f(xk)minus f(x983183)] leR2

αK+

K983131

k=1

αk

2983042gk9830422lowast

If αk = α is constant then for all K ge 1K983131

k=1

[f(xk)minus f(x983183)] le1

αDh(x983183x

1) +α

2

K983131

k=1

983042gk9830422lowast

If xK =1

K

K983131

k=1

xk or xK = argminxk

f(xk) 983042g983042lowast le M for all

g isin partf(x) for x isin C then f(xK)minus f(x983183) le1

KαDh(x983183x

1) +α

2M2

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 12 26

Exercise Prove that the negative entropy

h(x) =

n983131

j=1

xj log xj

is 1-strongly convex with respect to the ℓ1-norm

Example Let C = x isin Rn+

983123nj=1 xj = 1 Let x1 =

1

ne The

exponentiated gradient method with fixed stepsize α guarantees

f(xK)minus f(x983183) lelog n

Kα+

α

2K

K983131

k=1

983042gk9830422infin

where xK =1

K

K983131

k=1

xk

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 13 26

Example Let C sub x isin Rn 983042x9830421 le R1 Let h(x) =1

2(pminus 1)983042x9830422p

with p = 1 +1

log(2n) Then

K983131

k=1

[f(xk)minus f(x983183)] le2R2

1 log(2n)

αK+

e2

2

K983131

k=1

αk983042gk9830422infin

In particular taking αk =R1

e

983157log(2n)

kand xK =

1

K

K983131

k=1

xk gives

f(xK)minus f(x983183) le3eR1

983155log(2n)radicK

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 14 26

A simulated mirror-descent example Let

A =983045a1 middot middot middot am

983046T isin Rmtimesn

have entries drawn iid N (0 1) Let

bi =1

2(ai1 + ai2) + εi

where εi drawn iid N (0 10minus2) and n = 3000 m = 20 Define

f(x) = 983042Axminus b9830421 =m983131

i=1

|〈aix〉 minus bi|

which has subgradients ATsign(Axminus b) Consider

minx

f(x) st x isin C =

983099983103

983101x isin Rn+

n983131

j=1

xj = 1

983100983104

983102

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 15 26

We compare the subgradient method to exponentiated gradientdescent for this problem noting that the Euclidean projection of avector v isin Rn to the set C has coordinates

xj = [vj minus τ ]+

where τ isin R is chosen so that

n983131

j=1

xj =

n983131

j=1

[vj minus τ ]+ = 1

We use stepsize αk = α0radick where the initial stepsize α0 is chosen

to optimize the convergence guarantee for each of the methods

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 16 26

fkbest minus f983183 versus k

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 17 26

21 Stochastic version

Theorem 2

Let the conditions of Theorem 1 hold except that instead of receiving avector

gk isin partf(xk)

at iteration k the vector gk is a stochastic subgradient satisfying

E[gk|xk] isin partf(xk)

Then for any non-increasing stepsize sequence αk (where αk may bechosen dependent on g1 middot middot middot gk)

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078le E

983077R2

αK+

K983131

k=1

αk

2983042gk9830422lowast

983078

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 18 26

22 Adaptive stepsizes

Let xK =1

K

K983131

k=1

xk If Dh(xx983183) le R2 for all x isin C then

E[f(xK)minus f(x983183)] le E

983077R2

KαK+

1

K

K983131

k=1

αk

2983042gk9830422lowast

983078

If αk = α for all k we see that the choice of stepsize minimizing

R2

Kα+

α

2K

K983131

k=1

983042gk9830422lowast is α983183 =radic2R

983075K983131

k=1

983042gk9830422lowast

983076minus 12

which yields

E[f(xK)minus f(x983183)] leradic2R

KE

983093

983095983075

K983131

k=1

983042gk9830422lowast

983076 12

983094

983096

But this α983183 is not known in advance because

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 19 26

Corollary 3

Let the conditions of Theorem 2 hold Let

αk =R983161983160983160983159

k983131

i=1

983042gi9830422lowast

which is the ldquoup to nowrdquo optimal choice Then

E[f(xK)minus f(x983183)] le3R

KE

983093

983095983075

K983131

k=1

983042gk9830422lowast

983076 12

983094

983096

where

xK =1

K

K983131

k=1

xk

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 20 26

3 Variable metric methods

The idea is to adjust the metric with which one constructs updatesto better reflect problem structure

1 Choose subgradient gk isin partf(xk) or stochastic subgradient gk

satisfyingE[gk|xk] isin partf(xk)

2 Update positive semidefinite matrix Hk isin Rntimesn

3 Compute update

xk+1 = argminxisinC

983069〈gkx〉+ 1

2〈xminus xkHk(xminus xk)〉

983070

Special cases Subgradient method Hk =1

αkI

Newton method Hk = nabla2f(xk)

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 21 26

Theorem 4

Let Hk be a sequence of positive definite matrices where Hk is afunction of g1 middot middot middot gk (and potentially some additional randomness)Let gk be (stochastic) subgradients with

E[gk|xk] isin partf(xk)

Then

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078

le 1

2E

983077K983131

k=2

983059983042xk minus x9831839830422Hk

minus 983042xk minus x9831839830422Hkminus1

983060+ 983042x1 minus x9831839830422H1

983078

+1

2E

983077K983131

k=1

983042gk9830422Hminus1

k

983078

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 22 26

31 Adaptive gradient variable metric method

Theorem 5

LetRinfin = sup

xisinC983042xminus x983183983042infin

be the ℓinfin radius of the set C and the conditions of Theorem 4 hold Let

Hk =1

α

983075diag

983075k983131

i=1

gigiT

98307698307612

where α gt 0 is a pre-specified constant Then we have

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078le

983061R2

infin2

+ α2

983062E[tr(HK)]

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 23 26

Example Support vector machine classification problem

Consider

minx

f(x) =1

m

m983131

i=1

[1minus bi〈aix〉]+ st 983042x983042infin le 1

where the vectors ai isin minus1 0 1n with m = 5000 and n = 1000

For each coordinate j isin [n] we set aij isin plusmn1 to have a randomsign with probability 1j and aij = 0 otherwise

Let u isin minus1 1n uniformly at random We set bi = sign(〈aiu〉)with probability 095 and bi = minussign(〈aiu〉) otherwiseWe choose a stochastic gradient by selecting i isin [m] uniformly atrandom then set

gk isin part[1minus bi〈aixk〉]+

and αk = αradick

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 24 26

f(xk)minus f983183 versus k100 various initial stepsizes

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 25 26

f(xk)minus f983183 versus k100 best initial stepsize

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 26 26

Page 10: Lecture 11: Mirror Descent and Variable Metric Methodsmath.xmu.edu.cn/group/nona/damc/Lecture11.pdf · Definition: Let h be a differentiable convex function, differentiable on

The mirror descent for C = Rn is

xk+1 = argminxisinRn

983069f(xk) + 〈gkxminus xk〉+ 1

αkDh(xx

k)

983070

= argminxisinRn

983069〈gkx〉+ 1

αkDh(xx

k)

983070

= (nablah)minus1(nablah(xk)minus αkgk)

= ψ(φ(xk)minus αk(pminus 1)gk)

The update involving the inverse of the gradient mapping (nablah)minus1holds more generally that is for any differentiable and strictlyconvex h

This is the original form of the mirror descent update and itjustifies the name mirror descent as the gradient is ldquomirroredrdquothrough the distance-generating function h and back again

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 10 26

The case C = x isin Rn+

n983131

j=1

xj = 1

The subproblem in mirror descent is

minxisinC

〈vx〉+ 1

2983042x9830422p

wherev = αk(pminus 1)gk minus φ(xk)

Exercise Show by KKT conditions that the solution is given byx = ψ([τeminus v]+) where τ isin R satisfies

n983131

j=1

ψj([τeminus v]+) = 1

Binary search can be used to find τ

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 11 26

Theorem 1

Let αk gt 0 be any sequence of non-increasing stepsizes Assume that his strongly convex with respect to 983042 middot 983042 Let 983042 middot 983042lowast denote the dual normLet xk be generated by the mirror descent iteration If Dh(xx983183) le R2

for all x isin C then for all K ge 1

K983131

k=1

[f(xk)minus f(x983183)] leR2

αK+

K983131

k=1

αk

2983042gk9830422lowast

If αk = α is constant then for all K ge 1K983131

k=1

[f(xk)minus f(x983183)] le1

αDh(x983183x

1) +α

2

K983131

k=1

983042gk9830422lowast

If xK =1

K

K983131

k=1

xk or xK = argminxk

f(xk) 983042g983042lowast le M for all

g isin partf(x) for x isin C then f(xK)minus f(x983183) le1

KαDh(x983183x

1) +α

2M2

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 12 26

Exercise Prove that the negative entropy

h(x) =

n983131

j=1

xj log xj

is 1-strongly convex with respect to the ℓ1-norm

Example Let C = x isin Rn+

983123nj=1 xj = 1 Let x1 =

1

ne The

exponentiated gradient method with fixed stepsize α guarantees

f(xK)minus f(x983183) lelog n

Kα+

α

2K

K983131

k=1

983042gk9830422infin

where xK =1

K

K983131

k=1

xk

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 13 26

Example Let C sub x isin Rn 983042x9830421 le R1 Let h(x) =1

2(pminus 1)983042x9830422p

with p = 1 +1

log(2n) Then

K983131

k=1

[f(xk)minus f(x983183)] le2R2

1 log(2n)

αK+

e2

2

K983131

k=1

αk983042gk9830422infin

In particular taking αk =R1

e

983157log(2n)

kand xK =

1

K

K983131

k=1

xk gives

f(xK)minus f(x983183) le3eR1

983155log(2n)radicK

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 14 26

A simulated mirror-descent example Let

A =983045a1 middot middot middot am

983046T isin Rmtimesn

have entries drawn iid N (0 1) Let

bi =1

2(ai1 + ai2) + εi

where εi drawn iid N (0 10minus2) and n = 3000 m = 20 Define

f(x) = 983042Axminus b9830421 =m983131

i=1

|〈aix〉 minus bi|

which has subgradients ATsign(Axminus b) Consider

minx

f(x) st x isin C =

983099983103

983101x isin Rn+

n983131

j=1

xj = 1

983100983104

983102

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 15 26

We compare the subgradient method to exponentiated gradientdescent for this problem noting that the Euclidean projection of avector v isin Rn to the set C has coordinates

xj = [vj minus τ ]+

where τ isin R is chosen so that

n983131

j=1

xj =

n983131

j=1

[vj minus τ ]+ = 1

We use stepsize αk = α0radick where the initial stepsize α0 is chosen

to optimize the convergence guarantee for each of the methods

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 16 26

fkbest minus f983183 versus k

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 17 26

21 Stochastic version

Theorem 2

Let the conditions of Theorem 1 hold except that instead of receiving avector

gk isin partf(xk)

at iteration k the vector gk is a stochastic subgradient satisfying

E[gk|xk] isin partf(xk)

Then for any non-increasing stepsize sequence αk (where αk may bechosen dependent on g1 middot middot middot gk)

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078le E

983077R2

αK+

K983131

k=1

αk

2983042gk9830422lowast

983078

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 18 26

22 Adaptive stepsizes

Let xK =1

K

K983131

k=1

xk If Dh(xx983183) le R2 for all x isin C then

E[f(xK)minus f(x983183)] le E

983077R2

KαK+

1

K

K983131

k=1

αk

2983042gk9830422lowast

983078

If αk = α for all k we see that the choice of stepsize minimizing

R2

Kα+

α

2K

K983131

k=1

983042gk9830422lowast is α983183 =radic2R

983075K983131

k=1

983042gk9830422lowast

983076minus 12

which yields

E[f(xK)minus f(x983183)] leradic2R

KE

983093

983095983075

K983131

k=1

983042gk9830422lowast

983076 12

983094

983096

But this α983183 is not known in advance because

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 19 26

Corollary 3

Let the conditions of Theorem 2 hold Let

αk =R983161983160983160983159

k983131

i=1

983042gi9830422lowast

which is the ldquoup to nowrdquo optimal choice Then

E[f(xK)minus f(x983183)] le3R

KE

983093

983095983075

K983131

k=1

983042gk9830422lowast

983076 12

983094

983096

where

xK =1

K

K983131

k=1

xk

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 20 26

3 Variable metric methods

The idea is to adjust the metric with which one constructs updatesto better reflect problem structure

1 Choose subgradient gk isin partf(xk) or stochastic subgradient gk

satisfyingE[gk|xk] isin partf(xk)

2 Update positive semidefinite matrix Hk isin Rntimesn

3 Compute update

xk+1 = argminxisinC

983069〈gkx〉+ 1

2〈xminus xkHk(xminus xk)〉

983070

Special cases Subgradient method Hk =1

αkI

Newton method Hk = nabla2f(xk)

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 21 26

Theorem 4

Let Hk be a sequence of positive definite matrices where Hk is afunction of g1 middot middot middot gk (and potentially some additional randomness)Let gk be (stochastic) subgradients with

E[gk|xk] isin partf(xk)

Then

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078

le 1

2E

983077K983131

k=2

983059983042xk minus x9831839830422Hk

minus 983042xk minus x9831839830422Hkminus1

983060+ 983042x1 minus x9831839830422H1

983078

+1

2E

983077K983131

k=1

983042gk9830422Hminus1

k

983078

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 22 26

31 Adaptive gradient variable metric method

Theorem 5

LetRinfin = sup

xisinC983042xminus x983183983042infin

be the ℓinfin radius of the set C and the conditions of Theorem 4 hold Let

Hk =1

α

983075diag

983075k983131

i=1

gigiT

98307698307612

where α gt 0 is a pre-specified constant Then we have

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078le

983061R2

infin2

+ α2

983062E[tr(HK)]

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 23 26

Example Support vector machine classification problem

Consider

minx

f(x) =1

m

m983131

i=1

[1minus bi〈aix〉]+ st 983042x983042infin le 1

where the vectors ai isin minus1 0 1n with m = 5000 and n = 1000

For each coordinate j isin [n] we set aij isin plusmn1 to have a randomsign with probability 1j and aij = 0 otherwise

Let u isin minus1 1n uniformly at random We set bi = sign(〈aiu〉)with probability 095 and bi = minussign(〈aiu〉) otherwiseWe choose a stochastic gradient by selecting i isin [m] uniformly atrandom then set

gk isin part[1minus bi〈aixk〉]+

and αk = αradick

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 24 26

f(xk)minus f983183 versus k100 various initial stepsizes

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 25 26

f(xk)minus f983183 versus k100 best initial stepsize

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 26 26

Page 11: Lecture 11: Mirror Descent and Variable Metric Methodsmath.xmu.edu.cn/group/nona/damc/Lecture11.pdf · Definition: Let h be a differentiable convex function, differentiable on

The case C = x isin Rn+

n983131

j=1

xj = 1

The subproblem in mirror descent is

minxisinC

〈vx〉+ 1

2983042x9830422p

wherev = αk(pminus 1)gk minus φ(xk)

Exercise Show by KKT conditions that the solution is given byx = ψ([τeminus v]+) where τ isin R satisfies

n983131

j=1

ψj([τeminus v]+) = 1

Binary search can be used to find τ

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 11 26

Theorem 1

Let αk gt 0 be any sequence of non-increasing stepsizes Assume that his strongly convex with respect to 983042 middot 983042 Let 983042 middot 983042lowast denote the dual normLet xk be generated by the mirror descent iteration If Dh(xx983183) le R2

for all x isin C then for all K ge 1

K983131

k=1

[f(xk)minus f(x983183)] leR2

αK+

K983131

k=1

αk

2983042gk9830422lowast

If αk = α is constant then for all K ge 1K983131

k=1

[f(xk)minus f(x983183)] le1

αDh(x983183x

1) +α

2

K983131

k=1

983042gk9830422lowast

If xK =1

K

K983131

k=1

xk or xK = argminxk

f(xk) 983042g983042lowast le M for all

g isin partf(x) for x isin C then f(xK)minus f(x983183) le1

KαDh(x983183x

1) +α

2M2

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 12 26

Exercise Prove that the negative entropy

h(x) =

n983131

j=1

xj log xj

is 1-strongly convex with respect to the ℓ1-norm

Example Let C = x isin Rn+

983123nj=1 xj = 1 Let x1 =

1

ne The

exponentiated gradient method with fixed stepsize α guarantees

f(xK)minus f(x983183) lelog n

Kα+

α

2K

K983131

k=1

983042gk9830422infin

where xK =1

K

K983131

k=1

xk

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 13 26

Example Let C sub x isin Rn 983042x9830421 le R1 Let h(x) =1

2(pminus 1)983042x9830422p

with p = 1 +1

log(2n) Then

K983131

k=1

[f(xk)minus f(x983183)] le2R2

1 log(2n)

αK+

e2

2

K983131

k=1

αk983042gk9830422infin

In particular taking αk =R1

e

983157log(2n)

kand xK =

1

K

K983131

k=1

xk gives

f(xK)minus f(x983183) le3eR1

983155log(2n)radicK

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 14 26

A simulated mirror-descent example Let

A =983045a1 middot middot middot am

983046T isin Rmtimesn

have entries drawn iid N (0 1) Let

bi =1

2(ai1 + ai2) + εi

where εi drawn iid N (0 10minus2) and n = 3000 m = 20 Define

f(x) = 983042Axminus b9830421 =m983131

i=1

|〈aix〉 minus bi|

which has subgradients ATsign(Axminus b) Consider

minx

f(x) st x isin C =

983099983103

983101x isin Rn+

n983131

j=1

xj = 1

983100983104

983102

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 15 26

We compare the subgradient method to exponentiated gradientdescent for this problem noting that the Euclidean projection of avector v isin Rn to the set C has coordinates

xj = [vj minus τ ]+

where τ isin R is chosen so that

n983131

j=1

xj =

n983131

j=1

[vj minus τ ]+ = 1

We use stepsize αk = α0radick where the initial stepsize α0 is chosen

to optimize the convergence guarantee for each of the methods

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 16 26

fkbest minus f983183 versus k

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 17 26

21 Stochastic version

Theorem 2

Let the conditions of Theorem 1 hold except that instead of receiving avector

gk isin partf(xk)

at iteration k the vector gk is a stochastic subgradient satisfying

E[gk|xk] isin partf(xk)

Then for any non-increasing stepsize sequence αk (where αk may bechosen dependent on g1 middot middot middot gk)

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078le E

983077R2

αK+

K983131

k=1

αk

2983042gk9830422lowast

983078

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 18 26

22 Adaptive stepsizes

Let xK =1

K

K983131

k=1

xk If Dh(xx983183) le R2 for all x isin C then

E[f(xK)minus f(x983183)] le E

983077R2

KαK+

1

K

K983131

k=1

αk

2983042gk9830422lowast

983078

If αk = α for all k we see that the choice of stepsize minimizing

R2

Kα+

α

2K

K983131

k=1

983042gk9830422lowast is α983183 =radic2R

983075K983131

k=1

983042gk9830422lowast

983076minus 12

which yields

E[f(xK)minus f(x983183)] leradic2R

KE

983093

983095983075

K983131

k=1

983042gk9830422lowast

983076 12

983094

983096

But this α983183 is not known in advance because

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 19 26

Corollary 3

Let the conditions of Theorem 2 hold Let

αk =R983161983160983160983159

k983131

i=1

983042gi9830422lowast

which is the ldquoup to nowrdquo optimal choice Then

E[f(xK)minus f(x983183)] le3R

KE

983093

983095983075

K983131

k=1

983042gk9830422lowast

983076 12

983094

983096

where

xK =1

K

K983131

k=1

xk

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 20 26

3 Variable metric methods

The idea is to adjust the metric with which one constructs updatesto better reflect problem structure

1 Choose subgradient gk isin partf(xk) or stochastic subgradient gk

satisfyingE[gk|xk] isin partf(xk)

2 Update positive semidefinite matrix Hk isin Rntimesn

3 Compute update

xk+1 = argminxisinC

983069〈gkx〉+ 1

2〈xminus xkHk(xminus xk)〉

983070

Special cases Subgradient method Hk =1

αkI

Newton method Hk = nabla2f(xk)

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 21 26

Theorem 4

Let Hk be a sequence of positive definite matrices where Hk is afunction of g1 middot middot middot gk (and potentially some additional randomness)Let gk be (stochastic) subgradients with

E[gk|xk] isin partf(xk)

Then

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078

le 1

2E

983077K983131

k=2

983059983042xk minus x9831839830422Hk

minus 983042xk minus x9831839830422Hkminus1

983060+ 983042x1 minus x9831839830422H1

983078

+1

2E

983077K983131

k=1

983042gk9830422Hminus1

k

983078

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 22 26

31 Adaptive gradient variable metric method

Theorem 5

LetRinfin = sup

xisinC983042xminus x983183983042infin

be the ℓinfin radius of the set C and the conditions of Theorem 4 hold Let

Hk =1

α

983075diag

983075k983131

i=1

gigiT

98307698307612

where α gt 0 is a pre-specified constant Then we have

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078le

983061R2

infin2

+ α2

983062E[tr(HK)]

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 23 26

Example Support vector machine classification problem

Consider

minx

f(x) =1

m

m983131

i=1

[1minus bi〈aix〉]+ st 983042x983042infin le 1

where the vectors ai isin minus1 0 1n with m = 5000 and n = 1000

For each coordinate j isin [n] we set aij isin plusmn1 to have a randomsign with probability 1j and aij = 0 otherwise

Let u isin minus1 1n uniformly at random We set bi = sign(〈aiu〉)with probability 095 and bi = minussign(〈aiu〉) otherwiseWe choose a stochastic gradient by selecting i isin [m] uniformly atrandom then set

gk isin part[1minus bi〈aixk〉]+

and αk = αradick

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 24 26

f(xk)minus f983183 versus k100 various initial stepsizes

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 25 26

f(xk)minus f983183 versus k100 best initial stepsize

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 26 26

Page 12: Lecture 11: Mirror Descent and Variable Metric Methodsmath.xmu.edu.cn/group/nona/damc/Lecture11.pdf · Definition: Let h be a differentiable convex function, differentiable on

Theorem 1

Let αk gt 0 be any sequence of non-increasing stepsizes Assume that his strongly convex with respect to 983042 middot 983042 Let 983042 middot 983042lowast denote the dual normLet xk be generated by the mirror descent iteration If Dh(xx983183) le R2

for all x isin C then for all K ge 1

K983131

k=1

[f(xk)minus f(x983183)] leR2

αK+

K983131

k=1

αk

2983042gk9830422lowast

If αk = α is constant then for all K ge 1K983131

k=1

[f(xk)minus f(x983183)] le1

αDh(x983183x

1) +α

2

K983131

k=1

983042gk9830422lowast

If xK =1

K

K983131

k=1

xk or xK = argminxk

f(xk) 983042g983042lowast le M for all

g isin partf(x) for x isin C then f(xK)minus f(x983183) le1

KαDh(x983183x

1) +α

2M2

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 12 26

Exercise Prove that the negative entropy

h(x) =

n983131

j=1

xj log xj

is 1-strongly convex with respect to the ℓ1-norm

Example Let C = x isin Rn+

983123nj=1 xj = 1 Let x1 =

1

ne The

exponentiated gradient method with fixed stepsize α guarantees

f(xK)minus f(x983183) lelog n

Kα+

α

2K

K983131

k=1

983042gk9830422infin

where xK =1

K

K983131

k=1

xk

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 13 26

Example Let C sub x isin Rn 983042x9830421 le R1 Let h(x) =1

2(pminus 1)983042x9830422p

with p = 1 +1

log(2n) Then

K983131

k=1

[f(xk)minus f(x983183)] le2R2

1 log(2n)

αK+

e2

2

K983131

k=1

αk983042gk9830422infin

In particular taking αk =R1

e

983157log(2n)

kand xK =

1

K

K983131

k=1

xk gives

f(xK)minus f(x983183) le3eR1

983155log(2n)radicK

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 14 26

A simulated mirror-descent example Let

A =983045a1 middot middot middot am

983046T isin Rmtimesn

have entries drawn iid N (0 1) Let

bi =1

2(ai1 + ai2) + εi

where εi drawn iid N (0 10minus2) and n = 3000 m = 20 Define

f(x) = 983042Axminus b9830421 =m983131

i=1

|〈aix〉 minus bi|

which has subgradients ATsign(Axminus b) Consider

minx

f(x) st x isin C =

983099983103

983101x isin Rn+

n983131

j=1

xj = 1

983100983104

983102

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 15 26

We compare the subgradient method to exponentiated gradientdescent for this problem noting that the Euclidean projection of avector v isin Rn to the set C has coordinates

xj = [vj minus τ ]+

where τ isin R is chosen so that

n983131

j=1

xj =

n983131

j=1

[vj minus τ ]+ = 1

We use stepsize αk = α0radick where the initial stepsize α0 is chosen

to optimize the convergence guarantee for each of the methods

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 16 26

fkbest minus f983183 versus k

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 17 26

21 Stochastic version

Theorem 2

Let the conditions of Theorem 1 hold except that instead of receiving avector

gk isin partf(xk)

at iteration k the vector gk is a stochastic subgradient satisfying

E[gk|xk] isin partf(xk)

Then for any non-increasing stepsize sequence αk (where αk may bechosen dependent on g1 middot middot middot gk)

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078le E

983077R2

αK+

K983131

k=1

αk

2983042gk9830422lowast

983078

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 18 26

22 Adaptive stepsizes

Let xK =1

K

K983131

k=1

xk If Dh(xx983183) le R2 for all x isin C then

E[f(xK)minus f(x983183)] le E

983077R2

KαK+

1

K

K983131

k=1

αk

2983042gk9830422lowast

983078

If αk = α for all k we see that the choice of stepsize minimizing

R2

Kα+

α

2K

K983131

k=1

983042gk9830422lowast is α983183 =radic2R

983075K983131

k=1

983042gk9830422lowast

983076minus 12

which yields

E[f(xK)minus f(x983183)] leradic2R

KE

983093

983095983075

K983131

k=1

983042gk9830422lowast

983076 12

983094

983096

But this α983183 is not known in advance because

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 19 26

Corollary 3

Let the conditions of Theorem 2 hold Let

αk =R983161983160983160983159

k983131

i=1

983042gi9830422lowast

which is the ldquoup to nowrdquo optimal choice Then

E[f(xK)minus f(x983183)] le3R

KE

983093

983095983075

K983131

k=1

983042gk9830422lowast

983076 12

983094

983096

where

xK =1

K

K983131

k=1

xk

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 20 26

3 Variable metric methods

The idea is to adjust the metric with which one constructs updatesto better reflect problem structure

1 Choose subgradient gk isin partf(xk) or stochastic subgradient gk

satisfyingE[gk|xk] isin partf(xk)

2 Update positive semidefinite matrix Hk isin Rntimesn

3 Compute update

xk+1 = argminxisinC

983069〈gkx〉+ 1

2〈xminus xkHk(xminus xk)〉

983070

Special cases Subgradient method Hk =1

αkI

Newton method Hk = nabla2f(xk)

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 21 26

Theorem 4

Let Hk be a sequence of positive definite matrices where Hk is afunction of g1 middot middot middot gk (and potentially some additional randomness)Let gk be (stochastic) subgradients with

E[gk|xk] isin partf(xk)

Then

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078

le 1

2E

983077K983131

k=2

983059983042xk minus x9831839830422Hk

minus 983042xk minus x9831839830422Hkminus1

983060+ 983042x1 minus x9831839830422H1

983078

+1

2E

983077K983131

k=1

983042gk9830422Hminus1

k

983078

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 22 26

31 Adaptive gradient variable metric method

Theorem 5

LetRinfin = sup

xisinC983042xminus x983183983042infin

be the ℓinfin radius of the set C and the conditions of Theorem 4 hold Let

Hk =1

α

983075diag

983075k983131

i=1

gigiT

98307698307612

where α gt 0 is a pre-specified constant Then we have

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078le

983061R2

infin2

+ α2

983062E[tr(HK)]

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 23 26

Example Support vector machine classification problem

Consider

minx

f(x) =1

m

m983131

i=1

[1minus bi〈aix〉]+ st 983042x983042infin le 1

where the vectors ai isin minus1 0 1n with m = 5000 and n = 1000

For each coordinate j isin [n] we set aij isin plusmn1 to have a randomsign with probability 1j and aij = 0 otherwise

Let u isin minus1 1n uniformly at random We set bi = sign(〈aiu〉)with probability 095 and bi = minussign(〈aiu〉) otherwiseWe choose a stochastic gradient by selecting i isin [m] uniformly atrandom then set

gk isin part[1minus bi〈aixk〉]+

and αk = αradick

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 24 26

f(xk)minus f983183 versus k100 various initial stepsizes

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 25 26

f(xk)minus f983183 versus k100 best initial stepsize

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 26 26

Page 13: Lecture 11: Mirror Descent and Variable Metric Methodsmath.xmu.edu.cn/group/nona/damc/Lecture11.pdf · Definition: Let h be a differentiable convex function, differentiable on

Exercise Prove that the negative entropy

h(x) =

n983131

j=1

xj log xj

is 1-strongly convex with respect to the ℓ1-norm

Example Let C = x isin Rn+

983123nj=1 xj = 1 Let x1 =

1

ne The

exponentiated gradient method with fixed stepsize α guarantees

f(xK)minus f(x983183) lelog n

Kα+

α

2K

K983131

k=1

983042gk9830422infin

where xK =1

K

K983131

k=1

xk

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 13 26

Example Let C sub x isin Rn 983042x9830421 le R1 Let h(x) =1

2(pminus 1)983042x9830422p

with p = 1 +1

log(2n) Then

K983131

k=1

[f(xk)minus f(x983183)] le2R2

1 log(2n)

αK+

e2

2

K983131

k=1

αk983042gk9830422infin

In particular taking αk =R1

e

983157log(2n)

kand xK =

1

K

K983131

k=1

xk gives

f(xK)minus f(x983183) le3eR1

983155log(2n)radicK

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 14 26

A simulated mirror-descent example Let

A =983045a1 middot middot middot am

983046T isin Rmtimesn

have entries drawn iid N (0 1) Let

bi =1

2(ai1 + ai2) + εi

where εi drawn iid N (0 10minus2) and n = 3000 m = 20 Define

f(x) = 983042Axminus b9830421 =m983131

i=1

|〈aix〉 minus bi|

which has subgradients ATsign(Axminus b) Consider

minx

f(x) st x isin C =

983099983103

983101x isin Rn+

n983131

j=1

xj = 1

983100983104

983102

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 15 26

We compare the subgradient method to exponentiated gradientdescent for this problem noting that the Euclidean projection of avector v isin Rn to the set C has coordinates

xj = [vj minus τ ]+

where τ isin R is chosen so that

n983131

j=1

xj =

n983131

j=1

[vj minus τ ]+ = 1

We use stepsize αk = α0radick where the initial stepsize α0 is chosen

to optimize the convergence guarantee for each of the methods

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 16 26

fkbest minus f983183 versus k

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 17 26

21 Stochastic version

Theorem 2

Let the conditions of Theorem 1 hold except that instead of receiving avector

gk isin partf(xk)

at iteration k the vector gk is a stochastic subgradient satisfying

E[gk|xk] isin partf(xk)

Then for any non-increasing stepsize sequence αk (where αk may bechosen dependent on g1 middot middot middot gk)

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078le E

983077R2

αK+

K983131

k=1

αk

2983042gk9830422lowast

983078

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 18 26

22 Adaptive stepsizes

Let xK =1

K

K983131

k=1

xk If Dh(xx983183) le R2 for all x isin C then

E[f(xK)minus f(x983183)] le E

983077R2

KαK+

1

K

K983131

k=1

αk

2983042gk9830422lowast

983078

If αk = α for all k we see that the choice of stepsize minimizing

R2

Kα+

α

2K

K983131

k=1

983042gk9830422lowast is α983183 =radic2R

983075K983131

k=1

983042gk9830422lowast

983076minus 12

which yields

E[f(xK)minus f(x983183)] leradic2R

KE

983093

983095983075

K983131

k=1

983042gk9830422lowast

983076 12

983094

983096

But this α983183 is not known in advance because

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 19 26

Corollary 3

Let the conditions of Theorem 2 hold Let

αk =R983161983160983160983159

k983131

i=1

983042gi9830422lowast

which is the ldquoup to nowrdquo optimal choice Then

E[f(xK)minus f(x983183)] le3R

KE

983093

983095983075

K983131

k=1

983042gk9830422lowast

983076 12

983094

983096

where

xK =1

K

K983131

k=1

xk

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 20 26

3 Variable metric methods

The idea is to adjust the metric with which one constructs updatesto better reflect problem structure

1 Choose subgradient gk isin partf(xk) or stochastic subgradient gk

satisfyingE[gk|xk] isin partf(xk)

2 Update positive semidefinite matrix Hk isin Rntimesn

3 Compute update

xk+1 = argminxisinC

983069〈gkx〉+ 1

2〈xminus xkHk(xminus xk)〉

983070

Special cases Subgradient method Hk =1

αkI

Newton method Hk = nabla2f(xk)

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 21 26

Theorem 4

Let Hk be a sequence of positive definite matrices where Hk is afunction of g1 middot middot middot gk (and potentially some additional randomness)Let gk be (stochastic) subgradients with

E[gk|xk] isin partf(xk)

Then

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078

le 1

2E

983077K983131

k=2

983059983042xk minus x9831839830422Hk

minus 983042xk minus x9831839830422Hkminus1

983060+ 983042x1 minus x9831839830422H1

983078

+1

2E

983077K983131

k=1

983042gk9830422Hminus1

k

983078

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 22 26

31 Adaptive gradient variable metric method

Theorem 5

LetRinfin = sup

xisinC983042xminus x983183983042infin

be the ℓinfin radius of the set C and the conditions of Theorem 4 hold Let

Hk =1

α

983075diag

983075k983131

i=1

gigiT

98307698307612

where α gt 0 is a pre-specified constant Then we have

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078le

983061R2

infin2

+ α2

983062E[tr(HK)]

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 23 26

Example Support vector machine classification problem

Consider

minx

f(x) =1

m

m983131

i=1

[1minus bi〈aix〉]+ st 983042x983042infin le 1

where the vectors ai isin minus1 0 1n with m = 5000 and n = 1000

For each coordinate j isin [n] we set aij isin plusmn1 to have a randomsign with probability 1j and aij = 0 otherwise

Let u isin minus1 1n uniformly at random We set bi = sign(〈aiu〉)with probability 095 and bi = minussign(〈aiu〉) otherwiseWe choose a stochastic gradient by selecting i isin [m] uniformly atrandom then set

gk isin part[1minus bi〈aixk〉]+

and αk = αradick

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 24 26

f(xk)minus f983183 versus k100 various initial stepsizes

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 25 26

f(xk)minus f983183 versus k100 best initial stepsize

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 26 26

Page 14: Lecture 11: Mirror Descent and Variable Metric Methodsmath.xmu.edu.cn/group/nona/damc/Lecture11.pdf · Definition: Let h be a differentiable convex function, differentiable on

Example Let C sub x isin Rn 983042x9830421 le R1 Let h(x) =1

2(pminus 1)983042x9830422p

with p = 1 +1

log(2n) Then

K983131

k=1

[f(xk)minus f(x983183)] le2R2

1 log(2n)

αK+

e2

2

K983131

k=1

αk983042gk9830422infin

In particular taking αk =R1

e

983157log(2n)

kand xK =

1

K

K983131

k=1

xk gives

f(xK)minus f(x983183) le3eR1

983155log(2n)radicK

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 14 26

A simulated mirror-descent example Let

A =983045a1 middot middot middot am

983046T isin Rmtimesn

have entries drawn iid N (0 1) Let

bi =1

2(ai1 + ai2) + εi

where εi drawn iid N (0 10minus2) and n = 3000 m = 20 Define

f(x) = 983042Axminus b9830421 =m983131

i=1

|〈aix〉 minus bi|

which has subgradients ATsign(Axminus b) Consider

minx

f(x) st x isin C =

983099983103

983101x isin Rn+

n983131

j=1

xj = 1

983100983104

983102

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 15 26

We compare the subgradient method to exponentiated gradientdescent for this problem noting that the Euclidean projection of avector v isin Rn to the set C has coordinates

xj = [vj minus τ ]+

where τ isin R is chosen so that

n983131

j=1

xj =

n983131

j=1

[vj minus τ ]+ = 1

We use stepsize αk = α0radick where the initial stepsize α0 is chosen

to optimize the convergence guarantee for each of the methods

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 16 26

fkbest minus f983183 versus k

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 17 26

21 Stochastic version

Theorem 2

Let the conditions of Theorem 1 hold except that instead of receiving avector

gk isin partf(xk)

at iteration k the vector gk is a stochastic subgradient satisfying

E[gk|xk] isin partf(xk)

Then for any non-increasing stepsize sequence αk (where αk may bechosen dependent on g1 middot middot middot gk)

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078le E

983077R2

αK+

K983131

k=1

αk

2983042gk9830422lowast

983078

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 18 26

22 Adaptive stepsizes

Let xK =1

K

K983131

k=1

xk If Dh(xx983183) le R2 for all x isin C then

E[f(xK)minus f(x983183)] le E

983077R2

KαK+

1

K

K983131

k=1

αk

2983042gk9830422lowast

983078

If αk = α for all k we see that the choice of stepsize minimizing

R2

Kα+

α

2K

K983131

k=1

983042gk9830422lowast is α983183 =radic2R

983075K983131

k=1

983042gk9830422lowast

983076minus 12

which yields

E[f(xK)minus f(x983183)] leradic2R

KE

983093

983095983075

K983131

k=1

983042gk9830422lowast

983076 12

983094

983096

But this α983183 is not known in advance because

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 19 26

Corollary 3

Let the conditions of Theorem 2 hold Let

αk =R983161983160983160983159

k983131

i=1

983042gi9830422lowast

which is the ldquoup to nowrdquo optimal choice Then

E[f(xK)minus f(x983183)] le3R

KE

983093

983095983075

K983131

k=1

983042gk9830422lowast

983076 12

983094

983096

where

xK =1

K

K983131

k=1

xk

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 20 26

3 Variable metric methods

The idea is to adjust the metric with which one constructs updatesto better reflect problem structure

1 Choose subgradient gk isin partf(xk) or stochastic subgradient gk

satisfyingE[gk|xk] isin partf(xk)

2 Update positive semidefinite matrix Hk isin Rntimesn

3 Compute update

xk+1 = argminxisinC

983069〈gkx〉+ 1

2〈xminus xkHk(xminus xk)〉

983070

Special cases Subgradient method Hk =1

αkI

Newton method Hk = nabla2f(xk)

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 21 26

Theorem 4

Let Hk be a sequence of positive definite matrices where Hk is afunction of g1 middot middot middot gk (and potentially some additional randomness)Let gk be (stochastic) subgradients with

E[gk|xk] isin partf(xk)

Then

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078

le 1

2E

983077K983131

k=2

983059983042xk minus x9831839830422Hk

minus 983042xk minus x9831839830422Hkminus1

983060+ 983042x1 minus x9831839830422H1

983078

+1

2E

983077K983131

k=1

983042gk9830422Hminus1

k

983078

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 22 26

31 Adaptive gradient variable metric method

Theorem 5

LetRinfin = sup

xisinC983042xminus x983183983042infin

be the ℓinfin radius of the set C and the conditions of Theorem 4 hold Let

Hk =1

α

983075diag

983075k983131

i=1

gigiT

98307698307612

where α gt 0 is a pre-specified constant Then we have

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078le

983061R2

infin2

+ α2

983062E[tr(HK)]

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 23 26

Example Support vector machine classification problem

Consider

minx

f(x) =1

m

m983131

i=1

[1minus bi〈aix〉]+ st 983042x983042infin le 1

where the vectors ai isin minus1 0 1n with m = 5000 and n = 1000

For each coordinate j isin [n] we set aij isin plusmn1 to have a randomsign with probability 1j and aij = 0 otherwise

Let u isin minus1 1n uniformly at random We set bi = sign(〈aiu〉)with probability 095 and bi = minussign(〈aiu〉) otherwiseWe choose a stochastic gradient by selecting i isin [m] uniformly atrandom then set

gk isin part[1minus bi〈aixk〉]+

and αk = αradick

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 24 26

f(xk)minus f983183 versus k100 various initial stepsizes

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 25 26

f(xk)minus f983183 versus k100 best initial stepsize

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 26 26

Page 15: Lecture 11: Mirror Descent and Variable Metric Methodsmath.xmu.edu.cn/group/nona/damc/Lecture11.pdf · Definition: Let h be a differentiable convex function, differentiable on

A simulated mirror-descent example Let

A =983045a1 middot middot middot am

983046T isin Rmtimesn

have entries drawn iid N (0 1) Let

bi =1

2(ai1 + ai2) + εi

where εi drawn iid N (0 10minus2) and n = 3000 m = 20 Define

f(x) = 983042Axminus b9830421 =m983131

i=1

|〈aix〉 minus bi|

which has subgradients ATsign(Axminus b) Consider

minx

f(x) st x isin C =

983099983103

983101x isin Rn+

n983131

j=1

xj = 1

983100983104

983102

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 15 26

We compare the subgradient method to exponentiated gradientdescent for this problem noting that the Euclidean projection of avector v isin Rn to the set C has coordinates

xj = [vj minus τ ]+

where τ isin R is chosen so that

n983131

j=1

xj =

n983131

j=1

[vj minus τ ]+ = 1

We use stepsize αk = α0radick where the initial stepsize α0 is chosen

to optimize the convergence guarantee for each of the methods

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 16 26

fkbest minus f983183 versus k

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 17 26

21 Stochastic version

Theorem 2

Let the conditions of Theorem 1 hold except that instead of receiving avector

gk isin partf(xk)

at iteration k the vector gk is a stochastic subgradient satisfying

E[gk|xk] isin partf(xk)

Then for any non-increasing stepsize sequence αk (where αk may bechosen dependent on g1 middot middot middot gk)

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078le E

983077R2

αK+

K983131

k=1

αk

2983042gk9830422lowast

983078

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 18 26

22 Adaptive stepsizes

Let xK =1

K

K983131

k=1

xk If Dh(xx983183) le R2 for all x isin C then

E[f(xK)minus f(x983183)] le E

983077R2

KαK+

1

K

K983131

k=1

αk

2983042gk9830422lowast

983078

If αk = α for all k we see that the choice of stepsize minimizing

R2

Kα+

α

2K

K983131

k=1

983042gk9830422lowast is α983183 =radic2R

983075K983131

k=1

983042gk9830422lowast

983076minus 12

which yields

E[f(xK)minus f(x983183)] leradic2R

KE

983093

983095983075

K983131

k=1

983042gk9830422lowast

983076 12

983094

983096

But this α983183 is not known in advance because

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 19 26

Corollary 3

Let the conditions of Theorem 2 hold Let

αk =R983161983160983160983159

k983131

i=1

983042gi9830422lowast

which is the ldquoup to nowrdquo optimal choice Then

E[f(xK)minus f(x983183)] le3R

KE

983093

983095983075

K983131

k=1

983042gk9830422lowast

983076 12

983094

983096

where

xK =1

K

K983131

k=1

xk

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 20 26

3 Variable metric methods

The idea is to adjust the metric with which one constructs updatesto better reflect problem structure

1 Choose subgradient gk isin partf(xk) or stochastic subgradient gk

satisfyingE[gk|xk] isin partf(xk)

2 Update positive semidefinite matrix Hk isin Rntimesn

3 Compute update

xk+1 = argminxisinC

983069〈gkx〉+ 1

2〈xminus xkHk(xminus xk)〉

983070

Special cases Subgradient method Hk =1

αkI

Newton method Hk = nabla2f(xk)

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 21 26

Theorem 4

Let Hk be a sequence of positive definite matrices where Hk is afunction of g1 middot middot middot gk (and potentially some additional randomness)Let gk be (stochastic) subgradients with

E[gk|xk] isin partf(xk)

Then

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078

le 1

2E

983077K983131

k=2

983059983042xk minus x9831839830422Hk

minus 983042xk minus x9831839830422Hkminus1

983060+ 983042x1 minus x9831839830422H1

983078

+1

2E

983077K983131

k=1

983042gk9830422Hminus1

k

983078

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 22 26

31 Adaptive gradient variable metric method

Theorem 5

LetRinfin = sup

xisinC983042xminus x983183983042infin

be the ℓinfin radius of the set C and the conditions of Theorem 4 hold Let

Hk =1

α

983075diag

983075k983131

i=1

gigiT

98307698307612

where α gt 0 is a pre-specified constant Then we have

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078le

983061R2

infin2

+ α2

983062E[tr(HK)]

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 23 26

Example Support vector machine classification problem

Consider

minx

f(x) =1

m

m983131

i=1

[1minus bi〈aix〉]+ st 983042x983042infin le 1

where the vectors ai isin minus1 0 1n with m = 5000 and n = 1000

For each coordinate j isin [n] we set aij isin plusmn1 to have a randomsign with probability 1j and aij = 0 otherwise

Let u isin minus1 1n uniformly at random We set bi = sign(〈aiu〉)with probability 095 and bi = minussign(〈aiu〉) otherwiseWe choose a stochastic gradient by selecting i isin [m] uniformly atrandom then set

gk isin part[1minus bi〈aixk〉]+

and αk = αradick

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 24 26

f(xk)minus f983183 versus k100 various initial stepsizes

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 25 26

f(xk)minus f983183 versus k100 best initial stepsize

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 26 26

Page 16: Lecture 11: Mirror Descent and Variable Metric Methodsmath.xmu.edu.cn/group/nona/damc/Lecture11.pdf · Definition: Let h be a differentiable convex function, differentiable on

We compare the subgradient method to exponentiated gradientdescent for this problem noting that the Euclidean projection of avector v isin Rn to the set C has coordinates

xj = [vj minus τ ]+

where τ isin R is chosen so that

n983131

j=1

xj =

n983131

j=1

[vj minus τ ]+ = 1

We use stepsize αk = α0radick where the initial stepsize α0 is chosen

to optimize the convergence guarantee for each of the methods

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 16 26

fkbest minus f983183 versus k

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 17 26

21 Stochastic version

Theorem 2

Let the conditions of Theorem 1 hold except that instead of receiving avector

gk isin partf(xk)

at iteration k the vector gk is a stochastic subgradient satisfying

E[gk|xk] isin partf(xk)

Then for any non-increasing stepsize sequence αk (where αk may bechosen dependent on g1 middot middot middot gk)

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078le E

983077R2

αK+

K983131

k=1

αk

2983042gk9830422lowast

983078

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 18 26

22 Adaptive stepsizes

Let xK =1

K

K983131

k=1

xk If Dh(xx983183) le R2 for all x isin C then

E[f(xK)minus f(x983183)] le E

983077R2

KαK+

1

K

K983131

k=1

αk

2983042gk9830422lowast

983078

If αk = α for all k we see that the choice of stepsize minimizing

R2

Kα+

α

2K

K983131

k=1

983042gk9830422lowast is α983183 =radic2R

983075K983131

k=1

983042gk9830422lowast

983076minus 12

which yields

E[f(xK)minus f(x983183)] leradic2R

KE

983093

983095983075

K983131

k=1

983042gk9830422lowast

983076 12

983094

983096

But this α983183 is not known in advance because

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 19 26

Corollary 3

Let the conditions of Theorem 2 hold Let

αk =R983161983160983160983159

k983131

i=1

983042gi9830422lowast

which is the ldquoup to nowrdquo optimal choice Then

E[f(xK)minus f(x983183)] le3R

KE

983093

983095983075

K983131

k=1

983042gk9830422lowast

983076 12

983094

983096

where

xK =1

K

K983131

k=1

xk

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 20 26

3 Variable metric methods

The idea is to adjust the metric with which one constructs updatesto better reflect problem structure

1 Choose subgradient gk isin partf(xk) or stochastic subgradient gk

satisfyingE[gk|xk] isin partf(xk)

2 Update positive semidefinite matrix Hk isin Rntimesn

3 Compute update

xk+1 = argminxisinC

983069〈gkx〉+ 1

2〈xminus xkHk(xminus xk)〉

983070

Special cases Subgradient method Hk =1

αkI

Newton method Hk = nabla2f(xk)

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 21 26

Theorem 4

Let Hk be a sequence of positive definite matrices where Hk is afunction of g1 middot middot middot gk (and potentially some additional randomness)Let gk be (stochastic) subgradients with

E[gk|xk] isin partf(xk)

Then

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078

le 1

2E

983077K983131

k=2

983059983042xk minus x9831839830422Hk

minus 983042xk minus x9831839830422Hkminus1

983060+ 983042x1 minus x9831839830422H1

983078

+1

2E

983077K983131

k=1

983042gk9830422Hminus1

k

983078

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 22 26

31 Adaptive gradient variable metric method

Theorem 5

LetRinfin = sup

xisinC983042xminus x983183983042infin

be the ℓinfin radius of the set C and the conditions of Theorem 4 hold Let

Hk =1

α

983075diag

983075k983131

i=1

gigiT

98307698307612

where α gt 0 is a pre-specified constant Then we have

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078le

983061R2

infin2

+ α2

983062E[tr(HK)]

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 23 26

Example Support vector machine classification problem

Consider

minx

f(x) =1

m

m983131

i=1

[1minus bi〈aix〉]+ st 983042x983042infin le 1

where the vectors ai isin minus1 0 1n with m = 5000 and n = 1000

For each coordinate j isin [n] we set aij isin plusmn1 to have a randomsign with probability 1j and aij = 0 otherwise

Let u isin minus1 1n uniformly at random We set bi = sign(〈aiu〉)with probability 095 and bi = minussign(〈aiu〉) otherwiseWe choose a stochastic gradient by selecting i isin [m] uniformly atrandom then set

gk isin part[1minus bi〈aixk〉]+

and αk = αradick

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 24 26

f(xk)minus f983183 versus k100 various initial stepsizes

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 25 26

f(xk)minus f983183 versus k100 best initial stepsize

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 26 26

Page 17: Lecture 11: Mirror Descent and Variable Metric Methodsmath.xmu.edu.cn/group/nona/damc/Lecture11.pdf · Definition: Let h be a differentiable convex function, differentiable on

fkbest minus f983183 versus k

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 17 26

21 Stochastic version

Theorem 2

Let the conditions of Theorem 1 hold except that instead of receiving avector

gk isin partf(xk)

at iteration k the vector gk is a stochastic subgradient satisfying

E[gk|xk] isin partf(xk)

Then for any non-increasing stepsize sequence αk (where αk may bechosen dependent on g1 middot middot middot gk)

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078le E

983077R2

αK+

K983131

k=1

αk

2983042gk9830422lowast

983078

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 18 26

22 Adaptive stepsizes

Let xK =1

K

K983131

k=1

xk If Dh(xx983183) le R2 for all x isin C then

E[f(xK)minus f(x983183)] le E

983077R2

KαK+

1

K

K983131

k=1

αk

2983042gk9830422lowast

983078

If αk = α for all k we see that the choice of stepsize minimizing

R2

Kα+

α

2K

K983131

k=1

983042gk9830422lowast is α983183 =radic2R

983075K983131

k=1

983042gk9830422lowast

983076minus 12

which yields

E[f(xK)minus f(x983183)] leradic2R

KE

983093

983095983075

K983131

k=1

983042gk9830422lowast

983076 12

983094

983096

But this α983183 is not known in advance because

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 19 26

Corollary 3

Let the conditions of Theorem 2 hold Let

αk =R983161983160983160983159

k983131

i=1

983042gi9830422lowast

which is the ldquoup to nowrdquo optimal choice Then

E[f(xK)minus f(x983183)] le3R

KE

983093

983095983075

K983131

k=1

983042gk9830422lowast

983076 12

983094

983096

where

xK =1

K

K983131

k=1

xk

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 20 26

3 Variable metric methods

The idea is to adjust the metric with which one constructs updatesto better reflect problem structure

1 Choose subgradient gk isin partf(xk) or stochastic subgradient gk

satisfyingE[gk|xk] isin partf(xk)

2 Update positive semidefinite matrix Hk isin Rntimesn

3 Compute update

xk+1 = argminxisinC

983069〈gkx〉+ 1

2〈xminus xkHk(xminus xk)〉

983070

Special cases Subgradient method Hk =1

αkI

Newton method Hk = nabla2f(xk)

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 21 26

Theorem 4

Let Hk be a sequence of positive definite matrices where Hk is afunction of g1 middot middot middot gk (and potentially some additional randomness)Let gk be (stochastic) subgradients with

E[gk|xk] isin partf(xk)

Then

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078

le 1

2E

983077K983131

k=2

983059983042xk minus x9831839830422Hk

minus 983042xk minus x9831839830422Hkminus1

983060+ 983042x1 minus x9831839830422H1

983078

+1

2E

983077K983131

k=1

983042gk9830422Hminus1

k

983078

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 22 26

31 Adaptive gradient variable metric method

Theorem 5

LetRinfin = sup

xisinC983042xminus x983183983042infin

be the ℓinfin radius of the set C and the conditions of Theorem 4 hold Let

Hk =1

α

983075diag

983075k983131

i=1

gigiT

98307698307612

where α gt 0 is a pre-specified constant Then we have

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078le

983061R2

infin2

+ α2

983062E[tr(HK)]

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 23 26

Example Support vector machine classification problem

Consider

minx

f(x) =1

m

m983131

i=1

[1minus bi〈aix〉]+ st 983042x983042infin le 1

where the vectors ai isin minus1 0 1n with m = 5000 and n = 1000

For each coordinate j isin [n] we set aij isin plusmn1 to have a randomsign with probability 1j and aij = 0 otherwise

Let u isin minus1 1n uniformly at random We set bi = sign(〈aiu〉)with probability 095 and bi = minussign(〈aiu〉) otherwiseWe choose a stochastic gradient by selecting i isin [m] uniformly atrandom then set

gk isin part[1minus bi〈aixk〉]+

and αk = αradick

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 24 26

f(xk)minus f983183 versus k100 various initial stepsizes

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 25 26

f(xk)minus f983183 versus k100 best initial stepsize

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 26 26

Page 18: Lecture 11: Mirror Descent and Variable Metric Methodsmath.xmu.edu.cn/group/nona/damc/Lecture11.pdf · Definition: Let h be a differentiable convex function, differentiable on

21 Stochastic version

Theorem 2

Let the conditions of Theorem 1 hold except that instead of receiving avector

gk isin partf(xk)

at iteration k the vector gk is a stochastic subgradient satisfying

E[gk|xk] isin partf(xk)

Then for any non-increasing stepsize sequence αk (where αk may bechosen dependent on g1 middot middot middot gk)

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078le E

983077R2

αK+

K983131

k=1

αk

2983042gk9830422lowast

983078

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 18 26

22 Adaptive stepsizes

Let xK =1

K

K983131

k=1

xk If Dh(xx983183) le R2 for all x isin C then

E[f(xK)minus f(x983183)] le E

983077R2

KαK+

1

K

K983131

k=1

αk

2983042gk9830422lowast

983078

If αk = α for all k we see that the choice of stepsize minimizing

R2

Kα+

α

2K

K983131

k=1

983042gk9830422lowast is α983183 =radic2R

983075K983131

k=1

983042gk9830422lowast

983076minus 12

which yields

E[f(xK)minus f(x983183)] leradic2R

KE

983093

983095983075

K983131

k=1

983042gk9830422lowast

983076 12

983094

983096

But this α983183 is not known in advance because

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 19 26

Corollary 3

Let the conditions of Theorem 2 hold Let

αk =R983161983160983160983159

k983131

i=1

983042gi9830422lowast

which is the ldquoup to nowrdquo optimal choice Then

E[f(xK)minus f(x983183)] le3R

KE

983093

983095983075

K983131

k=1

983042gk9830422lowast

983076 12

983094

983096

where

xK =1

K

K983131

k=1

xk

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 20 26

3 Variable metric methods

The idea is to adjust the metric with which one constructs updatesto better reflect problem structure

1 Choose subgradient gk isin partf(xk) or stochastic subgradient gk

satisfyingE[gk|xk] isin partf(xk)

2 Update positive semidefinite matrix Hk isin Rntimesn

3 Compute update

xk+1 = argminxisinC

983069〈gkx〉+ 1

2〈xminus xkHk(xminus xk)〉

983070

Special cases Subgradient method Hk =1

αkI

Newton method Hk = nabla2f(xk)

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 21 26

Theorem 4

Let Hk be a sequence of positive definite matrices where Hk is afunction of g1 middot middot middot gk (and potentially some additional randomness)Let gk be (stochastic) subgradients with

E[gk|xk] isin partf(xk)

Then

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078

le 1

2E

983077K983131

k=2

983059983042xk minus x9831839830422Hk

minus 983042xk minus x9831839830422Hkminus1

983060+ 983042x1 minus x9831839830422H1

983078

+1

2E

983077K983131

k=1

983042gk9830422Hminus1

k

983078

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 22 26

31 Adaptive gradient variable metric method

Theorem 5

LetRinfin = sup

xisinC983042xminus x983183983042infin

be the ℓinfin radius of the set C and the conditions of Theorem 4 hold Let

Hk =1

α

983075diag

983075k983131

i=1

gigiT

98307698307612

where α gt 0 is a pre-specified constant Then we have

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078le

983061R2

infin2

+ α2

983062E[tr(HK)]

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 23 26

Example Support vector machine classification problem

Consider

minx

f(x) =1

m

m983131

i=1

[1minus bi〈aix〉]+ st 983042x983042infin le 1

where the vectors ai isin minus1 0 1n with m = 5000 and n = 1000

For each coordinate j isin [n] we set aij isin plusmn1 to have a randomsign with probability 1j and aij = 0 otherwise

Let u isin minus1 1n uniformly at random We set bi = sign(〈aiu〉)with probability 095 and bi = minussign(〈aiu〉) otherwiseWe choose a stochastic gradient by selecting i isin [m] uniformly atrandom then set

gk isin part[1minus bi〈aixk〉]+

and αk = αradick

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 24 26

f(xk)minus f983183 versus k100 various initial stepsizes

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 25 26

f(xk)minus f983183 versus k100 best initial stepsize

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 26 26

Page 19: Lecture 11: Mirror Descent and Variable Metric Methodsmath.xmu.edu.cn/group/nona/damc/Lecture11.pdf · Definition: Let h be a differentiable convex function, differentiable on

22 Adaptive stepsizes

Let xK =1

K

K983131

k=1

xk If Dh(xx983183) le R2 for all x isin C then

E[f(xK)minus f(x983183)] le E

983077R2

KαK+

1

K

K983131

k=1

αk

2983042gk9830422lowast

983078

If αk = α for all k we see that the choice of stepsize minimizing

R2

Kα+

α

2K

K983131

k=1

983042gk9830422lowast is α983183 =radic2R

983075K983131

k=1

983042gk9830422lowast

983076minus 12

which yields

E[f(xK)minus f(x983183)] leradic2R

KE

983093

983095983075

K983131

k=1

983042gk9830422lowast

983076 12

983094

983096

But this α983183 is not known in advance because

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 19 26

Corollary 3

Let the conditions of Theorem 2 hold Let

αk =R983161983160983160983159

k983131

i=1

983042gi9830422lowast

which is the ldquoup to nowrdquo optimal choice Then

E[f(xK)minus f(x983183)] le3R

KE

983093

983095983075

K983131

k=1

983042gk9830422lowast

983076 12

983094

983096

where

xK =1

K

K983131

k=1

xk

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 20 26

3 Variable metric methods

The idea is to adjust the metric with which one constructs updatesto better reflect problem structure

1 Choose subgradient gk isin partf(xk) or stochastic subgradient gk

satisfyingE[gk|xk] isin partf(xk)

2 Update positive semidefinite matrix Hk isin Rntimesn

3 Compute update

xk+1 = argminxisinC

983069〈gkx〉+ 1

2〈xminus xkHk(xminus xk)〉

983070

Special cases Subgradient method Hk =1

αkI

Newton method Hk = nabla2f(xk)

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 21 26

Theorem 4

Let Hk be a sequence of positive definite matrices where Hk is afunction of g1 middot middot middot gk (and potentially some additional randomness)Let gk be (stochastic) subgradients with

E[gk|xk] isin partf(xk)

Then

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078

le 1

2E

983077K983131

k=2

983059983042xk minus x9831839830422Hk

minus 983042xk minus x9831839830422Hkminus1

983060+ 983042x1 minus x9831839830422H1

983078

+1

2E

983077K983131

k=1

983042gk9830422Hminus1

k

983078

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 22 26

31 Adaptive gradient variable metric method

Theorem 5

LetRinfin = sup

xisinC983042xminus x983183983042infin

be the ℓinfin radius of the set C and the conditions of Theorem 4 hold Let

Hk =1

α

983075diag

983075k983131

i=1

gigiT

98307698307612

where α gt 0 is a pre-specified constant Then we have

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078le

983061R2

infin2

+ α2

983062E[tr(HK)]

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 23 26

Example Support vector machine classification problem

Consider

minx

f(x) =1

m

m983131

i=1

[1minus bi〈aix〉]+ st 983042x983042infin le 1

where the vectors ai isin minus1 0 1n with m = 5000 and n = 1000

For each coordinate j isin [n] we set aij isin plusmn1 to have a randomsign with probability 1j and aij = 0 otherwise

Let u isin minus1 1n uniformly at random We set bi = sign(〈aiu〉)with probability 095 and bi = minussign(〈aiu〉) otherwiseWe choose a stochastic gradient by selecting i isin [m] uniformly atrandom then set

gk isin part[1minus bi〈aixk〉]+

and αk = αradick

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 24 26

f(xk)minus f983183 versus k100 various initial stepsizes

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 25 26

f(xk)minus f983183 versus k100 best initial stepsize

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 26 26

Page 20: Lecture 11: Mirror Descent and Variable Metric Methodsmath.xmu.edu.cn/group/nona/damc/Lecture11.pdf · Definition: Let h be a differentiable convex function, differentiable on

Corollary 3

Let the conditions of Theorem 2 hold Let

αk =R983161983160983160983159

k983131

i=1

983042gi9830422lowast

which is the ldquoup to nowrdquo optimal choice Then

E[f(xK)minus f(x983183)] le3R

KE

983093

983095983075

K983131

k=1

983042gk9830422lowast

983076 12

983094

983096

where

xK =1

K

K983131

k=1

xk

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 20 26

3 Variable metric methods

The idea is to adjust the metric with which one constructs updatesto better reflect problem structure

1 Choose subgradient gk isin partf(xk) or stochastic subgradient gk

satisfyingE[gk|xk] isin partf(xk)

2 Update positive semidefinite matrix Hk isin Rntimesn

3 Compute update

xk+1 = argminxisinC

983069〈gkx〉+ 1

2〈xminus xkHk(xminus xk)〉

983070

Special cases Subgradient method Hk =1

αkI

Newton method Hk = nabla2f(xk)

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 21 26

Theorem 4

Let Hk be a sequence of positive definite matrices where Hk is afunction of g1 middot middot middot gk (and potentially some additional randomness)Let gk be (stochastic) subgradients with

E[gk|xk] isin partf(xk)

Then

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078

le 1

2E

983077K983131

k=2

983059983042xk minus x9831839830422Hk

minus 983042xk minus x9831839830422Hkminus1

983060+ 983042x1 minus x9831839830422H1

983078

+1

2E

983077K983131

k=1

983042gk9830422Hminus1

k

983078

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 22 26

31 Adaptive gradient variable metric method

Theorem 5

LetRinfin = sup

xisinC983042xminus x983183983042infin

be the ℓinfin radius of the set C and the conditions of Theorem 4 hold Let

Hk =1

α

983075diag

983075k983131

i=1

gigiT

98307698307612

where α gt 0 is a pre-specified constant Then we have

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078le

983061R2

infin2

+ α2

983062E[tr(HK)]

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 23 26

Example Support vector machine classification problem

Consider

minx

f(x) =1

m

m983131

i=1

[1minus bi〈aix〉]+ st 983042x983042infin le 1

where the vectors ai isin minus1 0 1n with m = 5000 and n = 1000

For each coordinate j isin [n] we set aij isin plusmn1 to have a randomsign with probability 1j and aij = 0 otherwise

Let u isin minus1 1n uniformly at random We set bi = sign(〈aiu〉)with probability 095 and bi = minussign(〈aiu〉) otherwiseWe choose a stochastic gradient by selecting i isin [m] uniformly atrandom then set

gk isin part[1minus bi〈aixk〉]+

and αk = αradick

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 24 26

f(xk)minus f983183 versus k100 various initial stepsizes

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 25 26

f(xk)minus f983183 versus k100 best initial stepsize

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 26 26

Page 21: Lecture 11: Mirror Descent and Variable Metric Methodsmath.xmu.edu.cn/group/nona/damc/Lecture11.pdf · Definition: Let h be a differentiable convex function, differentiable on

3 Variable metric methods

The idea is to adjust the metric with which one constructs updatesto better reflect problem structure

1 Choose subgradient gk isin partf(xk) or stochastic subgradient gk

satisfyingE[gk|xk] isin partf(xk)

2 Update positive semidefinite matrix Hk isin Rntimesn

3 Compute update

xk+1 = argminxisinC

983069〈gkx〉+ 1

2〈xminus xkHk(xminus xk)〉

983070

Special cases Subgradient method Hk =1

αkI

Newton method Hk = nabla2f(xk)

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 21 26

Theorem 4

Let Hk be a sequence of positive definite matrices where Hk is afunction of g1 middot middot middot gk (and potentially some additional randomness)Let gk be (stochastic) subgradients with

E[gk|xk] isin partf(xk)

Then

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078

le 1

2E

983077K983131

k=2

983059983042xk minus x9831839830422Hk

minus 983042xk minus x9831839830422Hkminus1

983060+ 983042x1 minus x9831839830422H1

983078

+1

2E

983077K983131

k=1

983042gk9830422Hminus1

k

983078

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 22 26

31 Adaptive gradient variable metric method

Theorem 5

LetRinfin = sup

xisinC983042xminus x983183983042infin

be the ℓinfin radius of the set C and the conditions of Theorem 4 hold Let

Hk =1

α

983075diag

983075k983131

i=1

gigiT

98307698307612

where α gt 0 is a pre-specified constant Then we have

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078le

983061R2

infin2

+ α2

983062E[tr(HK)]

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 23 26

Example Support vector machine classification problem

Consider

minx

f(x) =1

m

m983131

i=1

[1minus bi〈aix〉]+ st 983042x983042infin le 1

where the vectors ai isin minus1 0 1n with m = 5000 and n = 1000

For each coordinate j isin [n] we set aij isin plusmn1 to have a randomsign with probability 1j and aij = 0 otherwise

Let u isin minus1 1n uniformly at random We set bi = sign(〈aiu〉)with probability 095 and bi = minussign(〈aiu〉) otherwiseWe choose a stochastic gradient by selecting i isin [m] uniformly atrandom then set

gk isin part[1minus bi〈aixk〉]+

and αk = αradick

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 24 26

f(xk)minus f983183 versus k100 various initial stepsizes

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 25 26

f(xk)minus f983183 versus k100 best initial stepsize

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 26 26

Page 22: Lecture 11: Mirror Descent and Variable Metric Methodsmath.xmu.edu.cn/group/nona/damc/Lecture11.pdf · Definition: Let h be a differentiable convex function, differentiable on

Theorem 4

Let Hk be a sequence of positive definite matrices where Hk is afunction of g1 middot middot middot gk (and potentially some additional randomness)Let gk be (stochastic) subgradients with

E[gk|xk] isin partf(xk)

Then

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078

le 1

2E

983077K983131

k=2

983059983042xk minus x9831839830422Hk

minus 983042xk minus x9831839830422Hkminus1

983060+ 983042x1 minus x9831839830422H1

983078

+1

2E

983077K983131

k=1

983042gk9830422Hminus1

k

983078

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 22 26

31 Adaptive gradient variable metric method

Theorem 5

LetRinfin = sup

xisinC983042xminus x983183983042infin

be the ℓinfin radius of the set C and the conditions of Theorem 4 hold Let

Hk =1

α

983075diag

983075k983131

i=1

gigiT

98307698307612

where α gt 0 is a pre-specified constant Then we have

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078le

983061R2

infin2

+ α2

983062E[tr(HK)]

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 23 26

Example Support vector machine classification problem

Consider

minx

f(x) =1

m

m983131

i=1

[1minus bi〈aix〉]+ st 983042x983042infin le 1

where the vectors ai isin minus1 0 1n with m = 5000 and n = 1000

For each coordinate j isin [n] we set aij isin plusmn1 to have a randomsign with probability 1j and aij = 0 otherwise

Let u isin minus1 1n uniformly at random We set bi = sign(〈aiu〉)with probability 095 and bi = minussign(〈aiu〉) otherwiseWe choose a stochastic gradient by selecting i isin [m] uniformly atrandom then set

gk isin part[1minus bi〈aixk〉]+

and αk = αradick

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 24 26

f(xk)minus f983183 versus k100 various initial stepsizes

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 25 26

f(xk)minus f983183 versus k100 best initial stepsize

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 26 26

Page 23: Lecture 11: Mirror Descent and Variable Metric Methodsmath.xmu.edu.cn/group/nona/damc/Lecture11.pdf · Definition: Let h be a differentiable convex function, differentiable on

31 Adaptive gradient variable metric method

Theorem 5

LetRinfin = sup

xisinC983042xminus x983183983042infin

be the ℓinfin radius of the set C and the conditions of Theorem 4 hold Let

Hk =1

α

983075diag

983075k983131

i=1

gigiT

98307698307612

where α gt 0 is a pre-specified constant Then we have

E

983077K983131

k=1

(f(xk)minus f(x983183))

983078le

983061R2

infin2

+ α2

983062E[tr(HK)]

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 23 26

Example Support vector machine classification problem

Consider

minx

f(x) =1

m

m983131

i=1

[1minus bi〈aix〉]+ st 983042x983042infin le 1

where the vectors ai isin minus1 0 1n with m = 5000 and n = 1000

For each coordinate j isin [n] we set aij isin plusmn1 to have a randomsign with probability 1j and aij = 0 otherwise

Let u isin minus1 1n uniformly at random We set bi = sign(〈aiu〉)with probability 095 and bi = minussign(〈aiu〉) otherwiseWe choose a stochastic gradient by selecting i isin [m] uniformly atrandom then set

gk isin part[1minus bi〈aixk〉]+

and αk = αradick

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 24 26

f(xk)minus f983183 versus k100 various initial stepsizes

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 25 26

f(xk)minus f983183 versus k100 best initial stepsize

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 26 26

Page 24: Lecture 11: Mirror Descent and Variable Metric Methodsmath.xmu.edu.cn/group/nona/damc/Lecture11.pdf · Definition: Let h be a differentiable convex function, differentiable on

Example Support vector machine classification problem

Consider

minx

f(x) =1

m

m983131

i=1

[1minus bi〈aix〉]+ st 983042x983042infin le 1

where the vectors ai isin minus1 0 1n with m = 5000 and n = 1000

For each coordinate j isin [n] we set aij isin plusmn1 to have a randomsign with probability 1j and aij = 0 otherwise

Let u isin minus1 1n uniformly at random We set bi = sign(〈aiu〉)with probability 095 and bi = minussign(〈aiu〉) otherwiseWe choose a stochastic gradient by selecting i isin [m] uniformly atrandom then set

gk isin part[1minus bi〈aixk〉]+

and αk = αradick

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 24 26

f(xk)minus f983183 versus k100 various initial stepsizes

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 25 26

f(xk)minus f983183 versus k100 best initial stepsize

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 26 26

Page 25: Lecture 11: Mirror Descent and Variable Metric Methodsmath.xmu.edu.cn/group/nona/damc/Lecture11.pdf · Definition: Let h be a differentiable convex function, differentiable on

f(xk)minus f983183 versus k100 various initial stepsizes

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 25 26

f(xk)minus f983183 versus k100 best initial stepsize

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 26 26

Page 26: Lecture 11: Mirror Descent and Variable Metric Methodsmath.xmu.edu.cn/group/nona/damc/Lecture11.pdf · Definition: Let h be a differentiable convex function, differentiable on

f(xk)minus f983183 versus k100 best initial stepsize

Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 26 26