a. appendix a: convergence analysisproceedings.mlr.press/v70/lei17b/lei17b-supp.pdf · 2018. 10....

7
Doubly Greedy Primal-dual Coordinate Descent for Sparse Empirical Risk Minimization A. Appendix A: Convergence Analysis A.1. Proof of Theorem 4.2 Recall primal, dual and Lagrangian forms: P (x) def = 1 n n X i=1 φi (hAi , xi)+ g(x), (20) L(x, y) def = g(x)+ 1 n y T Ax - 1 n n X i=1 φ i (yi ) (21) D(y) def = min x L(x, y) Lx(y), y) (22) where ¯ x(y): R n ! R d is the optimal primal variable with respect to some y, namely, ¯ x(y) = arg min x L(x, y) For simplicity, we will use ¯ x (t) def x(y (t ) throughout this paper. Similarly, we also use ¯ y(x): R d ! R n to be the optimal dual variable with respect to some x. Recall with the choice of regularizer of our model, g(x)= h(x)+ λkxk1, where h(x)= μ 2 kxk 2 2 satisfies μ-strong convexity, μ- smooth and separable. The conjugate of loss function(e.g. smooth hinge loss used in our experiments): φ is γ-strongly convex. Recall the primal gap defined as Δ (t) p def = L(x (t+1) , y (t) ) - D(y (t) ), and dual gap Δ (t) d def = D - D(y (t) ). In the proof, we will connect the objective change in primal/dual update with the primal/dual gap and show how the sub-optimality: Δ (t) = Δ (t) p + Δ (t) d enjoys linear convergence. Lemma A.1. (Primal Progress): L(x (t) , y (t) ) - L(x (t+1) , y (t) ) 1 kx (t) - ¯ x (t) k0 - 1 Δ (t) p Proof. This lemma is a direct result by our greedy update rule of our primal variables. L(x (t) , y (t) ) - Lx (t) , y (t) ) = X i L(x (t) , y (t) ) - L((¯ x (t) i - x (t) i )ei + x (t) , y (t) ) = X i2supp(x (t) -¯ x (t) ) L(x (t) , y (t) ) - L((¯ x (t) i - x (t) i )ei + x (t) , y (t) ) k ¯ x (t) - x (t) k0 max i L(x (t) , y (t) ) - L((¯ x (t) i - x (t) i )ei + x (t) , y (t) ) = k ¯ x (t) - x (t) k0 ( L(x (t) , y (t) ) - L(x (t+1) , y (t) ) ) And by adding L(x (t+1) , y (t) ) - L(x (t) , y (t) ) to both sides we finishes the proof. Recall i (t) is the selected coordinate to update in dual variable y (t) . Lemma A.2. (Primal-Dual Progress). Δ (t) d - Δ (t-1) d + Δ (t) p - Δ (t-1) p L(x (t+1) , y t ) - L(x t , y t ) +( 1 n hA i (t) , x (t) - ¯ x (t) i i (t) ) 2 - ( 1 n hA i (t) , ¯ x (t) i- g) 2 , where g 2 1 n i (t) (y (t) ). Our goal is to prove that Δ (t) d - Δ (t-1) d + Δ (t) p - Δ (t-1) p -δΔ (t) p - δΔ (t) d to show linear convergence in sub-optimality. Since L(x (t+1) , y t ) - L(x t , y t ) - 1 kx (t) -¯ x (t) k 0 Δ (t) p , this lemma is the middle step that connects to the primal part, and the remaining part represents the dual progress and will be ana- lyzed later. Proof. The primal and dual gap comes from both primal and dual progresses: Δ (t) d - Δ (t-1) d | {z } dual progress + Δ (t) p - Δ (t-1) p | {z } primal progress Dual progress: By Danskins’ theorem, -D(y) is γ-strongly convex. There- fore for any g 2 i (t) (y (t) ), we have, Δ (t) d - Δ (t-1) d = ( - D(y (t) ) - ( - D(y (t-1) ) -( 1 n hA i (t) , ¯ x (t) i- g)(y (t) i (t) - y (t-1) i (t) ) (23) - γ 2 (y (t) i (t) - y (t-1) i (t) ) 2 Primal progress: Similarly we get, L(x t , y t ) - L(x t , y (t-1) ) ( 1 n hA i (t) , x (t) i- g)(y (t) i (t) - y (t-1) i (t) ) + γ 2 (y (t-1) i (t) - y (t) i (t) ) 2 (24) Therefore, Δ (t) p - Δ (t-1) p = L(x (t+1) , y t ) - L(x t , y (t-1) ) - (D(y (t) ) - D(y (t-1) )) = L(x (t+1) , y t ) - L(x t , y t )+ L(x t , y t ) - L(x t , y (t-1) ) -(D(y (t) ) - D(y (t-1) )) L(x (t+1) , y t ) - L(x t , y t ) + 1 n hA i (t) , x (t) - ¯ x (t) i(y (t) i (t) - y (t-1) i (t) ) Here the last inequality comes from inequalities (24) and (23). Meanwhile, with the update rule of dual variable: y (t) i (t) arg max β 1 n hA i (t) , x (t) iβ - φ i (t) (β) - 1 2(β - y (t) i (t) ) 2 Therefore 9 g 2 i (t) (y (t) ) such that y (t) i (t) - y (t-1) i (t) = ( 1 n (hA i (t) , x (t) i- g). Therefore: (23) = -( 1 n hA i (t) , ¯ x (t) i- g)(y (t) i (t) - y (t-1) i (t) ) - γ 2 (y (t) i (t-1) - y (t) i (t) ) 2 = h 1 n A i (t) , ¯ x (t) - x (t) i(y (t-1) i (t) - y (t) i (t) ) -( 1 + γ 2 )(y (t) i (t) - y (t-1) i (t) ) 2 (25) Summing together we have:

Upload: others

Post on 07-Sep-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A. Appendix A: Convergence Analysisproceedings.mlr.press/v70/lei17b/lei17b-supp.pdf · 2018. 10. 24. · Doubly Greedy Primal-dual Coordinate Descent for Sparse Empirical Risk Minimization

Doubly Greedy Primal-dual Coordinate Descent for Sparse Empirical Risk Minimization

A. Appendix A: Convergence AnalysisA.1. Proof of Theorem 4.2

Recall primal, dual and Lagrangian forms:

P (x)

def=

1

n

nX

i=1

i

(hAi

,xi) + g(x), (20)

L(x,y) def= g(x) +

1

n

y

T

Ax� 1

n

nX

i=1

⇤i

(y

i

) (21)

D(y)

def= min

x

L(x,y) ⌘ L(¯x(y),y) (22)

where ¯

x(y) : Rn ! Rd is the optimal primal variable with respectto some y, namely,

¯

x(y) = argmin

x

L(x,y)For simplicity, we will use ¯

x

(t)

def=

¯

x(y

(t

) throughout this paper.Similarly, we also use ¯

y(x) : Rd ! Rn to be the optimal dualvariable with respect to some x.

Recall with the choice of regularizer of our model, g(x) = h(x)+

�kxk1

, where h(x) =

µ

2

kxk22

satisfies µ-strong convexity, µ-smooth and separable. The conjugate of loss function(e.g. smoothhinge loss used in our experiments): �⇤ is �-strongly convex.

Recall the primal gap defined as �

(t)

p

def= L(x(t+1)

,y

(t)

) �D(y

(t)

), and dual gap �

(t)

d

def= D

⇤ � D(y

(t)

). In the proof,we will connect the objective change in primal/dual update withthe primal/dual gap and show how the sub-optimality: �

(t)

=

(t)

p

+�

(t)

d

enjoys linear convergence.

Lemma A.1. (Primal Progress):

L(x(t)

,y

(t)

)� L(x(t+1)

,y

(t)

) � 1

kx(t) � ¯

x

(t)k0

� 1

(t)

p

Proof. This lemma is a direct result by our greedy update rule ofour primal variables.

L(x(t)

,y

(t)

)� L(¯x(t)

,y

(t)

)

=

X

i

�L(x(t)

,y

(t)

)� L((x̄(t)

i

� x

(t)

i

)e

i

+ x

(t)

,y

(t)

)

=

X

i2supp(x(t)�¯x

(t))

�L(x(t)

,y

(t)

)� L((x̄(t)

i

� x

(t)

i

)e

i

+ x

(t)

,y

(t)

)

k¯x(t) � x

(t)k0

⇥max

i

�L(x(t)

,y

(t)

)� L((x̄(t)

i

� x

(t)

i

)e

i

+ x

(t)

,y

(t)

)

= k¯x(t) � x

(t)k0

�L(x(t)

,y

(t)

)� L(x(t+1)

,y

(t)

)

And by adding L(x(t+1)

,y

(t)

)� L(x(t)

,y

(t)

) to both sides wefinishes the proof.

Recall i(t) is the selected coordinate to update in dual variabley

(t).

Lemma A.2. (Primal-Dual Progress).

(t)

d

��

(t�1)

d

+�

(t)

p

��

(t�1)

p

L(x(t+1)

,y

t

)� L(xt

,y

t

)

+⌘(

1

n

hAi

(t) ,x(t) � ¯

x

(t)ii

(t))2 � ⌘(

1

n

hAi

(t) , ¯x(t)i � g)

2

, where g 2 1

n

@�

⇤i

(t)(y(t)

).

Our goal is to prove that �(t)

d

� �

(t�1)

d

+ �

(t)

p

� �

(t�1)

p

���(t)

p

� ��

(t)

d

to show linear convergence in sub-optimality.Since L(x(t+1)

,y

t

) � L(xt

,y

t

) � 1

kx(t)�¯x

(t)k0�

(t)

p

, thislemma is the middle step that connects to the primal part, andthe remaining part represents the dual progress and will be ana-lyzed later.

Proof. The primal and dual gap comes from both primal and dualprogresses:

(t)

d

��

(t�1)

d| {z }dual progress

+�

(t)

p

��

(t�1)

p| {z }primal progress

• Dual progress:

By Danskins’ theorem, �D(y) is �-strongly convex. There-fore for any g 2 @�

⇤i

(t)(y(t)

), we have,

(t)

d

��

(t�1)

d

=

��D(y

(t)

�� ��D(y

(t�1)

�( 1n

hAi

(t) , ¯x(t)i � g)(y

(t)

i

(t) � y

(t�1)

i

(t) ) (23)

��

2

(y

(t)

i

(t) � y

(t�1)

i

(t) )

2

• Primal progress:

Similarly we get,L(xt

,y

t

)� L(xt

,y

(t�1)

)

(

1

n

hAi

(t) ,x(t)i � g)(y

(t)

i

(t) � y

(t�1)

i

(t) )

+

2

(y

(t�1)

i

(t) � y

(t)

i

(t))2 (24)

Therefore,�

(t)

p

��

(t�1)

p

= L(x(t+1)

,y

t

)� L(xt

,y

(t�1)

)� (D(y

(t)

)�D(y

(t�1)

))

= L(x(t+1)

,y

t

)� L(xt

,y

t

) + L(xt

,y

t

)� L(xt

,y

(t�1)

)

�(D(y

(t)

)�D(y

(t�1)

))

L(x(t+1)

,y

t

)� L(xt

,y

t

)

+

1

n

hAi

(t) ,x(t) � ¯

x

(t)i(y(t)

i

(t) � y

(t�1)

i

(t) )

Here the last inequality comes from inequalities (24) and(23).

Meanwhile, with the update rule of dual variable:

y

(t)

i

(t) argmax

1

n

hAi

(t) ,x(t)i� � �

⇤i

(t)(�)� 1

2⌘

(� � y

(t)

i

(t))2

Therefore 9 g 2 @�

⇤i

(t)(y(t)

) such that y(t)

i

(t) � y

(t�1)

i

(t) =

⌘(

1

n

(hAi

(t) ,x(t)i � g). Therefore:

(23) = �( 1n

hAi

(t) , ¯x(t)i � g)(y

(t)

i

(t) � y

(t�1)

i

(t) )

��

2

(y

(t)

i

(t�1) � y

(t)

i

(t))2

= h 1n

A

i

(t) , ¯x(t) � x

(t)i(y(t�1)

i

(t) � y

(t)

i

(t))

�( 1⌘

+

2

)(y

(t)

i

(t) � y

(t�1)

i

(t) )

2 (25)

• Summing together we have:

Page 2: A. Appendix A: Convergence Analysisproceedings.mlr.press/v70/lei17b/lei17b-supp.pdf · 2018. 10. 24. · Doubly Greedy Primal-dual Coordinate Descent for Sparse Empirical Risk Minimization

Doubly Greedy Primal-dual Coordinate Descent for Sparse Empirical Risk Minimization

(t)

d

��

(t�1)

d

+�

(t)

p

��

(t�1)

p

L(x(t+1)

,y

t

)� L(xt

,y

t

)

+

2

n

hAi

(t) , ¯x(t) � x

(t)i(y(t)

i

(t) � y

(t�1)

i

(t) )

�( 1⌘

+

2

)(y

(t)

i

(t) � y

(t�1)

i

(t) )

2

= L(x(t+1)

,y

t

)� L(xt

,y

t

)

+

2⌘

n

hAi

(t) ,x(t) � ¯

x

(t)i( 1n

hAi

(t) ,x(t)i � g)

�⌘2

(

1

+

2

)(h 1n

A

i

(t) ,x(t)i � g)

2

L(x(t+1)

,y

t

)� L(xt

,y

t

)

+⌘(

1

n

hAi

(t) ,x(t) � ¯

x

(t)i)2 � ⌘(

1

n

hAi

(t) , ¯x(t)i � g)

2

Afterwards, we upper bound the dual progress ( 1

n

hAi

(t) ,x(t) �

¯

x

(t)i)2 � (

1

n

hAi

(t) , ¯x(t)i � g)

2 by dual gap �

(t)

d

:Lemma A.3. (Dual Progress).

(

1

n

hAi

(t) , x(t) � x̄

(t)i)2 � (

1

n

hAi

(t) , ¯x(t)i � g)

2

� �

2n

(t)

d

+

5R

2

2n

2

kx(t) � ¯

x

(t)k2 (26)

, where g 2 1

n

@�

⇤i

(t)(y(t)

i

(t)).

Proof. For simplicity, we denote �

⇤(y) =

1

n

Pi

⇤i

(y

i

). Tobegin with,

(t)

d

= D

⇤ �D(y) 2

k 1n

A

¯

x

(t) � @�

⇤(y

(t)

)k2

2n

k 1n

A

¯

x

(t) � @�

⇤(y

(t)

)k21In our algorithm, the greedy choice of i(t) makes sure k 1

n

Ax

(t) �@�

⇤(y

(t)

)ki

(t) = k 1

n

Ax

(t) � @�

⇤(y

(t)

)k1. However, herewe need the relation between k 1

n

A

¯

x

(t) � @�

⇤(y

(t)

)ki

(t) andk 1

n

A

¯

x

(t) � @�

⇤(y

(t)

)k1 (assumed to be reached at coordinate

i

⇤). We bridge their gap by �

def=

1

n

A(

¯

x

(t) � x

(t)

). Since

�( 1n

hAi

(t) , ¯x(t)i � 1

n

(�

⇤i

(t))0(y

(t)

i

(t)))2

= �( 1n

hAi

(t) ,x(t)i � 1

n

(�

⇤i

(t))0(y

(t)

i

(t)) + �

i

(t))2

� 1

2n

2

⇣hA

i

(t) ,x(t)i � (�

⇤i

(t))0(y

(t)

i

(t))

⌘2

+ �

2

i

(t)

= �1

2

�� 1n

Ax

(t) � @�

⇤(y

(t)

)

��2

1 + �

2

i

(t)

�1

2

(

1

n

hAi

⇤,x

(t)i � 1

n

(�

⇤i

⇤)

0(y

(t)

i

⇤ ))

2

+ k�k21

= �1

2

(

1

n

hAi

⇤,

¯

x

(t)i � 1

n

(�

⇤i

⇤)

0(y

(t)

i

⇤ )� �

i

⇤)

2

+ k�k21

�1

4

(

1

n

hAi

⇤,

¯

x

(t)i � 1

n

(�

⇤i

⇤)

0(y

(t)

i

⇤ ))

2

+

3

2

k�k21

= �1

4

k 1n

A

¯

x

(t) � @�

⇤(y

(t)

)k21 +

3

2

k�k21

� �

2n

k

d

+

3

2

k�k21

The first inequality follows �(a + b)

2

= �a2 � b

2 � 2ab �a2 � b

2

+

1

2

a

2

+ 2b

2

= � 1

2

a

2

+ b

2, and replace a by1

n

hAi

(t) ,x(t)i� 1

n

(�

⇤i

(t))0(y

(t)

i

(t)) and b

def= �

i

(t) . And similarly forthe third inequality.

Meanwhile, since kA(

¯

x

(t)�x(t)

)k1 Rk¯x(t)�x(t)k, togetherwe get Lemma A.3.

Now we have established the connection between the primal anddual progress (change in primal/dual gap) with primal and dual gap,and the only redundant part is kx(t) � ¯

x

(t)k, but since µ

2

kx(t) �¯

x

(t)k L(x(t)

,y

(t)

)� L(¯x(t)

,y

(t)

), which could be absorbedin the primal gap. Therefore, back to the main inequality (26):

Proof of Theorem 4.2.

(t)

d

��

(t�1)

d

+�

(t)

p

��

(t�1)

p

Lemma A.2 L(x(t+1)

,y

t

)� L(xt

,y

t

)

+h 1n

Ax

(t) �r'(y(t)

),y

(t) � y

(t�1)i

�2h 1n

A

¯

x

(t) �r'(y(t)

),y

(t) � y

(t�1)iLemma A.3 L(x(t+1)

,y

t

)� L(xt

,y

t

)� ⌘�

2n

(t)

d

+

5⌘R

2

2n

2

kx(t) � ¯

x

(t)k2

L(x(t+1)

,y

t

)� L(xt

,y

t

)� ⌘�

2n

(t)

d

+

5⌘R

2

µn

2

�L(x(t)

,y

(t)

)� L(¯x(t)

,y

(t)

)

= (1� 5⌘R

2

µn

2

)

�L(x(t+1)

,y

t

)� L(xt

,y

t

)

�� ⌘�

2n

(t)

d

+

5⌘R

2

µn

2

�L(x(t+1)

,y

t

)� L(¯x(t)

,y

t

)

Lemma A.1 �(1� 5⌘R

2

µn

2

)

1

kx(t) � ¯

x

(t)k0

� 1

(t)

p

� ⌘�

2n

(t)

d

+

5⌘R

2

µn

2

�L(x(t+1)

,y

t

)� L(¯x(t)

,y

t

)

= ��(1� 5⌘R

2

µn

2

)

1

kx(t) � ¯

x

(t)k0

� 1

� 5⌘R

2

µn

2

��

(t)

p

�⌘�

2n

(t)

d

Therefore, we havekx(t) � ¯

x

(t)k0

kx(t) � ¯

x

(t)k0

� 1

(1�5⌘R

2

µn

2

)�

(t)

p

+(1+

⌘�

2n

)�

(t)

d

(t�1)

d

+�

(t�1)

p

i.e. linear convergence. Notice when

(t) 2n

2

µ

(10R

2

+ n�µ)kx(t) � ¯

x

(t)k0

(27)

(t) 1

1 +

(t)�

2n

(t�1)

Specifically, when inequality holds for (27), and suppose kx(t) �¯

x

(t)k0

s, then it requires O(s(

n

+ 1) log

1

) iterations toachieve ✏ primal and dual sub-optimality, where =

R

2

µ�

.

Page 3: A. Appendix A: Convergence Analysisproceedings.mlr.press/v70/lei17b/lei17b-supp.pdf · 2018. 10. 24. · Doubly Greedy Primal-dual Coordinate Descent for Sparse Empirical Risk Minimization

Doubly Greedy Primal-dual Coordinate Descent for Sparse Empirical Risk Minimization

B Appendix B: Additional Experimental Results

Finally, we show result for � = 0.01, 0.1, and µ = 0.01, 0.1, 1. Here are some comments for results under differentparameters.

The winning margin of DGPD is larger on data sets of dense feature matrix than that of sparse feature matrix. One reasonfor this is, for data of sparse feature matrix, features of higher frequency are more likely to be active than those of lowerfrequency, and therefore, the feature sub-matrix corresponding to the active primal variables are often denser than submatrixmatrix corresponding to the inactive ones. This results in a less overall speedup.

100 200 300 400 500 600 700 800 900 1000

time

10 -15

10 -10

10 -5

rela

tive p

rim

al o

bje

ctiv

e

Mnist-RF-Time-l01-mu1

DGPD

DualRCD

PrimalRCD

SPDC-dense

100 200 300 400 500 600 700 800 900 1000

time

10 -15

10 -10

10 -5re

lativ

e p

rim

al o

bje

ctiv

e

Aloi-RF-Time-l01-mu1

DGPD

DualRCD

PrimalRCD

SPDC-dense

10 -2 10 0 10 2

time

10 -8

10 -6

10 -4

10 -2

rela

tive

prim

al o

bje

ctiv

e

RCV1-Time-l01-mu1

DGPD

DualRCD

PrimalRCD

SPDC-dense

10 0 10 2

time

10 -4

10 -3

10 -2

10 -1

rela

tive

prim

al o

bje

ctiv

e

Mnist-RB-Time-l01-mu1

DGPD

DualRCD

PrimalRCD

SPDC

10 0 10 2

time

10 -4

10 -3

10 -2

10 -1

rela

tive p

rim

al o

bje

ctiv

e

Aloi-RB-Time-l01-mu1

DGPD

DualRCD

PrimalRCD

SPDC

0 5 10 15

time

10 -15

10 -10

10 -5

rela

tive p

rim

al o

bje

ctiv

e

Sector-Time-l01-mu1

DGPD

DualRCD

PrimalRCD

SPDC

0 200 400 600 800 1000

iter

10 -15

10 -10

10 -5

rela

tive p

rim

al o

bje

ctiv

e

Mnist-RF-Iteration-l01-mu1

DGPD

DualRCD

PrimalRCD

SPDC-dense

0 200 400 600 800 1000

iter

10 -15

10 -10

10 -5

rela

tive p

rim

al o

bje

ctiv

e

Aloi-RF-Iteration-l01-mu1

DGPD

DualRCD

PrimalRCD

SPDC-dense

0 50 100 150

iter

10 -8

10 -6

10 -4

10 -2

rela

tive

prim

al o

bje

ctiv

e

RCV1-Iteration-l01-mu1

DGPD

DualRCD

PrimalRCD

SPDC-dense

0 20 40 60 80 100

iter

10 -4

10 -3

10 -2

10 -1

rela

tive

prim

al o

bje

ctiv

e

Mnist-RB-Iteration-l01-mu1

DGPD

DualRCD

PrimalRCD

SPDC

0 50 100 150 200

iter

10 -6

10 -5

10 -4

10 -3

10 -2

10 -1

rela

tive p

rim

al o

bje

ctiv

e

Aloi-RB-Iteration-l01-mu1

DGPD

DualRCD

PrimalRCD

SPDC

0 200 400 600 800 1000

iter

10 -15

10 -10

10 -5

rela

tive p

rim

al o

bje

ctiv

e

Sector-Iteration-l01-mu1

DGPD

DualRCD

PrimalRCD

SPDC

Figure 2. Relative Objective versus Time (the upper 2 rows) and versus # iterations (the lower 2 rows) for � = 0.1, µ = 1.

Page 4: A. Appendix A: Convergence Analysisproceedings.mlr.press/v70/lei17b/lei17b-supp.pdf · 2018. 10. 24. · Doubly Greedy Primal-dual Coordinate Descent for Sparse Empirical Risk Minimization

Doubly Greedy Primal-dual Coordinate Descent for Sparse Empirical Risk Minimization

We also observe that in order to achieve the best performance of DGPD, both primal and dual sparsity must hold, and thesparsity is partially controled by the L1/L2 penalty. In particular, when the L1 penalty has too much weight, the primaliterate would become too sparse to yield a reasonable prediction accuracy, which then results in a particularly dense dualiterate due to its non-zero loss on most of the samples. Another example is, when the L2 penalty becomes too large, theclassifier would tend to mis-classify many examples in order to gain a large margin, which results in dense dual iterates.

However, in practice such hyperparameter settings are less likely to be chosen due to its inferior prediction performance.

100 200 300 400 500 600 700 800 900 1000

time

10 -15

10 -10

10 -5

rela

tive p

rim

al o

bje

ctiv

e

Mnist-RF-Time-l01-mu01

DGPD

DualRCD

PrimalRCD

SPDC-dense

100 200 300 400 500 600 700 800 900 1000

time

10 -15

10 -10

10 -5

rela

tive p

rim

al o

bje

ctiv

e

Aloi-RF-Time-l01-mu01

DGPD

DualRCD

PrimalRCD

SPDC-dense

10 -2 10 0 10 2

time

10 -8

10 -6

10 -4

10 -2

rela

tive

prim

al o

bje

ctiv

e

RCV1-Time-l01-mu01

DGPD

DualRCD

PrimalRCD

SPDC-dense

10 0 10 2

time

10 -4

10 -3

10 -2

10 -1

rela

tive

prim

al o

bje

ctiv

e

Mnist-RB-Time-l01-mu01

DGPD

DualRCD

PrimalRCD

SPDC

10 0 10 2

time

10 -4

10 -3

10 -2

10 -1

rela

tive p

rim

al o

bje

ctiv

e

Aloi-RB-Time-l01-mu01

DGPD

DualRCD

PrimalRCD

SPDC

0 5 10 15

time

10 -15

10 -10

10 -5

rela

tive p

rim

al o

bje

ctiv

e

Sector-Time-l01-mu01

DGPD

DualRCD

PrimalRCD

SPDC

0 200 400 600 800 1000

iter

10 -15

10 -10

10 -5

rela

tive p

rim

al o

bje

ctiv

e

Mnist-RF-Iteration-l01-mu01

DGPD

DualRCD

PrimalRCD

SPDC-dense

0 200 400 600 800 1000

iter

10 -15

10 -10

10 -5

rela

tive p

rim

al o

bje

ctiv

e

Aloi-RF-Iteration-l01-mu01

DGPD

DualRCD

PrimalRCD

SPDC-dense

0 50 100 150

iter

10 -8

10 -6

10 -4

10 -2

rela

tive

prim

al o

bje

ctiv

e

RCV1-Iteration-l01-mu01

DGPD

DualRCD

PrimalRCD

SPDC-dense

0 20 40 60 80 100

iter

10 -4

10 -3

10 -2

10 -1

rela

tive

prim

al o

bje

ctiv

e

Mnist-RB-Iteration-l01-mu01

DGPD

DualRCD

PrimalRCD

SPDC

0 50 100 150 200

iter

10 -6

10 -5

10 -4

10 -3

10 -2

10 -1

rela

tive p

rim

al o

bje

ctiv

e

Aloi-RB-Iteration-l01-mu01

DGPD

DualRCD

PrimalRCD

SPDC

0 200 400 600 800 1000

iter

10 -15

10 -10

10 -5

rela

tive p

rim

al o

bje

ctiv

e

Sector-Iteration-l01-mu01

DGPD

DualRCD

PrimalRCD

SPDC

Figure 3. Relative Objective versus Time (the upper 2 rows) and versus # iterations (the lower 2 rows) for � = 0.1, µ = 0.1.

Page 5: A. Appendix A: Convergence Analysisproceedings.mlr.press/v70/lei17b/lei17b-supp.pdf · 2018. 10. 24. · Doubly Greedy Primal-dual Coordinate Descent for Sparse Empirical Risk Minimization

Doubly Greedy Primal-dual Coordinate Descent for Sparse Empirical Risk Minimization

100 200 300 400 500 600 700 800 900 1000

time

10 -15

10 -10

10 -5

rela

tive p

rim

al o

bje

ctiv

e

Mnist-RF-Time-l001-mu1

DGPD

DualRCD

PrimalRCD

SPDC-dense

100 200 300 400 500 600 700 800 900 1000

time

10 -15

10 -10

10 -5

rela

tive p

rim

al o

bje

ctiv

e

Aloi-RF-Time-l001-mu1

DGPD

DualRCD

PrimalRCD

SPDC-dense

10 -2 10 0 10 2

time

10 -8

10 -6

10 -4

10 -2

rela

tive p

rim

al o

bje

ctiv

e

RCV1-Time-l001-mu1

DGPD

DualRCD

PrimalRCD

SPDC-dense

10 0 10 2

time

10 -4

10 -3

10 -2

10 -1

rela

tive p

rim

al o

bje

ctiv

e

Mnist-RB-Time-l001-mu1

DGPD

DualRCD

PrimalRCD

SPDC

10 0 10 2

time

10 -4

10 -3

10 -2

10 -1

rela

tive p

rim

al o

bje

ctiv

e

Aloi-RB-Time-l001-mu1

DGPD

DualRCD

PrimalRCD

SPDC

0 5 10 15

time

10 -15

10 -10

10 -5

rela

tive

prim

al o

bje

ctiv

e

Sector-Time-l001-mu1

DGPD

DualRCD

PrimalRCD

SPDC

0 200 400 600 800 1000

iter

10 -15

10 -10

10 -5

rela

tive p

rim

al o

bje

ctiv

e

Mnist-RF-Iteration-l001-mu1

DGPD

DualRCD

PrimalRCD

SPDC-dense

0 200 400 600 800 1000

iter

10 -15

10 -10

10 -5

rela

tive p

rim

al o

bje

ctiv

e

Aloi-RF-Iteration-l001-mu1

DGPD

DualRCD

PrimalRCD

SPDC-dense

0 50 100 150

iter

10 -8

10 -6

10 -4

10 -2

rela

tive p

rim

al o

bje

ctiv

eRCV1-Iteration-l001-mu1

DGPD

DualRCD

PrimalRCD

SPDC-dense

0 20 40 60 80 100

iter

10 -4

10 -3

10 -2

10 -1

rela

tive p

rim

al o

bje

ctiv

e

Mnist-RB-Iteration-l001-mu1

DGPD

DualRCD

PrimalRCD

SPDC

0 50 100 150 200

iter

10 -6

10 -5

10 -4

10 -3

10 -2

10 -1

rela

tive p

rim

al o

bje

ctiv

e

Aloi-RB-Iteration-l001-mu1

DGPD

DualRCD

PrimalRCD

SPDC

0 200 400 600 800 1000

iter

10 -15

10 -10

10 -5

rela

tive p

rim

al o

bje

ctiv

e

Sector-Iteration-l001-mu1

DGPD

DualRCD

PrimalRCD

SPDC

Figure 4. Relative Objective versus Time (the upper 2 rows) and versus # iterations (the lower 2 rows) for � = 0.01, µ = 1.

Page 6: A. Appendix A: Convergence Analysisproceedings.mlr.press/v70/lei17b/lei17b-supp.pdf · 2018. 10. 24. · Doubly Greedy Primal-dual Coordinate Descent for Sparse Empirical Risk Minimization

Doubly Greedy Primal-dual Coordinate Descent for Sparse Empirical Risk Minimization

100 200 300 400 500 600 700 800 900 1000

time

10 -10

10 -5

rela

tive p

rim

al o

bje

ctiv

e

Mnist-RF-Time-l001-mu01

DGPD

DualRCD

PrimalRCD

SPDC-dense

100 200 300 400 500 600 700 800 900 1000

time

10 -15

10 -10

10 -5

rela

tive p

rim

al o

bje

ctiv

e

Aloi-RF-Time-l001-mu01

DGPD

DualRCD

PrimalRCD

SPDC-dense

10 -2 10 0 10 2

time

10 -5

10 -4

10 -3

10 -2

10 -1

obj

RCV1-Time-l001-mu01

DGPD

DualRCD

PrimalRCD

SPDC

10 0 10 2

time

10 -6

10 -5

10 -4

10 -3

10 -2

10 -1

obj

Mnist-RB-Time-l001-mu01

DGPD

DualRCD

PrimalRCD

SPDC

10 0 10 2

time

10 -6

10 -5

10 -4

10 -3

10 -2

10 -1

obj

Aloi-RB-Time-l001-mu01

DGPD

DualRCD

PrimalRCD

SPDC

2 4 6 8 10 12 14

time

10 -15

10 -10

10 -5

rela

tive p

rim

al o

bje

ctiv

e

Sector-Time-l001-mu01

DGPD

DualRCD

PrimalRCD

SPDC-dense

0 200 400 600 800 1000

iter

10 -15

10 -10

10 -5

rela

tive p

rim

al o

bje

ctiv

e

Mnist-RF-Iteration-l001-mu01

DGPD

DualRCD

PrimalRCD

SPDC-dense

0 200 400 600 800 1000

iter

10 -15

10 -10

10 -5

rela

tive p

rim

al o

bje

ctiv

e

Aloi-RF-Iteration-l001-mu01

DGPD

DualRCD

PrimalRCD

SPDC-dense

10 0 10 1 10 2

iter

10 -5

10 -4

10 -3

10 -2

10 -1

obj

RCV1-Iter-l001-mu01

DGPD

DualRCD

PrimalRCD

SPDC

10 0 10 1 10 2

iter

10 -6

10 -5

10 -4

10 -3

10 -2

10 -1

ob

j

Mnist-RB-Iter-l001-mu01

DGPD

DualRCD

PrimalRCD

SPDC

10 0 10 1 10 2

iter

10 -6

10 -5

10 -4

10 -3

10 -2

10 -1

obj

Aloi-RB-Iter-l001-mu01

DGPD

DualRCD

PrimalRCD

SPDC

0 200 400 600 800

iter

10 -15

10 -10

10 -5

rela

tive

prim

al o

bje

ctiv

e

Sector-Iteration-l001-mu01

DGPD

DualRCD

PrimalRCD

SPDC-dense

Figure 5. Relative Objective versus Time (the upper 2 rows) and versus # iterations (the lower 2 rows) for � = 0.01, µ = 0.1.

Page 7: A. Appendix A: Convergence Analysisproceedings.mlr.press/v70/lei17b/lei17b-supp.pdf · 2018. 10. 24. · Doubly Greedy Primal-dual Coordinate Descent for Sparse Empirical Risk Minimization

Doubly Greedy Primal-dual Coordinate Descent for Sparse Empirical Risk Minimization

100 200 300 400 500 600 700 800 900 1000

time

10 -15

10 -10

10 -5

rela

tive p

rim

al o

bje

ctiv

e

Mnist-RF-Time-l001-mu001

DGPD

DualRCD

PrimalRCD

SPDC-dense

100 200 300 400 500 600 700 800 900 1000

time

10 -15

10 -10

10 -5

rela

tive p

rim

al o

bje

ctiv

e

Aloi-RF-Time-l001-mu001

DGPD

DualRCD

PrimalRCD

SPDC-dense

10 -2 10 0 10 2

time

10 -8

10 -6

10 -4

10 -2

rela

tive p

rim

al o

bje

ctiv

e

RCV1-Time-l001-mu001

DGPD

DualRCD

PrimalRCD

SPDC-dense

10 0 10 2

time

10 -4

10 -3

10 -2

10 -1

rela

tive p

rim

al o

bje

ctiv

e

Mnist-RB-Time-l001-mu001

DGPD

DualRCD

PrimalRCD

SPDC

10 0 10 2

time

10 -4

10 -3

10 -2

10 -1

rela

tive p

rim

al o

bje

ctiv

e

Aloi-RB-Time-l001-mu001

DGPD

DualRCD

PrimalRCD

SPDC

0 5 10 15

time

10 -15

10 -10

10 -5

rela

tive

prim

al o

bje

ctiv

e

Sector-Time-l001-mu001

DGPD

DualRCD

PrimalRCD

SPDC

0 200 400 600 800 1000

iter

10 -15

10 -10

10 -5

rela

tive p

rim

al o

bje

ctiv

e

Mnist-RF-Iteration-l001-mu001

DGPD

DualRCD

PrimalRCD

SPDC-dense

0 200 400 600 800 1000

iter

10 -15

10 -10

10 -5

rela

tive p

rim

al o

bje

ctiv

e

Aloi-RF-Iteration-l001-mu001

DGPD

DualRCD

PrimalRCD

SPDC-dense

0 50 100 150

iter

10 -8

10 -6

10 -4

10 -2

rela

tive p

rim

al o

bje

ctiv

e

RCV1-Iteration-l001-mu001

DGPD

DualRCD

PrimalRCD

SPDC-dense

0 20 40 60 80 100

iter

10 -4

10 -3

10 -2

10 -1

rela

tive p

rim

al o

bje

ctiv

e

Mnist-RB-Iteration-l001-mu001

DGPD

DualRCD

PrimalRCD

SPDC

0 50 100 150 200

iter

10 -6

10 -5

10 -4

10 -3

10 -2

10 -1

rela

tive p

rim

al o

bje

ctiv

e

Aloi-RB-Iteration-l001-mu001

DGPD

DualRCD

PrimalRCD

SPDC

0 200 400 600 800 1000

iter

10 -15

10 -10

10 -5

rela

tive p

rim

al o

bje

ctiv

e

Sector-Iteration-l001-mu001

DGPD

DualRCD

PrimalRCD

SPDC

Figure 6. Relative Objective versus Time (the upper 2 rows) and versus # iterations (the lower 2 rows) for � = 0.01, µ = 0.01.