cdominik gmail com zheng qu ed ac uk peter richtarik … · arxiv:1502.08053v1 [math.oc] 27 feb...

arX

iv:1

502.

0805

3v1

[mat

h.O

C]

27 F

eb 2

015

Stochastic Dual Coordinate Ascent with Adaptive Probabilities

Dominik Csiba CDOMINIK @GMAIL .COM

Zheng Qu [email protected]

Peter Richtarik [email protected]

University of Edinburgh

AbstractThis paper introduces AdaSDCA: an adap-tive variant of stochastic dual coordinate as-cent (SDCA) for solving the regularized empir-ical risk minimization problems. Our modifica-tion consists in allowing the method adaptivelychange the probability distribution over the dualvariables throughout the iterative process. AdaS-DCA achieves provably better complexity boundthan SDCA with the best fixed probability dis-tribution, known as importance sampling. How-ever, it is of a theoretical character as it is expen-sive to implement. We also propose AdaSDCA+:a practical variant which in our experiments out-performs existing non-adaptive methods.

1. Introduction

Empirical Loss Minimization. In this paper we considerthe regularized empirical risk minimization problem:

minw∈Rd

[

P (w)def=

1

n

n∑

i=1

φi(A⊤i w) + λg(w)

]

. (1)

In the context of supervised learning,w is a linear predictor,A1, . . . , An ∈ R

d are samples,φ1, . . . , φn : Rd → R areloss functions,g : Rd → R is a regularizer andλ > 0 aregularization parameter. Hence, we are seeking to identifythe predictor which minimizes the average (empirical) lossP (w).

We assume throughout that the loss functions are1/γ-smooth for someγ > 0. That is, we assume they aredifferentiable and have Lipschitz derivative with Lipschitzconstant1/γ:

|φ′(a)− φ′(b)| ≤ 1

γ|a− b|

for all a, b ∈ R. Moreover, we assume thatg is 1-stronglyconvex with respect to the L2 norm:

g(w) ≤ αg(w1) + (1− α)g(w2)−α(1− α)

2‖w1 − w2‖2

for all w1, w2 ∈ dom g, 0 ≤ α ≤ 1 andw = αw1 + (1 −α)w2.

The ERM problem (1) has received considerable attentionin recent years due to its widespread usage in supervisedstatistical learning (Shalev-Shwartz & Zhang, 2013b). Of-ten, the number of samplesn is very large and it is im-portant to design algorithms that would be efficient in thisregime.

Modern stochastic algorithms for ERM. Severalhighly efficient methods for solving the ERM prob-lem were proposed and analyzed recently. Theseinclude primal methods such as SAG (Schmidt et al.,2013), SVRG (Johnson & Zhang, 2013), S2GD(Konecny & Richtarik, 2014), SAGA (Defazio et al.,2014), mS2GD (Konecny et al., 2014a) and MISO (Mairal,2014). Importance sampling was considered in ProxSVRG(Xiao & Zhang, 2014) and S2CD (Konecny et al., 2014b).

Stochastic Dual Coordinate Ascent.One of the most suc-cessful methods in this category isstochastic dual coor-dinate ascent (SDCA), which operates on the dual of theERM problem (1):

maxα=(α1,...,αn)∈Rn

[

D(α)def= −f(α)− ψ(α)

]

, (2)

where functionsf andψ are defined by

f(α)def= λg∗

(

1

λn

n∑

i=1

Aiαi

)

, (3)

ψ(α)def=

1

n

n∑

i=1

φ∗i (−αi), (4)

andg∗ andφ∗i are the convex conjugates1 of g andφi, re-spectively. Note that in dual problem, there are as manyvariables as there are samples in the primal:α ∈ R

n.

SDCA in each iteration randomly selects a dual variableαi, and performs its update, usually via closed-form

1By the convex (Fenchel) conjugate of a functionh :R

k → R we mean the functionh∗ : Rk → R defined by

h∗(u) = sups{s⊤u− h(s)}.

http://arxiv.org/abs/1502.08053v1


formula – this strategy is know as randomized coor-dinate descent. Methods based on updating randomlyselected dual variables enjoy, in our setting, a linearconvergence rate (Shalev-Shwartz & Zhang, 2013b; 2012;Takac et al., 2013; Shalev-Shwartz & Zhang, 2013a;Zhao & Zhang, 2014; Qu et al., 2014). These methodshave attracted considerable attention in the past fewyears, and include SCD (Shalev-Shwartz & Tewari, 2011),RCDM (Nesterov, 2012), UCDC (Richtarik & Takac,2014), ICD (Tappenden et al., 2013), PCDM(Richtarik & Takac, 2012), SPCDM (Fercoq & Richtarik,2013), SPDC (Zhang & Xiao, 2014), APCG (Lin et al.,2014), RCD (Necoara & Patrascu, 2014), APPROX(Fercoq & Richtarik, 2013), QUARTZ (Qu et al.,2014) and ALPHA (Qu & Richtarik, 2014). Re-cent advances on mini-batch and distributed variantscan be found in (Liu & Wright, 2014), (Zhao et al.,2014b), (Richtarik & Takac, 2013a), (Fercoq et al.,2014), (Trofimov & Genkin, 2014), (Jaggi et al., 2014),(Marecek et al., 2014) and (Mahajan et al., 2014).Other related work includes (Nemirovski et al., 2009;Duchi et al., 2011; Agarwal & Bottou, 2014; Zhao et al.,2014a; Fountoulakis & Tappenden, 2014; Tappenden et al.,2014). We also point to (Wright, 2014) for a review oncoordinate descent algorithms.

Selection Probabilities. Naturally, both the theoreticalconvergence rate and practical performance of random-ized coordinate descent methods depends on the proba-bility distribution governing the choice of individual co-ordinates. While most existing work assumes uniformdistribution, it was shown byRichtarik & Takac(2014);Necoara et al.(2012); Zhao & Zhang(2014) that coordi-nate descent works for an arbitrary fixed probability dis-tribution over individual coordinates and even subsets ofcoordinates (Richtarik & Takac, 2013b; Qu et al., 2014;Qu & Richtarik, 2014; Qu & Richtarik, 2014). In all ofthese works the theory allows the computation of a fixedprobability distribution, known asimportance sampling,which optimizes the complexity bounds. However, sucha distribution often depends on unknown quantities, suchas the distances of the individual variables from theiroptimal values (Richtarik & Takac, 2014; Qu & Richtarik,2014). In some cases, such as for smooth strongly con-vex functions or in the primal-dual setup we considerhere, the probabilities forming an importance samplingcan be explicitly computed (Richtarik & Takac, 2013b;Zhao & Zhang, 2014; Qu et al., 2014; Qu & Richtarik,2014; Qu & Richtarik, 2014). Typically, the theoretical in-fluence of using the importance sampling is in the replace-ment of the maximum of certain data-dependent quantitiesin the complexity bound by the average.

Adaptivity. Despite the striking developments in the field,there is virtually no literature on methods using anadap-

tive choice of the probabilities. We are aware of a fewpieces of work; but all resort to heuristics unsupported bytheory (Glasmachers & Dogan, 2013; Lukasewitz, 2013;Schaul et al., 2013; Banks-Watson, 2012; Loshchilov et al.,2011), which unfortunately also means that the methodsare sometimes effective, and sometimes not.We observethat in the primal-dual framework we consider, eachdual variable can be equipped with a natural measureof progress which we call “dual residue”. We proposethat the selection probabilities be constructed based onthese quantities.

Outline: In Section2 we summarize the contributions ofour work. In Section3 we describe our first, theoreticalmethods (Algorithm1) and describe the intuition behindit. In Section4 we provide convergence analysis. In Sec-tion 5 we introduce Algorithm2: an variant of Algorithm1containing heuristic elements which make it efficiently im-plementable. We conclude with numerical experiments inSection6. Technical proofs and additional numerical ex-periments can be found in the appendix.

2. Contributions

We now briefly highlight the main contributions of thiswork.

Two algorithms with adaptive probabilities. We proposetwo new stochastic dual ascent algorithms: AdaSDCA (Al-gorithm1) and AdaSDCA+ (Algorithm2) for solving (1)and its dual problem (2). The novelty of our algorithms isin adaptive choice of the probability distribution over thedual coordinates.

Complexity analysis.We provide a convergence rate anal-ysis for the first method, showing thatAdaSDCA enjoysbetter rate than the best known rate for SDCA witha fixed sampling (Zhao & Zhang, 2014; Qu et al., 2014).The probabilities are proportional to a certain measure ofdual suboptimality associated with each variable.

Practical method. AdaSDCA requires the same com-putational effort per iteration as the batch gradient algo-rithm. To solve this issue, we propose AdaSDCA+ (Algo-rithm 2): an efficient heuristic variant of the AdaSDCA.The computational effort of the heuristic method in a sin-gle iteration is low, which makes it very competitive withmethods based on importance sampling, such as IProx-SDCA (Zhao & Zhang, 2014). We support this with com-putational experiments in Section6.

Outline: In Section2 we summarize the contributions ofour work. In Section3 we describe our first, theoreticalmethods (AdaSDCA) and describe the intuition behind it.In Section4 we provide convergence analysis. In Sec-tion 5 we introduce AdaSDCA+: a variant of AdaSDCA


containing heuristic elements which make it efficiently im-plementable. We conclude with numerical experiments inSection6. Technical proofs and additional numerical ex-periments can be found in the appendix.

3. The Algorithm: AdaSDCA

It is well known that the optimal primal-dual pair(w∗, α∗) ∈ R

d×Rn satisfies the followingoptimality con-

ditions:

w∗ = ∇g∗(

1

λnAα∗

)

(5)

α∗i = −∇φi(A⊤

i w∗), ∀i ∈ [n]

def= {1, . . . , n}, (6)

whereA is thed-by-nmatrix with columnsA1, . . . , An.

Definition 1 (Dual residue). The dual residue, κ =(κ1, . . . , κn) ∈ R

n, associated with(w,α) is given by:

κidef= αi +∇φi(A⊤

i w). (7)

Note, thatκti = 0 if and only if αi satisfies (5). This mo-tivates the design of AdaSDCA (Algorithm1) as follows:whenever|κti| is large, theith dual coordinateαi is subop-timal and hence should be updated more often.

Definition 2 (Coherence). We say that probability vectorpt ∈ R

n is coherentwith the dual residueκt if for all i ∈[n] we have

κti 6= 0 ⇒ pti > 0.

Alternatively,pt is coherent withkt if for

Itdef= {i ∈ [n] : κti 6= 0} ⊆ [n].

we havemini∈It pti > 0.

AdaSDCA is a stochastic dual coordinate ascent method,with an adaptive probability vectorpt, which could po-tentially change at every iterationt. The primal anddual update rules are exactly the same as in standardSDCA (Shalev-Shwartz & Zhang, 2013b), which insteaduses uniform sampling probability at every iteration anddoes not require the computation of the dual residueκ.

Our first result highlights a key technical tool which ul-timately leads to the development of good adaptive sam-pling distributionspt in AdaSDCA. For simplicity we de-note byEt the expectation with respect to the random indexit ∈ [n] generated at iterationt.

Lemma 3. Consider the AdaSDCA algorithm during iter-ation t ≥ 0 and assume thatpt is coherent withκt. Then

Et

[

D(αt+1)−D(αt)]

− θ(

P (wt)−D(αt))

≥ − θ

2λn2

∑

i∈It

(

θ(vi + nλγ)

pti− nλγ

)

|κti|2, (8)

Algorithm 1 AdaSDCA

Init: vi = A⊤i Ai for i ∈ [n]; α0 ∈ R

n; α0 = 1λnAα0

for t ≥ 0 doPrimal update:wt = ∇g∗ (αt)Set:αt+1 = αt

Compute residueκt: κti = αti+∇φi(A⊤

i wt), ∀i ∈ [n]

Compute probability distributionpt coherent withκt

Generate randomit ∈ [n] according topt

Compute:∆αt

it= argmax

∆∈R

{

−φ∗it(−(αtit+∆))

−A⊤itwt∆− vit

2λn |∆|2}

Dual update:αt+1it

= αtit+∆αt

it

Average update:αt = αt +∆αit

λnAit

end forOutput: wt, αt

for arbitrary

0 ≤ θ ≤ mini∈It

pti. (9)

Proof. Lemma 3 is proved similarly to Lemma 2in (Zhao & Zhang, 2014), but in a slightly more generalsetting. For completeness, we provide the proof in the ap-pendix.

Lemma 3 plays a key role in the analysis of stochas-tic dual coordinate methods (Shalev-Shwartz & Zhang,2013b; Zhao & Zhang, 2014; Shalev-Shwartz & Zhang,2013a). Indeed, if the right-hand side of (8) is positive, thenthe primal dual errorP (wt) − D(αt) can be bounded bythe expected dual ascentEt[D(αt+1)−D(αt)] times1/θ,which yields the contraction of the dual error at the rate of1 − θ (see Theorem7). In order to make the right-handside of (8) positive we can take anyθ smaller thanθ(κt, pt)where the functionθ(·, ·) : Rn

+ × Rn+ → R is defined by:

θ(κ, p) ≡nλγ

∑

i:κi 6=0 |κi|2∑

i:κi 6=0 p−1i |κi|2(vi + nλγ)

. (10)

We also need to make sure that0 ≤ θ ≤ mini∈It pti in

order to apply Lemma3. A “good” adaptive probabilitypt

should then be the solution of the following optimizationproblem:

maxp∈R

n

+

θ(κt, p) (11)

s.t.

n∑

i=1

pi = 1

θ(κt, p) ≤ mini:κt

i6=0pi


A feasible solution to (11) is theimportance sampling(alsoknown as optimal serial sampling)p∗ defined by:

p∗idef=

vi + nλγ∑n

j=1 (vj + nλγ), ∀i ∈ [n], (12)

which was proposed in (Zhao & Zhang, 2014) to obtainproximal stochastic dual coordinate ascent method withimportance sampling (IProx-SDCA). The same optimalprobability vector was also deduced, via different meansand in a more general setting in (Qu et al., 2014). Note thatin this special case, sincept is independent of the residueκt, the computation ofκt is unnecessary and hence thecomplexity of each iteration does not scale up withn.

It seems difficult to identify other feasible solutions toprogram (11) apart fromp∗, not to mention solve it ex-actly. However, by relaxing the constraintθ(κt, p) ≤mini:κt

i6=0 pi, we obtain an explicit optimal solution.

Lemma 4. The optimal solutionp∗(κt) of

maxp∈R

n

+

θ(κt, p) (13)

s.t.

n∑

i=1

pi = 1

is:

(p∗(κt))i =|κti|

√vi + nλγ

∑n

j=1 |κtj |√

vj + nλγ, ∀i ∈ [n]. (14)

Proof. The proof is deferred to the appendix.

The suggestion made by (14) is clear: we should updatemore often those dual coordinatesαi which have large ab-solute dual residue|κti| and/or large Lipschitz constantvi.

If we let pt = p∗(κt) andθ = θ(κt, pt), the constraint (9)may not be sastified, in which case (8) does not neces-sarily hold. However, as shown by the next lemma, theconstraint (9) is not required for obtaining (8) when all thefunctions{φi}i are quadratic.

Lemma 5. Suppose that all{φi}i are quadratic. Lett ≥ 0.If mini∈It p

ti > 0, then(8) holds for anyθ ∈ [0,+∞).

The proof is deferred to Appendix.

4. Convergence results

In this section we present our theoretical complexity resultsfor AdaSDCA. The main results are formulated in Theo-rem7, covering the general case, and in Theorem11 in thespecial case when{φi}ni=1 are all quadratic.

4.1. General loss functions

We derive the convergence result from Lemma3.

Proposition 6. Let t ≥ 0. If mini∈It pti > 0 and

θ(κt, pt) ≤ mini∈It pti, then

Et

[

D(αt+1)−D(αt)]

≥ θ(κt, pt)(

P (wt)−D(αt))

.

Proof. This follows directly from Lemma3 and the factthat the right-hand side of (8) equals 0 whenθ = θ(κt, pt).

Theorem 7. Consider AdaSDCA. If at each iterationt ≥0, mini∈It p

ti > 0 andθ(κt, pt) ≤ mini∈It p

ti, then

E[P (wt)−D(αt)] ≤ 1

θt

t∏

k=0

(1− θk)(

D(α∗)−D(α0))

,

(15)

for all t ≥ 0 where

θtdef=

E[θ(κt, pt)(P (wt)−D(αt))]

E[P (wt)−D(αt)]. (16)

Proof. By Proposition6, we know that

E[D(αt+1)−D(αt)] ≥ E[θ(κt, pt)(P (wt)−D(αt))]

(16)= θt E[P (w

t)−D(αt)] (17)

≥ θt E[D(α∗)−D(αt)],

whence

E[D(α∗)−D(αt+1)] ≤ (1− θt)E[D(α∗)−D(αt)].

Therefore,

E[D(α∗)−D(αt)] ≤t∏

k=0

(1 − θk)(

D(α∗)−D(α0))

.

By plugging the last bound into (17) we get the bound onthe primal dual error:

E[P (wt)−D(αt)] ≤ 1

θtE[D(αt+1)−D(αt)]

≤ 1

θtE[D(α∗)−D(αt)]

≤ 1

θt

t∏

k=0

(1− θk)(

D(α∗)−D(α0))

.

As mentioned in Section3, by letting every sam-pling probability pt be the importance sampling (opti-mal serial sampling)p∗ defined in (12), AdaSDCA re-duces to IProx-SDCA proposed in (Zhao & Zhang, 2014).The convergence theory established for IProx-SDCAin (Zhao & Zhang, 2014), which can also be derived as adirect corollary of our Theorem7, is stated as follows.


Theorem 8 ((Zhao & Zhang, 2014)). Consider AdaSDCAwith pt = p∗ defined in(12) for all t ≥ 0. Then

E[P (wt)−D(αt)] ≤ 1

θ∗(1− θ∗)

t(

D(α∗)−D(α0))

,

where

θ∗ =nλγ

∑n

i=1(vi + λγn).

The next corollary suggests that abetter convergence ratethan IProx-SDCA can be achieved by using properlychosen adaptive sampling probability.Corollary 9. Consider AdaSDCA. If at each iterationt ≥0, pt is the optimal solution of(11), then(15) holds andθt ≥ θ∗ for all t ≥ 0.

However, solving (11) requires large computational effort,because of the dimensionn and the non-convex structure ofthe program. We show in the next section that when all theloss functions{φi}i are quadratic, then we can get betterconvergence rate in theory than IProx-SDCA by using theoptimal solution of (13).

4.2. Quadratic loss functions

The main difficulty of solving (11) comes from the inequal-ity constraint, which originates from (9). In this section wemainly show that the constraint (9) can be released if all{φi}i are quadratic.

Proposition 10. Suppose that all{φi}i are quadratic. Lett ≥ 0. If mini∈It p

ti > 0, then

Et

[

D(αt+1)−D(αt)]

≥ θ(κt, pt)(

P (wt)−D(αt))

.

Proof. This is a direct consequence of Lemma5 and thefact that the right-hand side of (8) equals 0 whenθ =θ(κt, pt).

Theorem 11. Suppose that all{φi}i are quadratic. Con-sider AdaSDCA. If at each iterationt ≥ 0, mini∈It p

ti > 0,

then(15) holds for allt ≥ 0.

Proof. We only need to apply Proposition10. The rest ofthe proof is the same as in Theorem7.

Corollary 12. Suppose that all{φi}i are quadratic. Con-sider AdaSDCA. If at each iterationt ≥ 0, pt is the optimalsolution of (13), which has a closed form(14), then(15)holds andθt ≥ θ∗ for all t ≥ 0.

5. Efficient heuristic variant

Corollary 9 and 12 suggest how to choose adaptivesampling probability in AdaSDCA which yields a the-oretical convergence rate at least as good as IProx-SDCA (Zhao & Zhang, 2014). However, there are twomain implementation issues of AdaSDCA:

1. The update of the dual residueκt at each iterationcostsO(nnz(A)) where nnz(A) is the number ofnonzero elements of the matrixA;

2. We do not know how to compute the optimal solutionof (11).

In this section, we propose a heuristic variant of AdaSDCA,which avoids the above two issues while staying close tothe ’good’ adaptive sampling distribution.

5.1. Description of Algorithm

Algorithm 2 AdaSDCA+Parametera numberm > 1Initialization Chooseα0 ∈ R

n, setα0 = 1λnAα0

for t ≥ 0 doPrimal update:wt = ∇g∗ (αt)Set:αt+1 = αt

if mod (t, n) == 0 thenOption I: Adaptive probability

Compute:κti = αti +∇φi(A⊤

i wt), ∀i ∈ [n]

Set:pti ∼ |κti|√vi + nλγ, ∀i ∈ [n]

Option II: Optimal Importance probabilitySet:pti ∼ (vi + nλγ), ∀i ∈ [n]

end ifGenerate randomit ∈ [n] according topt

Compute:∆αt

it= argmax

∆∈R

{

−φ∗it(−(αtit+∆))

−A⊤itwt∆− vit

2λn |∆|2}

Dual update:αt+1it

= αtit+∆αt

it

Average update:αt = αt +∆αit

λnAit

Probability update:pt+1it

∼ ptit/m, pt+1j ∼ ptj , ∀j 6= it

end forOutput: wt, αt

AdaSDCA+ has the same structure as AdaSDCA with afew important differences.

EpochsAdaSDCA+ is divided into epochs of lengthn. Atthe beginning of every epoch, sampling probabilities arecomputed according to one of two options. During eachepoch the probabilities are cheaply updated at the end ofevery iteration to approximate the adaptive model. The in-tuition behind is as follows. Afteri is sampled and thedual coordinateαi is updated, the residueκi naturally de-creases. We then decrease also the probability thati ischosen in the next iteration, by settingpt+1 to be propor-tional to (pt1, . . . p

ti−1, p

ti/m, p

ti+1, . . . , p

tn). By doing this

we avoid the computation ofκ at each iteration (issue 1)which costs as much as the full gradient algorithm, whilefollowing closely the changes of the dual residueκ. We re-


set the adaptive sampling probability after every epoch oflengthn.

Parameterm The setting of parameterm in AdaSDCA+directly affects the performance of the algorithm. Ifm istoo large, the probability of sampling the same coordinatetwice during an epoch will be very small. This will result ina random permutation through all coordinates every epoch.On the other hand, form too small the coordinates hav-ing larger probabilities at the beginning of an epoch couldbe sampled more often than it should, even after their cor-responding dual residues become sufficiently small. Wedon’t have a definitive rule on the choice ofm and we leavethis to future work. Experiments with different choices ofm can be found in Section6.

Option I & Option II At the beginning of each epoch, onecan choose between two options for resetting the samplingprobability. Option I corresponds to the optimal solutionof (13), given by the closed form (14). Option II is theoptimal serial sampling probability (12), the same as theone used in IProx-SDCA (Zhao & Zhang, 2014). However,AdaSDCA+ differs significantly with IProx-SDCA sincewe also update iteratively the sampling probability, whichas we show through numerical experiments yields a fasterconvergence than IProx-SDCA.

5.2. Computational cost

Sampling and probability update During the algorithmwe samplei ∈ [n] from non-uniform probability distri-bution pt, which changes at each iteration. This processcan be done efficiently using the Random Counters algo-rithm introduced in Section 6.2 of (Nesterov, 2012), whichtakesO(n log(n)) operations to create the probability treeandO(log(n)) operations to sample from the distributionor change one of the probabilities.

Total computational cost We can compute the computa-tional cost of one epoch. At the beginning of an epoch,we needO(nnz) operations to calculate the dual residueκ. Then we create a probability tree usingO(n log(n))operations. At each iteration we needO(log(n)) opera-tions to sample a coordinate,O(nnz /n) operations to cal-culate the update toα and a furtherO(log(n)) operationsto update the probability tree. As a result an epoch needsO(nnz+n log(n)) operations. For comparison purpose welist in Table1 the one epoch computational cost of compa-rable algorithms.

6. Numerical Experiments

In this section we present results of numerical experiments.

Table 1.One epoch computational cost of different algorithms

ALGORITHM COST OF AN EPOCH

SDCA& QUARTZ(UNIFORM) O(nnz)

IPROX-SDCA O(nnz+n log(n))

ADASDCA O(n · nnz)

ADASDCA+ O(nnz+n log(n))

Table 2.Dimensions and nonzeros of the datasets

DATASET d n nnz /(nd)

W8A 300 49, 749 3.9%

DOROTHEA 100, 000 800 0.9%

MUSHROOMS 112 8, 124 18.8%

COV1 54 581, 012 22%

IJCNN1 22 49, 990 41%

6.1. Loss functions

We test AdaSDCA and AdaSDCA+, SDCA, and IProx-SDCA for two different types of loss functions{φi}ni=1:quadratic loss and smoothed Hinge loss. Lety ∈ R

n be thevector of labels. The quadratic loss is given by

φi(x) =1

2γ(x− yi)

2

and the smoothed Hinge loss is:

φi(x) =

0 yix ≥ 1

1− yix− γ/2 yix ≤ 1− γ(1−yix)

2

2γ otherwise,

In both cases we useL2-regularizer, i.e.,

g(w) =1

2‖w‖2.

Quadratic loss functions appear usually in regres-sion problems, and smoothed Hinge loss can befound in linear support vector machine (SVM) prob-lems (Shalev-Shwartz & Zhang, 2013a).

6.2. Numerical results

We used 5 different datasets: w8a, dorothea, mushrooms,cov1 and ijcnn1 (see Table2).

In all our experiments we usedγ = 1 andλ = 1/n.


AdaSDCA The results of the theory developed in Section4can be observed through Figure1 to Figure4. AdaSDCAneeds the least amount of iterations to converge, confirmingthe theoretical result.

AdaSDCA+ V.S. othersWe can observe through Figure15to 24, that both options of AdaSDCA+ outperforms SDCAand IProx-SDCA, in terms of number of iterations, forquadratic loss functions and for smoothed Hinge loss func-tions. One can observe similar results in terms of timethrough Figure5 to Figure14.

Option I V.S. Option II Despite the fact that Option I isnot theoretically supported for smoothed hinge loss, it stillconverges faster than Option II on every dataset and for ev-ery loss function. The biggest difference can be observedon Figure13, where Option I converges to the machine pre-cision in just 15 seconds.

Different choices ofm To show the impact of differentchoices ofm on the performance of AdaSDCA+, in Fig-ures25 to 33 we compare the results of the two options ofAdaSDCA+ using differentm equal to2, 10 and50. It ishard to draw a clear conclusion here because clearly the op-timalm shall depend on the dataset and the problem type.

Figure 1.w8a datasetd = 300, n = 49749, Quadratic loss withL2 regularizer, comparing number of iterations with known algo-rithms

Figure 2.dorothea datasetd = 100000,n = 800, Quadratic losswith L2 regularizer, comparing number of iterations with knownalgorithms

Figure 3.mushrooms datasetd = 112,n = 8124, Quadratic losswith L2 regularizer, comparing number of iterations with knownalgorithms

Figure 4.ijcnn1 datasetd = 22, n = 49990, Quadratic loss withL2 regularizer, comparing number of iterations with known algo-rithms


Figure 5.w8a datasetd = 300,n = 49749, Quadratic loss withL2

regularizer, comparing real time with known algorithms

Figure 6.dorothea datasetd = 100000,n = 800, Quadratic losswith L2 regularizer, comparing real time with known algorithms

Figure 7.mushrooms datasetd = 112,n = 8124, Quadratic losswith L2 regularizer, comparing real time with known algorithms

Figure 8.cov1 datasetd = 54, n = 581012, Quadratic loss withL2 regularizer, comparing real time with known algorithms

Figure 9.ijcnn1 datasetd = 22, n = 49990, Quadratic loss withL2 regularizer, comparing real time with known algorithms

Figure 10.w8a datasetd = 300,n = 49749, Smooth Hinge losswith L2 regularizer, comparing real time with known algorithms


Figure 11.dorothea datasetd = 100000,n = 800, Smooth Hingeloss withL2 regularizer, comparing real time with known algo-rithms

Figure 12.mushrooms datasetd = 112,n = 8124, Smooth Hingeloss withL2 regularizer, comparing real time with known algo-rithms

Figure 13.cov1 datasetd = 54,n = 581012, Smooth Hinge losswith L2 regularizer, comparing real time with known algorithms

Figure 14.ijcnn1 datasetd = 22,n = 49990, Smooth Hinge losswith L2 regularizer, comparing real time with known algorithms

Figure 15.w8a datasetd = 300,n = 49749, Quadratic loss withL2 regularizer, comparing number of iterations with known algo-rithms

Figure 16.dorothea datasetd = 100000,n = 800, Quadratic losswith L2 regularizer, comparing number of iterations with knownalgorithms


Figure 17.mushrooms datasetd = 112,n = 8124, Quadratic losswith L2 regularizer, comparing number of iterations with knownalgorithms

Figure 18.cov1 datasetd = 54,n = 581012, Quadratic loss withL2 regularizer, comparing number of iterations with known algo-rithms

Figure 19.ijcnn1 datasetd = 22,n = 49990, Quadratic loss withL2 regularizer, comparing number of iterations with known algo-rithms

Figure 20.w8a datasetd = 300,n = 49749, Smooth Hinge losswith L2 regularizer, comparing number of iterations with knownalgorithms

Figure 21.dorothea datasetd = 100000,n = 800, Smooth Hingeloss withL2 regularizer, comparing number of iterations withknown algorithms

Figure 22.mushrooms datasetd = 112,n = 8124, Smooth Hingeloss withL2 regularizer, comparing number of iterations withknown algorithms


Figure 23.cov1 datasetd = 54,n = 581012, Smooth Hinge losswith L2 regularizer, comparing number of iterations with knownalgorithms

Figure 24.ijcnn1 datasetd = 22,n = 49990, Smooth Hinge losswith L2 regularizer, comparing number of iterations with knownalgorithms

Figure 25.w8a datasetd = 300,n = 49749, Quadratic loss withL2 regularizer, comparison of different choices of the constant m

Figure 26.dorothea datasetd = 100000,n = 800, Quadratic losswith L2 regularizer, comparison of different choices of the con-stantm

Figure 27.mushrooms datasetd = 112,n = 8124, Quadratic losswith L2 regularizer, comparison of different choices of the con-stantm

Figure 28.cov1 datasetd = 54,n = 581012, Quadratic loss withL2 regularizer, comparison of different choices of the constant m


Figure 29.ijcnn1 datasetd = 22,n = 49990, Quadratic loss withL2 regularizer, comparison of different choices of the constant m

Figure 30.w8a datasetd = 300,n = 49749, Smooth Hinge losswith L2 regularizer, comparison of different choices of the con-stantm

Figure 31.dorothea datasetd = 100000,n = 800, Smooth Hingeloss withL2 regularizer, comparison of different choices of theconstantm

Figure 32.mushrooms datasetd = 112,n = 8124, Smooth Hingeloss withL2 regularizer, comparison of different choices of theconstantm

Figure 33.ijcnn1 datasetd = 22,n = 49990, Smooth Hinge losswith L2 regularizer, comparison of different choices of the con-stantm


References

Agarwal, Alekh and Bottou, Leon. A lower bound for theoptimization of finite sums.arXiv:1410.0723, 2014.

Banks-Watson, Alexander. New classes of coordinate de-scent methods. Master’s thesis, University of Edinburgh,2012.

Defazio, Aaron, Bach, Francis, and Lacoste-Julien, Si-mon. Saga: A fast incremental gradient method withsupport for non-strongly convex composite objectives.arXiv:1407.0202, 2014.

Duchi, John, Hazan, Elad, and Singer, Yoram. Adaptivesubgradient methods for online learning and stochasticoptimization. Journal of Machine Learning Research,12(1):2121–2159, 2011.

Fercoq, Olivier and Richtarik, Peter. Accelerated, paralleland proximal coordinate descent.SIAM Journal on Opti-mization (after minor revision), arXiv:1312.5799, 2013.

Fercoq, Olivier and Richtarik, Peter. Smooth minimizationof nonsmooth functions by parallel coordinate descent.arXiv:1309.5885, 2013.

Fercoq, Olivier, Qu, Zheng, Richtarik, Peter, and Takac,Martin. Fast distributed coordinate descent for mini-mizing non-strongly convex losses.IEEE InternationalWorkshop on Machine Learning for Signal Processing,2014.

Fountoulakis, Kimon and Tappenden, Rachael. Robustblock coordinate descent.arXiv:1407.7573, 2014.

Glasmachers, Tobias and Dogan, Urun. Accelerated co-ordinate descent with adaptive coordinate frequencies.In Asian Conference on Machine Learning, pp. 72–86,2013.

Jaggi, Martin, Smith, Virginia, Takac, Martin, Terhorst,Jonathan, Krishnan, Sanjay, Hofmann, Thomas, and Jor-dan, Michael I. Communication-efficient distributeddual coordinate ascent. In Ghahramani, Z., Welling,M., Cortes, C., Lawrence, N.D., and Weinberger, K.Q.(eds.),Advances in Neural Information Processing Sys-tems 27, pp. 3068–3076. Curran Associates, Inc., 2014.

Johnson, Rie and Zhang, Tong. Accelerating stochasticgradient descent using predictive variance reduction. InNIPS, 2013.

Konecny, Jakub and Richtarik, Peter. S2GD: Semi-stochastic gradient descent methods.arXiv:1312.1666,2014.

Konecny, Jakub, Lu, Jie, Richtarik, Peter, and Takac,Mar-tin. mS2GD: Mini-batch semi-stochastic gradient de-scent in the proximal setting.arXiv:1410.4744, 2014a.

Konecny, Jakub, Qu, Zheng, and Richtarik, Peter. Semi-stochastic coordinate descent.arXiv:1412.6293, 2014b.

Lin, Qihang, Lu, Zhaosong, and Xiao, Lin. An acceleratedproximal coordinate gradient method and its applicationto regularized empirical risk minimization. TechnicalReport MSR-TR-2014-94, July 2014.

Liu, Ji and Wright, Stephen J. Asynchronous stochasticcoordinate descent: Parallelism and convergence prop-erties.arXiv:1403.3862, 2014.

Loshchilov, I., Schoenauer, M., and Sebag, M. Adap-tive Coordinate Descent. In et al., N. Krasnogor(ed.), Genetic and Evolutionary Computation Confer-ence (GECCO), pp. 885–892. ACM Press, July 2011.

Lukasewitz, Isabella. Block-coordinate frank-wolfe opti-mization. A study on randomized sampling methods,2013.

Mahajan, Dhruv, Keerthi, S. Sathiya, and Sundararajan, S.A distributed block coordinate descent method for train-ing l1 regularized linear classifiers.arXiv:1405.4544,2014.

Mairal, Julien. Incremental majorization-minimization op-timization with application to large-scale machine learn-ing. Technical report, 2014.

Marecek, Jakub, Richtarik, Peter, and Takac, Martin. Dis-tributed block coordinate descent for minimizing par-tially separable functions.arXiv:1406.0328, 2014.

Necoara, Ion and Patrascu, Andrei. A random coordinatedescent algorithm for optimization problems with com-posite objective function and linear coupled constraints.Computational Optimization and Applications, 57:307–337, 2014.

Necoara, Ion, Nesterov, Yurii, and Glineur, Francois. Ef-ficiency of randomized coordinate descent methods onoptimization problems with linearly coupled constraints.Technical report, Politehnica University of Bucharest,2012.

Nemirovski, Arkadi, Juditsky, Anatoli, Lan, Guanghui, andShapiro, Alexander. Robust stochastic approximationapproach to stochastic programming.SIAM Journal onOptimization, 19(4):1574–1609, 2009.

Nesterov, Yurii. Efficiency of coordinate descent methodson huge-scale optimization problems.SIAM Journal onOptimization, 22(2):341–362, 2012.

Qu, Zheng and Richtarik, Peter. Coordinate descent meth-ods with arbitrary sampling I: Algorithms and complex-ity. arXiv:1412.8060, 2014.


Qu, Zheng and Richtarik, Peter. Coordinate Descent withArbitrary Sampling II: Expected Separable Overapprox-imation. ArXiv e-prints, 2014.

Qu, Zheng, Richtarik, Peter, and Zhang, Tong. Random-ized Dual Coordinate Ascent with Arbitrary Sampling.arXiv:1411.5873, 2014.

Richtarik, Peter and Takac, Martin. Distributed co-ordinate descent method for learning with big data.arXiv:1310.2059, 2013a.

Richtarik, Peter and Takac, Martin. On optimal prob-abilities in stochastic coordinate descent methods.arXiv:1310.3438, 2013b.

Richtarik, Peter and Takac, Martin. Parallel coordi-nate descent methods for big data optimization prob-lems. Mathematical Programming (after minor revi-sion), arXiv:1212.0873, 2012.

Richtarik, Peter and Takac, Martin. Iteration complexity ofrandomized block-coordinate descent methods for min-imizing a composite function.Mathematical Program-ming, 144(2):1–38, 2014.

Schaul, Tom, Zhang, Sixin, and LeCun, Yann. No morepesky learning rates.Journal of Machine Learning Re-search, 3(28):343–351, 2013.

Schmidt, Mark, Le Roux, Nicolas, and Bach, Francis. Min-imizing finite sums with the stochastic average gradient.arXiv:1309.2388, 2013.

Shalev-Shwartz, Shai and Tewari, Ambuj. Stochastic meth-ods forℓ1-regularized loss minimization.Journal of Ma-chine Learning Research, 12:1865–1892, 2011.

Shalev-Shwartz, Shai and Zhang, Tong. Proximal stochas-tic dual coordinate ascent.arXiv:1211.2717, 2012.

Shalev-Shwartz, Shai and Zhang, Tong. Accelerated mini-batch stochastic dual coordinate ascent. InAdvancesin Neural Information Processing Systems 26, pp. 378–385. 2013a.

Shalev-Shwartz, Shai and Zhang, Tong. Stochastic dualcoordinate ascent methods for regularized loss.Journalof Machine Learning Research, 14(1):567–599, 2013b.

Takac, Martin, Bijral, Avleen Singh, Richtarik, Peter,andSrebro, Nathan. Mini-batch primal and dual methods forsvms.CoRR, abs/1303.2314, 2013.

Tappenden, Rachael, Richtarik, Peter, and Gondzio, Jacek.Inexact block coordinate descent method: complexityand preconditioning.arXiv:1304.5530, 2013.

Tappenden, Rachael, Richtarik, Peter, and Buke, Burak.Separable approximations and decomposition methodsfor the augmented lagrangian.Optimization Methodsand Software, 2014.

Trofimov, Ilya and Genkin, Alexander. Distributed co-ordinate descent for l1-regularized logistic regression.arXiv:1411.6520, 2014.

Wright, Stephen J. Coordinate descent al-gorithms. Technical report, 2014. URLhttp://www.optimization-online.org/DB_FILE/2014/12/4

Xiao, Lin and Zhang, Tong. A proximal stochasticgradient method with progressive variance reduction.arXiv:1403.4699, 2014.

Zhang, Yuchen and Xiao, Lin. Stochastic primal-dual coor-dinate method for regularized empirical risk minimiza-tion. Technical Report MSR-TR-2014-123, September2014.

Zhao, Peilin and Zhang, Tong. Stochastic optimizationwith importance sampling.arXiv:1401.2753, 2014.

Zhao, Tuo, Liu, Han, and Zhang, Tong. A general theoryof pathwise coordinate optimization.arXiv:1412.7477,2014a.

Zhao, Tuo, Yu, Mo, Wang, Yiming, Arora, Raman, andLiu, Han. Accelerated mini-batch randomized block co-ordinate descent method. In Ghahramani, Z., Welling,M., Cortes, C., Lawrence, N.D., and Weinberger, K.Q.(eds.),Advances in Neural Information Processing Sys-tems 27, pp. 3329–3337. Curran Associates, Inc., 2014b.

http://www.optimization-online.org/DB_FILE/2014/12/4679.pdf


Appendix

Proofs

We shall need the following inequality.

Lemma 13. Functionf : Rn → R defined in(3) satisfiesthe following inequality:

f(α+ h) ≤ f(α) + 〈∇f(α), h〉+ 1

2λn2h⊤A⊤Ah, (18)

holds for∀α, h ∈ Rn.

Proof. Sinceg is 1-strongly convex,g∗ is 1-smooth. Pickα, h ∈ R

n. Since,f(α) = λg∗( 1λnAα), we have

f(α+ h) = λg∗(1

λnAα+

1

λnAh)

≤ λ

(

g∗(1

λnAα) + 〈∇g∗( 1

λnAα),

1

λnAh〉+ 1

2‖ 1

λnAh‖2

)

= f(α) + 〈∇f(α), h〉+ 1

2λn2h⊤A⊤Ah.

Proof of Lemma3. It can be easily checked that the follow-ing relations hold

∇if(αt) =

1

nA⊤

i wt, ∀t ≥ 0, i ∈ [n], (19)

g(wt) + g∗(αt) = 〈wt, αt〉, ∀t ≥ 0, (20)

where {wt, αt, αt}t≥0 is the output sequence of Algo-rithm 1. Let t ≥ 0 andθ ∈ [0,mini p

ti]. For eachi ∈ [n],

sinceφi is 1/γ-smooth,φ∗i is γ-strongly convex and thusfor arbitrarysi ∈ [0, 1],

φ∗i (−αti + siκ

ti)

= φ∗i(

(1− si)(−αti) + si∇φi(A⊤

i wt))

≤ (1 − si)φ∗i (−αt

i) + siφ∗i (∇φi(A⊤

i wt))

− γsi(1 − si)|κti|22

. (21)

We have:

f(αt+1)− f(αt)

(18)≤ 〈∇f(αt), αt+1 − αt〉

+1

2λn2〈αt+1 − αt, A⊤A(αt+1 − αt)〉

= ∇if(αt)∆αt

it+

vi2λn2

|∆αtit|2

(19)=

1

nA⊤

itwt∆αt

it+

vi2λn2

|∆αtit|2 (22)

Thus,

D(αt+1)−D(αt)

(22)≥ − 1

nA⊤

itwt∆αt

it− vit

2λn2|∆αt

it|2 + 1

n

n∑

i=1

φ∗i (−αti)

− 1

n

n∑

i=1

φ∗i (−αt+1i )

= − 1

nA⊤

itwt∆αt

it− vit

2λn2|∆αt

it|2 + 1

nφ∗it(−αt

it)

− 1

nφ∗it(−

(

αtit+∆αt

it

)

)

= max∆∈R

− 1

nA⊤

itwt∆− vit

2λn2|∆|2 + 1

nφ∗it(−α

tit)

− 1

nφ∗it(−

(

αtit+∆

)

),

where the last equality follows from the definition of∆αtit

in Algorithm 1. Then by letting∆ = −siκtit for somearbitrarysi ∈ [0, 1] we get:

D(αt+1)−D(αt)

≥ siA⊤itwtκtitn

− s2i vit |κtit |22λn2

+1

nφ∗it(−αt

it)

− 1

nφ∗it(−αt

it+ siκ

tit)

(21)≥ si

n

(

φ∗it(−αtit)− φ∗it(∇φit(A

⊤itwt)) +A⊤

itwtκtit

)

− s2i vit |κtit |22λn2

+γsi(1 − si)|κtit |2

2n.

By taking expectation with respect toit we get:

Et

[

D(αt+1)−D(αt)]

≥n∑

i=1

ptisin

[

φ∗i (−αti)− φ∗i (∇φi(A⊤

i wt)) +A⊤

i wtκti]

−n∑

i=1

ptis2i |κti|2(vi + λγn)

2λn2+

n∑

i=1

ptiγsi|κti|22n

. (23)

Set

si =

{

0, i /∈ Itθ/pti, i ∈ It

(24)

Thensi ∈ [0, 1] for eachi ∈ [n] and by plugging it into (23)we get:

Et

[

D(αt+1)−D(αt)]

≥ θ

n

∑

i∈It

[


i wt)) +A⊤

i wtκti]

− θ

2λn2

∑

i∈It

(

θ(vi + nλγ)

pti− nλγ

)

|κti|2


Finally note that:

P (wt)−D(αt)

=1

n

n∑

i=1

[

φi(A⊤i w

t) + φ∗i (−αti)]

+ λ(

g(wt) + g∗(αt))

(20)=

1

n

n∑

i=1

[

φ∗i (−αti) + φi(A

⊤i w

t)]

+1

n〈wt, Aαt〉

=1

n

n∑

i=1

[

φ∗i (−αti) +A⊤

i wt∇φi(A⊤

i wt)

−φ∗i (∇φi(A⊤i w

t)) +A⊤i w

tαti

]

=1

n

n∑

i=1

[


i wt)) +A⊤

i wtκti]

=1

n

∑

i∈It

[


i wt)) +A⊤

i wtκti]

Proof of Lemma4. Note that (13) is a standard constrainedmaximization problem, where everything independent ofpcan be treated as a constant. We define the Lagrangian

L(p, η) = θ(κ, p)− η(n∑

i=1

pi − 1)

and get the following optimality conditions:

|κti|2(vi + nλγ)

p2i=

|κtj |2(vj + nλγ)

p2j, ∀i, j ∈ [n]

n∑

i=1

pi = 1

pi ≥ 0, ∀i ∈ [n],

the solution of which is (14).

Proof of Lemma5. Note that in the proof of Lemma3, theconditionθ ∈ [0,mini∈It p

ti] is only needed to ensure that

si defined by (24) is in [0, 1] so that (21) holds. If φi isquadratic function, then (21) holds for arbitrarysi ∈ R.Therefore in this case we only needθ to be positive and thesame reasoning holds.

cdominik gmail com zheng qu ed ac uk peter richtarik … · arxiv:1502.08053v1 [math.oc] 27 feb...

Documents