the cyclic block conditional gradient method for convex ...[22] provides an asymptotic analysis of...

The Cyclic Block Conditional Gradient Method for

Convex Optimization Problems

Amir Beck∗ Edouard Pauwels∗ Shoham Sabach∗

September 25, 2015

Abstract

In this paper we study the convex problem of optimizing the sum of a smooth function anda compactly supported non-smooth term with a specific separable form. We analyze the blockversion of the generalized conditional gradient method when the blocks are chosen in a cyclicorder. A global sublinear rate of convergence is established for two different stepsize strategiescommonly used in this class of methods. Numerical comparisons of the proposed method toboth the classical conditional gradient algorithm and its random block version demonstrate theeffectiveness of the cyclic block update rule.

Keywords: Conditional gradient, cyclic block decomposition, iteration complexity, linear oracle,nonsmooth convex minimization, support vector machine.

1 Introduction

With the growth of size of problems commonly encountered in many applied fields, there is a strongdemand for numerical methods featuring low computational cost iterative schemes. By low compu-tational cost, we mean algorithms which require at most matrix by vector multiplication (inversionof matrices are, for example, too expensive). In this context, it is necessary to propose and analyzenumerical schemes that

• are based on computationally efficient steps;

• exploit problem structure and data information;

• enjoy global convergence properties and iteration complexity estimates.

We consider structured convex problems consisting of minimizing the sum of two terms: a smoothterm, which is a composition of a smooth function and a linear mapping and a nonsmooth separableterm. We focus on programs for which the geometry of the non-smooth part exhibits such a degreeof complexity that proximal-based methods [5, 6, 24] do not constitute a viable alternative. Indeed,the efficiency of these methods considerably deteriorates in situations that do not fit in a “favorablegeometric settings”, see [11, 16] for a more detailed description of this concept. Algorithms based onlinear oracles, such as the conditional gradient method (also known as the Frank-Wolfe algorithm)[14, 21, 12, 13, 17] and their extensions to structured composite problems [8, 9, 1], or block separableproblems [19], are based on the principle of iteratively solving linearized subproblems. In settingswhere computing the proximal operator is too expensive, this approach has proven to be competitive.Successful examples of applications which benefit from this approach include trace-norm constrained

∗Faculty of Industrial Engineering and Management, Technion, Haifa, Israel(becka,epauwels,[email protected]).

1

or penalized problems [18, 11] and structured multiclass classification with extremely large numberof classes [19]. On the theoretical side, for most of the algorithms mentioned in the references above,a sublinear rate of O(1/k), both in function values and duality gap, is available.

Continuously increasing problem dimensions have motivated the principle of taking advantageof available block structure of the problem at hand. This led to the development of variable de-composition methods which break down the original large-scale problem into several much smallersubproblems that could be solved efficiently.

In recent works, [23] and [27] analyzed the average case iteration complexity of block versionsof gradient, projected gradient and forward-backward methods where at each iteration, the blockto be updated is selected at random. We refer to such a block selection rule as the random updaterule. In the context of linear oracles, [19] applied this approach to the conditional gradient methodfocusing on implementing it for the structured Support Vector Machine (SVM) training problem(see more details in Section 6.2). Analysis of algorithms involving random update rules typicallyprovides average case complexity results. A different kind of works consider updating the blocks in acyclic order. We refer to such a deterministic update rule as the cyclic update rule. In this context,[22] provides an asymptotic analysis of exact coordinate minimization for composite strongly convexproblems. In [6], a global rate of convergence result was established for the cyclic block coordinategradient projection method for convex problems over feasible sets with a separable structure. In [28],an explicit rate is given for the block proximal gradient method for `1 regularized convex problems.For this line of works, the estimates given for the cyclic update rule deterministically hold for thesequence of function values. As we already mentioned, this is not the case for the random updaterule for which only average case estimates are available.

Another relevant feature of block decomposition methods is the fact that they usually allow totake a substantillay larger step at each iteration (with respect to each block) compared to theirclassical non-block counterparts (using, for example, fixed stepsize, backtracking or line search).This fact potentially gives a numerical advantage to block decomposition algorithms compared toclassical variants, see [6].

In this work, we propose a cyclic block version of the generalized conditional gradient method[8, 9, 1] which for ease of reference is called the Cyclic Block Conditional Gradient (CBCG). Theword “generalized” means that we consider a more general nonsmooth part, which is not necessarilyan indicator function. The word “cyclic” means here that each block is updated once at eachiteration. The order in which the blocks are updated may vary arbitrarily between iterations andour analysis therefore includes the random permutation approach. We provide deterministic andglobal convergence rate estimates for the predefined stepsize strategy [13, 17], and for an adaptivestepsize strategy [21] including a backtracking version of it. These rates hold independently of theorder of updating the blocks at each iteration. We also establish rate estimates for an optimalitymeasure in the spirit of [17, 1], which constitutes an additional novelty compared to the resultspresented in [6]. All the rates proved below have the form of O(1/k) where k is the number ofcomplete iterations (over all blocks). Interestingly, this analysis leads to new convergence resultsfor methods that do not relate directly to linear oracles at first sight. For example, our resultslead to explicit deterministic rate estimates in terms of the duality gap for the cyclic and randompermutation variants of the Stochastic Dual Coordinate Ascent (SDCA) of [29].

We numerically compare the proposed CBCG method (using both the cyclic and random permu-tation updating rules) to its random update rule counterpart, that is, the Random Block ConditionalGradient (RBCG), which was proposed and analyzed in [19]. Extensive simulations on a large num-ber of synthetic examples suggest that CBCG is competitive with both RBCG and the classicalconditional gradient algorithm (CG). Finally, we also compare CBCG and RBCG on the problemof training the structured SVM [30, 31] for the optical character recognition task (OCR) originallyproposed in [30]. In this setting, we observe that the random permutation updating rule has advan-tage over the other updating rules.

2

The next section is dedicated to the presentation of the model and main assumptions (see Section2.1) and to the description of the CBCG algorithm (see Section 2.2). Section 3 presents few auxiliaryresults for an optimality measure which is commonly encountered when using methods which arebased on linear oracles. In Section 4 we present our theoretical findings about the rate of convergenceresults of the CBCG method. We split this section into two subsections which deal with the twodifferent stepsize strategies that we analyze in this paper (Section 4.1 for the predefined stepsize andSection 4.2 for the adaptive stepsize). We conclude Section 4 with a discussion on a backtrackingversion of the CBCG method (see Section 4.3). Section 5 presents an extension of the analysis givenin Section 4.2 for the specific case where the stepsize is chosen using exact line search for problemsin which the smooth part of the objective function is quadratic. This leads to a discussion about theimplications for block coordinate descent methods. Numerical experiments on synthetic data and onthe structured SVM training problem are presented in Section 6.

Conventions. Throughout the paper the underlying vector space is the n-dimensional Euclideanspace Rn with the l2-norm which is denoted by ‖·‖. This notation is also used for the matrix norm,which is assumed to be the spectral norm. We will consider the partition of an arbitrary vectorx ∈ Rn into N blocks where each block consists of a subset of the n coordinates. The size of eachblock (the number of coordinates) is given by the integer ni for i = 1, 2, . . . , N , such that

∑Ni=1 ni = n.

The ith block of a vector x ∈ Rn is denoted by xi. We assume that x ∈ Rn can be written as follows

x =

x1

x2...

xN

.

For any i = 1, 2, . . . , N we define the matrix Ui ∈ Rn×ni as the sub-matrix of the n × n identitymatrix consisting of the columns corresponding to the ith block. Thus, in particular,

(U1,U2, . . . ,UN ) = In.

It is clear that using these notations, we have for any x ∈ Rn that xi = UTi x and x =

∑Ni=1 Uixi.

Finally, for any subset S of Rn, δS denotes the indicator function of S which takes the value 0 on Sand +∞ otherwise.

2 The Optimization Model and Algorithm

2.1 Problem Formulation and Assumptions

We consider the optimization model

minx∈Rn

H (x) ≡ F (Ax) +

N∑i=1

gi(xi)

, (2.1)

where A ∈ Rm×n. We make the following standing assumption on model (2.1).

Assumption 1.

(i) gi : Rni → (−∞,∞], i = 1, 2, . . . , N , is a proper, closed and convex function which satisfies

– Xi ≡ dom gi ⊆ Rni is a compact set with diameter Di, that is,

supxi,yi∈Xi

‖xi − yi‖ = Di.

3

– gi is globally Lipschitz on Xi with constant li, that is,

|gi (xi)− gi (yi)| ≤ li ‖xi − yi‖ , ∀ xi,yi ∈ Xi. (2.2)

(ii) F : Rm → R is convex and continuously differentiable1 over A (X1 ×X2 × · · · ×XN ) ⊆ Rmand has Lipschitz continuous gradient with constant LF , that is,

‖∇F (x)−∇F (y)‖ ≤ LF ‖x− y‖ , ∀ x,y ∈ A (X1 ×X2 × · · · ×XN ) .

Since gi is assumed to be convex, it immediately implies that the domain Xi is a convex subset ofRni for i = 1, 2, . . . , N . We set g(x) ≡

∑Ni=1 gi(xi) and f(x) ≡ F (Ax). The domain of g is denoted

by dom g ≡ X and its diameter is denoted by D. Using these simplified notations, problem (2.1)actually consists of minimizing the sum f + g. It holds that X ≡ X1 ×X2 × · · · ×XN and therefore

D2 =

N∑i=1

D2i . (2.3)

Remark 2.1. By setting g (·) = δX (·), we recover the constrained optimization model that motivatedthe development of the traditional conditional gradient method (see [17] and references therein). Thisis also the case when we add a linear term to the indicator.

Under Assumption 1, problem (2.1) is guaranteed to attain its optimal value, and therefore theoptimal set, which is denoted by X∗, is nonempty and the corresponding optimal value is denotedby H∗ ∈ R. For each block i ∈ 1, 2, . . . , N, we employ the following notation:

• Ai ≡ AUi, so that A = (A1,A2, . . . ,AN );

• ∇if(x) ≡ UTi ∇f(x) denotes the partial gradient of f .

Using the previous notations, we have that ∇if (x) = ATi ∇F (Ax). We will use a refined notion

of Lipschitz continuity that fits our block separable composite setting. This is expressed by thefollowing standing assumption.

Assumption 2. For each i = 1, 2, . . . , N , there exists a constants βi > 0, such that for any x ∈ Xand any hi ∈ Rni satisfying x + Uihi ∈ X, it holds that

‖∇F (Ax + Aihi)−∇F (Ax)‖ ≤ βi ‖Aihi‖ .

Assumption 2 can be seen as a consequence of Assumption 1. Indeed, it is always possible toset βi = LF , i = 1, 2, . . . , N . However, adopting this more refined convention provides additionalalgorithmic flexibility which allows to take advantage of conditioning disparities between differentblocks by using stepsizes which are functions of the Lipschitz constants of the blocks, rather than theglobal Lipschitz constant. See for example Section 5.1, where it is shown how this approach allowsthe usage of exact line search for quadratic problems. We also note that when A = I, then βi can bechosen to be the ith block Lipschitz constant of the gradient of F (see, e.g., [6, 2]), which is alwaysa quantity smaller than LF . We define the following quantity

βmin ≡ min β1, β2, . . . , βN > 0. (2.4)

The following important result will play a central role in the forthcoming analysis. The proof is almostidentical to the well known proof of the descent lemma (see [7]), and is thus given in Appendix A.

Lemma 2.2 (Composite block descent lemma). Let i ∈ 1, 2, . . . , N, then for any x ∈ X andhi ∈ Rni such that x + Uihi ∈ X, we have

f(x + Uihi) ≤ f(x) + 〈∇if(x),hi〉+βi2‖Aihi‖2 . (2.5)

1A function is continuously differentiable over a given set D if it is continuously differentiable over an open setcontaining D.

4

2.2 The Cyclic Block Conditional Gradient (CBCG) Method

The generalized conditional gradient method [8, 9, 1] can be applied to problem (2.1) when thecorresponding linear oracle is available. Therefore we assume that for any x ∈ X, the solution of thefollowing problem can be easily computed:

minv∈X〈∇f (x) ,v〉+ g (v) .

We exploit here the separability of the function g (see Section 2.1) and propose a block decompositionextension of the generalized conditional gradient method which we call the Cyclic Block ConditionalGradient (CBCG) method. Before stating the algorithm, we will need the following additionalnotation. Let xkk∈N be a given sequence, then for any i = 1, 2, . . . , N , we define

xk,i =

xk+11...

xk+1i

xki+1...

xkN

. (2.6)

That is, the first i blocks in xk,i are those of xk+1 and the remaining N − i blocks are those of xk. Itis clear that using this notation we have xk,0 = xk and xk+1 = xk,N . The algorithm is given now.

CBCG: Cyclic Block Conditional GradientInitialization. x0 ∈ X and αki ∈ [0, 1] for all k ∈ N and i = 1, 2, . . . , N .General Step. For k = 1, 2, . . .,

(i) For any i = 1, 2, . . . , N , compute

pki ∈ argminpi∈Xi

〈∇if(xk,i−1),pi〉+ gi(pi)

, (2.7)

and thenxk,i = xk,i−1 + αki Ui(p

ki − xk,i−1i ). (2.8)

(ii) Update xk+1 = xk,N .

We will first analyze, in Section 4.1, the convergence rate of the CBCG method using a predefinedstepsize [13]. Here, the predefined stepsize that we use is given, for any i = 1, 2, . . . , N , by

αki = αk ≡ 2

k + 2.

In Section 4.2, we will consider an adaptive stepsize rule [21], which is determined by the minimizationof the quadratic upper bound of H related to (2.5). The expression of this stepsize will be madeprecise below (see (4.12)). The backtracking variant of the CBCG with adaptive stepsize rule ispresented and analyzed in Section 4.3.

3 The Optimality Measure

In this section, we describe few properties of an optimality measure including its block counterparts.This measure is typical when discussing methods which are based on linear oracles and usually plays

5

a crucial role in the convergence analysis, see [4] for an overview and [19, 1] for a link with Fenchelduality. For any x ∈ X, we define the following quantity

p(x) ∈ argminp∈X 〈∇f(x),p〉+ g(p) , (3.1)

as well as the optimality measure

S(x) ≡ maxp∈X〈∇f(x),x− p〉+ g(x)− g(p) = 〈∇f(x),x− p(x)〉+ g(x)− g(p(x)), (3.2)

where the last equality follows from (3.1). The function S is an optimality measure in the sense thatit is non-negative on X, and it is zero only on X∗. Furthermore, for any x ∈ X, the quantity S (x)is an upper bound on H(x)−H∗ as stated in the following lemma whose short proof is given for thesake of completeness.

Lemma 3.1. S(x) ≥ H (x)−H∗.

Proof. For any x∗ ∈ X∗ we have

S(x) = 〈∇f(x),x− p(x)〉+ g(x)− g(p(x))

= 〈∇f(x),x〉+ g(x)− [〈∇f(x), p(x)〉+ g(p(x))]

≥ 〈∇f(x),x〉+ g(x)− [〈∇f(x),x∗〉+ g(x∗)]

= 〈∇f(x),x− x∗〉+ g(x)− g(x∗),

where the inequality follows from (3.1). Using the convexity of f , we obtain that

S(x) ≥ 〈∇f(x),x− x∗〉+ g(x)− g(x∗) ≥ f(x)− f(x∗) + g(x)− g(x∗) = H(x)−H∗.

This proves the desired result.

We refine the notations introduced in (3.1) and (3.2) in order to fit them to our block structuredsetting. For any x ∈ X and any i = 1, 2, . . . , N , we set

pi(x) ∈ argminpi∈Xi 〈∇if(x),pi〉+ gi(pi) , (3.3)

and define the block optimality measure, which was already introduced in [19] when gi ≡ 0,

Si(x) ≡ maxpi∈Xi

〈∇fi(x),xi − pi〉+ gi(xi)− gi(pi) . (3.4)

It is clear that in this case it is also true that

Si (x) = 〈∇fi (x) ,xi − pi (x)〉+ gi (xi)− gi (pi (x)) . (3.5)

Using the separability of both g and X, we have for any x ∈ X that

S (x) =N∑i=1

Si (x) . (3.6)

There might be multiple optimal solutions for problem (3.1) and also for problem (3.3). Our onlyassumption is that the choices of p1(x), p2(x), . . . , pN (x) and p(x) are made under the restrictionthat

p(x) =

p1(x)p2(x)

...pN (x)

.

The following Lipschitz-type property of the block optimality measure Si, i = 1, 2, . . . , N , will becrucial in the forthcoming analysis.

6

Lemma 3.2. Let x,y ∈ X be two vectors which satisfy xi = yi for some i = 1, 2, . . . , N . Then thefollowing inequality holds

|Si(x)− Si(y)| ≤ LFDi ‖Ai‖ · ‖A(x− y)‖ .

Proof. From (3.5) we have

Si(x) = 〈∇fi (x) ,xi − pi(x)〉+ gi(xi)− gi(pi(x)) (3.7)

= 〈∇fi(y),xi − pi(x)〉+ gi(xi)− gi(pi(x)) + 〈∇fi(x)−∇fi(y),xi − pi(x)〉 .

Now, using the fact that f (x) ≡ F (Ax) and Assumption 1(ii), we obtain

〈∇fi(x)−∇fi(y),xi − pi(x)〉 =⟨ATi (∇F (Ax)−∇F (Ay)),xi − pi(x)

⟩= 〈∇F (Ax)−∇F (Ay),Ai(xi − pi(x))〉≤ ‖∇F (Ax)−∇F (Ay)‖ · ‖Ai(xi − pi(x))‖≤ LF ‖A(x− y)‖ · ‖Ai(xi − pi(x))‖≤ LFDi ‖Ai‖ · ‖A(x− y)‖ , (3.8)

where the first inequality follows from the Cauchy-Schwarz inequality and the last inequality followsfrom the fact that both xi and pi(x) belong to Xi (see Assumption 1(i)). Finally, by combining (3.7)with (3.8), and using the fact that xi = yi, we obtain that

Si(x) ≤ 〈∇fi(y),xi − pi(x)〉+ gi(xi)− gi(pi(x)) + LFDi ‖Ai‖ · ‖A(x− y)‖= 〈∇fi(y),yi − pi(x)〉+ gi(yi)− gi(pi(x)) + LFDi ‖Ai‖ · ‖A(x− y)‖≤ Si(y) + LFDi ‖Ai‖ · ‖A(x− y)‖ , (3.9)

where the last inequality follows from the definition of Si (see (3.4)). Changing the roles of x and y,we also obtain that

Si(y) ≤ Si(x) + LFDi ‖Ai‖ · ‖A(x− y)‖ ,

which along with (3.9) yields the desired result.

4 Convergence Analysis of the CBCG Method

This section is devoted to the convergence analysis of the CBCG algorithm. We will first prove, inSection 4.1, a sublinear convergence rate for the variant with the predefined stepsize rule. A similarrate of convergence will be then established, in Section 4.2, for the variant with the adaptive stepsizerule. Finally, we describe a backtracking procedure in Section 4.3 which allows to use the CBCGmethod when the constants βi, i = 1, 2, . . . , N , given in Assumption 2 are unknown in advance. Webegin with an extension of Lemma 2.2 that holds for any choice of stepsize.

Lemma 4.1. Let xkk∈N be the sequence generated by the CBCG method. Then, for any k ≥ 0and i ∈ 1, 2, . . . , N, we have

H(xk,i) ≤ H(xk,i−1)− αki Si(xk,i−1) +(αki )

2βi2‖Ai(p

ki − xki )‖2.

Proof. First, by the definition of the main step of the CBCG method (see (2.8)), we have

H(xk,i) = f(xk,i) + g(xk,i)

= f(xk,i−1 + αki Ui(pki − xk,i−1i )) + g(xk,i−1 + αki Ui(p

ki − xk,i−1i )).

7

We can now use Lemma 2.2 to obtain

H(xk,i) ≤ f(xk,i−1) + αki 〈∇if(xk,i−1),pki − xk,i−1i 〉

+(αki )

2βi2‖Ai(p

ki − xk,i−1i )‖2 + g(xk,i−1 + αki Ui(p

ki − xk,i−1i )). (4.1)

The last term can be bounded from above as follows:

g(xk,i−1 + αki Ui(pki − xk,i−1i )) =

N∑j=1,j 6=i

gj(xk,i−1j ) + gi((1− αki )x

k,i−1i + αki p

ki )

≤N∑

j=1,j 6=igj(x

k,i−1j ) + (1− αki )gi(x

k,i−1i ) + αki gi(p

ki )

= g(xk,i−1) + αki (gi(pki )− gi(x

k,i−1i )), (4.2)

where the inequality follows from the convexity of each gi, i = 1, 2, . . . , N . Now, combining (4.1)with (4.2) and using the definition of Si (see (3.5)), we obtain

H(xk,i) ≤ f(xk,i−1) + g(xk,i−1)− αki (〈∇if(xk,i−1),xk,i−1i − pki 〉+ gi(xk,i−1i )− gi(pki ))

+(αki )

2βi2‖Ai(p

ki − xk,i−1i )‖2

= H(xk,i−1)− αki Si(xk,i−1) +(αki )

2βi2‖Ai(p

ki − xk,i−1i )‖2

= H(xk,i−1)− αki Si(xk,i−1) +(αki )

2βi2‖Ai(p

ki − xki )‖2,

where the last inequality follows from the fact that xki = xk,i−1i (see (2.6)). This proves the desiredresult.

We also need an explicit expression of the distance between two consecutive iterates generatedby the CBCG method. The following lemma also holds for any choice of stepsize.

Lemma 4.2. Let xkk∈N be the sequence generated by the CBCG method. Then, for any k ∈ N,we have

‖xk+1 − xk‖2 =

N∑j=1

(αkj )2‖pkj − xk,j−1j ‖2.

Furthermore, for any i = 0, 1, . . . , N , we have

‖xk,i − xk‖2 ≤ ‖xk+1 − xk‖2.

Proof. For any fixed k ∈ N, we have from the definition of the iterates of the CBCG algorithm (see(2.8)) and (2.6) that

‖xk+1 − xk‖2 =N∑j=1

‖xk,Nj − xkj ‖2 =N∑j=1

(αkj )2‖pkj − xk,j−1j ‖2,

which proves the first statement. Furthermore, since for any i = 0, 1, . . . , N , xk,ij = xkj for all j > i

and xk,ij = xk,N for all j ≤ i, we have

‖xk,i − xk‖2 =

N∑j=1

‖xk,ij − xkj ‖2 =i∑

j=1

‖xk,ij − xkj ‖2 =i∑

j=1

‖xk,Nj − xkj ‖2

≤N∑j=1

‖xk,Nj − xkj ‖2 = ‖xk+1 − xk‖2,

8

which proves the second statement and the proof is complete.

4.1 Analysis for the Predefined Stepsize Strategy

In this section, we analyze the CBCG algorithm with the predefined stepsize given by

αki = αk ≡ 2

k + 2. (4.3)

Note that this stepsize rule is completely blind to disparities between blocks and therefore does notallow to counterbalance them in a way that will increase the convergence speed. The main steptowards the proof of convergence rate in this case is recorded in the following lemma.

Lemma 4.3. Let xkk∈N be the sequence generated by the CBCG method with the stepsize given in(4.3). Then, for any k ≥ 0, we have

H(xk+1) ≤ H(xk)− αkS(xk) +C1

2(αk)2,

where

C1 =N∑i=1

βi ‖Ai‖2D2i + 2LFD ‖A‖

N∑i=1

Di ‖Ai‖ . (4.4)

Proof. For any k ≥ 0 and each i ∈ 1, 2, . . . , N, we have from Lemma 4.1 that

H(xk,i) ≤ H(xk,i−1)− αkSi(xk,i−1) +(αk)2βi

2‖Ai(p

ki − xki )‖2

≤ H(xk,i−1)− αkSi(xk,i−1) +(αk)2βi‖Ai‖2D2

i

2. (4.5)

Summing (4.5) for i = 1, 2, . . . , N yields

H(xk+1) = H(xk,N )

≤ H(xk,0)− αkN∑i=1

Si(xk,i−1) +

(αk)2

2

N∑i=1

(βi‖Ai‖2D2i )

= H(xk)− αkN∑i=1

Si(xk) +

(αk)2

2

N∑i=1

(βi‖Ai‖2D2i ) + αk

N∑i=1

(Si(xk)− Si(xk,i−1))

= H(xk)− αkS(xk) +(αk)2

2

N∑i=1

(βi‖Ai‖2D2i ) + αk

N∑i=1

(Si(xk)− Si(xk,i−1)), (4.6)

where the last equality follows from (3.6). Using Lemma 3.2, and the fact that xk,i−1i = xki (see(2.6)) gives, for any i = 1, 2, . . . , N , that

Si(xk)− Si(xk,i−1) ≤ LFDi ‖Ai‖ · ‖A(xk − xk,i−1)‖. (4.7)

Combining (4.6) and (4.7) yields

H(xk+1) ≤ H(xk)−αkS(xk) +(αk)2

2

N∑i=1

(βi‖Ai‖2D2i ) +αkLF

N∑i=1

Di‖Ai‖ · ‖A(xk − xk,i−1)‖. (4.8)

9

From Lemma 4.2, for any i = 1, 2, . . . , N , we have

‖A(xk − xk,i−1)‖2 ≤ ‖A‖2 · ‖xk − xk,i−1‖2 ≤ ‖A‖2 · ‖xk − xk+1‖2

= (αk)2‖A‖2N∑j=1

‖pkj − xkj ‖2 ≤ (αk)2‖A‖2N∑j=1

D2j

= (αk)2‖A‖2D2. (4.9)

Combining (4.8) and (4.9) yields

H(xk+1) ≤ H(xk)− αkS(xk) +(αk)2

2

[N∑i=1

βi‖Ai‖2D2i + 2LFD‖A‖

N∑i=1

Di‖Ai‖

],

which proves the desired result.

We now recall the following technical lemma on sequences of non-negative numbers (cf. [1, Lemma1]).

Lemma 4.4. Let akk∈N and bkk∈N be two sequences satisfying, for any k ≥ 0, that

ak+1 ≤ ak − tkbk +C

2t2k, (4.10)

where 0 ≤ ak ≤ bk, tk = 2/(k + 2) and C > 0, then

(a) ak ≤ 2C/ (k + 1);

(b) For any n ≥ 1, we have

mink=bn/2c,...,n

bk ≤8C

n.

We are now ready to establish the sublinear rate of convergence of the CBCG algorithm withpredefined stepsize.

Theorem 4.5 (Sublinear rate for CBCG with predefined stepsize). Let xkk∈N be the sequencegenerated by the CBCG method with the predefined stepsize strategy given in (4.3). Then, for anyk ≥ 0, we have

H(xk)−H∗ ≤ 2C1

k + 1, (4.11)

and, for any n ≥ 1, we have

mink=bn/2c,...,n

S(xk) ≤ 8C1

n,

where

C1 =N∑i=1

βi ‖Ai‖2D2i + 2LFD ‖A‖

N∑i=1

Di ‖Ai‖ .

Proof. From Lemma 4.3 we have

H(xk+1)−H∗ ≤ H(xk)−H∗ − αkS(xk) +C1

2(αk)2.

In addition, by Lemma 3.1, we have that S(xk) ≥ H(xk)−H∗. We can therefore invoke Lemma 4.4with ak = H(xk)−H∗ and bk = S(xk), and the desired result is established.

10

4.2 Analysis for the Adaptive Stepsize Strategy

We now turn to the analysis of CBCG with adaptive stepsize. The main insight is to choose a stepsizethat minimizes the quadratic upper bound given in Lemma 4.1. In this setting, the adaptive stepsizeis defined as follows:

αki = argminα∈[0,1]

−αSi(xk,i−1) + α2 · βi

2‖Ai(p

ki − xki )‖2

= min

Si(x

k,i−1)

βi‖Ai(pki − xki )‖2, 1

. (4.12)

Note that by Lemma 4.1, this choice of stepsize leads to a decrease of objective value at each step.The analysis employed for the predefined stepsize does not seem to be easily adjusted to the adaptivestepsize rule. Thus, a different analysis is developed in this section. We begin with the followingtechnical result.

Lemma 4.6. Let xkk∈N be the sequence generated by the CBCG method with the adaptive stepsizestrategy given in (4.12). Then, for any k ≥ 0 and i ∈ 1, 2, . . . , N, we have

H(xk,i−1)−H(xk,i) ≥ αki2Si(x

k,i−1) ≥ (αki )2βi

2‖Ai(p

ki − xki )‖2.

Proof. We split the proof into two cases. First, if

Si(xk,i−1)

βi‖Ai(pki − xki )‖2≥ 1, (4.13)

then αki = 1, and by Lemma 4.1 we have

H(xk,i−1)−H(xk,i) ≥ αki Si(xk,i−1)−(αki )

2βi2‖Ai(p

ki − xki )‖2

= Si(xk,i−1)− βi

2‖Ai(x

ki − pki )‖2 (4.14)

≥ αki2Si(x

k,i−1) ≥ (αki )2βi

2‖Ai(x

ki − pki )‖2, (4.15)

where the last two inequalities follows from (4.13) and the fact that αki = 1. In the second case,when Si(x

k,i−1) < βi‖Ai(pki − xki )‖2, we have that

αki =Si(x

k,i−1)

βi‖Ai(pki − xki )‖2.

Thus,

H(xk,i−1)−H(xk,i) ≥ αki Si(xk,i−1)−(αki )

2βi2‖Ai(p

ki − xki )‖2

=Si(x

k,i−1)2

2βi‖Ai(pki − xki )‖2

=αki2Si(x

k,i−1) =(αki )

2βi2‖Ai(p

ki − xki )‖2. (4.16)

The result now follows by combining (4.15) and (4.16).

We can now prove the following important result.

11


H(xk,i−1)−H(xk,i) ≥ Si(xk,i−1)2

2 maxβi‖Ai‖2D2

i ,Ki

,where, for each i = 1, 2, . . . , N ,

Mi = maxx∈X‖∇if(x)‖ , (4.17)

Ki = (Mi + li)Di (4.18)

Proof. We again split the proof into two cases. First, if αki = 1, then by Lemma 4.6 we have

H(xk,i−1)−H(xk,i) ≥ 1

2Si(x

k,i−1) ≥ Si(xk,i−1)2

2Ki, (4.19)

where the last inequality follows from the fact that for any x ∈ X:

Si(x) = 〈∇if(x),xi − pi(x)〉+ gi(xi)− gi(pi(x))

≤ ‖∇if(x)‖ · ‖xi − pi(x)‖+ li ‖xi − pi(x)‖≤ (Mi + li)Di. (4.20)

In the second case, when αki < 1, we have that αki = Si(xk,i−1)

βi‖Ai(pki−xki )‖2, and by Lemma 4.6 we can write

H(xk,i−1)−H(xk,i) ≥ Si(xk,i−1)2

2βi‖Ai(pki − xki )‖2≥ Si(x

k,i−1)2

2βi‖Ai‖2D2i

, (4.21)

The result now follows by combining (4.20) and (4.21).

We will now prove an additional technical lemma that establishes a “sufficient decrease” propertyof the CBCG method with adaptive stepsize.


H(xk)−H(xk+1) ≥ βmin2N‖A(xk − xk,i−1)‖2,

where βmin is given by (2.4).

Proof. By Lemma 4.6 we have

H(xk,i−1)−H(xk,i) ≥ (αki )2βi

2‖Ai(p

ki − xki )‖2. (4.22)

12

Now, for any i ∈ 1, 2, . . . , N, we can write

‖A(xk − xk,i−1)‖2 =

∥∥∥∥∥∥N∑j=1

Aj(xkj − xk,i−1j )

∥∥∥∥∥∥2

≤ NN∑j=1

‖Aj(xkj − xk,i−1j )‖2

= N

i−1∑j=1

‖Aj(xkj − xk,i−1j )‖2

= N

i−1∑j=1

(αki )2‖Aj(x

kj − pkj )‖2

≤ 2N

βmin

i−1∑j=1

(H(xk,j−1)−H(xk,j))

=2N

βmin(H(xk)−H(xk,i−1))

≤ 2N

βmin(H(xk)−H(xk+1)),

where the first inequality follows from the fact that for any N vectors u1,u2, . . . ,uN , the inequality∥∥∥∑Nj=1 uj

∥∥∥2 ≤ N∑Nj=1 ‖uj‖

2 holds; the second inequality follows from (4.22), and the last inequality

follows from Lemma 4.6, which shows that the sequence of function values is non-increasing.

Remark 4.9. The bound in Lemma 4.8 can be improved in some situations where additional in-formation on the structure of A is available. For example, if the column space of each Ai spanorthogonal spaces, that is AT

i Aj = 0 for any 1 ≤ i < j ≤ N , then the factor N can be avoided.

The next lemma constitutes the crucial step towards the establishment of a sublinear convergencerate of the CBCG method with adaptive stepsize.

Lemma 4.10. Let xkk∈N be the sequence generated by the CBCG method with the adaptive stepsizestrategy given in (4.12). Then, for any k ≥ 0, we have

H(xk)−H(xk+1) ≥ 1

NC2S(xk)2,

where

C2 = 4

[max

i=1,2,...,N

max

βi ‖Ai‖2D2

i ,Ki

+NL2

FD2 maxi=1,2,...,N ‖Ai‖2

βmin

], (4.23)

and βmin is defined in (2.4) while Ki is defined in (4.17), for i = 1, 2, . . . , N .

Proof. For any i ∈ i = 1, 2, . . . , N we have that

Si(xk)2 = (Si(x

k,i−1) + Si(xk)− Si(xk,i−1))2

≤ 2Si(xk,i−1)2 + 2(Si(x

k)− Si(xk,i−1))2

≤ 2Si(xk,i−1)2 + 2L2

FD2i ‖Ai‖2 · ‖A(xk − xk,i−1)‖2

≤ 2Si(xk,i−1)2 +

4NL2FD

2i ‖Ai‖2

βmin(H(xk)−H(xk+1)), (4.24)

13

where the second inequality follows from Lemma 3.2 and the last inequality follows from Lemma 4.8.Invoking Lemma 4.7, we obtain from (4.24) that

Si(xk)2 ≤ 2Si(x

k,i−1)2 +4NL2

FD2i ‖Ai‖2

βmin(H(xk)−H(xk+1))

≤ 4 maxβi‖Ai‖2D2

i ,Ki

(H(xk,i−1)−H(xk,i)) +

4NL2FD

2i ‖Ai‖2

βmin(H(xk)−H(xk+1)).

(4.25)

Summing (4.25) for i = 1, 2, . . . , N yields

N∑i=1

Si(xk)2 ≤ 4

N∑i=1

maxβi‖Ai‖2D2

i ,Ki

(H(xk,i−1)−H(xk,i))

+4NL2

F

βmin

N∑i=1

D2i ‖Ai‖2(H(xk)−H(xk+1))

≤ 4 maxi=1,2,...,N

maxβi‖Ai‖2D2

i ,Ki

N∑i=1

(H(xk,i−1)−H(xk,i))

+4NL2

FD2 maxi=1,2,...,N ‖Ai‖2

βmin(H(xk)−H(xk+1))

= C2(H(xk)−H(xk+1)).

Finally, using (3.6) we have

S(xk)2 =

[N∑i=1

Si(xk)

]2≤ N

N∑i=1

Si(xk)2 ≤ NC2(H(xk)−H(xk+1)),


By combining Lemma 3.1 with Lemma 4.10, we obtain the following corollary.

Corollary 4.11. Let xkk∈N be the sequence generated by the CBCG method with the adaptivestepsize strategy given in (4.12). Then, for any k ≥ 0, we have

H(xk)−H(xk+1) ≥ 1

NC2(H(xk)−H∗)2,

where C2 is given in (4.23).

In order to get rate of convergence result in this case we will need the following technical lemma(cf. [6, Lemma 3.5]).

Lemma 4.12. Let akk∈N be a sequence of non-negative real numbers satisfying, for any k ∈ N,the following property

ak − ak+1 ≥ γa2k, (4.26)

and a0 ≤ 1/(γm) for some positive γ and m. Then, for any k ∈ N, we have that

ak ≤1

γ· 1

m+ k.

Combining Lemma 4.12 along with Corollary 4.11 and Lemma 3.1, establishes the sublinear rateof convergence of the CBCG method with adaptive stepsize.

14

Theorem 4.13 (Sublinear rate for CBCG with adaptive stepsize). Let xkk∈N be the sequencegenerated by the CBCG method with the adaptive stepsize strategy given in (4.12). Then for anyk ≥ 0 we have

H(xk)−H∗ ≤ NC2

k + 4, (4.27)

and

minn=0,1,...,k

S(xn) ≤ 2NC2

k + 4,

where

C2 = 4

[max

i=1,2,...,N

max

βi‖Ai‖2D2

i ,Ki

+NL2

FD2 maxi=1,2,...,N ‖Ai‖2

βmin

],

βmin is defined in (2.4) and Ki is defined in (4.17), for i = 1, 2, . . . , N .

Proof. Denote ak = H(xk) − H∗. Thanks to Corollary 4.11 we get that (4.26) holds true withγ = 1/(NC2). In addition, from Lemma 3.1 and (4.20), we have

a0 = H(x0)−H∗ ≤ S(x0) =N∑i=1

Si(x0) ≤

N∑i=1

Ki ≤ N maxi=1,2,...,N

Ki ≤NC2

4.

Hence, a0 ≤ 1/(4γ). Picking m = 4, we conclude from Lemma 4.12 that (4.27) holds. To find thebound on the optimality measure S(·), from Lemma 4.10, we have for any n ≥ 0

H(xn)−H(xn+1) ≥ γS(xn)2.

For any k0 ≥ 0, summing the latter inequality for n = k0, k0 + 1, . . . , 2k0, we obtain that

γ

2k0∑n=k0

S(xn)2 ≤ H(xk0)−H(x2k0+1) ≤ H(xk0)−H(x∗) ≤ 1

γ· 1

k0 + 4,

where the last inequality follows from (4.27). Hence,

minn=k0,k0+1,...,2k0

S(xn)2 ≤ 1

γ2· 1

(k0 + 1)(k0 + 4)≤ 1

γ2· 1

(k0 + 2)2, (4.28)

where we have used the fact that (k0 + 1)(k0 + 4) ≥ (k0 + 2)2. Similarly, for any k0 ≥ 0,

γ

2k0+1∑n=k0

S(xn)2 ≤ H(xk0)−H(x2k0+2) ≤ H(xk0)−H(x∗) ≤ 1

γ· 1

k0 + 4.

Hence

minn=k0,k0+1,...,2k0+1

S(xn)2 ≤ 1

γ2· 1

(k0 + 2)(k0 + 4)≤ 1

γ2· 1

(k0 + 5/2)2, (4.29)

where we have used the fact that (k0+2)(k0+4) ≥ (k0+5/2)2. Notice that k0+5/2 = (2k0+1)/2+2.Therefore, by combining (4.28) when k is even and (4.29) when k is odd, we conclude that

minn=0,1,...,k

S(xn) ≤ 1

γ· 1

k/2 + 2≤ 2NC2

k + 4,

which concludes the proof.

15

4.3 Backtracking Version of CBCG

Computing the adaptive stepsize requires to know the constants βi, i = 1, 2, . . . , N . In practice, theseconstants may not be known in advance or their known upper-approximations may be too loose. Acommon approach to overcome this problem is to use a backtracking scheme in order to estimate theunknown constants that are required to ensure convergence of the algorithm. This strategy is alsoeffective in the context of CG-type algorithms.

The crucial property for the convergence analysis with adaptive stepsize is the block structured de-scent condition given in Lemma 4.6. When the constants in Lemma 4.6 are not known, a workaroundis to enforce this descent condition algorithmically, which leads to the following scheme.

CBCG-B: Cyclic Block Conditional Gradient with BacktrackingInitialization. x0 ∈ X, βinit > 0, κ > 1, ξ−1i = 1, i = 1, 2, . . . , N .General Step. For k = 1, 2, . . .,

(i) For any i = 1, 2, . . . , N , compute

pki ∈ argminpi∈Xi

〈∇if(xk,i−1),pi〉+ gi(pi)

, (4.30)

Find ξ ∈ N, which is the smallest integer ξ ≥ ξk−1i such that

H(xk,i−1

)−H

(xk,i−1 + αUi(p

ki − xk,i−1i )

)≥ α

2Si(x

k,i−1), (4.31)

where α = min

Si(x

k,i−1)

κξβinit||Ai(pki−xk,i−1i )||2

, 1

and set

ξki = ξ, βki = κξβinit, αki = min

Si(x

k,i−1)

βki ||Ai(pki − xk,i−1i )||2, 1

, (4.32)

xk,i = xk,i−1 + αki Ui(pki − xk,i−1i ). (4.33)

(ii) Update: xk+1 = xk,N .

Under Assumption 2, we have from Lemma 4.6, that (4.31) holds provided that βki ≥ βi. Wetherefore have for any k ≥ 0 and any i = 1, 2, . . . , N that

βinit ≤ βki ≤ βi ≡ max κβi, βinit . (4.34)

The main insight is given in the following lemma which its proof follows the same arguments as thatof Lemma 4.6 using (4.31).

Lemma 4.14. Let xkk∈N be the sequence generated by the CBCG-B method. Then, for any k ≥ 0and i ∈ 1, 2, . . . , N, we have

H(xk,i−1)−H(xk,i) ≥ αki2Si(x

k,i−1) ≥ (αki )2βki

2‖Ai(p

ki − xki )‖2,

where

αki = min

Si(x

k,i−1)

βki ‖Ai(pki − xki )‖2, 1

.

Using the bounds in (4.34), we obtain the following rate of convergence result for the CBCGalgorithm with a backtracking scheme.

16

Theorem 4.15 (Sublinear rate for CBCG-B). Let xkk∈N be the sequence generated by the CBCG-Bmethod. Then, for any k ≥ 0, we have

H(xk)−H∗ ≤ NC3

k + 4,

and

minn=0,1,...,k

S(xn) ≤ 2NC3

k + 4,

where

C3 = 4

[max

i=1,2,...,Nmax

βi‖Ai‖2D2

i ,Ki

+NL2

FD2 maxi=1,2,...,N ‖Ai‖2

βinit

],

and βi is given in (4.34) while Ki is defined in (4.17), for i = 1, 2, . . . , N

Proof. The arguments of this proof are similar to those in the proof of Theorem 4.13 where we replaceLemma 4.6 with Lemma 4.14.

5 Further Discussions

In this section we begin by showing that the complexity analysis presented in Section 4.2 holdsfor the CBCG method with exact line search when the smooth part of the objective function isquadratic. Then, we discuss two issues that are relevant for the implications of our analysis of theCBCG method.

5.1 Exact Line Search for Quadratic Problems

We recall another well known stepsize rule – exact line search strategy (see, for example, [12]). Thisstepsize is defined as follows:

αki ∈ argminα∈[0,1]H(xk,i−1 + αUi(pki − xk,i−1i )). (5.1)

The minimization step (5.1) can incur unnecessary additional computations, unless it can be carriedout efficiently. This is the case for quadratic functions for which it reduces to the minimization of aone-dimensional quadratic over a segment. It is therefore tempting to use CBCG with this stepsizerule, but our convergence analysis does not cover it explicitly. However, in the quadratic case, exactline search can be recast as an adaptive stepsize strategy. Indeed, suppose for example that theproblem has the following form:

min

1

2

∥∥∥∥∥N∑i=1

Aixi

∥∥∥∥∥2

+

n∑i=1

〈bi,xi〉+

n∑i=1

gi(xi)

,

where Ai ∈ Rn×ni , bi ∈ Rni and gi satisfy Assumption 1(i). This problem fits model (2.1) withF (·) = 1

2‖ · ‖2,A = (A1,A2, . . . ,AN ) and gi(·) = 〈bi, ·〉 + gi(·). Therefore, we can choose βi = 1,

i = 1, 2, . . . , N and the exact line search strategy is equivalent to our adaptive stepsize strategy givenin (4.12) since the upper bound of Lemma 4.1 holds with equality. Therefore, in the quadratic case,convergence for the exact line search strategy follows from Theorem 4.13.

5.2 Random permutations

All the arguments presented in Section 4 remain valid if the order of the blocks is not fixed forall iterations. The only important element is that all blocks are visited once at each iteration, butthe order could change, in an arbitrary way, at each iteration. In particular, the arguments are

17

valid when the order of blocks is picked as a random permutation at the beginning of each iteration.Therefore, all the convergence rate estimates presented in Section 4 still hold true when using randompermutations.

The practical motivation for considering random permutations in the order of the blocks is thatit may be beneficial in practical settings [29]. Since the derived theoretical efficiency estimates applyto any rule for choosing the order of the blocks at the beginning of each iteration (deterministicor random), they actually do not explain the potential performance differences between differentvariants, such as purely cyclic versus random permutations.

5.3 Implications for Coordinate Descent in Machine Learning

In the case of blocks of size one (i.e., N = n), CBCG method bears a lot of resemblance to coordinatedescent type methods. For example, in dimension one, a conditional gradient step with exact linesearch is the same as exact minimization. Therefore, CBCG with blocks of size one and exact linesearch is equivalent to coordinate descent with exact minimization. As we have seen in the previoussection, our convergence analysis holds for this setting in the case of quadratic objectives. Thissetting has some interesting applications.

The `2 regularized empirical risk minimization is considered in [29], where the authors analyzethe stochastic coordinate descent method with exact minimization in the dual – a method referredto as “Stochastic Dual Coordinate Ascent” (SDCA). The dual problem can be written as follows(using our notations):

minx

1

n

n∑i=1

gi(xi) +λ

2

∥∥∥∥∥ 1

λn

n∑i=1

Aixi

∥∥∥∥∥2 , (5.2)

where λ > 0 is a parameter, and for any i = 1, 2, . . . , n, xi ∈ R, Ai ∈ Rn and gi is a closed,convex and proper function with a bounded domain. This is exactly the setting of Section 5.1 andtherefore Theorem 4.13 gives explicit deterministic rate estimates of the duality gap for the cyclic andrandom permutation coordinate descent variants applied to problem (5.2) when g1, g2, . . . , gn satisfyAssumption 1. It can be checked that in the setting of problem (5.2) (with λ = 1 for simplicity), weget in Theorem 4.13 that as n grows, the dominant term in the rate estimate (4.27) is of the form4D2 maxi ‖Ai‖2/(k+4). The quantity D2 remains proportional to n (see identity (2.3)) and the rateestimate of (4.27) still suffers from a multiplicative dependence in n, since each iteration requiresan effective pass through the n coordinates. Therefore, these results remain mostly of theoreticalinterest in the context of coordinate descent in machine learning where the focus is on large n andthere is space for improvement in order to theoretically grasp the practical performances of thesetypes of methods.

6 Numerical Experiments

The experiments presented in this section correspond to situations where for each i, the nonsmoothfunction gi, is taken to be the indicator of the set Xi ⊆ Rni , with the eventual addition of alinear term. We therefore recover the smooth constrained optimization model of the traditionalconditional gradient method (see also Remark 2.1). Furthermore, for the cyclic selection rule, in allthe experiments, we use the “random permutation” approach which consists in randomly changingthe order of the blocks at the beginning of each iteration. The convergence analyses described sofar do not depend on the order of the blocks and are still valid when it changes at each iteration.Therefore, all the results of Section 4 hold deterministically for this “random permutation” rulewhich we do not differentiate from the cyclic update rule, and hence the corresponding algorithmsbear the same name in this section. We begin with a modeling remark in Section 5.1. Numerical

18

results are given in Section 6.1 for synthetic examples and in Section 6.2 for the structured SVMtraining problem.

6.1 Synthetic Problems

We generate artificial convex quadratic problems with box constraints in R100. We consider problemsof the following form

min‖x‖∞≤1

1

2(x− y)TQ(x− y), (6.1)

where ‖·‖∞ denotes the `∞ norm, y ∈ R100 and Q ∈ R100×100 are given. The problem has a naturalcoordinate-wise structure which allows to apply the CBCG algorithm where each block consists ofa single coordinate. Indeed, by setting f(·) = 1

2(· − y)TQ(· − y) and gi = δ[−1,1], i = 1, 2, . . . , 100,problem (6.1) fits our model (2.1). We generate random instances of problem (6.1) as follows.

• Set d = 100 and n = 200.

• Generate X ∈ Rn×d with standard normal entries.

• Set Q = 1nXTD2X where D = diag(1/n2, 1/(n− 1)2, ..., 1)

• Generate y with standard normal entries.

We compare the standard Conditional Gradient (CG) algorithm [17], the Random Block ConditionalGradient (RBCG) algorithm [19] (with block chosen uniformly at random), the Cyclic Block Con-ditional Gradient where the order of the blocks is fixed for all iterations (CBCG-C) and the CyclicBlock Conditional Gradient where the order of the blocks is chosen by a random permutation at eachiteration (CBCG-P), which is the method of interest in this paper. For each algorithm, we comparethree different stepsize rules:

• Predefined: the stepsize which is given in (4.3). Note that for RBCG, there is a slightmodification in the definition of the predefined stepsize. There is no notion of a “cycle” inRBCG and the stepsize has the form 2n/(k + 2n) where k is the number of blocks updated sofar, see [19].

• Adaptive with Backtracking: set A = I and compute the stepsize using the backtrackingscheme.

• Exact line search: the stepsize is chosen to minimize the objective as in (5.1). Note that asexplained in Section 5.1, this corresponds to use the adaptive stepsize (4.12) with A = DX/

√n

since we consider a quadratic objective function.

For problems of the form (6.1), the second strategy is questionable since a closed form expression isavailable for the exact line search. However, the comparison to this strategy illustrates differencesin algorithmic behaviors related to the choice of A in the model (2.1) as well as the performance ofthe backtracking scheme.

Remark 6.1. For CG and RBCG, a sublinear convergence rate holds for the three stepsize strategies.For both CBCG-C and CBCG-P, in the case of the predefined and of the adaptive stepsizes, conver-gence is ensured by Theorems 4.5 and 4.15, respectively. Furthermore, since we consider quadraticobjective functions, exact line search stepsize strategy can be seen as a particular case of our adaptivestrategy and the convergence follows from Theorem 4.13 (see also Section 5.1).

19

Predefined step Backtracking Exact line−search

10−6

10−4

10−2

100

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10k

f

type

CBCG−P

CBCG−C

RBCG

CG

Figure 1: Comparison of Conditional Gradient (CG), its random block version (RBDG), its cyclicblock version with fixed block order (CBCG-C) and its cyclic block version with random permutationblock order (CBCG-P). We compare three different stepsize strategies based on 1000 randomlygenerated instances of problem (6.1). The central line is the median over the 1000 runs and theribbons show 98%, 90%, 80%, 60% and 40% quantiles. For all methods, k represents the number ofeffective passes through d coordinates.

We generate 1000 random instances of problem (6.1), fw denoting the objective function ofthe w-th randomly generated problem, w = 1, 2, . . . , 1000. For each problem, we run the fourdifferent algorithms with the three different stepsize rules (the initialization is at the origin). Theresults for the first 10 iterations are presented in Figure 1. For each algorithm, increasing k by 1means that N = 100 blocks have been queried, randomly for RBCG, sequentially for CBCG andall at once for CG. Since each objective function is generated randomly, it does not make sense todirectly compare performance across different problems on the same graph. To overcome this, foreach w = 1, 2, . . . , 1000, we “center” and “scale” the function values. That is, for each objectivefw, w = 1, 2, . . . , 1000, we estimate the optimal value f∗w of (6.1) by running CBCG-P with exactline search for 200 more iterations. For such a number of iterations, we observed on preliminaryexperiments that the algorithm reached machine precision on random instances of problem (6.1).The quantity plotted in Figure 1 is given by the following affine transformation,

fw(xk)− f∗wfw(0)− f∗w

,

so that in Figure 1, the first value is always 1 and the represented quantities are positive andasymptotically tend to 0. The main comments regarding this experiment are the following:

• For each stepsize rule, CBCG-C has an advantage.

• There is not much difference between the two variants CBCG-C and CBCG-P in this experi-ment.

• The predefined stepsize rule leads to much slower convergence.

• Adaptive with backtracking and exact line search rules yield improved convergence speed forboth CBCG and RBCG, which is not the case with the CG.

• Exact line search rule has a slight advantage over the backtracking rule.

20

The last point deserves further comments. Indeed, in model (2.1), there are different possible choicesof matrix A and function F that lead to equivalent problems. Despite being equivalent problems,different model choices may lead to variations in algorithmic performances when numerically solvinga problem. This is what we observe here. Indeed, as pointed out in Section 5.1, the exact line searchstrategy corresponds to choosing an F that is perfectly conditioned. This means that variations ofF around its minimum are isotropic. On the other hand, choosing A = I leads to a choice of Fthat is less well-behaved. These numerical experiments suggest that choosing A that correspondsto a better conditioned F leads to better numerical performances. Finally, the gap between the twostrategies remains small, highlighting the efficiency of the backtracking scheme.

6.2 Structured SVM

The main motivation for the introduction of block version of CG method with random update rulein [19] is that it leads to a new efficient algorithm for training the structured SVM [30, 31]. We referthe reader to [19] and the references therein for background on this problem and its relations to theconditional gradient method. In brief, the structured SVM solves a multi-class classification task. Itis dedicated to problems for which the output classes are embedded in a combinatorial structure suchas trees, graphs or sequences. In this setting the number of classes can be enormous which results inoptimization problems with an untractable number of linear constraints. For some of these problemsefficient decoding algorithms can be used as oracles to compute sub-gradients of the structured SVMproblem. They can also be used as oracles to solve the linearized sub-problem required to run theconditional gradient and block conditional gradient methods on the dual of the structured SVM.In this section, we briefly recall the mathematical formulation of the dual structured SVM problemand provide a numerical comparison between random and cyclic update rules for block conditionalgradient in this context. The purpose is not to be exhaustive here and we solely focus on the aspectsof the problem related to optimization. Therefore, we compare the numerical performances of thetwo selection rules on this real world example based on an optimization criterion. In this section,N denotes the number of training examples, M denotes the number of output classes and d is aninteger such that Rd is a space of tractable size (such that elements of Rd can be stored in memory).The dual variable of the structured SVM is a matrix α ∈ RN×M with non-negative entries (thiscould actually be refined with example dependent output classes, but we stick to this notation as afirst approximation). The dual problem of the structured SVM can be written as follows

minα≥0

λ

2‖Aα‖2 − Tr(bTα) : α1M = 1N

, (6.2)

where A : RN×M → Rd is a linear map, b ∈ RN×M is a matrix, Tr denotes the trace operatorand 1s denotes the vector in Rs which all entries are 1. Problem (6.2) has an interesting productstructure. Indeed, its feasible set can be viewed as a product of N simplices of dimension M . In thecontext of structured SVM it is not necessary to store α explicitly, it is sufficient to store and updateAα ∈ Rd and Tr(bTα). With this information, we can use the specific decoding oracles to solvesimplex constrained linear subproblems and run the conditional gradient algorithm. This constitutesthe main advantage of the method here, it allows to explore a potentially very large space by usingonly implicit conditional gradient steps (recall that M is the size of a set of combinatorial nature,see [19] for a complete derivation and more details). We use the code provided by the authors of[19] which gives the possibility to train the structured SVM using RBCG, CBCG-C (fixed blockorder) or CBCG-P (random permutation block order) on the Optical Character Recognition (OCR)originally proposed in [30]. We consider random update rule and cyclic update rule with varyingblock order (or random permutation). We only consider the exact line search strategy (5.1). Notethat our convergence analysis can be applied in this setting (see also Section 5.1). It can be checked(see [19] for details about the structured SVM model) that the constant C2 given by Theorem 4.13

21

λ : 0.001 λ : 0.01 λ : 0.1

10−1.5

10−1

10−0.5

0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50k

gap

type

CBCG−C

CBCG−P

RBCG

Figure 2: Comparison of RBCG and CBCG-C (purely cyclic) and CBCG-P (random permutation) forthe structured SVM training on the OCR dataset of [30]. The quantity represented is the optimalitymeasure defined by (3.1) and the heading represents the value of λ (from (6.2)). The central line isthe median over 20 runs and the ribbons show 80% and 50% quantiles. For all methods, k representsthe number of effective passes through all the blocks (which correspond to datapoints here).

remains bounded away from zero as N grows, in the setting of model (6.2). Therefore, the rategiven in (4.27) suffers from a multiplicative dependence in N . From a theoretical point of view, inthe context of large training sets (large N), this problematic dependence leaves room for improvedconvergence analysis. Numerical results in terms of the optimality measure S for various values of λare given in Figure 2. The global behavior is similar to what we observed in the synthetic examplesof Section 6.1. In particular, the cyclic update rule has a slight advantage. The global behaviordiffers from what we have observed in the synthetic examples of Section 6.1. The random blockselection rule (RBCG) has an advantage over the purely cyclic block selection rule (CBCG-C). Therandom permutation approach (CBCG-P) gives the best performances here. Since our analysis isthe same for both CBCG-C and CBCG-P, it does not allow to explain this empirical difference.

7 Discussion and Future Work

7.1 Comparison with existing results

We have described the CBCG algorithm and provided an explicit sublinear convergence rate estimateof the form O(1/k), which is canonical for linear oracle based algorithms (up to multiplicative factor)[26]. This asymptotic rate cannot be improved in general [10]. A few words on the constants are inorder. For the traditional CG method, the multiplicative constant is proportional to LFD

2 [17]2. Inorder to compare these rates with those obtained for block decomposition methods, we consider inthis discussion that one iteration of a block decomposition method consists in updating N arbitrarilychosen blocks (which is consistent with the notation in this paper). The multiplicative constantproposed in [19] for the average case complexity of RBCG is proportional to (LFD

2 +H(x0)−H∗)independently of the number of blocks3. The independence of the constant with respect to thenumber of blocks is an important property that we do not recover for the cyclic block updating rule.

2The constant presented in [17] relates to an affine invariant notion of curvature, but it can be upper bounded bythe term we consider.

3As in the case of CG, the result is not presented with this exact constant, but it is an upper bound that we usefor discussion purposes.

22

This behavior is consistent with previous observations comparing random and cyclic update rules[6] where multiplicative dependance in the number of blocks also appeared in the convergence rateanalysis of different types of methods.

For the sake of clarity, we will discuss the results of section 4 when A = I, all Di are equal andβi = LF , i = 1, 2, . . . , N . For the predefined stepsize, the quantity C1/LFD

2 (C1 given by (4.4))grows like Ω(

√N). Furthermore, for the adaptive stepsize, the quantity NC2/LFD

2 (C2 given by(4.23)) grows like Ω(N2). This is much worse than the

√N dependance of the predefined strategy

although practical simulations tend to show that the adaptive rule is much faster. Therefore, thereis a room for improvement in the analysis and this raises the natural question of the possibility toget multiplicative constants that do not depend on the number of blocks (which is in general the casefor random block selection rules). As we have seen in Sections 5 and 6, for SDCA and structuredSVM applications, the effect is to have a multiplicative dependance of the convergence rate in thesize of the training set. Therefore, although theoretically interesting, the rates we derive are limitedto explain performances on large training sets (large number of blocks in the dual) which is themotivation for using block decomposition methods in machine learning applications.

However, contrary to the average case complexity estimate of RBCG [19], the exact expression ofthe rate is not a direct generalization of that of CG algorithm. Recall that for traditional CG method,the multiplicative constant is proportional to LFD

2 [17]. The constants given in Theorems 4.5 and4.13 feature additional multiplicative and additive terms. In particular the quantity C1/LFD

2, forC1 given by (4.4) or the quantity NC2/LFD

2 for C2 given by (4.23), show multiplicative dependencein the number of blocks N . This behavior is consistent with previous observations comparing randomand cyclic update rules [6]. It is expected that the analysis of cyclic rules lead to worse constants sincethey represent worst case complexity analysis. An important theoretical question is whether this canbe leveraged or not for cyclic rules. In other words, is it possible to prove an explicit convergencerate of the form M/k for the cyclic rule such that M/(LFD

2) does not depend on the number ofblocks N?

For practical applications however, this remark should be mitigated since we are comparing upperbounds. These bounds may only reflect limitations of the analysis, not of the methods, and theircomparison may not shed much light on the comparative behavior of different rules on practicalproblems. Indeed, our numerical experiments on synthetic and real-world examples reproduciblysuggest an advantage of cyclic or random permutation variants over fully random update rule. Thisis again something that has already been observed in other contexts [6, 29]. A question relatedto the discussion of the previous paragraph is to give a theoretical justification to this observationor eventually provide different iteration dependent block update rules that comply with it (see forexample [25] for an illustration in the context of gradient descent).

7.2 Future Directions

Finally, a natural question is that of the extension of specificities of the conditional gradient methodin our block decomposition setting. Potential directions include the following:

• Linear convergence. CG is known to converge linearly when the optimum is in the relativeinterior of the feasible set [15, 20] or when the feasible set is strongly convex and the gradientof the smooth part of the objective is non-zero on the feasible set [21].

• Dual interpretation of the block decomposition. Generalized CG is known to implicitlygenerate subgradient sequences [1] related to the mirror descent algorithm [3] applied to a dualproblem. Similarly, a dual interpretation of RBCG is in terms of stochastic subgradient [19].

• Generalization of the results to exact line search stepsize strategies. The analysis ofsuch stepsize strategies is not problematic for CG or RBCG [17, 19]. However, we could notgeneralize it for CBCG, except in the quadratic case (see Section 5.1), and thus developing an

23

analysis for the exact line search strategy is in our opinion an important task. Most of ouranalysis collapses without further assumptions, because it is no longer possible to get sufficientdecrease conditions of the type of Lemma 4.8. We therefore expect that a different path needsto be considered.

• Generalization to inexact oracles. In many practical applications, the search directiongiven by the oracle is computed by an algorithm. It is therefore relevant to consider multi-plicative errors in order to use well defined stopping criteria for the inner algorithm. This wasalready considered in the random block variant [19].

A Proof of Lemma 2.2

We adapt the standard proof of the descent Lemma, see e.g., [7, Proposition A.24]. Under theassumptions of the lemma, using the fundamental theorem of calculus for line integrals on thesegment [x,x + Uih], we have

f(x + Uih) = f(x) +

∫ 1

0〈∇f(x + tUih),Uih〉 dt

= f(x) + 〈∇f(x),Uih〉+

∫ 1

0〈∇f(x + tUih)−∇f(x),Uih〉 dt. (A.1)

We can bound the integrand term for any t ∈ [0, 1] as follows

〈∇f(x + tUih)−∇f(x),Uih〉 =⟨AT∇F (A(x + tUih))−AT∇F (Ax),Uih

⟩= 〈∇F (A(x + tUih))−∇F (Ax),AUih〉= 〈∇F (Ax + tAih)−∇F (Ax),Aih〉≤ ‖∇F (Ax + tAih))−∇F (Ax)‖ · ‖Aih‖≤ tβi||Aih||2, (A.2)

where we have used Cauchy-Schwartz inequality and Assumption 2 to obtain the last two inequalities.Combining (A.1) and (A.2), we have

f(x + Uih) ≤ f(x) + 〈∇f(x),Uih〉+ βi||Aih||2∫ 1

0t dt

= f(x) + 〈∇if(x),h〉+βi||Aih||2

2,


References

[1] F. Bach. Duality between subgradient and conditional gradient methods. SIAM J. Optim.,25(1):115–129, 2015.

[2] A. Beck. On the convergence of alternating minimization for convex programming with appli-cations to iteratively reweighted least squares and decomposition schemes. SIAM J. Optim.,25(1):185–209, 2015.

[3] A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods forconvex optimization. Oper. Res. Lett., 31(3):167–175, 2003.

24

[4] A. Beck and M. Teboulle. A conditional gradient method with linear rate of convergence forsolving convex linear systems. Math. Methods Oper. Res., 59(2):235–247, 2004.

[5] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverseproblems. SIAM J. Imaging Sci., 2(1):183–202, 2009.

[6] A. Beck and L. Tetruashvili. On the convergence of block coordinate descent type methods.SIAM J. Optim., 23(4):2037–2060, 2013.

[7] D. P. Bertsekas. Nonlinear Programming. Belmont MA: Athena Scientific, second edition, 1999.

[8] K. Bredies and D. A. Lorenz. Iterated hard shrinkage for minimization problems with sparsityconstraints. SIAM J. Sci. Comput., 30(2):657–683, 2008.

[9] K. Bredies, D. A. Lorenz, and P. Maass. A generalized conditional gradient method and itsconnection to an iterative shrinkage method. Comput. Optim. Appl., 42(2):173–193, 2009.

[10] M. D. Canon and C. D. Cullum. A tight upper bound on the rate of convergence of frank-wolfealgorithm. SIAM J. Control, 6(4):509–516, 1968.

[11] B. Cox, A. Juditsky, and A. Nemirovski. Dual subgradient algorithms for large-scale nonsmoothlearning problems. Math. Program., 148(1-2, Ser. B):143–180, 2014.

[12] V. F. Dem’yanov and A. M. Rubinov. The minimization of a smooth convex functional on aconvex set. SIAM J. Control, 5(2):280–294, 1967.

[13] J. C. Dunn and S. Harshbarger. Conditional gradient algorithms with open loop step size rules.J. Math. Anal. Appl., 62(2):432–444, 1978.

[14] M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Res. Logist. Quart.,3(1-2):95–110, 1956.

[15] J. Guelat and P. Marcotte. Some comments on wolfe’s ’away step’. Math. Program., 35(1):110–119, 1986.

[16] Z. Harchaoui, A. Juditsky, and A. Nemirovski. Conditional gradient algorithms for norm-regularized smooth convex optimization. Math. Program., (Ser. A). Published online.

[17] M. Jaggi. Revisiting frank-wolfe: Projection-free sparse convex optimization. In Proceedings ofthe 30th International Conference on Machine Learning (ICML-13), volume 28, pages 427–435,2013.

[18] M. Jaggi and M. Sulovsky. A simple algorithm for nuclear norm regularized problems. InProceedings of the 27th International Conference on Machine Learning (ICML-10), pages 471–478, 2010.

[19] S. Lacoste-Julien, M. Jaggi, M. Schmidt, and P. Pletscher. Block-coordinate frank-wolfe opti-mization for structural svms. In Proceedings of the 30th International Conference on MachineLearning (ICML-13), volume 28, pages 53–61, 2013.

[20] S. Lacoste-Julien and J. Martin. An affine invariant linear convergence analysis for frank-wolfealgorithms. arXiv preprint arXiv:1312.7864, 2013.

[21] E. S. Levitin and B. T. Poljak. Minimization methods in the presence of constraints. USSRComput. Math. and Math. Phys., 6(5):787–823, 1966.

25

[22] T. Luo and P. Tseng. On the convergence of the coordinate descent method for convex differ-entiable minimization. J. Optim. Theory Appl., 72(1):7–35, 1992.

[23] Y. E. Nesterov. On the convergence of the coordinate descent huge-scale optimization problems.SIAM J. Optim., 22(2):341–362, 2012.

[24] Y. E. Nesterov. Gradient methods for minimizing composite functions. Math. Program., 140(1,Ser. B):125–161, 2013.

[25] J. Nutini, M. Schmidt, I. H. Laradji, M. Friedlander, and H. Koepke. Coordinate de-scent converges faster with the gauss-southwell rule than random selection. arXiv preprintarXiv:1506.00552, 2015. ICML 2015, to appear.

[26] B. T. Polyak. Introduction to Optimization. Translations Series in Mathematics and Engineer-ing. Optimization Software Inc. Publications Division, New York, 1987. Translated from theRussian, With a foreword by Dimitri P. Bertsekas.

[27] P. Richtarik and M. Takac. Iteration complexity of randomized block-coordinate descent meth-ods for minimizing a composite function. Math. Program., 144(1-2, Ser. A):1–38, 2014.

[28] A. Saha and A. Tewari. On the nonasymptotic convergence of cyclic coordinate descent methods.SIAM Journal on Optimization, 23(1):576–601, 2013.

[29] S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularizedloss. The Journal of Machine Learning Research, 14(1):567–599, 2013.

[30] B. Taskar, C. Guestrin, and D. Koller. Max-margin markov networks. In Advances in NeuralInformation Processing Systems (NIPS 2003), pages 25–32, 2004.

[31] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structuredand interdependent output variables. J. Mach. Learn. Res., 6:1453–1484, 2005. Published online.

26

the cyclic block conditional gradient method for convex ...[22] provides an asymptotic analysis of...

Documents