tight dimension independent lower bound on optimal ... · the upper bound of the convergence rate...

Tight Dimension Independent Lower Bound on OptimalExpected Convergence Rate for Diminishing Step Sizes in

SGD

Phuong Ha Nguyen Lam M. Nguyen Marten van Dijk

October 12, 2018

Abstract

We study convergence of Stochastic Gradient Descent (SGD) for strongly convex and smooth objec-tive functions F .

We prove a lower bound on the expected convergence rate which holds for any sequence of dimin-ishing stepsizes that are designed based on only global knowledge such as the fact that F is smoothand strongly convex, the component functions are smooth and convex, together with more additionalinformation. Our lower bound meets the expected convergence rate of a recently proposed sequence ofstepsizes at ICML 2018, which is based on such knowledge, within a factor 32. This shows that thestepsizes as proposed in the ICML paper are close to optimal. Furthermore, we conclude that in orderto be able to construct stepsizes that beat our lower bound, more detailed information about F must beknown.

Our work significantly improves over the state-of-the-art lower bound which we show is anotherfactor 643 · d worse, where d is the dimension. We are the first to prove a lower bound that comes withina small constant – independent from any other problem specific parameters – from an optimal solution.

1 Introduction

We are interested in solving the following stochastic optimization problem

minw∈Rd

{F (w) = E[f(w; ξ)]} , (1)

where ξ is a random variable obeying some distribution g(ξ). In the case of empirical risk minimizationwith a training set {(xi, yi)}ni=1, ξi is a random variable that is defined by a single random sample (x, y)

Phuong Ha Nguyen, Department of Electrical and Computer Engineering, University of Connecticut, Storrs, CT, USA. Email:[email protected]

Lam M. Nguyen, IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA. Email: [email protected]

Marten van Dijk, Department of Electrical and Computer Engineering, University of Connecticut, Storrs, CT, USA. Email:marten.van [email protected]

1

arX

iv:1

810.

0472

3v1

[m

ath.

OC

] 1

0 O

ct 2

018

mailto:[email protected]




pulled uniformly from the training set. Then, by defining fi(w) := f(w; ξi), empirical risk minimizationreduces to

minw∈Rd

{F (w) =

1

n

n∑i=1

fi(w)

}. (2)

Problems of this type arise frequently in supervised learning applications [7]. The classic first-order methodsto solve problem (2) are gradient descent (GD) [17] and stochastic gradient descent (SGD) 1 [19] algorithms.GD is a standard deterministic gradient method, which updates iterates along the negative full gradient withlearning ηt as follows

wt+1 = wt − ηt∇F (wt) = wt −ηtn

n∑i=1

∇fi(wt) , t ≥ 0.

We can choose ηt = η = O(1/L) and achieve a linear convergence rate for the strongly convex case [14].The upper bound of the convergence rate of GD and SGD has been studied in [2, 4, 14, 20]. However, GDrequires evaluation of n derivatives at each step, which is very expensive and therefore avoided in large-scale optimization. To reduce the computational cost for solving (2), a class of variance reduction methods[9, 6, 8, 15] has been proposed. The difference between GD and variance reduction methods is that GDneeds to compute the full gradient at each step, while the variance reduction methods will compute the fullgradient after a certain number of steps. In this way, variance reduction methods have less computationalcost compared to GD. To avoid evaluating the full gradient at all, SGD generates an unbiased randomvariable ξt such as

Eξt [∇f(wt; ξt)] = ∇F (wt),

and then evaluates gradient ∇f(wt; ξt) for ξt drawn from distribution g(ξ). After this, wt is updated asfollows

wt+1 = wt − ηt∇f(wt; ξt). (3)

Algorithm 1 provides a detailed description. Obviously, the computational cost of SGD is n times cheaperthan that of GD. However, as has been shown in literature we need to choose ηt = O(1/t) and the conver-gence rate of SGD is slowed down to O(1/t) [3], which is a sublinear convergence rate.

Algorithm 1 Stochastic Gradient Descent (SGD) MethodInitialize: w0

Iterate:for t = 0, 1, . . . do

Choose a step size (i.e., learning rate) ηt > 0.Generate a random variable ξt.Compute a stochastic gradient∇f(wt; ξt).Update the new iterate wt+1 = wt − ηt∇f(wt; ξt).

end for

In this paper we focus on the general problem (1) where F is strongly convex. Since F is strongly convex,a unique optimal solution of (1) exists and throughout the paper we denote this optimal solution by w∗. The

1We notice that even though stochastic gradient is referred to as SG in literature, the term stochastic gradient descent (SGD) hasbeen widely used in many important works of large-scale learning.

2

starting point for analysis is the recurrence

E[‖wt+1 − w∗‖2] ≤ (1− µηt)E[‖wt − w∗‖2] + η2tN, (4)

whereN = 2E[‖∇f(w∗; ξ)‖2]

and ηt is upper bounded by 12L ; the recurrence has been shown to hold if we assume (1) N is finite, (2) F (.)

is µ-strongly convex, (3) f(w; ξ) is L-smooth and (4) convex [16, 10]; we detail these assumptions below:

Assumption 1 (µ-strongly convex). The objective function F : Rd → R is µ-strongly convex, i.e., thereexists a constant µ > 0 such that ∀w,w′ ∈ Rd,

F (w)− F (w′) ≥ 〈∇F (w′), (w − w′)〉+µ

2‖w − w′‖2. (5)

As shown in [14, 3], Assumption 1 implies

2µ[F (w)− F (w∗)] ≤ ‖∇F (w)‖2 , ∀w ∈ Rd. (6)

Assumption 2 (L-smooth). f(w; ξ) is L-smooth for every realization of ξ, i.e., there exists a constant L > 0such that, ∀w,w′ ∈ Rd,

‖∇f(w; ξ)−∇f(w′; ξ)‖ ≤ L‖w − w′‖. (7)

Assumption 2 implies that F is also L-smooth.

Assumption 3. f(w; ξ) is convex for every realization of ξ, i.e., ∀w,w′ ∈ Rd,

f(w; ξ)− f(w′; ξ) ≥ 〈∇f(w′; ξ), (w − w′)〉.

We notice that the earlier established recurrence in [12] under the same set of assumptions

E[‖wt+1 − w∗‖2] ≤ (1− 2µηt + 2L2η2t )E[‖wt − w∗‖2] + η2

tN

is similar, but worse than (4) as it only holds for ηt < µL2 where (4) holds for ηt ≤ 1

2L . Only for stepsizes ηt < µ

2L2 the above recurrence provides a better bound than (4), i.e., 1 − 2µηt + 2L2η2t ≤ 1 − µηt.

In practical settings such as logistic regression µ = O(1/n), L = O(1), and t = O(n) (i.e. t is at mosta relatively small constant number of epochs, where a single epoch represents n iterations resembling thecomplexity of a single GD computation). As we will show, for this parameter setting the optimally chosenstep sizes are� µ

L2 . This is the reason we focus in this paper on analysing recurrence (4): For ηt ≤ 12L ,

Yt+1 ≤ (1− µηt)Yt + η2tN,

where Yt = E[‖wt − w∗‖2].

Problem Statement: It is well-known that based on the above assumptions (without the so-called boundedgradient assumption) and knowledge of only µ and L a sequence of stepsizes ηt can be constructed suchthat Yt is smaller than O(1/t) [16]; more explicitly, Yt ≤ 16N

µ1

µ(t− 20Lµ

)+4L. Knowing a tight lower bound

on Yt is important because of the following reasons: (1) It helps us understand into what extend a given

3

sequence of stepsizes ηt leads to an optimal expected convergence rate. (2) The lower bound tells us that asequence of step sizes as a function of only µ and L cannot beat an expected convergence rate of O(1/t).More information is needed in a construction of ηt if we want to achieve a better expected convergence rateYt = O(1/tp) where p > 1.

Related Work and Contribution: The authors of [13] proposed the first formal study about lower boundingthe expected convergence rate for SGD. The authors of [1] and [18] independently studied this lower boundusing information theory and were able to improve it.

As in this paper, the derivation in [1] is for SGD where the sequence of step sizes is a-priori fixed basedon global information regarding assumed stochastic parameters concerning the objective function F . Theirproof uses the following three assumptions (in this paper we assume a different set of assumptions as listedabove):

1. The assumption of a strongly convex objective function, i.e., Assumption 1 (see Definition 3 in [1]).

2. There exists a bounded convex set S ⊂ Rd such that

E[‖∇f(w; ξ)‖2] ≤ σ2

for all w ∈ S ⊂ Rd (see Definition 1 in [1]). Notice that this is not the same as the bounded gradientassumption where S = Rd is unbounded.2

3. The objective function F is a convex Lipschitz function, i.e., there exists a positive number K suchthat

‖F (w)− F (w′)‖ ≤ K‖w − w′‖, ∀w,w′ ∈ S ⊂ Rd.

We notice that this assumption actually implies the assumption on bounded gradients as stated above.

To prove the lower bound for strongly convex and Lipschitz objective functions, the authors constructed aclass of objective functions and showed that the lower bound for this class is, in terms of the notation usedin this paper,

log(2/√e)

108d

N

µ2t. (8)

We revisit their derivation in supplementary material A where we show how their lower bound transformsinto (8). Notice that their lower bound depends on dimension d.

In this paper we prove for strongly convex and smooth objective functions the lower bound

≈ 1

2

N

µ2t.

Our lower bound is independent from d and, in fact, it meets the expected convergence rate for a specif-ically constructed sequence of step sizes (based on only the parameters µ for strong convexity and L forsmoothness) within a factor 32. This proves that this sequence of step sizes leads to an optimal expectedconvergence rate within the small factor of 32 and proves that our lower bound is tight within a factor of 32.Notice that we significantly improve over state of the art since (8) is a factor 643 · d larger than our lowerbound, and more important, our lower bound is independent of d.

2The bounded gradient assumption, where S is unbounded, is in conflict with assuming strong convexity as explained in [16].

4

The specifically constructed sequence of step sizes mentioned above is from [16] and is given by ηt = 2µt+4L

and yields expected convergence rate Yt ≈ 16Nµ2t

. This explains the 32 factor difference.

In [19], the authors proved that in order to make SGD converge, stepsizes ηt should satisfy conditions∞∑t=0

ηt =∞ and∞∑t=0

η2t <∞.

In [12], the authors studied the expected convergence rates for another class of step sizes of O(1/tp) where0 < p ≤ 1. However, the authors of both [19] and [12] do not discuss about the optimal stepsizes among allproposed stepsizes which is what is done in this paper.

Outline: The paper is organized as follows. Section 2 describes a class of strongly convex and smooth ob-jective functions which is used to derive the lower bound. We verify our theory by experiments in Section 3.Suplementary material A comprehensively studies the work in [1]. Section 4 concludes the paper.

2 Lower Bound and Optimal Stepsize for SGD

In this paper, we consider the following extended problem of SGD: When constructing a sequence of step-sizes ηt, we do not only have access to µ and L; in addition we also have access to N , access to the fullgradient∇F (wt) in the t-th iteration, and access to an oracle that knows Yt, and in the t-th iteration. Noticethat this allows adaptively constructed ηt into some extend and our lower bound will hold for this moregeneral case.

Note that the construction of ηt as analyzed in this paper does not depend on knowledge of the stochasticgradient∇f(wt; ξt). So, we do not consider step sizes that are adaptively computed based on∇f(wt; ξt).

We study the best lower bound on the expected convergence rate for any possible sequence of stepsizes ηtthat satisfy the requirements given above in the extended SGD setting.

In order to prove a lower bound we propose a specific class of strongly convex and smooth objective functionF and we show in the extended SGD setting how to compute the optimal step size ηt as a function ofµ,L,N , ∇F (wt), and an oracle with access to Yt. We will show that the optimal stepsize ηt is basedon µ,L,N, Y0, . . . , Yt. For completeness, as in Algorithm 1, the next wt+1 is defined as wt+1 = wt −ηt∇f(wt; ξt).

We consider the following class of objective functions F : We consider a multivariate normal distribution ofa d-dimensional random vector ξ, i.e., ξ ∼ N (m,Σ), where m = E[ξ] and Σ = E[(ξ −m)(ξ −m)T] is the(symmetric positive semi-definite) covariance matrix. The density function of ξ is chosen as

g(ξ) =exp(−(ξ−m)TΣ−1(ξ−m)

2 )√(2π)d|Σ|

.

We select component functions

f(w; ξ) = s(ξ)‖w − ξ‖2

2,

where function s(ξ) is constructed a-priori according to the following random process:

5

• With probability 1−µ/L, we draw s(ξ) from the uniform distribution over interval [0, µ/(1−µ/L)].

• With probability µ/L, we draw s(ξ) from the uniform distribution over interval [0, L].

The following theorem analyses the sequence of optimal step sizes for our class of objective functions andgives a lower bound on the corresponding expected convergence rates. The theorem states that we cannotfind a better sequence of step sizes. In other words without any more additional information about theobjective function (beyond µ,L,N, Y0, . . . , Yt for computing ηt), we can at best prove a general upperbound which is at least the lower bound as stated in the theorem. As explained in the introduction an upperbound which is only a factor 32 larger than the theorem’s lower bound exists.

As a disclaimer we notice that for some objective functions the expected convergence rate can be muchbetter than what is stated in the theorem: This is due to the specific nature of the objective function itself.However, without knowledge about this nature, one can only prove a general upper bound on the expectedconvergence rate Yt and any such upper bound must be at least the lower bound as proven in the nexttheorem. Therefore, as a conclusion of the theorem we infer that only if more/other information is used foradaptively computing ηt, then it may be possible to derive stronger upper bounds that beat the lower boundof the theorem.

Theorem 1. We assume that component functions f(w; ξ) are constructed according to the recipe describedabove. Then, the corresponding objective function is µ-strongly convex and the component functions are L-smooth and convex with µ < L.

If we run Algorithm 1 and assume that an oracle accessing Yt = E[‖wt−w∗‖2] is given at the t-th iteration(our extended SGD problem setting), then an exact expression for the optimal sequence of stepsizes ηt(see (12)) based on µ,L,N, Y0, . . . , Yt can be given. For this sequence of stepsizes,

Yt ≥N

2µ

1

W, (9)

and for t ≥ T ′ = 20Lµ ,

Yt ≤16N

µ

1

µ(t− 20Lµ ) + 4L

, (10)

where

W =µt+µ(L+ µ/(1− µ/L))

12+ µ[ln(t+ L

1 + µ/(L(1− µ/L))

12)− ln(L

1 + µ/(L(1− µ/L))

12)].

Proof. Clearly, f(w; ξ) is s(ξ)-smooth where the maximum value of s(ξ) is equal to L. That is, all functionsf(w; ξ) are L-smooth (and we cannot claim a smaller smoothness parameter). We notice that

Eξ[s(ξ)] = (1− µ/L)µ/(1− µ/L)

2+ (µ/L)

L

2= µ

and

Eξ[s(ξ)2] = (1− µ/L)(µ/(1− µ/L))2

12+ (µ/L)

L2

12=µ(L+ µ

1−µ/L)

12.

6

With respect to f(w; ξ) and distribution g(ξ) we define

F (w) = Eξ[f(w; ξ)] = Eξ[s(ξ)‖w − ξ‖2

2].

Since s(ξ) only assigns a random variable to ξ which is drawn from a distribution whose descriptionis not a function of ξ, random variables s(ξ) and ξ are statistically independent. Therefore,

F (w) = Eξ[s(ξ)‖w − ξ‖2

2] = Eξ[s(ξ)]Eξ[

‖w − ξ‖2

2]

= µEξ[‖w − ξ‖2

2]

Notice:

1. ‖w − ξ‖2 = ‖(w −m) + (m− ξ)‖2 = ‖w −m‖2 + 2〈w −m,m− ξ〉+ ‖m− ξ‖2.

2. Since m = E[ξ], we have E[m− ξ] = 0.

3. E[‖m− ξ‖2] =∑d

i=1 E[(mi − ξi)2] =∑d

i=1 Σi,i = Tr(Σ).

Therefore,

F (w) = µEξ[‖w − ξ‖2

2] = µ

‖w −m‖2

2+ µ

Tr(Σ)

2,

and this shows F is µ-strongly convex and has minimum

w∗ = m.

Since

∇w[‖w − ξ‖2] = ∇w[〈w,w〉 − 2〈w, ξ〉+ 〈ξ, ξ〉]= 2w − 2ξ = 2(w − ξ),

we have∇wf(w; ξ) = s(ξ)(w − ξ).

In our notationN = 2Eξ[‖∇f(w∗; ξ)‖2] = 2Eξ[s(ξ)2‖w∗ − ξ‖2].

By using similar arguments as used above we can split the expectation and obtain

N = 2Eξ[s(ξ)2‖w∗ − ξ‖2] = 2Eξ[s(ξ)2]Eξ[‖w∗ − ξ‖2].

We already calculatedEξ[‖w∗ − ξ‖2] = ‖w∗ −m‖2 + Tr(Σ) = Tr(Σ)

and we know

Eξ[s(ξ)2] =µ(L+ µ

1−µ/L)

12.

7

This yields

N = 2Eξ[s(ξ)2]Eξ[‖w∗ − ξ‖2] =µ(L+ µ

1−µ/L)

6Tr(Σ).

In the SGD algorithm we compute

wt+1 = wt − ηt∇f(wt; ξt)

= wt − ηts(ξt)(wt − ξt)= (1− ηts(ξt))wt + ηts(ξt)ξt.

We choose w0 according to the following computation: We draw ξ from its distribution and apply fullgradient descent in order to find w0 which minimizes f(w; ξ) for w. Since

f(w; ξ) = s(ξ)‖w − ξ‖2

2,

the minimum is achieved by w0 = ξ. Therefore,

Y0 = E[‖w0 − w∗‖2] = E[‖ξ − w∗‖2] = Tr(Σ).

Let Ft = σ(w0, ξ0, . . . , ξt−1) be the σ-algebra generated by w0, ξ0, . . . , ξt−1. We derive

E[‖wt+1 − w∗‖2|Ft] = E[‖(1− ηts(ξt))(wt − w∗) + ηts(ξt)(ξt − w∗)‖2|Ft]

which is equal to

E[(1− ηts(ξt))2‖wt − w∗‖2 + 2ηts(ξt)(1− ηts(ξt))〈wt − w∗, ξt − w∗〉+ η2t s(ξt)

2‖ξt − w∗‖2|Ft].(11)

Given Ft, wt is not a random variable. Furthermore, we can use linearity of taking expectations and asabove split expectations:

E[(1− ηts(ξt))2]‖wt−w∗‖2 +E[2ηts(ξt)(1− ηts(ξt))]〈wt−w∗,E[ξt−w∗]〉+E[η2t s(ξt)

2]E[‖ξt−w∗‖2].

Again notice that E[ξt − w∗] = 0 and E[‖ξt − w∗‖2] = Tr(Σ). So, E[‖wt+1 − w∗‖2|Ft] is equal to

E[(1− ηts(ξt))2]‖wt − w∗‖2 + η2t

N

2

= (1− 2ηtµ+ η2t

µ(L+ µ1−µ/L)

12)‖wt − w∗‖2 + η2

t

N

2

= (1− µηt(2− ηt(L+µ

1− µ/L)/12))‖wt − w∗‖2 + η2

t

N

2.

In terms of Yt = E[‖wt − w∗‖2], by taking the full expectation (also over Ft) we get

Yt+1 = (1− µηt(2− ηt(L+µ

1− µL

)/12))Yt + η2t

N

2.

This is very close to recurrence (4).

8

The optimal step size in this case is solved by taking the derivative with respect to ηt. The derivative is equalto

−2µYt + 2ηtµ(L+µ

1− µL

)Yt/12 +Nηt (12)

which shows the minimum is achieved for

ηt =2µYt

N + µ(L+ µ1− µ

L)Yt/6

(13)

giving

Yt+1 = Yt −2µ2Y 2

t

N + µ(L+ µ1− µ

L)Yt/6

= Yt −2µ2Y 2

t

N(1 + Yt/Tr(Σ)). (14)

We note that Yt+1 ≤ Yt for any t ≥ 0. We proceed by proving a lower bound on Yt. Clearly,

Yt+1 ≥ Yt −2µ2Y 2

t

N(15)

Let us define γ = 2µ2/N . We can rewrite (15) as follows:

γYt+1 ≥ γYt(1− γYt) or

(γYt+1)−1 ≤ 1 + (γYt)−1 +

1

(γYt)−1 − 1.

In order to make the inequality above correct, we require 1 − γYt > 0 for any t ≥ 0. Since Yt+1 ≤ Yt, weonly need Y0 <

1γ . This means

Tr(Σ) <N

2µ2or Tr(Σ) <

µ(L+ µ1−µ/L)

6

Tr(Σ)

2µ2.

This is equivalent to µ < 3(L+ µ1− µ

L) which is obviously true since µ < L.

This implies

(γYt+1)−1 ≤ (t+ 1) + (γY0)−1 +

t∑i=0

1

(γYi)−1 − 1.

Sincet∑i=0

1

(γYi)−1 − 1≤ ln(t+ 1 + (γY0)−1)− ln((γY0)−1),

we have the following inequality:

(γYt+1)−1 ≤ (t+ 1) + (γY0)−1 + ln(t+ 1 + (γY0)−1)− ln((γY0)−1.

9

Reordering, substituting γ = 2µ2/N , and replacing t+ 1 by t yields, for t ≥ 0, the lower bound

Yt ≥N

2µ

1

µt+N/(2Y0) + µ[ln(t+ N2µY0

)− ln( N2µY0

)]

=N

2µ

1

W,

where

W =µt+µ(L+ µ/(1− µ/L))

12+ µ[ln(t+ L

1 + µ/(L(1− µ/L))

12)− ln(L

1 + µ/(L(1− µ/L))

12)].

The upper bound of Yt comes from the following fact. If we run Algorithm 1 with stepsize η′t = 2µt+4L for

t ≥ 0 in [16], then we have from [16] an expected convergence rate

Y ′t ≤16N

µ

1

µ(t− T ′) + 4L,

whereT ′ =

4L

µmax{LµY0

N, 1} − 4L

µ.

Substituting

N =µ(L+ µ

1−µ/L)

6Tr(Σ) and Y0 = Tr(Σ)

yields T ′ = 20Lµ . Since ηt is the most optimal stepsize and η′t is not, Yt ≤ Y ′t . I.e., we have

Yt ≤16N

µ

1

µ(t− 20Lµ ) + 4L

.

Corollary 1. Given the class of objective functions analyzed in Theorem 1, we run Algorithm 1 and assumean oracle with access to Yt = E[‖wt − w∗‖2] as well as the full gradient ∇F (wt) at the t-th iteration. Anexact expression for the optimal sequence of stepsizes ηt based on µ,L,N and this extended oracle can begiven. For this sequence of stepsizes, the same lower and upper bounds on the expected convergence rate Ytas in Theorem 1 hold.

Proof. The proof of this corollary is directly derived from the reason why we are allowed to transform(11) into (12), i.e., ηt and ξt must be independent to get (12) from (11). If the construction of ηt does notdepend on ξt (or ∇f(wt; ξt)), then only Yt is required to construct the optimal stepsize ηt. It implies thatthe information of∇F (wt) is not useful and we can borrow the proof of Theorem 1 to arrive at the result ofthis corollary.

Let us consider the set of all possible objective functions F which are µ-strongly convex and L-smooth. Foran objective function F ∈ F , let γFt (µ,L,U) be defined as the smallest expected convergence rate Yt thatcan be achieved by a stepsize construction {ηt}, where {ηt} is computed as a function h of µ,L and oracleU with access to N,Yt and ∇F (wt) at the t-th iteration. That is,

γFt (µ,L,U) = inf{ηt}∈h(µ,L,U)

Yt(F, {ηt}),

10

where Yt is explicitly shown as a function of the objective function and sequence of step sizes.

Among the objective functions F ∈ F , we consider objective function F which has the worst expectedconvergence rate at the t-th iteration. Let us denote the expected convergence rate Yt that corresponds to theworst objective function as γt(µ,L,U). Precisely,

γt(µ,L,U) = supF∈F

γFt (µ,L,U)

= supF∈F

inf{η}∈h(µ,L,U)

Yt[F, η].

The lower bound and upper bound of γt(µ,L,U) is stated in Corollary 2.

Corollary 2. Given µ,L and oracle U with access to N , Yt and ∇F (wt) at the t-th iteration, the conver-gence rate of the worst strongly convex and smooth objective function F with optimal stepsize ηt based onµ,L and U is γt(µ,L,U). The expected convergence rate γt(µ,L,U) satisfies the same lower bounds onthe expected convergence rate as in Theorem 1 where Yt is substituted by γt(µ,L,U). As an upper boundwe have

γt(µ,L,U) ≤ 16N

µ

1

µ(t− T ′) + 4L,

whereT ′ =

4L

µmax{LµY0

N, 1} − 4L

µ

for

h(µ,L,U) = {ηt =2

µt+ 4L}.

Notice that scheme h for constructing step sizes is independent of oracle U , in other words its knowledge isnot needed.

Proof. Due to the definition of γt(µ,L,U), it is always larger than γFt (µ,L,U) for all F ∈ F . FromCorollary 1 we infer thatγFt (µ,L,U) is larger than the lower bound N

2µ1W as specified in Theorem 1. Since

this holds for all F ∈ F , it also holds for the supremum over F ∈ F .

The upper bound follows from the result in [16], i.e., for any given F and ηt = 2µt+4L , we have

Yt ≤16N

µ

1

µ(t− T ′) + 4L,

whereT ′ =

4L

µmax{LµY0

N, 1} − 4L

µ.

The importance of Corollary 2 is that, for the worst objective function, we can now compute the gap betweenthe lower bound and the upper bound, i.e., they are separated by a factor of 32. This implies that no schemefor constructing a sequence of stepsizes ηt that is based on µ,L, and oracle U can achieve a better expectedconvergence rate than O(1/t). The only way to achieve a better expected convergence rate is to use ascheme that has access to information beyond what is given by µ,L, and oracle U . For example, if the

11

construction of ηt must depend on the information of the full gradient ∇F (wt) as well as the stochasticgradient ∇f(wt, ξt), or we have to develop a new updating form to replace the updating form of SGD (i.e.wt+1 = wt − ηt∇f(wt; ξt)).

Corollary 2 shows that the lower bound and the upper bound of Yt are O(1/t) (see (9)). Furthermore, itoffers a general strategy ηt = 2

µt+4L for computing step sizes which only depends on µ and L in orderto realize the upper bound (which comes within a factor 32 of the lower bound). This means that we canfinally conclude that there does not exist a significantly better construction for step sizes than ηt = 2

µt+4Lfor classical SGD (not our extended SGD problem).

3 Numerical Experiments

We verify our theory by considering simulations with different values of sample size n (1000, 10000, and100000) and vector size d (10, 100, and 1000). First, we generate n vectors of size d with mean m and pos-itive definite covariance matrix Σ. To be simple, we generate m ∈ Rd and diagonal matrix Σ ∈ Rd×d withuniform at random in [0, 1] for each element in m and each element in the diagonal of Σ. We experimentedwith 10 runs and reported the average results.

Figure 1: Yt and its upper and lower bounds

We denote the labels “Upper Y t” (red line) “Lower Y t” (violet line) in Figure 1 as the upper and lowerbounds of Yt in (10) and (9), respectively; “Y t opt” (orange line) as Yt defined in Theorem 1 with the giveninformation from the oracle; “Y t” (green) as the squared norm of the difference between wt and w∗, wherewt generated from Algorithm 1 with learning rate in (13). We note that “Lower Y t” and “Y t opt” arevery close to each other in Figure 1 and the difference between them is shown in Figure 2. Note that Yt inFigure 1 is computed as average of 10 runs of ‖wt − w∗‖2 (not exactly E[‖wt − w∗‖2]).

12

Figure 2: The difference between “Lower Y t” and “Y t opt” (n = 10000, d = 100)

Discussion: We have a vertical line at epoch 20 because we expect to see the upper bound in (10) to takeeffect when t ≥ T ′ = 20L

µ . The “Upper Y t” (red line), “Lower Y t” (violet line) and “Y t opt” (orangeline) do not oscillate because they can be correctly computed using formulas (10), (9) and (14), respectively,i.e., all the lines do not have variation. The green line “Y t” for stepsize ηt = 2

µt+4L in Figure 1 oscillatesbecause in our analysis we do not consider the variance of Yt. As shown in (4), we have

E[‖wt+1 − w∗‖2] ≤ (1− µηt)E[‖wt − w∗‖2] + η2tN.

It is clear that a decrease in ηt leads to a decrease of the variance of Yt (i.e. η2tN ). This fact is reflected in all

subfigures in Figure 1. We expect that increasing d and n (the number of dimensions in data and the numberof data points) would increase the variance. Hence, we see that it requires larger t to make the varianceapproach 0 as shown in Figure 1. We can see that when t is sufficiently large, then optimality of ηt = 2

µt+4Lis clearly shown in Figure 1 when n = 1000 and d = 10, i.e., the green line is in between red line (upperbound) and violet line (lower bound). Moreover, these two bounds are pretty close to each other when t issufficient large.

4 Conclusion

In this paper, we study the convergence of SGD. We show that for any given stepsize ηt constructed basedon µ, L, N , and an oracle with access to Yt and ∇F (wt) at the t-th iteration, the best possible lower boundof the convergence is O(1/t). Note that this extends classical SGD where only µ and L are given forconstruction of ηt. This result implies that the best possible lower bound of the convergence rate for anypossible stepsize ηt based on µ and L isO(1/t). This result confirms the optimality of the proposed stepsizeηt = 2

µt+4L for t ≥ 0 in [16]. Compared to the result in [1], our proposed class of objective functions issimple and does not require many assumptions for the sake of proof. Also our lower bound is orders ofmagnitude more tight as it is the first lower bound to be independent of dimension d. In addition, [1] doesnot study the lower bound of the extended problem of SGD.

References

[1] Alekh Agarwal, Peter L Bartlett, Pradeep Ravikumar, and Martin J Wainwright. Information-theoreticlower bounds on the oracle complexity of stochastic convex optimization. 2010.

[2] D.P. Bertsekas. Nonlinear Programming. Athena Scientific, 1999.

[3] Leon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learn-ing. arXiv:1606.04838, 2016.

13

[4] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004.

[5] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. Wiley-Interscience, NewYork, NY, USA, 1991.

[6] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient methodwith support for non-strongly convex composite objectives. In NIPS, pages 1646–1654, 2014.

[7] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: DataMining, Inference, and Prediction. Springer Series in Statistics, 2nd edition, 2009.

[8] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance re-duction. In NIPS, pages 315–323, 2013.

[9] Nicolas Le Roux, Mark Schmidt, and Francis Bach. A stochastic gradient method with an exponentialconvergence rate for finite training sets. In NIPS, pages 2663–2671, 2012.

[10] Remi Leblond, Fabian Pederegosa, and Simon Lacoste-Julien. Improved asynchronous parallel opti-mization analysis for stochastic incremental methods. arXiv preprint arXiv:1801.03749, 2018.

[11] Lucien LeCam et al. Convergence of estimates under dimensionality restrictions. The Annals ofStatistics, 1(1):38–53, 1973.

[12] Eric Moulines and Francis R Bach. Non-asymptotic analysis of stochastic approximation algorithmsfor machine learning. In Advances in Neural Information Processing Systems, pages 451–459, 2011.

[13] Arkadii Semenovich Nemirovsky and David Borisovich Yudin. Problem complexity and method effi-ciency in optimization. 1983.

[14] Yurii Nesterov. Introductory lectures on convex optimization : a basic course. Applied optimization.Kluwer Academic Publ., Boston, Dordrecht, London, 2004.

[15] Lam M. Nguyen, Jie Liu, Katya Scheinberg, and Martin Takac. SARAH: A novel method for machinelearning problems using stochastic recursive gradient. In ICML, 2017.

[16] Lam M. Nguyen, Phuong Ha Nguyen, Marten van Dijk, Peter Richtarik, Katya Scheinberg, and MartinTakac. SGD and hogwild! Convergence without the bounded gradients assumption. In ICML, 2018.

[17] Jorge Nocedal and Stephen J. Wright. Numerical Optimization. Springer, New York, 2nd edition,2006.

[18] Maxim Raginsky and Alexander Rakhlin. Information-Based Complexity, Feedback and Dynamics inConvex Programming. IEEE Trans. Information Theory, 57(10):7036–7056, 2011.

[19] Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of MathematicalStatistics, 22(3):400–407, 1951.

[20] Marten van Dijk, Lam Nguyen, Phuong Ha Nguyen, and Dzung Phan. Characterization of ConvexObjective Functions and Optimal Expected Convergence Rates for SGD. arXiv preprint, 2018.

[21] Bin Yu. Assouad, Fano, and Le Cam. In Festschrift for Lucien Le Cam, pages 423–435. Springer,1997.

14

A Related Work

In [1], the authors showed that the lower bound of Yt is O(1/t) with bounded gradient assumption forobjective function F over a convex set S. To show the lower bound, the authors use the following threeassumptions for the objective function F :

1. The assumption of a strongly convex objective function, i.e., Assumption 1 (see Definition 3 in [1]).

2. There exists a bounded convex set S ⊂ Rd such that

E[‖∇f(w; ξ)‖2] ≤ σ2

for all w ∈ S ⊂ Rd (see Definition 1 in [1]). Notice that this is not the same as the bounded gradientassumption where S = Rd is unbounded.

3. The objective function F is a convex Lipschitz function, i.e., there exists a positive number K suchthat

‖F (w)− F (w′)‖ ≤ K‖w − w′‖, ∀w,w′ ∈ S ⊂ Rd.

We notice that this assumption actually implies the assumption on bounded gradients as stated above.

On the existence of the assumption of bounded convex set S ⊂ Rd where SGD converges: let us restatethe example in [16], i.e. F (w) = 1

2(f1(w) + f2(w)) where f1(w) = 12w

2 and f2(w) = w. It is obviousthat F is strongly convex but f1 and f2 are not. Let w0 = 0 ∈ S, for any number t ≥ 0, with probability12t , the steps of SGD algorithm for all i < t are wi+1 = wi − ηi. This implies that wt = −

∑i=1 ηi.

Since∑

i=1 ηi = ∞, wt will escape the set S when t is sufficiently large. Hence, we have the followingconclusion S must be unbounded, e.g., S = Rd if there is at least one component function f of F which isnot strongly convex. This result implies that S is a pure subset of Rd if and only if all component functionsf of F are strongly convex.

If S is Rd, we have the following results:

On the non-coexistence of the assumption of a bounded gradient over Rd and assumption of hav-ing strong convexity: As pointed out in [16], the assumption of bounded gradient does not co-exist withstrongly convex assumption. As shown in [16], for any w ∈ Rd, we have

2µ[F (w)− F (w∗)](6)

≤ ‖∇F (w)‖2 = ‖E[∇f(w; ξ)]‖2

≤ E[‖∇f(w; ξ)‖2] ≤ σ2.

Therefore,

F (w) ≤ σ2

2µ+ F (w∗),∀w ∈ Rd.

Note that, the from Assumption 1 and∇F (w∗) = 0, we have

F (w) ≥ µ‖w − w∗‖2 + F (w∗), ∀w ∈ Rd.

Clearly, the two last inequalities contradict to each other for sufficiently large ‖w − w∗‖2. Precisely, onlywhen σ is equal to∞, then the assumption of bounded gradient and the assumption of strongly convexity

15

of F can co-exist. However, σ cannot be ∞ and this result implies that there does not exist any objectivefunction F satisfies the assumption of bounded gradients over Rd and the assumption of having a stronglyconvex objective function at the same time.

On the non-coexistence of the assumption of being convex Lipschitz over Rd and assumption of beingstrongly convex: Moreover, we can also show that the assumption of convex Lipschitz function does notco-exist with the assumption of being strongly convex. As shown in Section 2.3 in [1], the assumption ofLipschitz function implies that ‖∇F (w)‖ ≤ K,∀w ∈ Rd. Hence, by using the same argument from theanalysis of the non-coexistence of bounded gradient assumption and assumption of strongly convex, wecan conclude that these two assumptions cannot co-exist. In other words, there does not exist an objectivefunction F which satisfies the assumption of convex Lipschitz function and assumption of being stronglyconvex at the same time.

A.1 Discussion on the usage of Assumptions in [1]

As stated in Section 3 and Section 4.1.1 in [1], the authors construct a class of strongly convex Lipschitzobjective function F which hasK = σ. The authors showed that the problem of convex optimization for theconstructed class of objective functions F is at least as hard as estimating the biases of d independent coins(i.e., the problem of estimating parameters of Bernoulli variables). As one additional important assumptionto prove the lower bound of the SGD algorithm, the authors assume the existence of stepsizes ηt which makethe SGD algorithm converge for a given objective function F under the three aforementioned assumptions(see Lemma 2 in [1]). Note that the proof of the lower bound of Yt of SGD is described in Theorem 2 in [1]and Theorem 2 uses their Lemma 2. If their Lemma 2 is not valid, then the proof of the lower bound of Ytin Theorem 2 is also not valid.

Actually, the authors in [1] do not require all the component function of objective function F to be stronglyconvex. Given the proof strategy of the convergence of SGD, one may require that the convex set S where Fhas all these nice properties must be Rd as explained above. This, however, will lead to the non-coexistenceof bounded gradient assumption and strongly convex assumption and the non-coexistence of Lipschitz func-tion assumption and strongly convex assumption as discussed above. In this case, their Lemma 2 is notvalid because of non-existence of an objective function F , in which case the proof of lower bound of Yt inTheorem 2 is not correct.

However, we explain why the setup as proposed in [1] is still acceptable and lead to a proper lower bound:The paper assumes that we only restrict the analysis of SGD in a bounded convex set S which is not Rd,and only in this bounded set S we assume that objective function acts like a Lipschitz function (implyingbounded gradients in S).

There are two possible cases at the t-th iteration of SGD algorithm, the algorithm diverges or converges. Letus define p = Pr(wt /∈ S). Hence, Pr(wt ∈ S) = 1− p. Let

Y convt = E[‖wt − w∗‖2|wt ∈ S]

andY divt = E[‖wt − w∗‖2|wt /∈ S].

16

Since Yt = E[‖wt − w∗‖2, Yt is equal to

Yt = p · Y divt + (1− p) · Y conv

t

≥ p · Y convt + (1− p) · Y conv

t

≥ Y convt

≥ lower bound in [1].

The above derivation hinges on the first inequality where we assume Y divt ≥ Y conv

t . Typically, it is alwaystrue that Y div

t ≥ Y convt because wt gets far from w∗ for the divergence case and it gets close to w∗ for the

convergence case. Of course a proper proof of this property is still needed in order to rigously complete theargument leading the the lower bound in [1].

The above result is interesting because now we only need to prove the convergence of SGD in a certainconvex set S with a certain probability p. This is completely different from the proof of convergence ofSGD in the general case as in [12] and [16] where we need to prove it with probability of 1.

We describe the setup of the class of strong convex functions proposed in [1] and then we show that ourresult is much more tight when compared to their result.

A.2 Setup

As shown in Section 4.1.1 [1], the following two sets are required.

1. Subset V ⊂ {−1,+1}d and V = {α1, . . . , αM} with ∆H(αj , αk) ≥ d4 for all j 6= k, where ∆H

denotes the Hamming metric, i.e ∆H(α, β) :=∑d

i=1 I[αi 6= βi]. As discussed by the authors,|V| = M ≥ (2/

√e)

d2 .

2. Subset Fbase = {f+i , f

−i , i = 1, . . . , d} where f+

i , f−i will be designed depending on the problem at

hand.

Given V , Fbase and a constant δ ∈ (0, 14 ], we define the function class F(δ) := {Fα, α ∈ V} where

Fα(w) :=c

d

d∑i=1

{(1/2 + αiδ)f+i (w) + (1/2− αiδ)f−i (w)}. (16)

The Fbase and constant c are chosen in such a way that F(δ) ⊂ F where F is the class of strongly convexobjective functions defined over set S and satisfies all the assumptions as mentioned before. In case F isthe class of strongly convex functions, the key idea to compute the lower bound of SGD proposed in [1] byapplying Fano’s inequality [21] and Le Cam’s bound [5, 11] is as follows: If an SGD algorithmMt workswell for optimizing a given function Fα∗ , α∗ ∈ V with a given oracle U , then there exists a hypothesis testfinding α such that:

1

3≥ PrU [α(Mt) 6= α] ≥ 1− 2

16dtδ2 + log(2)

d log(2/√e)

. (17)

17

From (17), we have16dtδ2 + log(2)

d log(2/√e)

≈ 16dtδ2

d log(2/√e)≥ 2/3.

Hence,

t ≥ log(2/√e)

48

1

δ2. (18)

As shown in Section 4.3 [1], to proceed the proof, we set Yt = cδ2r2

18(1−θ) . Pluging this ε into (18) yields

Yt ≥1

t

log(2/√e)

864

cr2

1− θ. (19)

In addition to the proof of the lower bound, we also need to set c = Ldrd1/p

and µ2 = Lrd1/p

(1 − θ) whereS = B∞(r). By substituting c and µ2 into (19), we obtain:

Yt ≥1

t

log(2/√e)

864d

1

µ2c2r2. (20)

To complete the description of the setup in [1], we briefly describe the proposed oracle U which outputssome information to the SGD algorithm at each iteration for constructing the stepsize ηt. There are twotypes of oracle U defined as follows.

1. Oracle UA: 1-dimensional unbiased gradients

(a) Pick an index i ∈ 1, . . . , d uniformly at random.

(b) Draw bi ∈ {0, 1} according to a Bernoulli distribution with parameter 1/2 + αiδ.

(c) For the given input x ∈ S, return the value fi and subgradient∇fi of the function

fi,A := c[bif+i + (1− bi)f−i ].

2. Oracle UB: d-dimensional unbiased gradients.

• For i = 1, . . . , d, draw bi ∈ {0, 1} according to a Bernoulli distribution with parameter 1/2 +αiδ.

• For the given input x ∈ S, return the value fi and subgradient∇fi of the function

fi,B :=c

d

d∑i=1

[bif+i + (1− bi)f−i ].

A.3 Analysis and Comparison

In this section, we want to compare our lower bound in Corollary 2 (i.e., N2µ2t

) with the one in (20) whent is sufficiently large. In order to do this, we need to compute N = 2E[‖∇f(w∗; ξ)‖2] for the strongly

18

convex function class proposed in [1]. For the strongly convex case, the authors defined the base functionsas follows. Given a parameter θ ∈ [0, 1), we have

f+i (w) = rθ|wi + r|+ 1− θ

4(wi + r)2,

f−i (w) = rθ|wi − r|+1− θ

4(wi − r)2,

where w = (w1, . . . , wd). Let ei be (1/2 + αiδ). Substituting ei in (16) yields Fα(w) = 1d [∑d

i=1 fα,i(w)]where fα,i(w) = c[eif

+i (w) + (1− ei)f−i (w)]. Due to the construction of Fα, the definition of fα,i(w) and

the construction of oracle UA or oracle UB , w∗ of Fα can be found by finding each w∗i for each fα,i(w) first.Precisely, we have the following cases:

1. wi < −r: we have

• fα,i(w) = −rθ(wi + r)ei + 1−θ4 (wi + r)2ei − rθ(wi − r)(1− ei) + 1−θ

4 (wi − r)2(1− ei).

• ∇fα,i(w) = (1− θ)eir − 1+θ2 r + 1−θ

2 wi.

• ∇fα,i(w) = 0 at w−ri = r[1− 2ei + 2θ1−θ ].

2. −r ≤ wi ≤ r: we have

• fα,i(w) = rθ(wi + r)ei + 1−θ4 (wi + r)2ei − rθ(wi − r)(1− ei) + 1−θ

4 (wi − r)2(1− ei).

• ∇fα,i(w) = (1 + θ)eir − 1+θ2 r + 1−θ

2 wi.

• ∇fα,i(w) = 0 at w[−r,r]i = r 1+θ

1−θ (1− 2ei).

3. r ≤ wi ≤ ∞: we have

• fα,i(w) = rθ(wi + r)ei + 1−θ4 (wi + r)2ei + rθ(wi − r)(1− ei) + 1−θ

4 (wi − r)2(1− ei).

• ∇fα,i(w) = (1− θ)eir + 3θ−12 r + 1−θ

2 wi.

• ∇fα,i(w) = 0 at wri = r[1− 2ei − 2 θ1−θ ].

Now, we have five important points w−ri , w[−r,r]i , wri ,−r and r and at these points Fα can be minimum. We

consider the following cases

1. αi = −1 and then ei = 12 + αiδ = 1

2 − δ where δ ∈ [0, 1/4), we have

• w−ri = r[ 2θ1−θ + 2δ] > −r.

• w[−r,r]i = r 1+θ

1−θ (2δ). In this case w[−r,r]i may belong [−r, r] or it may be greater than r.

• wri = r(2δ − 2θ1−θ ) < r .

This result implies Fα is minimum at w∗i = r and∇fα,i(w∗) = cr[(1− θ)ei+ θ] = cr[(1− θ)(1/2−δ) + θ]. Or it can be minimum at w[−r,r]

i if w[−r,r]i ∈ [−r, r] and ∇fα,i(w∗) = 0.

2. αi = +1 and then ei = 12 + αiδ = 1

2 + δ where δ ∈ [0, 1/4), we have

• w−ri = r[ 2θ1−θ − 2δ]. Since 2θ

1−θ − 2δ > −1 when δ ∈ [0, 1/4) and θ ∈ [0, 1). Hence w−ri > −r.

19

• w[−r,r]i = r 1+θ

1−θ (−2δ) < 0. In this case w[−r,r]i may belong [−r, r] or it may be smaller than −r.

• wri = r(−2δ − 2θ1−θ ) < r.

This result implies Fα is minimum atw∗i = −r and∇fα,i(w∗) = cr[(1−θ)ei−1] = cr[(1−θ)(1/2+

δ)− 1]. Or it can be minimum at w[−r,r]i if w[−r,r]

i ∈ [−r, r] and ∇fα,i(w∗) = 0.

By definition, we have

N = 2E[‖∇fi(w∗)‖2] = 21

d

d∑i=1

[ei‖c∇f+i (w∗)‖2 + (1− ei)‖c∇f−i (w∗)‖2]

≥ 2mini[ei‖c∇f+i (w∗)‖2 + (1− ei)‖c∇f−i (w∗)‖2]

From the analysis above, we have four possible w∗i , i.e., −r, r, r 1+θ1−θ (−2δ) and r 1+θ

1−θ (2δ). If we plug w∗

which has w∗i = −r or w∗i = r, then we have [ei‖c∇f+i (w∗)‖2 + (1− ei)‖c∇f−i (w∗)‖2] = (1/2− δ)c2r2.

For w∗i which has w∗i = r 1+θ1−θ (−2δ) or r 1+θ

1−θ (2δ), we have [ei‖c∇f+i (w∗)‖2 + (1 − ei)‖c∇f−i (w∗)‖2] =

(1/4− δ2)(1 + θ)2c2r2. In all case, it is obvious that

N ≥ 2mini[ei‖c∇f+i (w∗)‖2 + (1− ei)‖c∇f−i (w∗)‖2] ≥ 1

8c2r2,∀δ ∈ [0, 1/4), θ ∈ [0, 1).

Since we want to compare our lower bound and the lower bound in [1], we need to compare our lower boundwith the possibly largest bound of the lower bound in [1]. So, we give an advantage to the lower bound in [1]by assuming N = 1

8c2r2. Hence, substituting N = 1

8c2r2 into (20) yields

Yt ≥log(2/

√e)

108d

N

µ2t. (21)

Clearly, the lower bound in (21) is much smaller than the one of ours in Corollary 2, i.e., N2µ2t

when t issufficiently large. Moreover, this lower bound depends on 1/d and it becomes smaller when d increases. Itimplies our lower bound in this paper is much better than the one in (21), i.e. it is much more tight.

20

tight dimension independent lower bound on optimal ... · the upper bound of the convergence rate...

Documents