stochastic optimization for large-scale learning

Post on 13-Apr-2017

232 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Introduction Time Reduction Space Reduction Conclusion

Stochastic Optimization forLarge-scale Machine Learning

Lijun Zhang

LAMDA group, Nanjing University, China

The 13rd Chinese Workshop on Machine Learning and Applications

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAMLNANJING UNIVERSITY

Introduction Time Reduction Space Reduction Conclusion

Outline

1 IntroductionDefinitionAdvantages

2 Time ReductionBackgroundMixed Gradient Descent

3 Space ReductionBackgroundStochastic Proximal Gradient Descent

4 Conclusion

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion Definition Advantages

Outline

1 IntroductionDefinitionAdvantages

2 Time ReductionBackgroundMixed Gradient Descent

3 Space ReductionBackgroundStochastic Proximal Gradient Descent

4 Conclusion

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion Definition Advantages

What is Stochastic Optimization? (I)

Definition 1: A Special Objective [Nemirovski et al., 2009]

minw∈W

f (w) = Eξ [ℓ(w, ξ)] =

Ξℓ(w, ξ)dP(ξ)

where ξ is a random variable

It is possible to generate an i.i.d. sample ξ1, ξ2, . . .

Risk Minimization

minw∈W

f (w) = E(x,y) [ℓ(w, (x, y))]

ℓ(·, ·) is a loss function, e.g., hinge loss ℓ(u, v) = max(0, 1− uv)

Training data (x1, y1), . . . , (xn, yn) are i.i.d.

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion Definition Advantages

What is Stochastic Optimization? (I)

Definition 1: A Special Objective [Nemirovski et al., 2009]

minw∈W

f (w) = Eξ [ℓ(w, ξ)] =

Ξℓ(w, ξ)dP(ξ)

where ξ is a random variable

It is possible to generate an i.i.d. sample ξ1, ξ2, . . .

Risk Minimization

minw∈W

f (w) = E(x,y) [ℓ(w, (x, y))]

ℓ(·, ·) is a loss function, e.g., hinge loss ℓ(u, v) = max(0, 1− uv)

Training data (x1, y1), . . . , (xn, yn) are i.i.d.

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion Definition Advantages

What is Stochastic Optimization? (II)

Definition 2: A Special Access Model [Hazan and Kale, 2011]

minw∈W

f (w)

There exists a stochastic oracle that produces unbiasedgradient o(·)

E[o(w)] ∈ ∂f (w) or E[o(w)] = ∇f (w)

Empirical Risk Minimization

minw∈W

f (w) =1n

n∑

i=1

ℓ(yi , x⊤i w)

Sampling a (xt , yt) from (xi , yi)ni=1 randomly

E[∂ℓ(yt , x⊤t w)] ⊆ ∂

1n

n∑

i=1

ℓ(yi , x⊤i w)

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion Definition Advantages

What is Stochastic Optimization? (II)

Definition 2: A Special Access Model [Hazan and Kale, 2011]

minw∈W

f (w)

There exists a stochastic oracle that produces unbiasedgradient o(·)

E[o(w)] ∈ ∂f (w) or E[o(w)] = ∇f (w)

Empirical Risk Minimization

minw∈W

f (w) =1n

n∑

i=1

ℓ(yi , x⊤i w)

Sampling a (xt , yt) from (xi , yi)ni=1 randomly

E[∂ℓ(yt , x⊤t w)] ⊆ ∂

1n

n∑

i=1

ℓ(yi , x⊤i w)

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion Definition Advantages

Outline

1 IntroductionDefinitionAdvantages

2 Time ReductionBackgroundMixed Gradient Descent

3 Space ReductionBackgroundStochastic Proximal Gradient Descent

4 Conclusion

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion Definition Advantages

Why Stochastic Optimization? (I)

https://infographiclist.files.wordpress.com/2011/09/world-of-data.jpeg

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion Definition Advantages

Why Stochastic Optimization? (II)

Empirical Risk Minimization

minw∈W⊆Rd

f (w) =1n

n∑

i=1

ℓ(yi , x⊤i w)

n is the number of training data

d is the dimensionality

Deterministic Optimization—Gradient Descent (GD)

1: for t = 1, 2, . . . ,T do2: w′

t+1 = wt − ηt(1

n

∑ni=1 ∇ℓ(yi , x⊤

i wt))

3: wt+1 = ΠW(w′t+1)

4: end for

The Challenge

Time complexity per iteration: O(nd) + O(poly(d))

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion Definition Advantages

Why Stochastic Optimization? (II)

Empirical Risk Minimization

minw∈W⊆Rd

f (w) =1n

n∑

i=1

ℓ(yi , x⊤i w)

n is the number of training data

d is the dimensionality

Deterministic Optimization—Gradient Descent (GD)

1: for t = 1, 2, . . . ,T do2: w′

t+1 = wt − ηt(1

n

∑ni=1 ∇ℓ(yi , x⊤

i wt))

3: wt+1 = ΠW(w′t+1)

4: end for

The Challenge

Time complexity per iteration: O(nd) + O(poly(d))

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion Definition Advantages

Why Stochastic Optimization? (III)

Empirical Risk Minimization

minw∈W⊆Rd

f (w) =1n

n∑

i=1

ℓ(yi , x⊤i w)

Stochastic Optimization—Stochastic Gradient Descent (SGD)

1: for t = 1, 2, . . . ,T do2: Sample a training example (xt , yt) randomly3: w′

t+1 = wt − ηt∇ℓ(yt , x⊤t wt)

4: wt+1 = ΠW(w′t+1)

5: end for

The Advantage—Time Reduction

Time complexity per iteration: O(d) + O(poly(d))

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion Definition Advantages

Why Stochastic Optimization? (III)

Empirical Risk Minimization

minw∈W⊆Rd

f (w) =1n

n∑

i=1

ℓ(yi , x⊤i w)

Stochastic Optimization—Stochastic Gradient Descent (SGD)

1: for t = 1, 2, . . . ,T do2: Sample a training example (xt , yt) randomly3: w′

t+1 = wt − ηt∇ℓ(yt , x⊤t wt)

4: wt+1 = ΠW(w′t+1)

5: end for

The Advantage—Time Reduction

Time complexity per iteration: O(d) + O(poly(d))

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion Definition Advantages

Why Stochastic Optimization? (IV)

Optimization over Large Matrices

minW∈W⊆Rm×n

F (W )

Matrix completion, distance metric learning, multi-tasklearning, multi-class learning

Deterministic Optimization—Gradient Descent (GD)

1: for t = 1, 2, . . . ,T do2: W ′

t+1 = Wt − ηt∇F (Wt)3: Wt+1 = ΠW(W ′

t+1)4: end for

The Challenge

Space complexity per iteration: O(mn)

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion Definition Advantages

Why Stochastic Optimization? (IV)

Optimization over Large Matrices

minW∈W⊆Rm×n

F (W )

Matrix completion, distance metric learning, multi-tasklearning, multi-class learning

Deterministic Optimization—Gradient Descent (GD)

1: for t = 1, 2, . . . ,T do2: W ′

t+1 = Wt − ηt∇F (Wt)3: Wt+1 = ΠW(W ′

t+1)4: end for

The Challenge

Space complexity per iteration: O(mn)

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion Definition Advantages

Why Stochastic Optimization? (V)

Optimization over Large Matrices

minW∈W⊆Rm×n

F (W )

Stochastic Optimization—Stochastic Gradient Descent (SGD)

1: for t = 1, 2, . . . ,T do2: Generate a low-rank stochastic gradient Gt of F (·) at Wt

3: W ′t+1 = Wt − ηtGt

4: Wt+1 = ΠW(W ′t+1)

5: end for

The Advantage—Space Reduction

Space complexity per iteration could be: O((m + n)r)

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion Definition Advantages

Why Stochastic Optimization? (V)

Optimization over Large Matrices

minW∈W⊆Rm×n

F (W )

Stochastic Optimization—Stochastic Gradient Descent (SGD)

1: for t = 1, 2, . . . ,T do2: Generate a low-rank stochastic gradient Gt of F (·) at Wt

3: W ′t+1 = Wt − ηtGt

4: Wt+1 = ΠW(W ′t+1)

5: end for

The Advantage—Space Reduction

Space complexity per iteration could be: O((m + n)r)

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent

Outline

1 IntroductionDefinitionAdvantages

2 Time ReductionBackgroundMixed Gradient Descent

3 Space ReductionBackgroundStochastic Proximal Gradient Descent

4 Conclusion

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent

Stochastic Gradient Descent (SGD)

Empirical Risk Minimization

minw∈W⊆Rd

f (w) =1n

n∑

i=1

ℓ(yi , x⊤i w)

The Algorithm

1: for t = 1, 2, . . . ,T do2: Sample a training example (xt , yt) randomly3: w′

t+1 = wt − ηt∇ℓ(yt , x⊤t wt)

4: wt+1 = ΠW(w′t+1)

5: end for

Advantage v.s. Disadvantage

Time complexity per iteration is low: O(d) + O(poly(d))

The iteration complexity is much higher than GD

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent

Stochastic Gradient Descent (SGD)

Empirical Risk Minimization

minw∈W⊆Rd

f (w) =1n

n∑

i=1

ℓ(yi , x⊤i w)

The Algorithm

1: for t = 1, 2, . . . ,T do2: Sample a training example (xt , yt) randomly3: w′

t+1 = wt − ηt∇ℓ(yt , x⊤t wt)

4: wt+1 = ΠW(w′t+1)

5: end for

Advantage v.s. Disadvantage

Time complexity per iteration is low: O(d) + O(poly(d))

The iteration complexity is much higher than GD

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent

The Problem

Iteration Complexity

The number of iterations T to ensure

f (wT )− minw∈Ω

f (w) ≤ ǫ

Iteration Complexity of GD and SGD

Convex & Smooth Strongly Convex & Smooth

GD O(

1√ǫ

)O(√

κ log 1ǫ

)

SGD O( 1ǫ2

)O(

1µǫ

)

Total Time Complexity of GD and SGD

Convex & Smooth Strongly Convex & Smooth

GD O(

n√ǫ

)O(n√κ log 1

ǫ

)

SGD O( 1ǫ2

)O(

1µǫ

)

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent

The Problem

Iteration Complexity

The number of iterations T to ensure

f (wT )− minw∈Ω

f (w) ≤ ǫ

Iteration Complexity of GD and SGD

Convex & Smooth Strongly Convex & Smooth

GD O(

1√ǫ

)O(√

κ log 1ǫ

)

SGD O( 1ǫ2

)O(

1µǫ

)

Total Time Complexity of GD and SGD

Convex & Smooth Strongly Convex & Smooth

GD O(

n√ǫ

)O(n√κ log 1

ǫ

)

SGD O( 1ǫ2

)O(

1µǫ

)

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent

The Problem

Iteration Complexity

The number of iterations T to ensure

f (wT )− minw∈Ω

f (w) ≤ ǫ

Iteration Complexity of GD and SGD ǫ = 10−6

Convex & Smooth Strongly Convex & Smooth

GD O(

1√ǫ

)103 O

(√κ log 1

ǫ

)6√κ

SGD O( 1ǫ2

)1012 O

(1µǫ

)106

µ

Total Time Complexity of GD and SGD

Convex & Smooth Strongly Convex & Smooth

GD O(

n√ǫ

)O(n√κ log 1

ǫ

)

SGD O( 1ǫ2

)O(

1µǫ

)

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent

The Problem

Iteration Complexity

The number of iterations T to ensure

f (wT )− minw∈Ω

f (w) ≤ ǫ

Iteration Complexity of GD and SGD ǫ = 10−6

Convex & Smooth Strongly Convex & Smooth

GD O(

1√ǫ

)103 O

(√κ log 1

ǫ

)6√κ

SGD O( 1ǫ2

)1012 O

(1µǫ

)106

µ

Total Time Complexity of GD and SGD

Convex & Smooth Strongly Convex & Smooth

GD O(

n√ǫ

)O(n√κ log 1

ǫ

)

SGD O( 1ǫ2

)O(

1µǫ

)

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent

The Problem

Iteration Complexity

The number of iterations T to ensure

f (wT )− minw∈Ω

f (w) ≤ ǫ

Iteration Complexity of GD and SGD ǫ = 10−6

Convex & Smooth Strongly Convex & Smooth

GD O(

1√ǫ

)103 O

(√κ log 1

ǫ

)6√κ

SGD O( 1ǫ2

)1012 O

(1µǫ

)106

µ

Total Time Complexity of GD and SGD ǫ = 10−6

Convex & Smooth Strongly Convex & Smooth

GD O(

n√ǫ

)103n O

(n√κ log 1

ǫ

)6n

√κ

SGD O( 1ǫ2

)1012 O

(1µǫ

)106

µ

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent

The Problem

Iteration Complexity

The number of iterations T to ensure

f (wT )− minw∈Ω

f (w) ≤ ǫ

Iteration Complexity of GD and SGD ǫ = 10−6

Convex & Smooth Strongly Convex & Smooth

GD O(

1√ǫ

)103 O

(√κ log 1

ǫ

)6√κ

SGD O( 1ǫ2

)1012 O

(1µǫ

)106

µ

Total Time Complexity of GD and SGD ǫ = 10−6

Convex & Smooth Strongly Convex & Smooth

GD O(

n√ǫ

)103n O

(n√κ log 1

ǫ

)6n

√κ

SGD O( 1ǫ2

)1012 O

(1µǫ

)106

µ

http://cs.nju.edu.cn/zlj Stochastic Optimization

SGD may be slower thanGD if ǫ is very small.

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent

Outline

1 IntroductionDefinitionAdvantages

2 Time ReductionBackgroundMixed Gradient Descent

3 Space ReductionBackgroundStochastic Proximal Gradient Descent

4 Conclusion

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent

Motivations

Reason of Slow Convergence Rate

The step size of SGD is a decreasing sequence

ηt =1√t

for convex function

ηt =1t for strongly convex function

Reason of Decreasing Step Size

w′t+1 = wt − ηt∇ℓ(yi , x⊤

i wt)

Stochastic gradients introduce a constant error

The Key Idea

Control the variance of stochastic gradients

Choose a fixed step size η

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent

Motivations

Reason of Slow Convergence Rate

The step size of SGD is a decreasing sequence

ηt =1√t

for convex function

ηt =1t for strongly convex function

Reason of Decreasing Step Size

w′t+1 = wt − ηt∇ℓ(yi , x⊤

i wt)

Stochastic gradients introduce a constant error

The Key Idea

Control the variance of stochastic gradients

Choose a fixed step size η

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent

Motivations

Reason of Slow Convergence Rate

The step size of SGD is a decreasing sequence

ηt =1√t

for convex function

ηt =1t for strongly convex function

Reason of Decreasing Step Size

w′t+1 = wt − ηt∇ℓ(yi , x⊤

i wt)

Stochastic gradients introduce a constant error

The Key Idea

Control the variance of stochastic gradients

Choose a fixed step size η

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent

Mixed Gradient Descent (I)

Mixed Gradient of wt

m(wt) = ∇ℓ(yt , xt⊤wt)−∇ℓ(yt , xt

⊤w0) +∇f (w0)

where (xt , yt) is a random sample, w0 is a initial solution, and

∇f (w0) =1n

n∑

i=1

∇ℓ(yi , x⊤i w0)

The Properties of Mixed Gradient

It is still a unbiased estimate of true gradient

E[m(wt)] =1n

n∑

i=1

∇ℓ(yi , x⊤i wt) = ∇f (wt)

The variance is controlled by the distance

‖∇ℓ(yt , x⊤t wt)−∇ℓ(yt , x⊤

t w0)‖2 ≤ L‖wt − w0‖2

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent

Mixed Gradient Descent (I)

Mixed Gradient of wt

m(wt) = ∇ℓ(yt , xt⊤wt)−∇ℓ(yt , xt

⊤w0) +∇f (w0)

where (xt , yt) is a random sample, w0 is a initial solution, and

∇f (w0) =1n

n∑

i=1

∇ℓ(yi , x⊤i w0)

The Properties of Mixed Gradient

It is still a unbiased estimate of true gradient

E[m(wt)] =1n

n∑

i=1

∇ℓ(yi , x⊤i wt) = ∇f (wt)

The variance is controlled by the distance

‖∇ℓ(yt , x⊤t wt)−∇ℓ(yt , x⊤

t w0)‖2 ≤ L‖wt − w0‖2

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent

Mixed Gradient Descent (I)

Mixed Gradient of wt

m(wt) = ∇ℓ(yt , xt⊤wt)−∇ℓ(yt , xt

⊤w0) +∇f (w0)

where (xt , yt) is a random sample, w0 is a initial solution, and

∇f (w0) =1n

n∑

i=1

∇ℓ(yi , x⊤i w0)

The Properties of Mixed Gradient

It is still a unbiased estimate of true gradient

E[m(wt)] =1n

n∑

i=1

∇ℓ(yi , x⊤i wt) = ∇f (wt)

The variance is controlled by the distance

‖∇ℓ(yt , x⊤t wt)−∇ℓ(yt , x⊤

t w0)‖2 ≤ L‖wt − w0‖2

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent

Mixed Gradient Descent (I)

Mixed Gradient of wt

m(wt) = ∇ℓ(yt , xt⊤wt)−∇ℓ(yt , xt

⊤w0) +∇f (w0)

where (xt , yt) is a random sample, w0 is a initial solution, and

∇f (w0) =1n

n∑

i=1

∇ℓ(yi , x⊤i w0)

The Properties of Mixed Gradient

It is still a unbiased estimate of true gradient

E[m(wt)] =1n

n∑

i=1

∇ℓ(yi , x⊤i wt) = ∇f (wt)

The variance is controlled by the distance

‖∇ℓ(yt , x⊤t wt)−∇ℓ(yt , x⊤

t w0)‖2 ≤ L‖wt − w0‖2

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent

Mixed Gradient Descent (I)

Mixed Gradient of wt

m(wt) = ∇ℓ(yt , xt⊤wt)−∇ℓ(yt , xt

⊤w0) +∇f (w0)

where (xt , yt) is a random sample, w0 is a initial solution, and

∇f (w0) =1n

n∑

i=1

∇ℓ(yi , x⊤i w0)

The Properties of Mixed Gradient

It is still a unbiased estimate of true gradient

E[m(wt)] =1n

n∑

i=1

∇ℓ(yi , x⊤i wt) = ∇f (wt)

The variance is controlled by the distance

‖∇ℓ(yt , x⊤t wt)−∇ℓ(yt , x⊤

t w0)‖2 ≤ L‖wt − w0‖2

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent

Mixed Gradient Descent (II)

The Algorithm [Zhang et al., 2013]

1: Compute the true gradient of w0

∇f (w0) =1n

n∑

i=1

∇ℓ(yi , x⊤i w0)

2: for t = 1, 2, . . . ,T do3: Sample a training example (xt , yt) randomly4: Compute the mixed gradient of wt

m(wt) = ∇ℓ(yt , x⊤t wt)−∇ℓ(yt , x⊤

t w0) +∇f (w0)

5: w′t+1 = wt − ηm(wt)

6: wt+1 = ΠW(w′t+1)

7: end for

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent

Theoretical Guarantees

Theorem 1 ([Zhang et al., 2013])

Suppose the objective function is smooth and strongly convex.To find an ǫ-optimal solution, the mixed gradient descent needs

True Gradient Stochastic GradientMGD O

(log 1

ǫ

)O(κ2 log 1

ǫ

)

In contrast, SGD needs O(1/µǫ) stochastic gradients.

Extensions

For unbounded domain, O(κ2 log 1/ǫ)) can be improved toO(κ log 1/ǫ) [Johnson and Zhang, 2013]

For smooth and convex function, O(log 1/ǫ) true gradientsand O(1/ǫ) stochastic gradients are needed[Mahdavi et al., 2013]

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent

Theoretical Guarantees

Theorem 1 ([Zhang et al., 2013])

Suppose the objective function is smooth and strongly convex.To find an ǫ-optimal solution, the mixed gradient descent needs

True Gradient Stochastic GradientMGD O

(log 1

ǫ

)O(κ2 log 1

ǫ

)

In contrast, SGD needs O(1/µǫ) stochastic gradients.

Extensions

For unbounded domain, O(κ2 log 1/ǫ)) can be improved toO(κ log 1/ǫ) [Johnson and Zhang, 2013]

For smooth and convex function, O(log 1/ǫ) true gradientsand O(1/ǫ) stochastic gradients are needed[Mahdavi et al., 2013]

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent

Experimental Results (I)

Reuters Corpus Volume I (RCV1) data set

The optimization error

0 50 100 150 200

10−10

10−5

# of Gradients

Tra

inin

g Lo

ss −

Opt

imum

EMGDSGD

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent

Experimental Results (II)

Reuters Corpus Volume I (RCV1) data set

The variance of mixed gradient

0 50 100 150 20010

−15

10−10

10−5

# of Gradients

Var

ianc

e

EMGDSGD

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent

Outline

1 IntroductionDefinitionAdvantages

2 Time ReductionBackgroundMixed Gradient Descent

3 Space ReductionBackgroundStochastic Proximal Gradient Descent

4 Conclusion

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent

Stochastic Gradient Descent (SGD)

Nuclear Norm Regularization (Composition Optimization)

minW∈Rm×n

F (W ) = f (W ) + λ‖X‖∗The optimal solution W∗ is low-rank

The Algorithm [Avron et al., 2012]

1: for t = 1, 2, . . . ,T do2: Generate a low-rank stochastic gradient Gt of F (·) at Wt

3: W ′t+1 = Wt − ηtGt

4: Wt+1 = Πr (W ′t+1) (truncate small singular values)

5: end for

Efficient Implementation

Wt+1 = Πr (Wt − ηtGt) can be solved by incremental SVDwith O((m + n)r) space and O((m + n)r2) time

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent

Stochastic Gradient Descent (SGD)

Nuclear Norm Regularization (Composition Optimization)

minW∈Rm×n

F (W ) = f (W ) + λ‖X‖∗The optimal solution W∗ is low-rank

The Algorithm [Avron et al., 2012]

1: for t = 1, 2, . . . ,T do2: Generate a low-rank stochastic gradient Gt of F (·) at Wt

3: W ′t+1 = Wt − ηtGt

4: Wt+1 = Πr (W ′t+1) (truncate small singular values)

5: end for

Efficient Implementation

Wt+1 = Πr (Wt − ηtGt) can be solved by incremental SVDwith O((m + n)r) space and O((m + n)r2) time

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent

Stochastic Gradient Descent (SGD)

Nuclear Norm Regularization (Composition Optimization)

minW∈Rm×n

F (W ) = f (W ) + λ‖X‖∗The optimal solution W∗ is low-rank

The Algorithm [Avron et al., 2012]

1: for t = 1, 2, . . . ,T do2: Generate a low-rank stochastic gradient Gt of F (·) at Wt

3: W ′t+1 = Wt − ηtGt

4: Wt+1 = Πr (W ′t+1) (truncate small singular values)

5: end for

Efficient Implementation

Wt+1 = Πr (Wt − ηtGt) can be solved by incremental SVDwith O((m + n)r) space and O((m + n)r2) time

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent

Stochastic Gradient Descent (SGD)

Nuclear Norm Regularization (Composition Optimization)

minW∈Rm×n

F (W ) = f (W ) + λ‖X‖∗The optimal solution W∗ is low-rank

The Algorithm [Avron et al., 2012]

1: for t = 1, 2, . . . ,T do2: Generate a low-rank stochastic gradient Gt of F (·) at Wt

3: W ′t+1 = Wt − ηtGt

4: Wt+1 = Πr (W ′t+1) (truncate small singular values)

5: end for

Advantage v.s. Disadvantage

Space complexity per iteration is low: O((m + n)r)

There is no theoretical guarantee due to truncation

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent

Outline

1 IntroductionDefinitionAdvantages

2 Time ReductionBackgroundMixed Gradient Descent

3 Space ReductionBackgroundStochastic Proximal Gradient Descent

4 Conclusion

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent

Stochastic Proximal Gradient Descent (SPGD)

Nuclear Norm Regularization

minW∈Rm×n

F (W ) = f (W ) + λ‖X‖∗

The Algorithm [Zhang et al., 2015b]

1: for t = 1, 2, . . . ,T do2: Generate a low-rank stochastic gradient Gt of f (·) at Wt3:

Wt+1 = argminW∈Rm×n

12

∥∥∥W − (Wt − ηtGt)∥∥∥

2

F+ ηtλ‖W‖∗

4: end for5: return WT+1

Efficient Implementation

The above problem can also be solved by incrementalSVD with O((m + n)r) space and O((m + n)r2) time

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent

Stochastic Proximal Gradient Descent (SPGD)

Nuclear Norm Regularization

minW∈Rm×n

F (W ) = f (W ) + λ‖X‖∗

The Algorithm [Zhang et al., 2015b]

1: for t = 1, 2, . . . ,T do2: Generate a low-rank stochastic gradient Gt of f (·) at Wt3:

Wt+1 = argminW∈Rm×n

12

∥∥∥W − (Wt − ηtGt)∥∥∥

2

F+ ηtλ‖W‖∗

4: end for5: return WT+1

Efficient Implementation

The above problem can also be solved by incrementalSVD with O((m + n)r) space and O((m + n)r2) time

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent

Stochastic Proximal Gradient Descent (SPGD)

Nuclear Norm Regularization

minW∈Rm×n

F (W ) = f (W ) + λ‖X‖∗

The Algorithm [Zhang et al., 2015b]

1: for t = 1, 2, . . . ,T do2: Generate a low-rank stochastic gradient Gt of f (·) at Wt3:

Wt+1 = argminW∈Rm×n

12

∥∥∥W − (Wt − ηtGt)∥∥∥

2

F+ ηtλ‖W‖∗

4: end for5: return WT+1

Efficient Implementation

The above problem can also be solved by incrementalSVD with O((m + n)r) space and O((m + n)r2) time

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent

Stochastic Proximal Gradient Descent (SPGD)

Nuclear Norm Regularization

minW∈Rm×n

F (W ) = f (W ) + λ‖X‖∗

The Algorithm [Zhang et al., 2015b]

1: for t = 1, 2, . . . ,T do2: Generate a low-rank stochastic gradient Gt of f (·) at Wt3:

Wt+1 = argminW∈Rm×n

12

∥∥∥W − (Wt − ηtGt)∥∥∥

2

F+ ηtλ‖W‖∗

4: end for5: return WT+1

Advantages

Space complexity per iteration is still low: O((m + n)r)

It is supported by solid theoretical guaranteeshttp://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent

Theoretical Guarantees

Theorem 2 (General convex functions [Zhang et al., 2015b])

Assume E[‖Gt‖2F ] is upper bounded. Setting ηt = 1/

√T , we

have

E [F (WT )− F (W∗)] ≤ O(

log T√T

)

where W∗ is the optimal solution.

Theorem 3 (Strongly convex functions [Zhang et al., 2015b])

Suppose f (·) is strongly convex, and E[‖Gt‖2F ] is upper

bounded. Setting ηt = 1/(µt), we have

E [F (WT )− F (W∗)] ≤ O(

log TT

)

where W∗ is the optimal solution.

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent

Theoretical Guarantees

Theorem 2 (General convex functions [Zhang et al., 2015b])

Assume E[‖Gt‖2F ] is upper bounded. Setting ηt = 1/

√T , we

have

E [F (WT )− F (W∗)] ≤ O(

log T√T

)

where W∗ is the optimal solution.

Theorem 3 (Strongly convex functions [Zhang et al., 2015b])

Suppose f (·) is strongly convex, and E[‖Gt‖2F ] is upper

bounded. Setting ηt = 1/(µt), we have

E [F (WT )− F (W∗)] ≤ O(

log TT

)

where W∗ is the optimal solution.

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent

Application to Kernel PCA

Traditional Algorithm of Kernel PCA

1 Construct a kernel matrix K ∈ Rn×n with Kij = κ(xi , xj)

2 Calculate the top eigenvectors and eigenvalues of K

The Challenge

Space complexity is O(n2)

The Question

Is it possible to find top eigensystems of K withoutconstructing K explicitly?

Yes, by SPGD.

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent

Application to Kernel PCA

Traditional Algorithm of Kernel PCA

1 Construct a kernel matrix K ∈ Rn×n with Kij = κ(xi , xj)

2 Calculate the top eigenvectors and eigenvalues of K

The Challenge

Space complexity is O(n2)

The Question

Is it possible to find top eigensystems of K withoutconstructing K explicitly?

Yes, by SPGD.

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent

Application to Kernel PCA

Traditional Algorithm of Kernel PCA

1 Construct a kernel matrix K ∈ Rn×n with Kij = κ(xi , xj)

2 Calculate the top eigenvectors and eigenvalues of K

The Challenge

Space complexity is O(n2)

The Question

Is it possible to find top eigensystems of K withoutconstructing K explicitly?

Yes, by SPGD.

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent

Application to Kernel PCA (I)

Nuclear Norm Regularized Least Squares

minW∈Rn×n

12‖W − K‖2

F + λ‖W‖∗

Top eigensystems of K can be recovered from the optimalsolution W∗

Proximal Gradient Descent

1: for t = 1, 2, . . . ,T do2: Calculate the gradient Gt = Wt − K3:

Wt+1 = argminW∈Rm×n

12‖W − (Wt − ηtGt)‖2

F + ηtλ‖W‖∗

4: end for

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent

Application to Kernel PCA (I)

Nuclear Norm Regularized Least Squares

minW∈Rn×n

12‖W − K‖2

F + λ‖W‖∗

Top eigensystems of K can be recovered from the optimalsolution W∗

Proximal Gradient Descent

1: for t = 1, 2, . . . ,T do2: Calculate the gradient Gt = Wt − K3:

Wt+1 = argminW∈Rm×n

12‖W − (Wt − ηtGt)‖2

F + ηtλ‖W‖∗

4: end for

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent

Application to Kernel PCA (II)

Nuclear Norm Regularized Least Squares

minW∈Rn×n

12‖W − K‖2

F + λ‖W‖∗

Stochastic Proximal Gradient Descent [Zhang et al., 2015a]

1: for t = 1, 2, . . . ,T do2: Calculate a low-rank stochastic gradient Gt = Wt − ξ3:

Wt+1 = argminW∈Rm×n

12

∥∥∥W − (Wt − ηtGt)∥∥∥

2

F+ ηtλ‖W‖∗

4: end for

ξ is low-rank and E[ξ] = K

Random Fourier features [Rahimi and Recht, 2008]

Constructing columns/rows of K randomly

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent

Application to Kernel PCA (II)

Nuclear Norm Regularized Least Squares

minW∈Rn×n

12‖W − K‖2

F + λ‖W‖∗

Stochastic Proximal Gradient Descent [Zhang et al., 2015a]

1: for t = 1, 2, . . . ,T do2: Calculate a low-rank stochastic gradient Gt = Wt − ξ3:

Wt+1 = argminW∈Rm×n

12

∥∥∥W − (Wt − ηtGt)∥∥∥

2

F+ ηtλ‖W‖∗

4: end for

ξ is low-rank and E[ξ] = K

Random Fourier features [Rahimi and Recht, 2008]

Constructing columns/rows of K randomly

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent

Experimental Results (I)

The Magic data set

The rank of each iterate

50 100 150 200

50

100

150

200

t

Rank(W

t)

λ = 10, k = 5λ = 10, k = 50λ = 100, k = 5λ = 100, k = 50

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent

Experimental Results (II)

The Magic data setThe optimization error

50 100 150 2000

1

2

3

4x 10

−3

t

‖W

t−W

∗‖2 F/n2

λ = 10, k = 5λ = 10, k = 50λ = 100, k = 5λ = 100, k = 500.05t

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion

Conclusion and Future Work

Time Reduction by Stochastic Optimization

A novel algorithm named Mixed Gradient Descent (MGD)is proposed to reduce the iteration complexity

Extend MGD to other scenarios, such as distributedsetting, non-convex functions

Space Reduction by Stochastic Optimization

A simple algorithm based on Stochastic Proximal GradientDescent (SPGD) is proposed to reduce the spacecomplexity

Apply to more applications, such as matrix completion,multi-task learning

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion

Special Issue on “Multi-instance Learning in PatternRecognition and Vision”

URL: http://www.journals.elsevier.com/pattern-recognition/call-for-papers/

special-issue-on-multiple-instance-learning-in-pattern-recog

Dates:

First submission date: Mar. 1, 2016

Final paper notification: Dec. 1, 2016

Guest editors:

Jianxin Wu, Nanjing University

Xiang Bai, Huazhong University of Science andTechnology

Marco Loog, Delft University of Technology

Fabio Roli, University of Cagliari

Zhi-Hua Zhou, Nanjing University

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion

Reference I

,

Avron, H., Kale, S., Kasiviswanathan, S., and Sindhwani, V. (2012).Efficient and practical stochastic subgradient descent for nuclear normregularization.In Proceedings of the 29th International Conference on Machine Learning, pages1231–1238.

Hazan, E. and Kale, S. (2011).Beyond the regret minimization barrier: an optimal algorithm for stochasticstrongly-convex optimization.In Proceedings of the 24th Annual Conference on Learning Theory, pages421–436.

Johnson, R. and Zhang, T. (2013).Accelerating stochastic gradient descent using predictive variance reduction.In Advances in Neural Information Processing Systems 26, pages 315–323.

Mahdavi, M., Zhang, L., and Jin, R. (2013).Mixed optimization for smooth functions.In Advance in Neural Information Processing Systems 26 (NIPS), pages674–682.

http://cs.nju.edu.cn/zlj Stochastic Optimization

Thanks!

Learning And Mining from DatA

A DAML

Introduction Time Reduction Space Reduction Conclusion

Reference II

Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. (2009).Robust stochastic approximation approach to stochastic programming.SIAM Journal on Optimization, 19(4):1574–1609.

Rahimi, A. and Recht, B. (2008).Random features for large-scale kernel machines.In Advances in Neural Information Processing Systems 20, pages 1177–1184.

Zhang, L., Mahdavi, M., and Jin, R. (2013).Linear convergence with condition number independent access of full gradients.In Advance in Neural Information Processing Systems 26, pages 980–988.

Zhang, L., Yang, T., Jin, R., and Zhou, Z.-H. (2015a).Stochastic optimization for kernel pca.Submitted.

Zhang, L., Yang, T., Jin, R., and Zhou, Z.-H. (2015b).Stochastic proximal gradient descent for nuclear norm regularization.ArXiv e-prints, arXiv:1511.01664.

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

top related