stochastic optimization for large-scale learning

Introduction Time Reduction Space Reduction Conclusion

Stochastic Optimization forLarge-scale Machine Learning

Lijun Zhang

LAMDA group, Nanjing University, China

The 13rd Chinese Workshop on Machine Learning and Applications

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAMLNANJING UNIVERSITY

Outline

1 IntroductionDefinitionAdvantages

2 Time ReductionBackgroundMixed Gradient Descent

3 Space ReductionBackgroundStochastic Proximal Gradient Descent

4 Conclusion

A DAML

Introduction Time Reduction Space Reduction Conclusion Definition Advantages

Outline

4 Conclusion

A DAML

What is Stochastic Optimization? (I)

Definition 1: A Special Objective [Nemirovski et al., 2009]

minw∈W

f (w) = Eξ [ℓ(w, ξ)] =

Ξℓ(w, ξ)dP(ξ)

where ξ is a random variable

It is possible to generate an i.i.d. sample ξ1, ξ2, . . .

Risk Minimization

minw∈W

f (w) = E(x,y) [ℓ(w, (x, y))]

ℓ(·, ·) is a loss function, e.g., hinge loss ℓ(u, v) = max(0, 1− uv)

Training data (x1, y1), . . . , (xn, yn) are i.i.d.

A DAML

What is Stochastic Optimization? (I)

Definition 1: A Special Objective [Nemirovski et al., 2009]

minw∈W

f (w) = Eξ [ℓ(w, ξ)] =

Ξℓ(w, ξ)dP(ξ)

where ξ is a random variable

It is possible to generate an i.i.d. sample ξ1, ξ2, . . .

Risk Minimization

minw∈W

f (w) = E(x,y) [ℓ(w, (x, y))]

ℓ(·, ·) is a loss function, e.g., hinge loss ℓ(u, v) = max(0, 1− uv)

Training data (x1, y1), . . . , (xn, yn) are i.i.d.

A DAML

What is Stochastic Optimization? (II)

Definition 2: A Special Access Model [Hazan and Kale, 2011]

minw∈W

There exists a stochastic oracle that produces unbiasedgradient o(·)

E[o(w)] ∈ ∂f (w) or E[o(w)] = ∇f (w)

Empirical Risk Minimization

minw∈W

f (w) =1n

ℓ(yi , x⊤i w)

Sampling a (xt , yt) from (xi , yi)ni=1 randomly

E[∂ℓ(yt , x⊤t w)] ⊆ ∂

ℓ(yi , x⊤i w)

A DAML

What is Stochastic Optimization? (II)

Definition 2: A Special Access Model [Hazan and Kale, 2011]

minw∈W

There exists a stochastic oracle that produces unbiasedgradient o(·)

E[o(w)] ∈ ∂f (w) or E[o(w)] = ∇f (w)

minw∈W

f (w) =1n

ℓ(yi , x⊤i w)

Sampling a (xt , yt) from (xi , yi)ni=1 randomly

E[∂ℓ(yt , x⊤t w)] ⊆ ∂

ℓ(yi , x⊤i w)

A DAML

Outline

4 Conclusion

A DAML

Why Stochastic Optimization? (I)

https://infographiclist.files.wordpress.com/2011/09/world-of-data.jpeg

A DAML

Why Stochastic Optimization? (II)

minw∈W⊆Rd

f (w) =1n

ℓ(yi , x⊤i w)

n is the number of training data

d is the dimensionality

Deterministic Optimization—Gradient Descent (GD)

1: for t = 1, 2, . . . ,T do2: w′

t+1 = wt − ηt(1

∑ni=1 ∇ℓ(yi , x⊤

i wt))

3: wt+1 = ΠW(w′t+1)

4: end for

The Challenge

Time complexity per iteration: O(nd) + O(poly(d))

A DAML

Why Stochastic Optimization? (II)

minw∈W⊆Rd

f (w) =1n

ℓ(yi , x⊤i w)

n is the number of training data

d is the dimensionality

1: for t = 1, 2, . . . ,T do2: w′

t+1 = wt − ηt(1

∑ni=1 ∇ℓ(yi , x⊤

i wt))

3: wt+1 = ΠW(w′t+1)

4: end for

The Challenge

Time complexity per iteration: O(nd) + O(poly(d))

A DAML

Why Stochastic Optimization? (III)

minw∈W⊆Rd

f (w) =1n

ℓ(yi , x⊤i w)

Stochastic Optimization—Stochastic Gradient Descent (SGD)

1: for t = 1, 2, . . . ,T do2: Sample a training example (xt , yt) randomly3: w′

t+1 = wt − ηt∇ℓ(yt , x⊤t wt)

4: wt+1 = ΠW(w′t+1)

5: end for

The Advantage—Time Reduction

Time complexity per iteration: O(d) + O(poly(d))

A DAML

Why Stochastic Optimization? (III)

minw∈W⊆Rd

f (w) =1n

ℓ(yi , x⊤i w)

4: wt+1 = ΠW(w′t+1)

5: end for

The Advantage—Time Reduction

Time complexity per iteration: O(d) + O(poly(d))

A DAML

Why Stochastic Optimization? (IV)

Optimization over Large Matrices

minW∈W⊆Rm×n

F (W )

Matrix completion, distance metric learning, multi-tasklearning, multi-class learning

1: for t = 1, 2, . . . ,T do2: W ′

t+1 = Wt − ηt∇F (Wt)3: Wt+1 = ΠW(W ′

t+1)4: end for

The Challenge

Space complexity per iteration: O(mn)

A DAML

Why Stochastic Optimization? (IV)

minW∈W⊆Rm×n

F (W )

Matrix completion, distance metric learning, multi-tasklearning, multi-class learning

1: for t = 1, 2, . . . ,T do2: W ′

t+1 = Wt − ηt∇F (Wt)3: Wt+1 = ΠW(W ′

t+1)4: end for

The Challenge

Space complexity per iteration: O(mn)

A DAML

Why Stochastic Optimization? (V)

minW∈W⊆Rm×n

F (W )

1: for t = 1, 2, . . . ,T do2: Generate a low-rank stochastic gradient Gt of F (·) at Wt

3: W ′t+1 = Wt − ηtGt

4: Wt+1 = ΠW(W ′t+1)

5: end for

The Advantage—Space Reduction

Space complexity per iteration could be: O((m + n)r)

A DAML

Why Stochastic Optimization? (V)

minW∈W⊆Rm×n

F (W )

3: W ′t+1 = Wt − ηtGt

4: Wt+1 = ΠW(W ′t+1)

5: end for

The Advantage—Space Reduction

Space complexity per iteration could be: O((m + n)r)

A DAML

Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent

Outline

4 Conclusion

A DAML

Stochastic Gradient Descent (SGD)

minw∈W⊆Rd

f (w) =1n

ℓ(yi , x⊤i w)

The Algorithm

4: wt+1 = ΠW(w′t+1)

5: end for

Advantage v.s. Disadvantage

Time complexity per iteration is low: O(d) + O(poly(d))

The iteration complexity is much higher than GD

A DAML

minw∈W⊆Rd

f (w) =1n

ℓ(yi , x⊤i w)

The Algorithm

4: wt+1 = ΠW(w′t+1)

5: end for

Time complexity per iteration is low: O(d) + O(poly(d))

The iteration complexity is much higher than GD

A DAML

The Problem

Iteration Complexity

The number of iterations T to ensure

f (wT )− minw∈Ω

f (w) ≤ ǫ

Iteration Complexity of GD and SGD

Convex & Smooth Strongly Convex & Smooth

1√ǫ

)O(√

κ log 1ǫ

SGD O( 1ǫ2

Total Time Complexity of GD and SGD

n√ǫ

)O(n√κ log 1

SGD O( 1ǫ2

A DAML

The Problem

f (w) ≤ ǫ

Iteration Complexity of GD and SGD

1√ǫ

)O(√

κ log 1ǫ

SGD O( 1ǫ2

n√ǫ

)O(n√κ log 1

SGD O( 1ǫ2

A DAML

The Problem

f (w) ≤ ǫ

Iteration Complexity of GD and SGD ǫ = 10−6

1√ǫ

)103 O

(√κ log 1

)6√κ

SGD O( 1ǫ2

)1012 O

(1µǫ

n√ǫ

)O(n√κ log 1

SGD O( 1ǫ2

A DAML

The Problem

f (w) ≤ ǫ

1√ǫ

)103 O

(√κ log 1

)6√κ

SGD O( 1ǫ2

)1012 O

(1µǫ

n√ǫ

)O(n√κ log 1

SGD O( 1ǫ2

A DAML

The Problem

f (w) ≤ ǫ

1√ǫ

)103 O

(√κ log 1

)6√κ

SGD O( 1ǫ2

)1012 O

(1µǫ

Total Time Complexity of GD and SGD ǫ = 10−6

n√ǫ

)103n O

(n√κ log 1

SGD O( 1ǫ2

)1012 O

(1µǫ

A DAML

The Problem

f (w) ≤ ǫ

1√ǫ

)103 O

(√κ log 1

)6√κ

SGD O( 1ǫ2

)1012 O

(1µǫ

Total Time Complexity of GD and SGD ǫ = 10−6

n√ǫ

)103n O

(n√κ log 1

SGD O( 1ǫ2

)1012 O

(1µǫ

SGD may be slower thanGD if ǫ is very small.

A DAML

Outline

4 Conclusion

A DAML

Motivations

Reason of Slow Convergence Rate

The step size of SGD is a decreasing sequence

ηt =1√t

for convex function

ηt =1t for strongly convex function

Reason of Decreasing Step Size

w′t+1 = wt − ηt∇ℓ(yi , x⊤

Stochastic gradients introduce a constant error

The Key Idea

Control the variance of stochastic gradients

Choose a fixed step size η

A DAML

Motivations

ηt =1√t

for convex function

The Key Idea

A DAML

Motivations

ηt =1√t

for convex function

The Key Idea

A DAML

Mixed Gradient Descent (I)

Mixed Gradient of wt

m(wt) = ∇ℓ(yt , xt⊤wt)−∇ℓ(yt , xt

⊤w0) +∇f (w0)

where (xt , yt) is a random sample, w0 is a initial solution, and

∇f (w0) =1n

∇ℓ(yi , x⊤i w0)

The Properties of Mixed Gradient

It is still a unbiased estimate of true gradient

E[m(wt)] =1n

∇ℓ(yi , x⊤i wt) = ∇f (wt)

The variance is controlled by the distance

‖∇ℓ(yt , x⊤t wt)−∇ℓ(yt , x⊤

t w0)‖2 ≤ L‖wt − w0‖2

A DAML

⊤w0) +∇f (w0)

∇f (w0) =1n

E[m(wt)] =1n

t w0)‖2 ≤ L‖wt − w0‖2

A DAML

⊤w0) +∇f (w0)

∇f (w0) =1n

E[m(wt)] =1n

t w0)‖2 ≤ L‖wt − w0‖2

A DAML

⊤w0) +∇f (w0)

∇f (w0) =1n

E[m(wt)] =1n

t w0)‖2 ≤ L‖wt − w0‖2

A DAML

⊤w0) +∇f (w0)

∇f (w0) =1n

E[m(wt)] =1n

t w0)‖2 ≤ L‖wt − w0‖2

A DAML

Mixed Gradient Descent (II)

The Algorithm [Zhang et al., 2013]

1: Compute the true gradient of w0

∇f (w0) =1n

2: for t = 1, 2, . . . ,T do3: Sample a training example (xt , yt) randomly4: Compute the mixed gradient of wt

m(wt) = ∇ℓ(yt , x⊤t wt)−∇ℓ(yt , x⊤

t w0) +∇f (w0)

5: w′t+1 = wt − ηm(wt)

6: wt+1 = ΠW(w′t+1)

7: end for

A DAML

Theoretical Guarantees

Theorem 1 ([Zhang et al., 2013])

Suppose the objective function is smooth and strongly convex.To find an ǫ-optimal solution, the mixed gradient descent needs

True Gradient Stochastic GradientMGD O

(log 1

)O(κ2 log 1

In contrast, SGD needs O(1/µǫ) stochastic gradients.

Extensions

For unbounded domain, O(κ2 log 1/ǫ)) can be improved toO(κ log 1/ǫ) [Johnson and Zhang, 2013]

For smooth and convex function, O(log 1/ǫ) true gradientsand O(1/ǫ) stochastic gradients are needed[Mahdavi et al., 2013]

A DAML

Theorem 1 ([Zhang et al., 2013])

Suppose the objective function is smooth and strongly convex.To find an ǫ-optimal solution, the mixed gradient descent needs

True Gradient Stochastic GradientMGD O

(log 1

)O(κ2 log 1

In contrast, SGD needs O(1/µǫ) stochastic gradients.

Extensions

For unbounded domain, O(κ2 log 1/ǫ)) can be improved toO(κ log 1/ǫ) [Johnson and Zhang, 2013]

For smooth and convex function, O(log 1/ǫ) true gradientsand O(1/ǫ) stochastic gradients are needed[Mahdavi et al., 2013]

A DAML

Experimental Results (I)

Reuters Corpus Volume I (RCV1) data set

The optimization error

0 50 100 150 200

10−10

10−5

# of Gradients

ss −

EMGDSGD

A DAML

Experimental Results (II)

Reuters Corpus Volume I (RCV1) data set

The variance of mixed gradient

0 50 100 150 20010

10−10

10−5

# of Gradients

EMGDSGD

A DAML

Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent

Outline

4 Conclusion

A DAML

Nuclear Norm Regularization (Composition Optimization)

minW∈Rm×n

F (W ) = f (W ) + λ‖X‖∗The optimal solution W∗ is low-rank

The Algorithm [Avron et al., 2012]

3: W ′t+1 = Wt − ηtGt

4: Wt+1 = Πr (W ′t+1) (truncate small singular values)

5: end for

Efficient Implementation

Wt+1 = Πr (Wt − ηtGt) can be solved by incremental SVDwith O((m + n)r) space and O((m + n)r2) time

A DAML

minW∈Rm×n

3: W ′t+1 = Wt − ηtGt

5: end for

A DAML

minW∈Rm×n

3: W ′t+1 = Wt − ηtGt

5: end for

A DAML

minW∈Rm×n

3: W ′t+1 = Wt − ηtGt

5: end for

Space complexity per iteration is low: O((m + n)r)

There is no theoretical guarantee due to truncation

A DAML

Outline

4 Conclusion

A DAML

Stochastic Proximal Gradient Descent (SPGD)

Nuclear Norm Regularization

minW∈Rm×n

F (W ) = f (W ) + λ‖X‖∗

The Algorithm [Zhang et al., 2015b]

1: for t = 1, 2, . . . ,T do2: Generate a low-rank stochastic gradient Gt of f (·) at Wt3:

Wt+1 = argminW∈Rm×n

∥∥∥W − (Wt − ηtGt)∥∥∥

F+ ηtλ‖W‖∗

4: end for5: return WT+1

The above problem can also be solved by incrementalSVD with O((m + n)r) space and O((m + n)r2) time

A DAML

minW∈Rm×n

F (W ) = f (W ) + λ‖X‖∗

∥∥∥W − (Wt − ηtGt)∥∥∥

F+ ηtλ‖W‖∗

A DAML

minW∈Rm×n

F (W ) = f (W ) + λ‖X‖∗

∥∥∥W − (Wt − ηtGt)∥∥∥

F+ ηtλ‖W‖∗

A DAML

minW∈Rm×n

F (W ) = f (W ) + λ‖X‖∗

∥∥∥W − (Wt − ηtGt)∥∥∥

F+ ηtλ‖W‖∗

Advantages

Space complexity per iteration is still low: O((m + n)r)

It is supported by solid theoretical guaranteeshttp://cs.nju.edu.cn/zlj Stochastic Optimization

A DAML

Theorem 2 (General convex functions [Zhang et al., 2015b])

Assume E[‖Gt‖2F ] is upper bounded. Setting ηt = 1/

√T , we

E [F (WT )− F (W∗)] ≤ O(

log T√T

where W∗ is the optimal solution.

Theorem 3 (Strongly convex functions [Zhang et al., 2015b])

Suppose f (·) is strongly convex, and E[‖Gt‖2F ] is upper

bounded. Setting ηt = 1/(µt), we have

E [F (WT )− F (W∗)] ≤ O(

log TT

A DAML

Theorem 2 (General convex functions [Zhang et al., 2015b])

Assume E[‖Gt‖2F ] is upper bounded. Setting ηt = 1/

√T , we

E [F (WT )− F (W∗)] ≤ O(

log T√T

Theorem 3 (Strongly convex functions [Zhang et al., 2015b])

Suppose f (·) is strongly convex, and E[‖Gt‖2F ] is upper

bounded. Setting ηt = 1/(µt), we have

E [F (WT )− F (W∗)] ≤ O(

log TT

A DAML

Application to Kernel PCA

Traditional Algorithm of Kernel PCA

1 Construct a kernel matrix K ∈ Rn×n with Kij = κ(xi , xj)

2 Calculate the top eigenvectors and eigenvalues of K

The Challenge

Space complexity is O(n2)

The Question

Is it possible to find top eigensystems of K withoutconstructing K explicitly?

Yes, by SPGD.

A DAML

The Challenge

The Question

Yes, by SPGD.

A DAML

The Challenge

The Question

Yes, by SPGD.

A DAML

Application to Kernel PCA (I)

Nuclear Norm Regularized Least Squares

minW∈Rn×n

12‖W − K‖2

F + λ‖W‖∗

Top eigensystems of K can be recovered from the optimalsolution W∗

Proximal Gradient Descent

1: for t = 1, 2, . . . ,T do2: Calculate the gradient Gt = Wt − K3:

12‖W − (Wt − ηtGt)‖2

F + ηtλ‖W‖∗

4: end for

A DAML

Application to Kernel PCA (I)

minW∈Rn×n

12‖W − K‖2

F + λ‖W‖∗

Top eigensystems of K can be recovered from the optimalsolution W∗

Proximal Gradient Descent

1: for t = 1, 2, . . . ,T do2: Calculate the gradient Gt = Wt − K3:

12‖W − (Wt − ηtGt)‖2

F + ηtλ‖W‖∗

4: end for

A DAML

Application to Kernel PCA (II)

minW∈Rn×n

12‖W − K‖2

F + λ‖W‖∗

Stochastic Proximal Gradient Descent [Zhang et al., 2015a]

1: for t = 1, 2, . . . ,T do2: Calculate a low-rank stochastic gradient Gt = Wt − ξ3:

∥∥∥W − (Wt − ηtGt)∥∥∥

F+ ηtλ‖W‖∗

4: end for

ξ is low-rank and E[ξ] = K

Random Fourier features [Rahimi and Recht, 2008]

Constructing columns/rows of K randomly

A DAML

Application to Kernel PCA (II)

minW∈Rn×n

12‖W − K‖2

F + λ‖W‖∗

Stochastic Proximal Gradient Descent [Zhang et al., 2015a]

1: for t = 1, 2, . . . ,T do2: Calculate a low-rank stochastic gradient Gt = Wt − ξ3:

∥∥∥W − (Wt − ηtGt)∥∥∥

F+ ηtλ‖W‖∗

4: end for

ξ is low-rank and E[ξ] = K

Random Fourier features [Rahimi and Recht, 2008]

Constructing columns/rows of K randomly

A DAML

Experimental Results (I)

The Magic data set

The rank of each iterate

50 100 150 200

Rank(W

λ = 10, k = 5λ = 10, k = 50λ = 100, k = 5λ = 100, k = 50

A DAML

Experimental Results (II)

The Magic data setThe optimization error

50 100 150 2000

∗‖2 F/n2

λ = 10, k = 5λ = 10, k = 50λ = 100, k = 5λ = 100, k = 500.05t

A DAML

Conclusion and Future Work

Time Reduction by Stochastic Optimization

A novel algorithm named Mixed Gradient Descent (MGD)is proposed to reduce the iteration complexity

Extend MGD to other scenarios, such as distributedsetting, non-convex functions

Space Reduction by Stochastic Optimization

A simple algorithm based on Stochastic Proximal GradientDescent (SPGD) is proposed to reduce the spacecomplexity

Apply to more applications, such as matrix completion,multi-task learning

A DAML

Special Issue on “Multi-instance Learning in PatternRecognition and Vision”

URL: http://www.journals.elsevier.com/pattern-recognition/call-for-papers/

special-issue-on-multiple-instance-learning-in-pattern-recog

Dates:

First submission date: Mar. 1, 2016

Final paper notification: Dec. 1, 2016

Guest editors:

Jianxin Wu, Nanjing University

Xiang Bai, Huazhong University of Science andTechnology

Marco Loog, Delft University of Technology

Fabio Roli, University of Cagliari

Zhi-Hua Zhou, Nanjing University

A DAML

Reference I

Avron, H., Kale, S., Kasiviswanathan, S., and Sindhwani, V. (2012).Efficient and practical stochastic subgradient descent for nuclear normregularization.In Proceedings of the 29th International Conference on Machine Learning, pages1231–1238.

Hazan, E. and Kale, S. (2011).Beyond the regret minimization barrier: an optimal algorithm for stochasticstrongly-convex optimization.In Proceedings of the 24th Annual Conference on Learning Theory, pages421–436.

Johnson, R. and Zhang, T. (2013).Accelerating stochastic gradient descent using predictive variance reduction.In Advances in Neural Information Processing Systems 26, pages 315–323.

Mahdavi, M., Zhang, L., and Jin, R. (2013).Mixed optimization for smooth functions.In Advance in Neural Information Processing Systems 26 (NIPS), pages674–682.

Thanks!

A DAML

Reference II

Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. (2009).Robust stochastic approximation approach to stochastic programming.SIAM Journal on Optimization, 19(4):1574–1609.

Rahimi, A. and Recht, B. (2008).Random features for large-scale kernel machines.In Advances in Neural Information Processing Systems 20, pages 1177–1184.

Zhang, L., Mahdavi, M., and Jin, R. (2013).Linear convergence with condition number independent access of full gradients.In Advance in Neural Information Processing Systems 26, pages 980–988.

Zhang, L., Yang, T., Jin, R., and Zhou, Z.-H. (2015a).Stochastic optimization for kernel pca.Submitted.

Zhang, L., Yang, T., Jin, R., and Zhou, Z.-H. (2015b).Stochastic proximal gradient descent for nuclear norm regularization.ArXiv e-prints, arXiv:1511.01664.

A DAML

stochastic optimization for large-scale learning

Technology

stochastic optimization: beyond stochastic gradients and...

adaptive stochastic optimization

research on robust generation scheduling with large-scale...

the stochastic optimization problem

microscopic advances with large-scale learning: stochastic...

stochastic and global optimization

stochastic optimization: a review

large-scale stochastic pde-constrained optimization ·...

stochastic optimization online and batch stochastic...

stochastic optimization for large-scale optimal transport...

multi-objective optimization perspectives on …...

weimar optimization and stochastic days - dynardo …...

online stochastic combinatorial optimization

the value of stochastic modeling in two-stage stochastic...

stochastic newton and quasi-newton methods for large-scale...

stochastic optimization in hilbert spacesoutline aymeric...

stochastic optimization techniques for big data machine...

large-scale stochastic...

stochastic global maximum principle for optimization with...

simulators for large-scale network optimization carolina...