introduction to machine learning and stochastic optimization · introduction to machine learning...

Post on 07-Jun-2020

19 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Introduction to Machine Learning and Stochastic

OptimizationRobert M. Gower

Spring School on Optimization and Data Science, Novi Saad, March 2017

Solving the Finite Sum Training Problem

A Datum Function

Finite Sum Training Problem

Optimization Sum of Terms

The Training Problem

Gradient Descent Example

A Logistic Regression problem using the fourclass labelled data from LIBSVM

(n, d)= (862,2)

Optimal point

Gradient Descent Example

A Logistic Regression problem using the fourclass labelled data from LIBSVM

(n, d)= (862,2)

The Training Problem

Stochastic Gradient Descent

Is it possible to design a method that uses only the gradient of a single data function at each iteration?

Stochastic Gradient Descent

Is it possible to design a method that uses only the gradient of a single data function at each iteration?

Unbiased EstimateLet j be a random index sampled from {1, …, n} selected uniformly at random. Then

Stochastic Gradient Descent

Is it possible to design a method that uses only the gradient of a single data function at each iteration?

Unbiased EstimateLet j be a random index sampled from {1, …, n} selected uniformly at random. Then

Stochastic Gradient Descent

Stochastic Gradient Descent

Optimal point

EXE: Using that

Show that

Strong Convexity

Assumptions for Convergence

EXE: Using that

Show that

Strong Convexity

Assumptions for Convergence Often the same as

the regularization parameter

EXE: Using that

Show that

Strong Convexity

Assumptions for Convergence Often the same as

the regularization parameter

Strong convexity parameter!

EXE: Using that

Show that

Expected Bounded Stochastic Gradients

Strong Convexity

Assumptions for Convergence Often the same as

the regularization parameter

Strong convexity parameter!

Example of Strong Convexity

Hinge loss + L2

Quadratic lower bound

Complexity / Convergence

Theorem

Proof:

Unbiased estimatorTaking expectation with respect to j

Taking total expectationStrong conv.

Bounded Stoch grad

Stochastic Gradient Descent α =0.01

Stochastic Gradient Descent α =0.1

Stochastic Gradient Descent α =0.2

Stochastic Gradient Descent α =0.5

Complexity / ConvergenceTheorem (Shrinking stepsize)

Complexity / ConvergenceTheorem (Shrinking stepsize)

Sublinear convergence

Complexity / ConvergenceTheorem (Shrinking stepsize)

Sublinear convergence

Shrinking Stepsize

Comparison SGD vs GD

time

SGD

M. Schmidt, N. Le Roux, F. Bach (2016)Mathematical Programming Minimizing Finite Sums with the Stochastic Average Gradient.

Comparison SGD vs GD

time

SGD

M. Schmidt, N. Le Roux, F. Bach (2016)Mathematical Programming Minimizing Finite Sums with the Stochastic Average Gradient.

Comparison SGD vs GD

time

SGD

GD

M. Schmidt, N. Le Roux, F. Bach (2016)Mathematical Programming Minimizing Finite Sums with the Stochastic Average Gradient.

Comparison SGD vs GD

time

SGD

GD

Stoch. Average Gradient (SAG)

M. Schmidt, N. Le Roux, F. Bach (2016)Mathematical Programming Minimizing Finite Sums with the Stochastic Average Gradient.

Comparison SGD vs GD

time

SGD

GD

Stoch. Average Gradient (SAG)

M. Schmidt, N. Le Roux, F. Bach (2016)Mathematical Programming Minimizing Finite Sums with the Stochastic Average Gradient.

Maybe just an unbiased estimate is not enough.

Variance reduced methods through

Sketching

Build an Estimate of the Gradient

Instead of using directlyUse to update estimate

Build an Estimate of the Gradient

Instead of using directlyUse to update estimate

Build an Estimate of the Gradient

Instead of using directlyUse to update estimate

Unbiased

Converges in L2

We would like gradient estimate such that:

Build an Estimate of the Gradient

Instead of using directlyUse to update estimate

Unbiased

Converges in L2

We would like gradient estimate such that:

Example: The Stochastic Average Gradient

M. Schmidt, N. Le Roux, F. Bach (2016)Mathematical Programming Minimizing Finite Sums with the Stochastic Average Gradient.

The Stochastic Average Gradient

The Stochastic Average Gradient

How to prove this converges? Is this the only option?

Introducing the Jacobian

Introducing the Jacobian

Introducing the Jacobian

The Stochastic Average Gradient

Is this the only option? How to prove this converges?

The Stochastic Average Gradient

Is this the only option? How to prove this converges?

The Stochastic Average Gradient

Is this the only option? How to prove this converges?

Stoch. Linear Measurement

Stochastic Sketch

Sparse Stochastic Matrix

Stochastic Sparse Sketches

Stochastic Sketch

Sparse Stochastic Matrix

Stochastic Sparse Sketches

Eg: SGD Sketch

Stochastic Sparse Sketches

Many examples: Sparse Rademacher matrices, sampling with replacement, nonuniform...etc

Eg: Mini-batch SGD Sketch

Sample Stochastic Sketch

A Jacobian Based Method

Maintain Jacobian Estimate

Improved Guess

Sample Stochastic Sketch

A Jacobian Based Method

Maintain Jacobian Estimate

? ?

A Jacobian Based Method

Jacobian Sketching Algorithm

A Jacobian Based Method

Jacobian Sketching Algorithm

A Jacobian Based Method

Jacobian Sketching Algorithm

? ?

Updating the Jacobian Estimate: Sketch and project

Updating the Jacobian Estimate: Sketch and project

Updating the Jacobian Estimate: Sketch and project

Sketch and Project the Jacobian

Updating the Jacobian Estimate: Sketch and project

Sketch and Project the Jacobian

Updating the Jacobian Estimate: Sketch and project

RMG and Peter Richtarik (2015) Randomized iterative methods for linear systemsSIAM Journal on Matrix Analysis and Applications 36(4)

Solution:

Exercise

Show that the solution Jt is given by

Proof: The Lagrangian is given by

Solution:

Exercise

Show that the solution Jt is given by

Proof: The Lagrangian is given by

Solution:

Exercise

Show that the solution Jt is given by

Proof: The Lagrangian is given by

Differentiating in J and setting to zero: (1)

Solution:

Exercise

Show that the solution Jt is given by

Proof: The Lagrangian is given by

Differentiating in J and setting to zero: (1)(2)

Solution:

Exercise

Show that the solution Jt is given by

Proof: The Lagrangian is given by

Differentiating in J and setting to zero: (1)(2)

Substituting (1) into (2) gives the solution.

Solution:

Sketch and project the Jacobian

Solution:

Sketch and project the Jacobian

Solution:

Sketch and project the Jacobian

Solution:

Sketch and project the Jacobian

Solution:

Sketch and project the Jacobian

Unbiased ConditionLemma.

Proof:

Unbiased ConditionLemma.

Proof:

Unbiased ConditionLemma.

Proof:

Exercise

Proof:

Exercise

Proof:

Exercise

Proof:

A Jacobian Based Method

Archetype Jacobian Sketching Algorithm

A Jacobian Based Method

Archetype Jacobian Sketching Algorithm

Looks expensive and complicated. Investigate

Homework:

Example: minibatch-SAGA

Gradiant estimate

Jacobain update

Homework:

Example: minibatch-SAGA

Proving Convergence of Variance reduced

methods

top related