hanie sedghi, research scientist at allen institute for artificial intelligence, at mlconf seattle...

Beating Perils of Non-convexity:Guaranteed Training of Neural Networks

Hanie Sedghi

Allen Institute for AI

Joint work with Majid Janzamin (Twitter)and Anima Anandkumar (UC Irvine)

Introduction

Training Neural Networks

Tremendous practical impactwith deep learning.

Highly non-convex optimization

Algorithm: backpropagation.

Backpropagation can get stuckin bad local optima

Hanie Sedghi 1/ 32

Introduction

Convex vs. Non-convex Optimization

Most work on convex analysis.. Most problems are nonconvex!

Image taken from https://www.facebook.com/nonconvex

Hanie Sedghi 2/ 32

Introduction

Convex vs. Nonconvex Optimization

One global optima. Multiple local optima

In high dimensions possiblyexponential local optima

Hanie Sedghi 3/ 32

Introduction

Convex vs. Nonconvex Optimization

One global optima. Multiple local optima

In high dimensions possiblyexponential local optima

How to deal with non-convexity?

Hanie Sedghi 3/ 32

Introduction

Toy Example

y=1y=−1

Local optimum Global optimum

Labeled input samplesGoal: binary classification

σ(·) σ(·)

x1 x2x

Local and global optima for backpropagation

Hanie Sedghi 4/ 32

Introduction

Example: bad local optima

Train Feedforward neural Network with ReLU activation function on MNIST

Local optima leads to wrong classification!

Swirszcz, Czarnecki, and Pascanu, Local minima in training of deep networks, ’16

Hanie Sedghi 5/ 32

Introduction

Example: bad local optima

Bad initialization cannot be recovered by more iterations.

Bad initialization cannot be resolved with depth.

Bad initialization can hurt networks with ReLU or sigmoid.

Swirszcz, Czarnecki, and Pascanu, Local minima in training of deep networks, ’16

Hanie Sedghi 6/ 32

Guaranteed Training of Neural Networks

Outline

Introduction

Guaranteed Training of Neural NetworksAlgorithmError AnalysisGeneral Framework and Extension to RNNs

Hanie Sedghi 7/ 32

Outline

Introduction

Hanie Sedghi 8/ 32

Three Main Components

Hanie Sedghi 9/ 32

Guaranteed Learning through Tensor Methods

Replace the Objective Function

Hanie Sedghi 10/ 32

Best Tensor Decomposition

argminθ

‖T (x)− T (θ)‖.

T (x) : empirical tensor,T (θ) : low rank tensor based on θ

Hanie Sedghi 10/ 32

Best Tensor Decomposition

argminθ

‖T (x)− T (θ)‖.

T (x) : empirical tensor,T (θ) : low rank tensor based on θ

Preserves Global minimum

Finding Globally optimal tensor decomposition

Simple algorithms succeed under mild and natural conditions.(Anandkumar et al ’14, Anandkumar, Ge and Janzamin ’14)

Hanie Sedghi 10/ 32

Background: Tensor Decomposition

Rank-1 tensor:T = w · a⊗ b⊗ c ⇔ T (i, j, l) = w · a(i) · b(j) · c(l).

Hanie Sedghi 11/ 32

Rank-1 tensor:T = w · a⊗ b⊗ c ⇔ T (i, j, l) = w · a(i) · b(j) · c(l).CANDECOMP/PARAFAC (CP) Decomposition

T =∑

j∈[k]

wjaj ⊗ bj ⊗ cj ∈ Rd×d×d, aj , bj, cj ∈ Sd−1.

= + ....

Tensor T w1 · a1 ⊗ b1 ⊗ c1 w2 · a2 ⊗ b2 ⊗ c2

Hanie Sedghi 11/ 32

Rank-1 tensor:T = w · a⊗ b⊗ c ⇔ T (i, j, l) = w · a(i) · b(j) · c(l).CANDECOMP/PARAFAC (CP) Decomposition

T =∑

j∈[k]

wjaj ⊗ bj ⊗ cj ∈ Rd×d×d, aj , bj, cj ∈ Sd−1.

= + ....

Tensor T w1 · a1 ⊗ b1 ⊗ c1 w2 · a2 ⊗ b2 ⊗ c2

Algorithm: Alternating Least Square (ALS), Tensor poweriteration, . . .

Hanie Sedghi 11/ 32

Method of

Moments

Hanie Sedghi 12/ 32

Method-of-Moments for Neural Networks

Supervised setting: observing {(xi, yi)}Non-linear transformations via activating function σ(·)

Hanie Sedghi 13/ 32

Supervised setting: observing {(xi, yi)}Non-linear transformations via activating function σ(·)Random x and y:

Moment possibilities: E[y ⊗ y], E[y ⊗ x], . . .

Hanie Sedghi 13/ 32

σ(·) σ(·)

x1 x2 x3x

A1 ⇒ If σ(·) is linear:A1 is linearly observed in output y. X

Hanie Sedghi 13/ 32

σ(·) σ(·)

x1 x2 x3x

A1 ⇒E[y ⊗ x] = E[σ(A⊤

1 x)⊗ x]

No linear transformation of A1. ×

Hanie Sedghi 13/ 32

σ(·) σ(·)

x1 x2 x3x

A1 ⇒E[y ⊗ x] = E[σ(A⊤

1 x)⊗ x]

One solution: Linearization by using a derivative operator

σ(A⊤1 x)

Derivative−−−−−−→ σ′(·)A⊤1

Hanie Sedghi 13/ 32

σ(·) σ(·)

x1 x2 x3x

A1 ⇒E[y ⊗ x] = E[σ(A⊤

1 x)⊗ x]

One solution: Linearization by using a derivative operator

E[y ⊗ φ(x)] = E[∇xy], φ(·) =?

Hanie Sedghi 13/ 32

Moments of a Neural Network

E[y|x] := f(x) = a⊤2 σ(A⊤1 x)

σ(·) σ(·)

x1 x2 x3x

Hanie Sedghi 14/ 32

E[y|x] := f(x) = a⊤2 σ(A⊤1 x)

Linearization using derivative operator:

φm(x) : m-th order derivative operator

σ(·) σ(·)

x1 x2 x3x

Hanie Sedghi 14/ 32

E[y|x] := f(x) = a⊤2 σ(A⊤1 x)

E [y · φ1(x)] = +

σ(·) σ(·)

x1 x2 x3x

Hanie Sedghi 14/ 32

E[y|x] := f(x) = a⊤2 σ(A⊤1 x)

E [y · φ2(x)] = +

σ(·) σ(·)

x1 x2 x3x

Hanie Sedghi 14/ 32

E[y|x] := f(x) = a⊤2 σ(A⊤1 x)

E [y · φ3(x)] = +

σ(·) σ(·)

x1 x2 x3x

Hanie Sedghi 14/ 32

E[y|x] := f(x) = a⊤2 σ(A⊤1 x)

E [y · φ3(x)] = +

σ(·) σ(·)

x1 x2 x3x

Why tensors are required?

Matrix decomposition recovers subspace, not actual weights.

Tensor decomposition uniquely recovers under non-degeneracy.

Hanie Sedghi 14/ 32

Hanie Sedghi 15/ 32

Derivative Operator: Score Function Transformations

Continuous x with pdf p(·):

S1(x) := −∇x log p(x)

Input:

x ∈ Rd

S1(x) ∈ Rd

Hanie Sedghi 16/ 32

Continuous x with pdf p(·):mth-order score function:

Sm(x) := (−1)m∇(m)p(x)

p(x) Input:

x ∈ Rd

S1(x) ∈ Rd

Hanie Sedghi 16/ 32

Sm(x) := (−1)m∇(m)p(x)

p(x) Input:

x ∈ Rd

S2(x) ∈ Rd×d

Hanie Sedghi 16/ 32

Sm(x) := (−1)m∇(m)p(x)

p(x) Input:

x ∈ Rd

S3(x) ∈ Rd×d×d

Hanie Sedghi 16/ 32

Sm(x) := (−1)m∇(m)p(x)

p(x) Input:

x ∈ Rd

S3(x) ∈ Rd×d×d

Hanie Sedghi 16/ 32

Sm(x) := (−1)m∇(m)p(x)

p(x) Input:

x ∈ Rd

S3(x) ∈ Rd×d×d

Theorem (Score function property, JSA’14)

Providing derivative information: let E[y|x] := f(x), then

E [y ⊗ Sm(x)] = E

[∇(m)

x f(x)].

“Score Function Features for Discriminative Learning: Matrix and Tensor Framework”

by M. Janzamin, H. S. and A. Anandkumar, Dec. 2014.

Hanie Sedghi 16/ 32

Method of

Moments

Probabilistic

Models

& Score Fn.

Hanie Sedghi 17/ 32

NN-LIFT: Neural Network-LearnIng using Feature Tensors

Input:

x ∈ Rd S3(x) ∈ R

d×d×d

Input:

x ∈ Rd S3(x) ∈ R

d×d×d

Estimating moment using

labeled data {(xi, yi)}

S3(xi)

Cross-

moment

Input:

x ∈ Rd S3(x) ∈ R

d×d×d

S3(xi)

Cross-

moment

+ + · · ·

Rank-1 components are the estimates of columns of A1

line 1

CP tensor

decomposition

Input:

x ∈ Rd S3(x) ∈ R

d×d×d

S3(xi)

Cross-

moment

+ + · · ·

Rank-1 components are the estimates of columns of A1

line 1

CP tensor

decomposition

Fourier technique ⇒ b1 (bias of first layer)

Linear Regression ⇒ a2, b2 (parameters of last layer)

Hanie Sedghi 18/ 32

Outline

Introduction

Hanie Sedghi 19/ 32

Estimation Error Bound

Theorem (JSA’15)

Two layer NN, realizeable setting

Full column rank assumption on weight matrix A1

number of samples n = poly(d, k), we have w.h.p.

|f(x)− f(x)|2 ≤ O(1/n).

“Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using

Tensor Methods” by M. Janzamin, H. S. and A. Anandkumar, June. 2015.

Hanie Sedghi 20/ 32

Risk Bound

Generalization of neural network:

Hanie Sedghi 21/ 32

Risk Bound

Approximation error in fitting the target function to a neuralnetwork

Estimation error in estimating the weights of fixed neuralnetwork

Hanie Sedghi 21/ 32

Risk Bound

Approximation error in fitting the target function to a neuralnetwork

Estimation error in estimating the weights of fixed neuralnetwork

Known: continuous functions with compact domain can bearbitrarily well approximated by neural networks with onehidden layer.

Hanie Sedghi 21/ 32

Approximation Error

Approximation error related to Fourier spectrum of f(x).(Barron ‘93).

E[y|x] = f(x)

Hanie Sedghi 22/ 32

Approximation Error

E[y|x] = f(x)

F (ω) :=

f(x)e−j〈ω,x〉dx

‖ω‖2 · |F (ω)|dω

Approximation error ≤ Cf/√k

Hanie Sedghi 22/ 32

Approximation Error

E[y|x] = f(x)

FourierTransform

‖ω‖|F (w)|

Hanie Sedghi 22/ 32

Our Main Result

Theorem(JSA’15)

Approximating arbitrary function f(x) with bounded Cf

n samples, d input dimension, k number of neurons.

Assume Cf is small.

Ex[|f(x)− f(x)|2] ≤ O(C2f/k) +O(1/n).

Polynomial sample complexity n in terms of dimensions d, k.

Computational complexity same as SGD with enough parallelprocessors.

“Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using

Tensor Methods” by M. Janzamin, H. S. and A. Anandkumar, June. 2015.

Hanie Sedghi 23/ 32

Experiment: NN-LiFT vs. Backprop

MNIST dataset

Hanie Sedghi 24/ 32

Experiment: NN-LiFT vs. Backprop

MNIST dataset

Use Denoising Auto-Encoder(DAE) to estimate score function.

DAE learns the first order scorefunction.

We learn higher order scorefunctions recursively. (JSA ’14)

Hanie Sedghi 24/ 32

Experiment Results: NN-LiFT vs. Backprop

MNIST dataset

Use DAE to estimate score function.

NN-LiFT outperforms backpropagation with SGD even for lowhidden dimensions.

10% difference with 128 neurons

Hanie Sedghi 25/ 32

MNIST dataset

Using Adam is not enough for backpropagation to win in highdimensions.

Hanie Sedghi 25/ 32

MNIST dataset

Using Adam is not enough for backpropagation to win in highdimensions.

If we down-sample labeled data, NN-LiFT outperforms Adamby 6− 12%.

Hanie Sedghi 25/ 32

Outline

Introduction

Hanie Sedghi 26/ 32

FEAST: Feature ExtrAction using Score function Tensors

Mixture of Generalized LinearModels (GLM)

E[y|x, h] = g(〈Uh, x〉 + 〈b, h〉)

Mixture of Linear Regression

E[y|x] =∑

πi〈ui, x〉+ 〈bi, h〉

Probabilistic

Models

& Score Fn.

“Provable Tensor Methods for Learning Mixtures of Generalized Linear Models” by H.

S., M. Janzamin and A. Anandkumar, AISTATS 2016

Hanie Sedghi 27/ 32

Guaranteed Training of Recurrent Neural Networks

Output

Hidden Layer

(a) NN

Output

Hidden Layer

xt+1xt

(b) IO-RNN

Output

Hidden Layer

xtxt−1 xt+1

ytyt−1 yt+1

zt−1

(c) BRNN

Hanie Sedghi 28/ 32

Output

Hidden Layer

(a) NN

Output

Hidden Layer

xt+1xt

(b) IO-RNN

Output

Hidden Layer

xtxt−1 xt+1

ytyt−1 yt+1

zt−1

(c) BRNN

E[yt|ht)] = A⊤2 ht, ht = f(A1xt + Uht−1)

Hanie Sedghi 28/ 32

Output

Hidden Layer

(a) NN

Output

Hidden Layer

xt+1xt

(b) IO-RNN

Output

Hidden Layer

xtxt−1 xt+1

ytyt−1 yt+1

zt−1

(c) BRNN

E[yt|ht, zt)] = A⊤2

], ht = f(A1xt + Uht−1), zt = g(B1xt + V zt+1)

Hanie Sedghi 28/ 32

Output

Hidden Layer

(a) NN

Output

Hidden Layer

xt+1xt

(b) IO-RNN

Output

Hidden Layer

xtxt−1 xt+1

ytyt−1 yt+1

zt−1

(c) BRNN

Input sequence, not i.i.d.

Learning the weight matrices between hidden layers

Need to ensure bounded state evolution

Hanie Sedghi 29/ 32

Output

Hidden Layer

(a) NN

Output

Hidden Layer

xt+1xt

(b) IO-RNN

Output

Hidden Layer

xtxt−1 xt+1

ytyt−1 yt+1

zt−1

(c) BRNN

Markovian Evolution of the input sequence

Extension of score function to Markov Chains

Polynomial activation functions

“Training Input-Output Recurrent Neural Networks through Spectral Methods” by H.

S. and A. Anandkumar, Preprint 2016

Hanie Sedghi 29/ 32

References

M. Janzamin, H. Sedghi and A. Anandkumar, Beating the Perils ofNon-Convexity: Guaranteed Training of Neural Networks usingTensor Methods, 2015

M. Janzamin*, H. Sedghi* and A. Anandkumar, Score FunctionFeatures for Discriminative Learning: Matrix and TensorFramework, 2014

H.Sedghi and A. Anandkumar, Provable Methods for TrainingNeural Networks with Sparse Connectivity, NIPS Deep LearningWorkshop 2014, ICLR workshop 2015

H. Sedghi, M. Janzamin and A. Anandkumar, Provable TensorMethods for Learning Mixtures of Generalized Linear Models,AISTATS, 2016

H. Sedghi and A. Anandkumar, Training Input-Output RecurrentNeural Networks through Spectral Methods, 2016

Hanie Sedghi 30/ 32

Conclusion and Future Work

Summary

For the first time guaranteed risk bound for neural networks

Efficient sample and computational complexityHigher order score functions as new features

Useful in general for recovering new discriminative features.

Extension to input sequences for training RNNs and BRNNs

Future Work

Extension to training Convolutional Neural Networks

Empirical performance

Regularization analysis

Hanie Sedghi 31/ 32

Conclusion and Future Work

Summary

For the first time guaranteed risk bound for neural networks

Efficient sample and computational complexityHigher order score functions as new features

Useful in general for recovering new discriminative features.

Extension to input sequences for training RNNs and BRNNs

Future Work

Extension to training Convolutional Neural Networks

Empirical performance

Regularization analysis

Thank You!Hanie Sedghi 32/ 32

hanie sedghi, research scientist at allen institute for artificial intelligence, at mlconf seattle...

Technology

ted willke, intel labs mlconf 2013

josh wills, mlconf 2013

talwalkar mlconf (1)

session 2 - akyildiz, beinecke, yee at mlconf nyc

h2o 0xdata mlconf

mlconf nyc corinna cortes

mlconf nyc claudia perlich

mlconf 2016 sigopt talk by scott clark

hanie folio

sandy ryza – software engineer, cloudera at mlconf atl

steffen rendle, research scientist, google at mlconf sf

reviewanalysis mlconf 2016 jprendki

michal malohlava, software engineer, h2o.ai at mlconf nyc

mlconf yael elmatad

evita hanie pangaribowo - agecon...

quoc le, slides mlconf 11/15/13

mlconf nyc chang wang

hanie hidayah

experts, hanie, daniel i lluc

mlconf nyc edo liberty