an optimal control framework for efficient training … · an optimal control framework for...

44
c Lars Ruthotto Optimal Control for Deep Learning CoSIP ICDL, Berlin 2017 An Optimal Control Framework for Efficient Training of Deep Neural Networks CoSIP Intense Course on Deep Learning, Berlin December 1, 2017 Lars Ruthotto Department of Mathematics and Computer Science, Emory University [email protected] Title Intro Stab PDE 1

Upload: lyquynh

Post on 21-Aug-2018

253 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An Optimal Control Framework for Efficient Training … · An Optimal Control Framework for Efficient Training of Deep Neural ... classification problem generated from peaks in

c© Lars RuthottoOptimal Control for Deep Learning

CoSIP ICDL, Berlin 2017

An Optimal Control Framework forEfficient Training of Deep Neural

Networks

CoSIP Intense Course on Deep Learning, Berlin

December 1, 2017

Lars RuthottoDepartment of Mathematics and Computer Science, Emory University

[email protected]

Title Intro Stab PDE 1

Page 2: An Optimal Control Framework for Efficient Training … · An Optimal Control Framework for Efficient Training of Deep Neural ... classification problem generated from peaks in

c© Lars RuthottoOptimal Control for Deep Learning

CoSIP ICDL, Berlin 2017

Fundamental (Open) QuestionsExpressibility

I how to find neural network that can approximate function of interest?I successes: approximation theorems, optimal sparsity, . . .I communities: harmonic analysis, approximation theory, . . .

LearningI how to (efficiently) train neural network?I successes: stochastic gradient and zoo of variants (including ADAM,

second-order, . . . )I community: mainly optimization and optimal control

GeneralizationI does the neural network generalize?I successes: VC dimensions, bias/variance dilemma, regularization, . . .I community: mainly statistics

ComputingI how to design network that is expressive and generalizes well and which

method will train it efficiently?I successes: hardware, . . .I community: scientific computing

Title Intro Stab PDE 2

Page 3: An Optimal Control Framework for Efficient Training … · An Optimal Control Framework for Efficient Training of Deep Neural ... classification problem generated from peaks in

c© Lars RuthottoOptimal Control for Deep Learning

CoSIP ICDL, Berlin 2017

Team and Acknowledgements

Joint work: Emory↔ Xtract Tech. ↔ University of British Columbia

Lili Meng Bo Chang Elliot Holtham Eldad Haber Seong Hwan Jun

Funding:I This work is supported in part by NSF award DMS 1522599I Thanks to NVIDIA Corp for donation of a TITAN X GPU

Title Intro Stab PDE 3

Page 4: An Optimal Control Framework for Efficient Training … · An Optimal Control Framework for Efficient Training of Deep Neural ... classification problem generated from peaks in

c© Lars RuthottoOptimal Control for Deep Learning

CoSIP ICDL, Berlin 2017

Agenda: Optimal Control Framework for Deep Learning

I Deep Learning meets Optimal ControlI Stability and Generalization

I when is deep learning well-posed?I stabilizing the forward propagation

I Convolution Neural Networks as PDEI continuity in feature spaceI allows to interpret and categorize CNN

I Multiscale Parabolic CNNsI image classification across scalesI shallow-to-deep training

I Reversible Hyperbolic CNNsI memory-efficient + stable→ arbitrarily deep

E Haber, LRStable Architectures for DeepNeural Networks.Inverse Problems, 2017.

E Holtham, E Haber, LRLearning Across Scales.AAAI, 2018.

B Chang, L Meng, E Holtham, E Haber,LR, D BegertReversible Architectures for ArbitrarilyDeep ResNNs.AAAI, 2018.

Title Intro Stab PDE 4

Page 5: An Optimal Control Framework for Efficient Training … · An Optimal Control Framework for Efficient Training of Deep Neural ... classification problem generated from peaks in

c© Lars RuthottoOptimal Control for Deep Learning

CoSIP ICDL, Berlin 2017

Deep Learning meets OptimalControl

Title Intro Stab PDE 5

Page 6: An Optimal Control Framework for Efficient Training … · An Optimal Control Framework for Efficient Training of Deep Neural ... classification problem generated from peaks in

c© Lars RuthottoOptimal Control for Deep Learning

CoSIP ICDL, Berlin 2017

Deep Learning Revolution (?)

−→ Yj+1 = σ(KjYj + bj)

I Neural Networks with a particular (deep) architectureI invented in the 1950’sI able to ”learn” complicated patterns from dataI applications: image classification, face recognition, segmentation,

driverless cars, . . .I recent success fueled by: massive data sets, computing powerI A few recent quotes:

I Apple Is Bringing the AI Revolution to Your iPhone, WIRED ’16I Data Scientist: Sexiest Job of the 21st Century, Harvard Business Rev ’17

Title Intro Stab PDE 6

Page 7: An Optimal Control Framework for Efficient Training … · An Optimal Control Framework for Efficient Training of Deep Neural ... classification problem generated from peaks in

c© Lars RuthottoOptimal Control for Deep Learning

CoSIP ICDL, Berlin 2017

Supervised Learning using Deep Neural Networks

training data, Y0,C propagated features, YN ,C classification result

Supervised Deep Learning Problem

Given training data, Y0, and labels, C, find transformation parameters(K, b) and classification weights (W, µ) such that the DNN predicts thedata-label relationship (and generalizes to new data), by solving

minimizeK,b,W,µ loss[g(WYN + µ),C] + regularizer[K,b,W, µ]

subject to Yj+1 = activation(KjYj + bj), ∀j = 0, . . . ,N − 1

Title Intro Stab PDE 7

Page 8: An Optimal Control Framework for Efficient Training … · An Optimal Control Framework for Efficient Training of Deep Neural ... classification problem generated from peaks in

c© Lars RuthottoOptimal Control for Deep Learning

CoSIP ICDL, Berlin 2017

Deep Residual Neural NetworksAward-winning forward propagation

Yj+1 = Yj + hKj,2σ(Kj,1Yj + bj), ∀ j = 0, 1, . . . ,N − 1.

ResNet is forward Euler discretization of

∂ty(t,K,b, y0) = K2(t)σ (K1(t)y(t,K,b, y0) + b(t)) ,y(0,K,b, y0) = y0.

deep learning↔ trajectory problem, imageregistration, mass transport,. . .

In short, write ResNets as

∂ty(t, θ(t), y0) = f (y, θ(t)), y(0, θ, y0) = y0

K. He, X. Zhang, S. Ren, and J. SunDeep residual learning for image recognition.IEEE Conf. on CVPR, 770–778, 2016.

input features, Y0

propagated features, YN

Title Intro Stab PDE 8

Page 9: An Optimal Control Framework for Efficient Training … · An Optimal Control Framework for Efficient Training of Deep Neural ... classification problem generated from peaks in

c© Lars RuthottoOptimal Control for Deep Learning

CoSIP ICDL, Berlin 2017

Optimal Control Framework for Deep Learning

training data, Y0,C prop. features, Y(T, θ,Y0),C classification result

Supervised Deep Learning Problem

Given training data, Y0, and labels, C, find network parameters θ andclassification weights W, µ such that the DNN predicts the data-labelrelationship (and generalizes to new data), i.e., solve

minimizeθ,W,µ loss[g(WY(T, θ,Y0) + µ),C] + regularizer[θ,W, µ]

Title Intro Stab PDE 9

Page 10: An Optimal Control Framework for Efficient Training … · An Optimal Control Framework for Efficient Training of Deep Neural ... classification problem generated from peaks in

c© Lars RuthottoOptimal Control for Deep Learning

CoSIP ICDL, Berlin 2017

Optimal Control Approaches to Deep Learning

control 1control 2

control 3control 4

control 5

input featuresoutput features←− time −→

Deep Learning↔ trajectory problem.I use for analysis and new algorithmsI invent your own architecture

E. Haber, LRStable Architectures for Deep NeuralNetworks.Inverse Problems, accepted 2017.

Weinan EA Proposal on Machine Learning viaDynamical Systems.Communications in Mathematics andStatistics, 5(1), 1–11, 2017.

Title Intro Stab PDE 10

Page 11: An Optimal Control Framework for Efficient Training … · An Optimal Control Framework for Efficient Training of Deep Neural ... classification problem generated from peaks in

c© Lars RuthottoOptimal Control for Deep Learning

CoSIP ICDL, Berlin 2017

Blessing of Dimensionality (or Width)Setup: ResNN, 9 fully connected single layers, σ = tanh

input features + labels propagated features classification result1

increase the dimension (width) no need to change topology!

1≈ 100% validation accuracyTitle Intro Stab PDE 11

Page 12: An Optimal Control Framework for Efficient Training … · An Optimal Control Framework for Efficient Training of Deep Neural ... classification problem generated from peaks in

c© Lars RuthottoOptimal Control for Deep Learning

CoSIP ICDL, Berlin 2017

Stability and Well-Posedness

Title Intro Stab PDE 12

Page 13: An Optimal Control Framework for Efficient Training … · An Optimal Control Framework for Efficient Training of Deep Neural ... classification problem generated from peaks in

c© Lars RuthottoOptimal Control for Deep Learning

CoSIP ICDL, Berlin 2017

Stability of DNN: Goals

Our goals:I predict a priori if architecture can generalizeI design architectures that generalizeI impose regularization to find solutions that generalize

main ingredients of well-posed inverse problems:1. well-posed forward problem2. bounded inverse

Title Intro Stab PDE 13

Page 14: An Optimal Control Framework for Efficient Training … · An Optimal Control Framework for Efficient Training of Deep Neural ... classification problem generated from peaks in

c© Lars RuthottoOptimal Control for Deep Learning

CoSIP ICDL, Berlin 2017

Stability of Continuous Forward Propagation

Interpret ResNet as discretization of initial valueproblem

∂ty(t,K, b, y) = σ(K(t)y(t,K, b, y) + b(t))y(0,K, b, y) = y.

IVP is stable if for any v ∈ Rn

‖y(T,K, b, y)− y(T,K, b, y + εv)‖2 = O(ε).

unstable IVP

stable IVP

idea: ensure stability by design / constraints on K, b

Title Intro Stab PDE 14

Page 15: An Optimal Control Framework for Efficient Training … · An Optimal Control Framework for Efficient Training of Deep Neural ... classification problem generated from peaks in

c© Lars RuthottoOptimal Control for Deep Learning

CoSIP ICDL, Berlin 2017

Stability of Forward Propagation

K+, λ(K+) = 2

−1.5 0 1.5−1.5

0

1.5K−, λ(K−) = −2

−1.5 0 1.5−1.5

0

1.5K0, λ(K0) = ±ı

−1.5 0 1.5−1.5

0

1.5

Fact: The ODE y′(t) = f (y) is stable if the real parts of the eigenvalues of theJacobian J are non-positive.For non-autonomous ODEs we also need that J changes slowly in time.Rigorous argument using framework of kinematic eigenvalues.For the ResNet

J(t) = diag(σ′(K(t)Y(t) + b(t)))K(t).

If activation is monotonically increasing, real(eig(K(t))) ≤ 0 sufficient.Title Intro Stab PDE 15

Page 16: An Optimal Control Framework for Efficient Training … · An Optimal Control Framework for Efficient Training of Deep Neural ... classification problem generated from peaks in

c© Lars RuthottoOptimal Control for Deep Learning

CoSIP ICDL, Berlin 2017

Enforcing Stability: Antisymmetric Transformation

Two examples of unconditionally stable networks.

ResNet with antisymmetric transformationmatrix

ddt

y(t) = σ((K(t)−K(t)>)y + b(t)).

ResNet with auxiliary variable andantisymmetric matrix

ddt

(yz

)(t) = σ

((0 A(t)

−A(t)> 0

)(yz

)+ b(t)

).

How about stability of the discrete system? −0.1 0 0.1−0.1

0

0.1

Title Intro Stab PDE 16

Page 17: An Optimal Control Framework for Efficient Training … · An Optimal Control Framework for Efficient Training of Deep Neural ... classification problem generated from peaks in

c© Lars RuthottoOptimal Control for Deep Learning

CoSIP ICDL, Berlin 2017

Stability of Discrete Forward Problem

forward Euler Adams Bashforth 3

Note: ResNet (fwd Euler) not stable for antisymmetric transformations.Better:

Yj+1 = Yj+h12

(23σ(KjYj + bj)− 16σ(Kj−1Yj−1 + bj−1) + 5σ(Kj−2Yj−2 + bj−2)) .

Title Intro Stab PDE 17

Page 18: An Optimal Control Framework for Efficient Training … · An Optimal Control Framework for Efficient Training of Deep Neural ... classification problem generated from peaks in

c© Lars RuthottoOptimal Control for Deep Learning

CoSIP ICDL, Berlin 2017

Symplectic IntegrationHamiltonian-inspired neural networks (Verlet integration)

Zj+ 12= Zj− 1

2− hσ (KjYj + bj)

Yj+1 = Yj + hσ(

K>j Zj+ 12+ bj

)Necessary Conditions for Generalization

I continuous forward propagation stable when real(eig(K)) ≤ 0I learning problem well-posed when real(eig(K)) ≈ 0I stable scheme for discrete forward propagation and h ”small enough”

Simple explanation of (and cure for) exploding and vanishing gradients!

Y. Bengio, P. Simdard, P. FrasconiLearning Long-Term Dependencies with Gradient Descent Is DifficultIEEE Trans. on Neural Networks, 5(2):157–166, 1994.

Title Intro Stab PDE 18

Page 19: An Optimal Control Framework for Efficient Training … · An Optimal Control Framework for Efficient Training of Deep Neural ... classification problem generated from peaks in

c© Lars RuthottoOptimal Control for Deep Learning

CoSIP ICDL, Berlin 2017

Example: Impact of Network Depthclassification problem generated from peaks in MATLAB R©

data setupI 2, 000 points in 2D, 5 classesI Residual Neural NetworkI tanh activation, softmax classifierI multilevel: 32 layers→ 64 layers

compare three configurations1. ”deep”: T = 5 (3rd order multistep)2. ”medium”: T = 2 (1st order Verlet)3. ”shallow”: T = 0.2 (3rd order multistep)

features and labels

Q: how does learning performance compare?

Title Intro Stab PDE 19

Page 20: An Optimal Control Framework for Efficient Training … · An Optimal Control Framework for Efficient Training of Deep Neural ... classification problem generated from peaks in

c© Lars RuthottoOptimal Control for Deep Learning

CoSIP ICDL, Berlin 2017

Example: Impact of Network Depth - Convergencedeep, ab3 (T = 5) medium, Verlet (T = 2) shallow, ab3 (T = 0.2)

obje

ctiv

efu

nctio

n

1 300 3000

1

2

3

432 layers 64 layers

1 100 1000

1

2

3

432 layers 64 layers

1 100 1000

1

2

3

432 layers 64 layers

valid

atio

nac

cura

cy

1 300 3000.4

0.6

0.8

1

32 layers 64 layers

1 100 1000.4

0.6

0.8

1

32 layers 64 layers

1 100 1000.4

0.6

0.8

1

32 layers 64 layers

Title Intro Stab PDE 20

Page 21: An Optimal Control Framework for Efficient Training … · An Optimal Control Framework for Efficient Training of Deep Neural ... classification problem generated from peaks in

c© Lars RuthottoOptimal Control for Deep Learning

CoSIP ICDL, Berlin 2017

Example: Impact of Network Depth - Dynamics

deep, ab3 (T = 5) medium, Verlet (T = 2) shallow, ab3 (T = 0.2)

Title Intro Stab PDE 21

Page 22: An Optimal Control Framework for Efficient Training … · An Optimal Control Framework for Efficient Training of Deep Neural ... classification problem generated from peaks in

c© Lars RuthottoOptimal Control for Deep Learning

CoSIP ICDL, Berlin 2017

PDE-Interpretation ofConvolution Neural Networks

Title Intro Stab PDE 22

Page 23: An Optimal Control Framework for Efficient Training … · An Optimal Control Framework for Efficient Training of Deep Neural ... classification problem generated from peaks in

c© Lars RuthottoOptimal Control for Deep Learning

CoSIP ICDL, Berlin 2017

Convolutions and PDEsLet y be of 1D grid function y↔ y (grid: n cells of width hx = 1/n)

K(θ)y = [θ1 θ2 θ3] ∗ y =

(β4

4[1 2 1] +

β2

2hx[−1 0 1] +

β3

h2x[−1 2− 1]

)∗ y

where the coefficients β1, β2, β3 satisfy 1/4 −1/(2hx) −1/h2x

1/2 0 2/h2x

1/4 1/(2hx) −1/h2x

β1β2β3

=

θ1θ2θ3

.

In the limit hx → 0 this gives

K(θ(t)) = β1(t) + β2(t)∂x + β3(t)∂2x .

Convolution operator K is linear combination of differential operators

Title Intro Stab PDE 23

Page 24: An Optimal Control Framework for Efficient Training … · An Optimal Control Framework for Efficient Training of Deep Neural ... classification problem generated from peaks in

c© Lars RuthottoOptimal Control for Deep Learning

CoSIP ICDL, Berlin 2017

Parabolic CNNIn original Residual Net choose K2 = −K>1 = K>.This gives parabolic PDE

Yt = −K(t)>σ(K(t)Y + b(t)), Y(0) = Y0

Jacobian

J(t) = −K(t)>diag(σ′(K(t)Y + b(t))

)K(t)

symmetric and negative definite (σ′ ≥ 0) stable if K,b do not change tooquickly. Use forward Euler discretization with h small enough

Yj+1 = Yj − hK>j σ(KjY + bj), j = 0, 1, . . . ,N − 1

Similar to anisotropic diffusion (popular in image processing)

Y. Chen, T. PockTrainable Nonlinear Reaction Diffusion.IEEE PAMI, 39(6), 1256–1272, 2017.

Title Intro Stab PDE 24

Page 25: An Optimal Control Framework for Efficient Training … · An Optimal Control Framework for Efficient Training of Deep Neural ... classification problem generated from peaks in

c© Lars RuthottoOptimal Control for Deep Learning

CoSIP ICDL, Berlin 2017

Hamiltonian CNN

Introducing auxiliary variable Z, consider dynamics

∂tY(t) = KT1 (t)σ(K1(t)Z(t) + b1(t)),

∂tZ(t) = −KT2 (t)σ(K2(t)Y(t) + b2(t)).

In matrix form this is(∂tY∂tZ

)=

(KT

1 00 −KT

2

)σ(( 0 K1

K2 0

)(YZ

)+

(b1b2

)).

(Can be shown that eigenvalues of Jacobian are allimaginary stability when K1,K2,b1,b2 changeslowly in time) Discretize using Verlet method

Yj+1 = Yj + hKTj1σ(Kj1Zj + bj1),

Zj+1 = Zj − hKTj2σ(Kj2Yj+1 + bj2).

Title Intro Stab PDE 25

Page 26: An Optimal Control Framework for Efficient Training … · An Optimal Control Framework for Efficient Training of Deep Neural ... classification problem generated from peaks in

c© Lars RuthottoOptimal Control for Deep Learning

CoSIP ICDL, Berlin 2017

Second-Order Network

Consider second-order forwarddynamics

∂ttY = −K(t)>σ(K(t)Y + b(t))

And their Leapfrog discretization

Yj+1 = 2Yj − Yj−1 − h2K>j σ(KjY + bj)

Similar to: Full Waveform Inversion,Ultrasound, . . .

Title Intro Stab PDE 26

Page 27: An Optimal Control Framework for Efficient Training … · An Optimal Control Framework for Efficient Training of Deep Neural ... classification problem generated from peaks in

c© Lars RuthottoOptimal Control for Deep Learning

CoSIP ICDL, Berlin 2017

Loss Function in CNNLet C be some values, Y features, and W classification weights.Regression:

loss(W,Y) =12‖WY− C‖2

If Y and W are discretizations of Y and W, loss is discretization of

`(W,Y) =(∫

ΩW(x)Y(x)dx− C

)2

⇒ add h2x and model W as image, e.g., enforce smoothness

R(W) =

∫Ω‖∇W(x)‖2dx.

Classification: Similar, since hypothesis functions can be written as

Cpred(W,Y) = g(h2

xWY)

Title Intro Stab PDE 27

Page 28: An Optimal Control Framework for Efficient Training … · An Optimal Control Framework for Efficient Training of Deep Neural ... classification problem generated from peaks in

c© Lars RuthottoOptimal Control for Deep Learning

CoSIP ICDL, Berlin 2017

Some Challenges in CNN

ComputationsI note that networks have a widths and depthI Toy example: width 16 , depth 20 (time steps).

Forward propagation: 5,120 2D-convs/image

Memory ConsumptionI adjoint equations (backpropagation) need

intermediate states (hidden features)I Toy example (continued): training data is 5k

images with 32× 32 pixels.Storage (just features): 130 GB (double) or

65 GB (single).

Architecture Design & InterpretationI CNN should be easy to train and generalize wellI CNN should be difficult to fool (adversarial)I Can we understand the reasoning of a CNN?

Pei et al., DeepXplore, 2017

Title Intro Stab PDE 28

Page 29: An Optimal Control Framework for Efficient Training … · An Optimal Control Framework for Efficient Training of Deep Neural ... classification problem generated from peaks in

c© Lars RuthottoOptimal Control for Deep Learning

CoSIP ICDL, Berlin 2017

Parabolic Networks

Title Intro Stab PDE 29

Page 30: An Optimal Control Framework for Efficient Training … · An Optimal Control Framework for Efficient Training of Deep Neural ... classification problem generated from peaks in

c© Lars RuthottoOptimal Control for Deep Learning

CoSIP ICDL, Berlin 2017

Parabolic Residual Neural Networks

Recall the decay property of heat equation. Example:

∂ty(t, x) = −∂xxy(t, x), + initial + boundary cond.

Some consequences for learningI forward propagation is asymptotically stable (if kernels constant in time)I network is robust against perturbation of inputs (adversarial)I learning problem ill-posed ( inverse heat equation)I numerical methods for parabolic include multiscale, multilevel, ROM, . . .

Title Intro Stab PDE 30

Page 31: An Optimal Control Framework for Efficient Training … · An Optimal Control Framework for Efficient Training of Deep Neural ... classification problem generated from peaks in

c© Lars RuthottoOptimal Control for Deep Learning

CoSIP ICDL, Berlin 2017

Multi-Resolution Learning

level 1, 12× 12 level 2, 24× 24 level 3, 48× 48 level 4, 96× 96

Multi-Resolution Learning

Restrict the images n timesθ0,W0,µ0 ← random initializationfor j = 1 : n do

optimize with data on level k starting from θj−1,Wj−1,µk−1

obtain θ∗,W∗,µ∗θj ← prolongate θ∗

Wj ← interpolate W∗

How to prolongate the kernels?Title Intro Stab PDE 31

Page 32: An Optimal Control Framework for Efficient Training … · An Optimal Control Framework for Efficient Training of Deep Neural ... classification problem generated from peaks in

c© Lars RuthottoOptimal Control for Deep Learning

CoSIP ICDL, Berlin 2017

Galerkin Projection of Convolution Kernels

KH = RKhP,

whereI Kh fine mesh operator (given)I R restriction (e.g., averaging)I P prolongation (e.g.,

interpolation)

Remarks:I Galerkin: R = γP>

I Coarse→ Fine: unique ifkernel size constant.

I only small linear solverequired

∗pr

olo

ng

atio

n

rest

rict

ion

fine convolution

coarse convolution

Title Intro Stab PDE 32

Page 33: An Optimal Control Framework for Efficient Training … · An Optimal Control Framework for Efficient Training of Deep Neural ... classification problem generated from peaks in

c© Lars RuthottoOptimal Control for Deep Learning

CoSIP ICDL, Berlin 2017

Example: Multiresolution Learning

dataI 60, 000 gray-scale images 18× 18I Residual Neural Network, width 6I 2D convolution layers, fully connectedI tanh activation, softmax classifier

multilevel experiments1. train on fine→ classify coarse:

84.1% vs. 94.9%(no restriction) (with restriction)

2. train on coarse→ classify fine:

61.0% vs. 91.0%(no prolongation) (with prolongation)

Title Intro Stab PDE 33

Page 34: An Optimal Control Framework for Efficient Training … · An Optimal Control Framework for Efficient Training of Deep Neural ... classification problem generated from peaks in

c© Lars RuthottoOptimal Control for Deep Learning

CoSIP ICDL, Berlin 2017

Example: Shallow-to-Deep MNIST6 level strategy / random initialization vs. multilevel learning.

10 20 10 20 10 20 10 20 10 20 10 200

0.1

0.2

0.32 layers 4 layers 8 layers 16 layers 32 layers 64 layers

iterations

objfunction

random initializationmultilevel prolongation

10 20 10 20 10 20 10 20 10 20 10 200.92

0.94

0.96

0.98

1

2 layers 4 layers 8 layers 16 layers 32 layers 64 layers

validationaccuracy

Title Intro Stab PDE 34

Page 35: An Optimal Control Framework for Efficient Training … · An Optimal Control Framework for Efficient Training of Deep Neural ... classification problem generated from peaks in

c© Lars RuthottoOptimal Control for Deep Learning

CoSIP ICDL, Berlin 2017

Example: Multilevel Learning ImageNet-10

ImageNet-10I 13k natural images

224× 224I 10 sub-categoriesI Residual Neural Network,

width 64, depth 34I 2D convolution layers,

fully connected

20 40 60 72 20 40 60 68

10

20

30

40

50

60

70

80coarse level, 112x112 fine level, 224x224

epochs

clas

sific

atio

ner

ror

in%

coarse-to-fine, testcoarse-to-fine, trainingfine only, testfine only, training

fine-scale only coarse-to-fineruntime [sec] 59,122±7,540 43,882±3, 476

validation acc. 76.47±0.93% 82.67±0.93%

Title Intro Stab PDE 35

Page 36: An Optimal Control Framework for Efficient Training … · An Optimal Control Framework for Efficient Training of Deep Neural ... classification problem generated from peaks in

c© Lars RuthottoOptimal Control for Deep Learning

CoSIP ICDL, Berlin 2017

Hyperbolic Networks

Title Intro Stab PDE 36

Page 37: An Optimal Control Framework for Efficient Training … · An Optimal Control Framework for Efficient Training of Deep Neural ... classification problem generated from peaks in

c© Lars RuthottoOptimal Control for Deep Learning

CoSIP ICDL, Berlin 2017

Hyperbolic Residual Neural Networks

Recall the reversibility of hyperbolic equations. Example:

∂tty(t, x) = ∂xxy(t, x), + initial + boundary cond.

Similar property recently discovered for residual networks

yk+1 = yk + F(xk)xk+1 = xk + G(yk)

−→ xk−1 = xk − G(yk)yk−1 = yk − F(xk)

useful for adjoint computations (backpropagation)

Reversible ResNets: ↓↓↓ memory ↑ computation

B.D. Nguyen, G.A. McMechanFive ways to avoid storing source wavefield snapshots in2D elastic prestack reverse time migration.Geophysics, 2014.

A.N. Gomez, M. Ren, R. Urtasun, R.B. GrosseThe Reversible Residual Network:Backpropagation Without Storing Activations.arXiv, 2017.

Title Intro Stab PDE 37

Page 38: An Optimal Control Framework for Efficient Training … · An Optimal Control Framework for Efficient Training of Deep Neural ... classification problem generated from peaks in

c© Lars RuthottoOptimal Control for Deep Learning

CoSIP ICDL, Berlin 2017

Limitations of ReversibilityQ: Is any algebraically reversible network reversible in practice?For α, β ∈ R consider original RevNet for F(Z) = αZ and G(Y) = βZ

Yj+1 = Yj + αZj, and Zj+1 = Zj − βYj+1.

Combining two time steps in Y

Yj+1 − Yj = αZj, and Yj − Yj−1 = αZj−1

Subtracting those two gives

Yj+1 − 2Yj + Yj−1 = α(Zj − Zj−1) = αβYj

⇔Yj+1 − (2 + αβ)Yj + Yj−1 = 0

There is a solution Yj = ξj, i.e., with a = (2 + αβ)/2

ξ2 − 2aξ + 1 = 0 ⇒ ξ = a±√

a2 − 1

|ξ|2 = 1 (stable) if (1 + αβ/2)2 ≤ 1. Otherwise ξ growing!

Title Intro Stab PDE 38

Page 39: An Optimal Control Framework for Efficient Training … · An Optimal Control Framework for Efficient Training of Deep Neural ... classification problem generated from peaks in

c© Lars RuthottoOptimal Control for Deep Learning

CoSIP ICDL, Berlin 2017

Reversible Hamiltonian Neural NetworksForward propagation in double-layer Hamiltonian

Yj+1 = Yj + hKTj1σ(Kj1Zj + bj1),

Zj+1 = Zj − hKTj2σ(Kj2Yj+1 + bj2).

Recall: Antisymmetric structure gives stability (when parameters changeslowly).

Clearly, given YN ,YN−1 and ZN ,ZN−1 dynamics can be computedbackwards:

Zj = Zj+1 + hKTj2σ(Kj2Yj+1 + bj2)

Yj = Yj+1 − hKTj1σ(Kj1Zj + bj1),

Possible to recompute weights, ↑50% computation costs, but largememory savings + stability

A. Mahendran, A VedaldiUnderstanding deep imagerepresentations by inverting them.CVPR, 2015.

B. Chang, L Meng, E. Holtham, E. Haber, LR, D BegertReversible Architectures for Arbitrarily Deep ResNNs.in review, arXiv, 2017.

Title Intro Stab PDE 39

Page 40: An Optimal Control Framework for Efficient Training … · An Optimal Control Framework for Efficient Training of Deep Neural ... classification problem generated from peaks in

c© Lars RuthottoOptimal Control for Deep Learning

CoSIP ICDL, Berlin 2017

Hamiltonian Reversible STL-10

Data:I color images, 96× 96I 10 classesI 5k training imagesI 8k test images

Test Accuracy for our networks:I Hamiltonian: 85.5%I Midpoint: 84.6%I Second Order: 83.7%

Some Benchmark Results:

Google Intel Chinese UHK Nvidia Facebook Xtract2013 2014 2015 2015 2016 201770.2% 72.8% 73.15% 74.10% 74.33% 85.5%

Title Intro Stab PDE 40

Page 41: An Optimal Control Framework for Efficient Training … · An Optimal Control Framework for Efficient Training of Deep Neural ... classification problem generated from peaks in

c© Lars RuthottoOptimal Control for Deep Learning

CoSIP ICDL, Berlin 2017

Experiment: Stability and GeneralizationGoal: Compare efficiency of reversible Hamiltonian CNN to ResNN

Name Units Channels Params Error rateCIFAR-10 CIFAR-100 CIFAR-10 CIFAR-100

ResNet-32 (He et al. 2016) 5-5-5 16-32-64 0.46 0.47 7.14% 29.95%RevNet-38 (Gomez et al. 2017) 3-3-3 32-64-112 0.46 0.48 7.24% 28.96%HrNet-74 (Ours) 6-6-6 32-64-112 0.43 0.44 7.24% 30.22%MidPoint-26w (Ours) 4-4-4 32-64-112 0.50 0.51 8.84% 32.75%SecondOrder-62 (Ours) 4-4-4 32-64-112 0.50 0.51 9.65% 37.48%HrNet-50 (Ours) 3-3-3-3 16-32-64-80 0.46 0.46 8.45% 32.18%MidPoint-50 (Ours) 3-3-3-3 16-32-64-80 0.44 0.45 8.89% 33.53%SecondOrder-50 (Ours) 3-3-3-3 16-32-64-112 0.50 0.51 9.17% 33.31%ResNet-110 (He et al. 2016) 18-18-18 16-32-64 1.73 1.73 5.74% 26.44%RevNet-110 (Gomez et al. 2017) 9-9-9 32-64-128 1.73 1.74 5.76% 25.40%HrNet-110 (Ours) 9-9-9 32-64-128 0.81 0.82 6.80% 29.57%HrNet-218 (Ours) 18-18-18 32-64-128 1.68 1.69 Bo running% 26.11%MidPoint-62w (Ours) 10-10-10 32-64-128 1.78 1.79 7.24% 29.02%SecondOrder-62w (Ours) 10-10-10 32-64-128 1.78 1.79 8.15% 32.65 %ResNet-1202 200-200-200 32-64-128 19.4 - 7.93% -HrNet-1202 (Ours) 100-100-100 32-64-128 9.70 - 7.25% -

Table 1: Main results for different architectures on CIFAR-10 and CIFAR-100.

Methods Accuracy

(Yang et al. 2015) 73.15%(Dundar, Jin, and Culurciello 2015) 74.1%(Zhao et al. 2016) 74.3%HrNet (Ours) 85.5%MidPoint (Ours) (16-64-128-256) 84.6%SecondOrder (Ours) 83.7%

Table 2: Main results on STL-10.

Methods Error Rate

(Liang and Hu 2015) 1.77%(Lee, Gallagher, and Tu 2016) 1.69%HrNet-82 (Ours) 2.91%HrNet-300 (Ours) Lili to run%MidPoint-82 (Ours) 3.41%MidPoint-300 (Ours) Bo to run%SecondOrder-82 (Ours) 3.18%

Table 3: Main results on SVHN.

Figure 3: Comparison of ResNet and HrNet for CIFAR10

pushing the state-of-the-art results. Therefore we intention-ally use simple architectures here as shown in Table 4. Forcomparison, we use ResNet (He et al. 2016) as our base-line. CIFAR-10 has much more training data than STL-10(50,000 VS. 5,000), so we decrease the training data from20% to 0.05% for CIFAR-10, while from 80% to 5% forSTL-10. The test data keeps the same.

CIFAR-10 Fig. 3 shows the result on CIFAR-10 when de-creasing the training data from 20% to 5%. Our HrNet per-forms consistently better in terms of accuracy than ResNet,with even more than 13% accuracy boost when there are just3% and 4% of the original training data.

STL-10 From the result as shown in Fig. 4, we can see thatHrNet consistently achieves better accuracy than ResNetwith the average 3.4% accuracy boost. Especially when us-ing just 40% of the training data, HrNet has a 5.7% accuracyboost than ResNet.

We attribute the robustness to small amount of training

Layers Nums 4 4 4 4Filters Nums 16 64 128 256

Table 4: HrNet and ResNet Architecture for smallamount of training data of CIFRA10 and STL10

Figure 4: Comparison of ResNet and HrNet for STL10

data to the intrinsic stability of our Hamiltonian neural net-work architecture.

Exploring with a 1202-layer HrNetAn aggressively deep ResNet is explored on CIFAR-10 in(He et al. 2016) with 1202 layers, which has higher error rate(7.93%) than their 110-layer ResNet (6.43%). They attributethe higher error rate to the overfitting problem. To demon-strate the robustness of our architecture, we also explore a1202-layer architecture on CIFAR-10. Compared with theoriginal ResNet, our architecture has only a half of param-eters, but better accuracy. To be fair comparison of ResNet,we just only weight decay is used, without distracting fromthe focus on the complicated optimization or maxout (Good-fellow et al. 2013) or dropout (Srivastava et al. 2014) as wellLR:please revisit wording of this sentence.

ConclusionIn this paper, we present three robust and reversible architec-tures that connect the stable ODE with deep residual neuralnetworks and yield well-posed learning problems. We ex-ploit the intrinsic reversibility property to obtain a memory-efficient implementation, which does not need to store theactivations and features at hidden layers. Together with thestability of the forward propagation, this allows trainingdeeper and wider architectures with limited computationalresources. We evaluate our methods on four publicly avail-able datasets against several strong baselines. Experimentalresults show the efficacy of our method, showing superior oron-par state-of-the-art performance. Moreover, with smalleramount of data, our architectures achieve better accuracycompared with the widely used state-of-the-art ResNet ar-chitecture.

ReferencesArnold, V. I. 2012. Geometrical methods in the theory of ordinarydifferential equations. Springer Science & Business Media.Arora, S.; Liang, Y.; and Ma, T. 2016. Why are deep nets re-versible: A simple theory, with implications for training. ICLR-Workshop.Cessac, B. 2010. A view of neural networks as dynamical systems.International Journal of Bifurcation and Chaos.Cho, K.; Van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.;Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learning phraserepresentations using rnn encoder-decoder for statistical machinetranslation. ACL.Coddington, E. A., and Levinson, N. 1955. Theory of ordinarydifferential equations. Tata McGraw-Hill Education.Dinh, L.; Krueger, D.; and Bengio, Y. 2015. Nice: Non-linearindependent components estimation. ICLR-Workshop.Dinh, L.; Sohl-Dickstein, J.; and Bengio, S. 2016. Density estima-tion using real NVP. NIPS.Dosovitskiy, A., and Brox, T. 2016. Inverting visual representa-tions with convolutional networks. In CVPR.Dundar, A.; Jin, J.; and Culurciello, E. 2015. Convolutional clus-tering for unsupervised learning. arXiv preprint arXiv:1511.06241.Esteva, A.; Kuprel, B.; Novoa, R. A.; Ko, J.; Swetter, S. M.; Blau,H. M.; and Thrun, S. 2017. Dermatologist-level classification ofskin cancer with deep neural networks. Nature.Gebru, T.; Krause, J.; Wang, Y.; Chen, D.; Deng, J.; Aiden, E. L.;and Fei-Fei, L. 2017. Using deep learning and Google street viewto estimate the demographic makeup of the US. arXiv preprintarXiv:1702.06683.Gilbert, A. C.; Zhang, Y.; Lee, K.; Zhang, Y.; and Lee, H. 2017.Towards understanding the invertibility of convolutional neural net-works. arXiv preprint arXiv:1705.08664.Gomez, A. N.; Ren, M.; Urtasun, R.; and Grosse, R. B. 2017.The reversible residual network: Backpropagation without storingactivations. NIPS.Goodfellow, I. J.; Warde-Farley, D.; Mirza, M.; Courville, A.; andBengio, Y. 2013. Maxout networks. ICML.Haber, E., and Ruthotto, L. 2017. Stable architectures for deepneural networks. arXiv preprint arXiv:1705.03341.Haber, E.; Ruthotto, L.; and Holtham, E. 2017. Learning acrossscales-a multiscale method for convolution neural networks. arXivpreprint arXiv:1703.02009.He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learn-ing for image recognition. In CVPR.He, K.; Gkioxari, G.; Dollar, P.; and Girshick, R. 2017. MaskR-CNN. In ICCV.Horn, R. A., and Johnson, C. R. 2012. Matrix analysis. Cambridgeuniversity press.Hornik, K. 1991. Approximation capabilities of multilayer feed-forward networks. Neural networks.Huang, G.; Sun, Y.; Liu, Z.; Sedra, D.; and Weinberger, K. Q. 2016.Deep networks with stochastic depth. In ECCV.Huang, G.; Liu, Z.; Weinberger, K. Q.; and van der Maaten, L.2017. Densely connected convolutional networks. CVPR.Krizhevsky, A., and Hinton, G. 2009. Learning multiple layers offeatures from tiny images.LeCun, Y.; Bengio, Y.; and Hinton, G. 2015. Deep learning. Na-ture.

CIFAR10 (total: 50k images) STL10 (total: 5k images)

stability leads to improved generalization

Title Intro Stab PDE 41

Page 42: An Optimal Control Framework for Efficient Training … · An Optimal Control Framework for Efficient Training of Deep Neural ... classification problem generated from peaks in

c© Lars RuthottoOptimal Control for Deep Learning

CoSIP ICDL, Berlin 2017

Conclusion

Title Intro Stab PDE 42

Page 43: An Optimal Control Framework for Efficient Training … · An Optimal Control Framework for Efficient Training of Deep Neural ... classification problem generated from peaks in

c© Lars RuthottoOptimal Control for Deep Learning

CoSIP ICDL, Berlin 2017

Numerical Methods for Deep LearningAn (almost perfectly) true statement

backpropagation + GPU +

TensorFlow

CaffeTorch

...

⇒ success

So, why study numeric methods for deep learning?

Transfer LearningI DL is similar to path planning, optimal control, differential equations . . .

Do More With LessI Better modeling and algorithms process more data, use less resourcesI How about 3D images and videos?

Power Of AbstractionI Use continuous interpretation to design/relate architectures

It’s fun!I new interdisciplinary courses in computer science + mathematics

Title Intro Stab PDE 43

Page 44: An Optimal Control Framework for Efficient Training … · An Optimal Control Framework for Efficient Training of Deep Neural ... classification problem generated from peaks in

c© Lars RuthottoOptimal Control for Deep Learning

CoSIP ICDL, Berlin 2017

Σ: Optimal Control Framework for Deep LearningOptimal control formulation

I new insights, theory, algorithms

Stability and well-posednessI required for generalizationI continuous propagation Hamiltonian systemsI discrete propagation (Verlet, Adams-Bashforth, . . . )I example: impact on convergence (no batch

normalization!)

Parabolic CNNsI multiscale and shallow-to-deep training

Hyperbolic CNNsI preserve high-frequency featuresI can be made reversible

E. Haber, LRStable Architectures for DeepNeural Networks.Inverse Problems, 2017.

E. Holtham, E. Haber, LRLearning Across Scales.AAAI, 2018.

B. Chang, L Meng, E. Holtham, E. Haber,LR, D BegertReversible Architectures for ArbitrarilyDeep ResNNs.AAAI, 2018.

Title Intro Stab PDE 44