an optimal control framework for efficient training … · an optimal control framework for...
TRANSCRIPT
c© Lars RuthottoOptimal Control for Deep Learning
CoSIP ICDL, Berlin 2017
An Optimal Control Framework forEfficient Training of Deep Neural
Networks
CoSIP Intense Course on Deep Learning, Berlin
December 1, 2017
Lars RuthottoDepartment of Mathematics and Computer Science, Emory University
Title Intro Stab PDE 1
c© Lars RuthottoOptimal Control for Deep Learning
CoSIP ICDL, Berlin 2017
Fundamental (Open) QuestionsExpressibility
I how to find neural network that can approximate function of interest?I successes: approximation theorems, optimal sparsity, . . .I communities: harmonic analysis, approximation theory, . . .
LearningI how to (efficiently) train neural network?I successes: stochastic gradient and zoo of variants (including ADAM,
second-order, . . . )I community: mainly optimization and optimal control
GeneralizationI does the neural network generalize?I successes: VC dimensions, bias/variance dilemma, regularization, . . .I community: mainly statistics
ComputingI how to design network that is expressive and generalizes well and which
method will train it efficiently?I successes: hardware, . . .I community: scientific computing
Title Intro Stab PDE 2
c© Lars RuthottoOptimal Control for Deep Learning
CoSIP ICDL, Berlin 2017
Team and Acknowledgements
Joint work: Emory↔ Xtract Tech. ↔ University of British Columbia
Lili Meng Bo Chang Elliot Holtham Eldad Haber Seong Hwan Jun
Funding:I This work is supported in part by NSF award DMS 1522599I Thanks to NVIDIA Corp for donation of a TITAN X GPU
Title Intro Stab PDE 3
c© Lars RuthottoOptimal Control for Deep Learning
CoSIP ICDL, Berlin 2017
Agenda: Optimal Control Framework for Deep Learning
I Deep Learning meets Optimal ControlI Stability and Generalization
I when is deep learning well-posed?I stabilizing the forward propagation
I Convolution Neural Networks as PDEI continuity in feature spaceI allows to interpret and categorize CNN
I Multiscale Parabolic CNNsI image classification across scalesI shallow-to-deep training
I Reversible Hyperbolic CNNsI memory-efficient + stable→ arbitrarily deep
E Haber, LRStable Architectures for DeepNeural Networks.Inverse Problems, 2017.
E Holtham, E Haber, LRLearning Across Scales.AAAI, 2018.
B Chang, L Meng, E Holtham, E Haber,LR, D BegertReversible Architectures for ArbitrarilyDeep ResNNs.AAAI, 2018.
Title Intro Stab PDE 4
c© Lars RuthottoOptimal Control for Deep Learning
CoSIP ICDL, Berlin 2017
Deep Learning meets OptimalControl
Title Intro Stab PDE 5
c© Lars RuthottoOptimal Control for Deep Learning
CoSIP ICDL, Berlin 2017
Deep Learning Revolution (?)
−→ Yj+1 = σ(KjYj + bj)
I Neural Networks with a particular (deep) architectureI invented in the 1950’sI able to ”learn” complicated patterns from dataI applications: image classification, face recognition, segmentation,
driverless cars, . . .I recent success fueled by: massive data sets, computing powerI A few recent quotes:
I Apple Is Bringing the AI Revolution to Your iPhone, WIRED ’16I Data Scientist: Sexiest Job of the 21st Century, Harvard Business Rev ’17
Title Intro Stab PDE 6
c© Lars RuthottoOptimal Control for Deep Learning
CoSIP ICDL, Berlin 2017
Supervised Learning using Deep Neural Networks
training data, Y0,C propagated features, YN ,C classification result
Supervised Deep Learning Problem
Given training data, Y0, and labels, C, find transformation parameters(K, b) and classification weights (W, µ) such that the DNN predicts thedata-label relationship (and generalizes to new data), by solving
minimizeK,b,W,µ loss[g(WYN + µ),C] + regularizer[K,b,W, µ]
subject to Yj+1 = activation(KjYj + bj), ∀j = 0, . . . ,N − 1
Title Intro Stab PDE 7
c© Lars RuthottoOptimal Control for Deep Learning
CoSIP ICDL, Berlin 2017
Deep Residual Neural NetworksAward-winning forward propagation
Yj+1 = Yj + hKj,2σ(Kj,1Yj + bj), ∀ j = 0, 1, . . . ,N − 1.
ResNet is forward Euler discretization of
∂ty(t,K,b, y0) = K2(t)σ (K1(t)y(t,K,b, y0) + b(t)) ,y(0,K,b, y0) = y0.
deep learning↔ trajectory problem, imageregistration, mass transport,. . .
In short, write ResNets as
∂ty(t, θ(t), y0) = f (y, θ(t)), y(0, θ, y0) = y0
K. He, X. Zhang, S. Ren, and J. SunDeep residual learning for image recognition.IEEE Conf. on CVPR, 770–778, 2016.
input features, Y0
propagated features, YN
Title Intro Stab PDE 8
c© Lars RuthottoOptimal Control for Deep Learning
CoSIP ICDL, Berlin 2017
Optimal Control Framework for Deep Learning
training data, Y0,C prop. features, Y(T, θ,Y0),C classification result
Supervised Deep Learning Problem
Given training data, Y0, and labels, C, find network parameters θ andclassification weights W, µ such that the DNN predicts the data-labelrelationship (and generalizes to new data), i.e., solve
minimizeθ,W,µ loss[g(WY(T, θ,Y0) + µ),C] + regularizer[θ,W, µ]
Title Intro Stab PDE 9
c© Lars RuthottoOptimal Control for Deep Learning
CoSIP ICDL, Berlin 2017
Optimal Control Approaches to Deep Learning
control 1control 2
control 3control 4
control 5
input featuresoutput features←− time −→
Deep Learning↔ trajectory problem.I use for analysis and new algorithmsI invent your own architecture
E. Haber, LRStable Architectures for Deep NeuralNetworks.Inverse Problems, accepted 2017.
Weinan EA Proposal on Machine Learning viaDynamical Systems.Communications in Mathematics andStatistics, 5(1), 1–11, 2017.
Title Intro Stab PDE 10
c© Lars RuthottoOptimal Control for Deep Learning
CoSIP ICDL, Berlin 2017
Blessing of Dimensionality (or Width)Setup: ResNN, 9 fully connected single layers, σ = tanh
input features + labels propagated features classification result1
increase the dimension (width) no need to change topology!
1≈ 100% validation accuracyTitle Intro Stab PDE 11
c© Lars RuthottoOptimal Control for Deep Learning
CoSIP ICDL, Berlin 2017
Stability and Well-Posedness
Title Intro Stab PDE 12
c© Lars RuthottoOptimal Control for Deep Learning
CoSIP ICDL, Berlin 2017
Stability of DNN: Goals
Our goals:I predict a priori if architecture can generalizeI design architectures that generalizeI impose regularization to find solutions that generalize
main ingredients of well-posed inverse problems:1. well-posed forward problem2. bounded inverse
Title Intro Stab PDE 13
c© Lars RuthottoOptimal Control for Deep Learning
CoSIP ICDL, Berlin 2017
Stability of Continuous Forward Propagation
Interpret ResNet as discretization of initial valueproblem
∂ty(t,K, b, y) = σ(K(t)y(t,K, b, y) + b(t))y(0,K, b, y) = y.
IVP is stable if for any v ∈ Rn
‖y(T,K, b, y)− y(T,K, b, y + εv)‖2 = O(ε).
unstable IVP
stable IVP
idea: ensure stability by design / constraints on K, b
Title Intro Stab PDE 14
c© Lars RuthottoOptimal Control for Deep Learning
CoSIP ICDL, Berlin 2017
Stability of Forward Propagation
K+, λ(K+) = 2
−1.5 0 1.5−1.5
0
1.5K−, λ(K−) = −2
−1.5 0 1.5−1.5
0
1.5K0, λ(K0) = ±ı
−1.5 0 1.5−1.5
0
1.5
Fact: The ODE y′(t) = f (y) is stable if the real parts of the eigenvalues of theJacobian J are non-positive.For non-autonomous ODEs we also need that J changes slowly in time.Rigorous argument using framework of kinematic eigenvalues.For the ResNet
J(t) = diag(σ′(K(t)Y(t) + b(t)))K(t).
If activation is monotonically increasing, real(eig(K(t))) ≤ 0 sufficient.Title Intro Stab PDE 15
c© Lars RuthottoOptimal Control for Deep Learning
CoSIP ICDL, Berlin 2017
Enforcing Stability: Antisymmetric Transformation
Two examples of unconditionally stable networks.
ResNet with antisymmetric transformationmatrix
ddt
y(t) = σ((K(t)−K(t)>)y + b(t)).
ResNet with auxiliary variable andantisymmetric matrix
ddt
(yz
)(t) = σ
((0 A(t)
−A(t)> 0
)(yz
)+ b(t)
).
How about stability of the discrete system? −0.1 0 0.1−0.1
0
0.1
Title Intro Stab PDE 16
c© Lars RuthottoOptimal Control for Deep Learning
CoSIP ICDL, Berlin 2017
Stability of Discrete Forward Problem
forward Euler Adams Bashforth 3
Note: ResNet (fwd Euler) not stable for antisymmetric transformations.Better:
Yj+1 = Yj+h12
(23σ(KjYj + bj)− 16σ(Kj−1Yj−1 + bj−1) + 5σ(Kj−2Yj−2 + bj−2)) .
Title Intro Stab PDE 17
c© Lars RuthottoOptimal Control for Deep Learning
CoSIP ICDL, Berlin 2017
Symplectic IntegrationHamiltonian-inspired neural networks (Verlet integration)
Zj+ 12= Zj− 1
2− hσ (KjYj + bj)
Yj+1 = Yj + hσ(
K>j Zj+ 12+ bj
)Necessary Conditions for Generalization
I continuous forward propagation stable when real(eig(K)) ≤ 0I learning problem well-posed when real(eig(K)) ≈ 0I stable scheme for discrete forward propagation and h ”small enough”
Simple explanation of (and cure for) exploding and vanishing gradients!
Y. Bengio, P. Simdard, P. FrasconiLearning Long-Term Dependencies with Gradient Descent Is DifficultIEEE Trans. on Neural Networks, 5(2):157–166, 1994.
Title Intro Stab PDE 18
c© Lars RuthottoOptimal Control for Deep Learning
CoSIP ICDL, Berlin 2017
Example: Impact of Network Depthclassification problem generated from peaks in MATLAB R©
data setupI 2, 000 points in 2D, 5 classesI Residual Neural NetworkI tanh activation, softmax classifierI multilevel: 32 layers→ 64 layers
compare three configurations1. ”deep”: T = 5 (3rd order multistep)2. ”medium”: T = 2 (1st order Verlet)3. ”shallow”: T = 0.2 (3rd order multistep)
features and labels
Q: how does learning performance compare?
Title Intro Stab PDE 19
c© Lars RuthottoOptimal Control for Deep Learning
CoSIP ICDL, Berlin 2017
Example: Impact of Network Depth - Convergencedeep, ab3 (T = 5) medium, Verlet (T = 2) shallow, ab3 (T = 0.2)
obje
ctiv
efu
nctio
n
1 300 3000
1
2
3
432 layers 64 layers
1 100 1000
1
2
3
432 layers 64 layers
1 100 1000
1
2
3
432 layers 64 layers
valid
atio
nac
cura
cy
1 300 3000.4
0.6
0.8
1
32 layers 64 layers
1 100 1000.4
0.6
0.8
1
32 layers 64 layers
1 100 1000.4
0.6
0.8
1
32 layers 64 layers
Title Intro Stab PDE 20
c© Lars RuthottoOptimal Control for Deep Learning
CoSIP ICDL, Berlin 2017
Example: Impact of Network Depth - Dynamics
deep, ab3 (T = 5) medium, Verlet (T = 2) shallow, ab3 (T = 0.2)
Title Intro Stab PDE 21
c© Lars RuthottoOptimal Control for Deep Learning
CoSIP ICDL, Berlin 2017
PDE-Interpretation ofConvolution Neural Networks
Title Intro Stab PDE 22
c© Lars RuthottoOptimal Control for Deep Learning
CoSIP ICDL, Berlin 2017
Convolutions and PDEsLet y be of 1D grid function y↔ y (grid: n cells of width hx = 1/n)
K(θ)y = [θ1 θ2 θ3] ∗ y =
(β4
4[1 2 1] +
β2
2hx[−1 0 1] +
β3
h2x[−1 2− 1]
)∗ y
where the coefficients β1, β2, β3 satisfy 1/4 −1/(2hx) −1/h2x
1/2 0 2/h2x
1/4 1/(2hx) −1/h2x
β1β2β3
=
θ1θ2θ3
.
In the limit hx → 0 this gives
K(θ(t)) = β1(t) + β2(t)∂x + β3(t)∂2x .
Convolution operator K is linear combination of differential operators
Title Intro Stab PDE 23
c© Lars RuthottoOptimal Control for Deep Learning
CoSIP ICDL, Berlin 2017
Parabolic CNNIn original Residual Net choose K2 = −K>1 = K>.This gives parabolic PDE
Yt = −K(t)>σ(K(t)Y + b(t)), Y(0) = Y0
Jacobian
J(t) = −K(t)>diag(σ′(K(t)Y + b(t))
)K(t)
symmetric and negative definite (σ′ ≥ 0) stable if K,b do not change tooquickly. Use forward Euler discretization with h small enough
Yj+1 = Yj − hK>j σ(KjY + bj), j = 0, 1, . . . ,N − 1
Similar to anisotropic diffusion (popular in image processing)
Y. Chen, T. PockTrainable Nonlinear Reaction Diffusion.IEEE PAMI, 39(6), 1256–1272, 2017.
Title Intro Stab PDE 24
c© Lars RuthottoOptimal Control for Deep Learning
CoSIP ICDL, Berlin 2017
Hamiltonian CNN
Introducing auxiliary variable Z, consider dynamics
∂tY(t) = KT1 (t)σ(K1(t)Z(t) + b1(t)),
∂tZ(t) = −KT2 (t)σ(K2(t)Y(t) + b2(t)).
In matrix form this is(∂tY∂tZ
)=
(KT
1 00 −KT
2
)σ(( 0 K1
K2 0
)(YZ
)+
(b1b2
)).
(Can be shown that eigenvalues of Jacobian are allimaginary stability when K1,K2,b1,b2 changeslowly in time) Discretize using Verlet method
Yj+1 = Yj + hKTj1σ(Kj1Zj + bj1),
Zj+1 = Zj − hKTj2σ(Kj2Yj+1 + bj2).
Title Intro Stab PDE 25
c© Lars RuthottoOptimal Control for Deep Learning
CoSIP ICDL, Berlin 2017
Second-Order Network
Consider second-order forwarddynamics
∂ttY = −K(t)>σ(K(t)Y + b(t))
And their Leapfrog discretization
Yj+1 = 2Yj − Yj−1 − h2K>j σ(KjY + bj)
Similar to: Full Waveform Inversion,Ultrasound, . . .
Title Intro Stab PDE 26
c© Lars RuthottoOptimal Control for Deep Learning
CoSIP ICDL, Berlin 2017
Loss Function in CNNLet C be some values, Y features, and W classification weights.Regression:
loss(W,Y) =12‖WY− C‖2
If Y and W are discretizations of Y and W, loss is discretization of
`(W,Y) =(∫
ΩW(x)Y(x)dx− C
)2
⇒ add h2x and model W as image, e.g., enforce smoothness
R(W) =
∫Ω‖∇W(x)‖2dx.
Classification: Similar, since hypothesis functions can be written as
Cpred(W,Y) = g(h2
xWY)
Title Intro Stab PDE 27
c© Lars RuthottoOptimal Control for Deep Learning
CoSIP ICDL, Berlin 2017
Some Challenges in CNN
ComputationsI note that networks have a widths and depthI Toy example: width 16 , depth 20 (time steps).
Forward propagation: 5,120 2D-convs/image
Memory ConsumptionI adjoint equations (backpropagation) need
intermediate states (hidden features)I Toy example (continued): training data is 5k
images with 32× 32 pixels.Storage (just features): 130 GB (double) or
65 GB (single).
Architecture Design & InterpretationI CNN should be easy to train and generalize wellI CNN should be difficult to fool (adversarial)I Can we understand the reasoning of a CNN?
Pei et al., DeepXplore, 2017
Title Intro Stab PDE 28
c© Lars RuthottoOptimal Control for Deep Learning
CoSIP ICDL, Berlin 2017
Parabolic Networks
Title Intro Stab PDE 29
c© Lars RuthottoOptimal Control for Deep Learning
CoSIP ICDL, Berlin 2017
Parabolic Residual Neural Networks
Recall the decay property of heat equation. Example:
∂ty(t, x) = −∂xxy(t, x), + initial + boundary cond.
Some consequences for learningI forward propagation is asymptotically stable (if kernels constant in time)I network is robust against perturbation of inputs (adversarial)I learning problem ill-posed ( inverse heat equation)I numerical methods for parabolic include multiscale, multilevel, ROM, . . .
Title Intro Stab PDE 30
c© Lars RuthottoOptimal Control for Deep Learning
CoSIP ICDL, Berlin 2017
Multi-Resolution Learning
level 1, 12× 12 level 2, 24× 24 level 3, 48× 48 level 4, 96× 96
Multi-Resolution Learning
Restrict the images n timesθ0,W0,µ0 ← random initializationfor j = 1 : n do
optimize with data on level k starting from θj−1,Wj−1,µk−1
obtain θ∗,W∗,µ∗θj ← prolongate θ∗
Wj ← interpolate W∗
How to prolongate the kernels?Title Intro Stab PDE 31
c© Lars RuthottoOptimal Control for Deep Learning
CoSIP ICDL, Berlin 2017
Galerkin Projection of Convolution Kernels
KH = RKhP,
whereI Kh fine mesh operator (given)I R restriction (e.g., averaging)I P prolongation (e.g.,
interpolation)
Remarks:I Galerkin: R = γP>
I Coarse→ Fine: unique ifkernel size constant.
I only small linear solverequired
∗
∗pr
olo
ng
atio
n
rest
rict
ion
fine convolution
coarse convolution
Title Intro Stab PDE 32
c© Lars RuthottoOptimal Control for Deep Learning
CoSIP ICDL, Berlin 2017
Example: Multiresolution Learning
dataI 60, 000 gray-scale images 18× 18I Residual Neural Network, width 6I 2D convolution layers, fully connectedI tanh activation, softmax classifier
multilevel experiments1. train on fine→ classify coarse:
84.1% vs. 94.9%(no restriction) (with restriction)
2. train on coarse→ classify fine:
61.0% vs. 91.0%(no prolongation) (with prolongation)
Title Intro Stab PDE 33
c© Lars RuthottoOptimal Control for Deep Learning
CoSIP ICDL, Berlin 2017
Example: Shallow-to-Deep MNIST6 level strategy / random initialization vs. multilevel learning.
10 20 10 20 10 20 10 20 10 20 10 200
0.1
0.2
0.32 layers 4 layers 8 layers 16 layers 32 layers 64 layers
iterations
objfunction
random initializationmultilevel prolongation
10 20 10 20 10 20 10 20 10 20 10 200.92
0.94
0.96
0.98
1
2 layers 4 layers 8 layers 16 layers 32 layers 64 layers
validationaccuracy
Title Intro Stab PDE 34
c© Lars RuthottoOptimal Control for Deep Learning
CoSIP ICDL, Berlin 2017
Example: Multilevel Learning ImageNet-10
ImageNet-10I 13k natural images
224× 224I 10 sub-categoriesI Residual Neural Network,
width 64, depth 34I 2D convolution layers,
fully connected
20 40 60 72 20 40 60 68
10
20
30
40
50
60
70
80coarse level, 112x112 fine level, 224x224
epochs
clas
sific
atio
ner
ror
in%
coarse-to-fine, testcoarse-to-fine, trainingfine only, testfine only, training
fine-scale only coarse-to-fineruntime [sec] 59,122±7,540 43,882±3, 476
validation acc. 76.47±0.93% 82.67±0.93%
Title Intro Stab PDE 35
c© Lars RuthottoOptimal Control for Deep Learning
CoSIP ICDL, Berlin 2017
Hyperbolic Networks
Title Intro Stab PDE 36
c© Lars RuthottoOptimal Control for Deep Learning
CoSIP ICDL, Berlin 2017
Hyperbolic Residual Neural Networks
Recall the reversibility of hyperbolic equations. Example:
∂tty(t, x) = ∂xxy(t, x), + initial + boundary cond.
Similar property recently discovered for residual networks
yk+1 = yk + F(xk)xk+1 = xk + G(yk)
−→ xk−1 = xk − G(yk)yk−1 = yk − F(xk)
useful for adjoint computations (backpropagation)
Reversible ResNets: ↓↓↓ memory ↑ computation
B.D. Nguyen, G.A. McMechanFive ways to avoid storing source wavefield snapshots in2D elastic prestack reverse time migration.Geophysics, 2014.
A.N. Gomez, M. Ren, R. Urtasun, R.B. GrosseThe Reversible Residual Network:Backpropagation Without Storing Activations.arXiv, 2017.
Title Intro Stab PDE 37
c© Lars RuthottoOptimal Control for Deep Learning
CoSIP ICDL, Berlin 2017
Limitations of ReversibilityQ: Is any algebraically reversible network reversible in practice?For α, β ∈ R consider original RevNet for F(Z) = αZ and G(Y) = βZ
Yj+1 = Yj + αZj, and Zj+1 = Zj − βYj+1.
Combining two time steps in Y
Yj+1 − Yj = αZj, and Yj − Yj−1 = αZj−1
Subtracting those two gives
Yj+1 − 2Yj + Yj−1 = α(Zj − Zj−1) = αβYj
⇔Yj+1 − (2 + αβ)Yj + Yj−1 = 0
There is a solution Yj = ξj, i.e., with a = (2 + αβ)/2
ξ2 − 2aξ + 1 = 0 ⇒ ξ = a±√
a2 − 1
|ξ|2 = 1 (stable) if (1 + αβ/2)2 ≤ 1. Otherwise ξ growing!
Title Intro Stab PDE 38
c© Lars RuthottoOptimal Control for Deep Learning
CoSIP ICDL, Berlin 2017
Reversible Hamiltonian Neural NetworksForward propagation in double-layer Hamiltonian
Yj+1 = Yj + hKTj1σ(Kj1Zj + bj1),
Zj+1 = Zj − hKTj2σ(Kj2Yj+1 + bj2).
Recall: Antisymmetric structure gives stability (when parameters changeslowly).
Clearly, given YN ,YN−1 and ZN ,ZN−1 dynamics can be computedbackwards:
Zj = Zj+1 + hKTj2σ(Kj2Yj+1 + bj2)
Yj = Yj+1 − hKTj1σ(Kj1Zj + bj1),
Possible to recompute weights, ↑50% computation costs, but largememory savings + stability
A. Mahendran, A VedaldiUnderstanding deep imagerepresentations by inverting them.CVPR, 2015.
B. Chang, L Meng, E. Holtham, E. Haber, LR, D BegertReversible Architectures for Arbitrarily Deep ResNNs.in review, arXiv, 2017.
Title Intro Stab PDE 39
c© Lars RuthottoOptimal Control for Deep Learning
CoSIP ICDL, Berlin 2017
Hamiltonian Reversible STL-10
Data:I color images, 96× 96I 10 classesI 5k training imagesI 8k test images
Test Accuracy for our networks:I Hamiltonian: 85.5%I Midpoint: 84.6%I Second Order: 83.7%
Some Benchmark Results:
Google Intel Chinese UHK Nvidia Facebook Xtract2013 2014 2015 2015 2016 201770.2% 72.8% 73.15% 74.10% 74.33% 85.5%
Title Intro Stab PDE 40
c© Lars RuthottoOptimal Control for Deep Learning
CoSIP ICDL, Berlin 2017
Experiment: Stability and GeneralizationGoal: Compare efficiency of reversible Hamiltonian CNN to ResNN
Name Units Channels Params Error rateCIFAR-10 CIFAR-100 CIFAR-10 CIFAR-100
ResNet-32 (He et al. 2016) 5-5-5 16-32-64 0.46 0.47 7.14% 29.95%RevNet-38 (Gomez et al. 2017) 3-3-3 32-64-112 0.46 0.48 7.24% 28.96%HrNet-74 (Ours) 6-6-6 32-64-112 0.43 0.44 7.24% 30.22%MidPoint-26w (Ours) 4-4-4 32-64-112 0.50 0.51 8.84% 32.75%SecondOrder-62 (Ours) 4-4-4 32-64-112 0.50 0.51 9.65% 37.48%HrNet-50 (Ours) 3-3-3-3 16-32-64-80 0.46 0.46 8.45% 32.18%MidPoint-50 (Ours) 3-3-3-3 16-32-64-80 0.44 0.45 8.89% 33.53%SecondOrder-50 (Ours) 3-3-3-3 16-32-64-112 0.50 0.51 9.17% 33.31%ResNet-110 (He et al. 2016) 18-18-18 16-32-64 1.73 1.73 5.74% 26.44%RevNet-110 (Gomez et al. 2017) 9-9-9 32-64-128 1.73 1.74 5.76% 25.40%HrNet-110 (Ours) 9-9-9 32-64-128 0.81 0.82 6.80% 29.57%HrNet-218 (Ours) 18-18-18 32-64-128 1.68 1.69 Bo running% 26.11%MidPoint-62w (Ours) 10-10-10 32-64-128 1.78 1.79 7.24% 29.02%SecondOrder-62w (Ours) 10-10-10 32-64-128 1.78 1.79 8.15% 32.65 %ResNet-1202 200-200-200 32-64-128 19.4 - 7.93% -HrNet-1202 (Ours) 100-100-100 32-64-128 9.70 - 7.25% -
Table 1: Main results for different architectures on CIFAR-10 and CIFAR-100.
Methods Accuracy
(Yang et al. 2015) 73.15%(Dundar, Jin, and Culurciello 2015) 74.1%(Zhao et al. 2016) 74.3%HrNet (Ours) 85.5%MidPoint (Ours) (16-64-128-256) 84.6%SecondOrder (Ours) 83.7%
Table 2: Main results on STL-10.
Methods Error Rate
(Liang and Hu 2015) 1.77%(Lee, Gallagher, and Tu 2016) 1.69%HrNet-82 (Ours) 2.91%HrNet-300 (Ours) Lili to run%MidPoint-82 (Ours) 3.41%MidPoint-300 (Ours) Bo to run%SecondOrder-82 (Ours) 3.18%
Table 3: Main results on SVHN.
Figure 3: Comparison of ResNet and HrNet for CIFAR10
pushing the state-of-the-art results. Therefore we intention-ally use simple architectures here as shown in Table 4. Forcomparison, we use ResNet (He et al. 2016) as our base-line. CIFAR-10 has much more training data than STL-10(50,000 VS. 5,000), so we decrease the training data from20% to 0.05% for CIFAR-10, while from 80% to 5% forSTL-10. The test data keeps the same.
CIFAR-10 Fig. 3 shows the result on CIFAR-10 when de-creasing the training data from 20% to 5%. Our HrNet per-forms consistently better in terms of accuracy than ResNet,with even more than 13% accuracy boost when there are just3% and 4% of the original training data.
STL-10 From the result as shown in Fig. 4, we can see thatHrNet consistently achieves better accuracy than ResNetwith the average 3.4% accuracy boost. Especially when us-ing just 40% of the training data, HrNet has a 5.7% accuracyboost than ResNet.
We attribute the robustness to small amount of training
Layers Nums 4 4 4 4Filters Nums 16 64 128 256
Table 4: HrNet and ResNet Architecture for smallamount of training data of CIFRA10 and STL10
Figure 4: Comparison of ResNet and HrNet for STL10
data to the intrinsic stability of our Hamiltonian neural net-work architecture.
Exploring with a 1202-layer HrNetAn aggressively deep ResNet is explored on CIFAR-10 in(He et al. 2016) with 1202 layers, which has higher error rate(7.93%) than their 110-layer ResNet (6.43%). They attributethe higher error rate to the overfitting problem. To demon-strate the robustness of our architecture, we also explore a1202-layer architecture on CIFAR-10. Compared with theoriginal ResNet, our architecture has only a half of param-eters, but better accuracy. To be fair comparison of ResNet,we just only weight decay is used, without distracting fromthe focus on the complicated optimization or maxout (Good-fellow et al. 2013) or dropout (Srivastava et al. 2014) as wellLR:please revisit wording of this sentence.
ConclusionIn this paper, we present three robust and reversible architec-tures that connect the stable ODE with deep residual neuralnetworks and yield well-posed learning problems. We ex-ploit the intrinsic reversibility property to obtain a memory-efficient implementation, which does not need to store theactivations and features at hidden layers. Together with thestability of the forward propagation, this allows trainingdeeper and wider architectures with limited computationalresources. We evaluate our methods on four publicly avail-able datasets against several strong baselines. Experimentalresults show the efficacy of our method, showing superior oron-par state-of-the-art performance. Moreover, with smalleramount of data, our architectures achieve better accuracycompared with the widely used state-of-the-art ResNet ar-chitecture.
ReferencesArnold, V. I. 2012. Geometrical methods in the theory of ordinarydifferential equations. Springer Science & Business Media.Arora, S.; Liang, Y.; and Ma, T. 2016. Why are deep nets re-versible: A simple theory, with implications for training. ICLR-Workshop.Cessac, B. 2010. A view of neural networks as dynamical systems.International Journal of Bifurcation and Chaos.Cho, K.; Van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.;Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learning phraserepresentations using rnn encoder-decoder for statistical machinetranslation. ACL.Coddington, E. A., and Levinson, N. 1955. Theory of ordinarydifferential equations. Tata McGraw-Hill Education.Dinh, L.; Krueger, D.; and Bengio, Y. 2015. Nice: Non-linearindependent components estimation. ICLR-Workshop.Dinh, L.; Sohl-Dickstein, J.; and Bengio, S. 2016. Density estima-tion using real NVP. NIPS.Dosovitskiy, A., and Brox, T. 2016. Inverting visual representa-tions with convolutional networks. In CVPR.Dundar, A.; Jin, J.; and Culurciello, E. 2015. Convolutional clus-tering for unsupervised learning. arXiv preprint arXiv:1511.06241.Esteva, A.; Kuprel, B.; Novoa, R. A.; Ko, J.; Swetter, S. M.; Blau,H. M.; and Thrun, S. 2017. Dermatologist-level classification ofskin cancer with deep neural networks. Nature.Gebru, T.; Krause, J.; Wang, Y.; Chen, D.; Deng, J.; Aiden, E. L.;and Fei-Fei, L. 2017. Using deep learning and Google street viewto estimate the demographic makeup of the US. arXiv preprintarXiv:1702.06683.Gilbert, A. C.; Zhang, Y.; Lee, K.; Zhang, Y.; and Lee, H. 2017.Towards understanding the invertibility of convolutional neural net-works. arXiv preprint arXiv:1705.08664.Gomez, A. N.; Ren, M.; Urtasun, R.; and Grosse, R. B. 2017.The reversible residual network: Backpropagation without storingactivations. NIPS.Goodfellow, I. J.; Warde-Farley, D.; Mirza, M.; Courville, A.; andBengio, Y. 2013. Maxout networks. ICML.Haber, E., and Ruthotto, L. 2017. Stable architectures for deepneural networks. arXiv preprint arXiv:1705.03341.Haber, E.; Ruthotto, L.; and Holtham, E. 2017. Learning acrossscales-a multiscale method for convolution neural networks. arXivpreprint arXiv:1703.02009.He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learn-ing for image recognition. In CVPR.He, K.; Gkioxari, G.; Dollar, P.; and Girshick, R. 2017. MaskR-CNN. In ICCV.Horn, R. A., and Johnson, C. R. 2012. Matrix analysis. Cambridgeuniversity press.Hornik, K. 1991. Approximation capabilities of multilayer feed-forward networks. Neural networks.Huang, G.; Sun, Y.; Liu, Z.; Sedra, D.; and Weinberger, K. Q. 2016.Deep networks with stochastic depth. In ECCV.Huang, G.; Liu, Z.; Weinberger, K. Q.; and van der Maaten, L.2017. Densely connected convolutional networks. CVPR.Krizhevsky, A., and Hinton, G. 2009. Learning multiple layers offeatures from tiny images.LeCun, Y.; Bengio, Y.; and Hinton, G. 2015. Deep learning. Na-ture.
CIFAR10 (total: 50k images) STL10 (total: 5k images)
stability leads to improved generalization
Title Intro Stab PDE 41
c© Lars RuthottoOptimal Control for Deep Learning
CoSIP ICDL, Berlin 2017
Conclusion
Title Intro Stab PDE 42
c© Lars RuthottoOptimal Control for Deep Learning
CoSIP ICDL, Berlin 2017
Numerical Methods for Deep LearningAn (almost perfectly) true statement
backpropagation + GPU +
TensorFlow
CaffeTorch
...
⇒ success
So, why study numeric methods for deep learning?
Transfer LearningI DL is similar to path planning, optimal control, differential equations . . .
Do More With LessI Better modeling and algorithms process more data, use less resourcesI How about 3D images and videos?
Power Of AbstractionI Use continuous interpretation to design/relate architectures
It’s fun!I new interdisciplinary courses in computer science + mathematics
Title Intro Stab PDE 43
c© Lars RuthottoOptimal Control for Deep Learning
CoSIP ICDL, Berlin 2017
Σ: Optimal Control Framework for Deep LearningOptimal control formulation
I new insights, theory, algorithms
Stability and well-posednessI required for generalizationI continuous propagation Hamiltonian systemsI discrete propagation (Verlet, Adams-Bashforth, . . . )I example: impact on convergence (no batch
normalization!)
Parabolic CNNsI multiscale and shallow-to-deep training
Hyperbolic CNNsI preserve high-frequency featuresI can be made reversible
E. Haber, LRStable Architectures for DeepNeural Networks.Inverse Problems, 2017.
E. Holtham, E. Haber, LRLearning Across Scales.AAAI, 2018.
B. Chang, L Meng, E. Holtham, E. Haber,LR, D BegertReversible Architectures for ArbitrarilyDeep ResNNs.AAAI, 2018.
Title Intro Stab PDE 44