ricky t. q. chen and david duvenaudrtqichen/posters/diffopnet_poster.pdfricky t. q. chen and david...
TRANSCRIPT
Neural Networks with Cheap Differential OperatorsRicky T. Q. Chen and David Duvenaud
University of Toronto, Vector Institute
Overview
Given a function f : Rd → Rd , we seek to obtain a vectorcontaining its dimension-wise k-th order derivatives,
Dkdimf (x) :=
[∂kf1(x)
∂xk1· · · ∂
kfd(x)
∂xkd
]T∈ Rd (1)
using only k evaluations of automatic differentiation regardlessof the dimension d .
Full Jacobianof f (x)
Ddimf (x)(diag of Jac)
This has applications in:
I. Solving differential equations.
II. Continuous Normalizing Flows.
III. Learningstochastic differential equations.
Reverse-mode Automatic Differentiation
Deep learning software rely on automatic differentiation (AD)to compute gradients. More generally, AD computesvector-Jacobian products.
vT∂f (x)
∂x=
d∑i=1
vi∂fi(x)
∂x(2)
However, computing the Jacobian trace (ie.∑Ddim)
is as expensive as the full Jacobian. One evaluation ofAD can only compute one row of the Jacobian.
Forward: Network Structure
Our approach:Dimension-wise derivatives can be efficiently computed forrestricted architectures with simple modifications to the ADprocedure in the backward pass.
Conditioner hi = ci(x−i). The i -th hidden dimensiondepends on all inputs except the i -th input dimension. Can becomputed in parallel using masked neural networks (e.g.MADE, PixelCNN).
Transformer fi(x) = τi(xi , hi). τi : Rd → R outputs the i -thdimension given the concatenated vector. Can be computed inparallel if composed of matrix multiplication and element-wiseoperations.
Backward: Modified Computation Graph
x1
x2
xd
...
h1
h2
hd
...
f1
f2
fd
...
Forward computation graph
x1
x2
xd
...
h1
h2
hd
...
f1
f2
fd
...
Backward computation graph
Let h be stop_gradient(h) and f = τ (x , h), so
∂ f i(x)
∂xj=∂τi(xi , hi)
∂xj=
{∂fi(x)∂xi
if i = j
0 if i 6= j(3)
Dimension-wise derivatives.
1T ∂ f (x)
∂x= Ddimf =
[∂f1(x)∂x1· · · ∂fd(x)
∂xd
]T= Ddimf (4)
Higher orders.
1T ∂Dk−1dim f (x)
∂x= Dk
dimf (x) = Dkdimf (x) (5)
Backpropagating through dim-wise derivatives.
∂Dkdimf
∂w+∂Dk
dimf
∂h
∂h
∂w=∂Dk
dimf
∂w(6)
App I: Linear Multistep ODE Solvers
Implicit ODE solvers need to solve an optimizationsub-problem in the inner loop.
General idea: replace Newton-Raphson
y (k+1) = y (k) −
[∂F (y (k))
∂y (k)
]−1
F (y (k)) (7)
with Jacobi-Newton
y (k+1) = y (k) − [DdimF (y)]−1 � F (y (k)) (8)
0 2000 4000 6000 8000 10000 12000
Training Iteration
0
500
1000
1500
2000
Num
.E
valu
atio
ns
RK4(5)
ABM
ABM-Jacobi
App II: Continuous Normalizing Flows
If dxdt = f (t, x), then ∂ log p(x)
∂t = − tr(∂f∂x
). (See ”Neural ODEs”)
ModelMNIST Omniglot
ELBO ↑ NLL ↓ ELBO ↑ NLL ↓VAE -86.55 82.14 -104.28 97.25
Planar -86.06 81.91 -102.65 96.04
IAF -84.20 80.79 -102.41 96.08
Sylvester -83.32 80.22 -99.00 93.77
FFJORD -82.82 − -98.33 −DiffOp-CNF -82.37 80.22 -97.42 93.90
Table: Evidence lower bound and negative log-likelihood for staticMNIST and Omniglot.
Exact trace vs stochastic trace. Converges faster andresults in easier to solve dynamical systems.
App III: Stochastic Differential Eqs
dx(t) = f (x(t), t)dt + g(x(t), t)dW (9)
Data Density Learned
Idea: match left- and right-hand-side of the Fokker-Planckequation, which describes the change in density.
∂p(t, x)
∂t= −
d∑i=1
∂
∂xi[fi(t, x)p(t, x)] +
1
2
d∑i=1
∂2
∂x2i
[g 2ii (t, x)p(t, x)
](10)
References
• Germain et al. “MADE: Masked autoencoder for distributionestimation.” (2015)
• Huang et al. “Neural Autoregressive Flows.” (2018)
• Chen et al. “Neural ODEs.” (2018)