differentiable functional programminggunes/assets/pdf/baydin... · 2019. 11. 13. · working...

Differentiable FunctionalProgrammingAtılım Güneş Baydin

University of Oxfordhttp://www.robots.ox.ac.uk/~gunes/

F#unctional Londoners Meetup, April 28, 2016

http://www.robots.ox.ac.uk/~gunes/

About me

Current (from 11 April 2016):Postdoctoral researcher,Machine Learning Research Group, University of Oxfordhttp://www.robots.ox.ac.uk/~parg/

Previously:Brain and Computation Lab, National University of IrelandMaynooth: http://www.bcl.hamilton.ie/Working primarily with F#, on algorithmic differentiation,functional programming, machine learning

1/36

http://www.robots.ox.ac.uk/~parg/http://www.bcl.hamilton.ie/

Today’s talk

Derivatives in computer programsDifferentiable functional programmingDiffSharp + Hype librariesTwo demos

2/36

Derivatives in computer programsHow do we compute them?

Manual differentiationf(x) = sin(exp x)let f x = sin (exp x)

Calculus 101: differentiation rulesd(fg)dx =

dfdxg+ f

dgdx

d(af + bg)dx = a

dfdx + b

dgdx

. . .

f ′(x) = cos(exp x)× exp xlet f’ x = (cos (exp x)) * (exp x)

3/36

Manual differentiationIt can get complicatedf(x) = 64x(1− x)(1− 2x)2(1− 8x+ 8x2)2(4th iteration of the logistic map ln+1 = 4ln(1− ln), l1 = x)let f x =

64*x * (1-x) * ((1 - 2*x) ** 2) * ((1 - 8*x + 8*x*x) ** 2)

f ′(x) =128x(1−x)(−8+16x)(1−2x)2(1−8x+8x2)+64(1−x)(1−2x)2(1−8x+8x2)2−64x(1−2x)2(1−8x+8x2)2−256x(1−x)(1−2x)(1−8x+8x2)2let f’ x = 128*x * (1-x) * (-8+16*x) * (1-2*x)**2 * (1-8*x+8*x*

x) + 64 * (1-x) * (1-2*x)**2 * (1-8*x+8*x*x)**2 - 64*x(1-2*x)**2 * (1-8*x+8*x*x)**2 - 256*x*(1-x) * (1-2*x) * (1-8*x+8*x*x)**2

4/36

Symbolic differentiationComputer algebra packages help: Mathematica, Maple, Maxima

But, it has some serious drawbacks5/36

Symbolic differentiationWe get “expression swell”

Logistic map ln+1 = 4ln(1 − ln), l1 = xn ln ddx ln

1 x 12 4x(1 − x) 4(1 − x) − 4x3 16x(1− x)(1−

2x)216(1− x)(1− 2x)2 −16x(1 − 2x)2 −64x(1 − x)(1 − 2x)

4 64x(1− x)(1−2x)2 (1− 8x +8x2)2

128x(1 − x)(−8 +16x)(1−2x)2(1−8x+8x2) + 64(1− x)(1−2x)2(1−8x+8x2)2−64x(1−2x)2(1−8x+8x2)2 − 256x(1 −x)(1 − 2x)(1 − 8x +8x2)2 1 2 3 4 5

0100200300400500600

n

Number of terms

ln

ddx ln

6/36

Symbolic differentiationWe are limited to closed-form formulaeYou can find the derivative of math expressions:f(x) = 64x(1− x)(1− 2x)2(1− 8x+ 8x2)2But not of algorithms, branching, control flow:let f x n =

if n = 1 thenx

elselet mutable v = xfor i = 1 to n

v

Numerical differentiationA very common hack:Use the limit definition of the derivative

dfdx = limh→0

f(x+ h)− f(x)h

to approximate the numerical value of the derivativelet diff f x =

let h = 0.00001(f (x + h) - f (x)) / h

Again, some serious drawbacks

8/36

Numerical differentiationWe must select a proper value of hand we face approximation errors

h

10-17 10-15 10-13 10-11 10-9 10-7 10-5 10-3 10-110-1010-810-610-410-2100102

Round-off errordominant Truncation errordominant

Error

Computed using

E(h, x∗) =∣∣∣∣∣ f(x

∗ + h) − f(x∗)h

−ddx

f(x)∣∣x∗

∣∣∣∣∣f(x) = 64x(1 − x)(1 − 2x)2(1 − 8x + 8x2)2x∗ = 0.2

9/36

Numerical differentiationBetter approximations exist

Higher-order finite differencesE.g.∂f(x)∂xi

=f(x+ hei)− f(x− hei)2h + O(h2) ,

Richardson extrapolationDifferential quadrature

but they increase rapidly in complexity and never completelyeliminate the error

10/36

Numerical differentiation

Poor performance:f : Rn → R, approximate the gradient∇f = ( ∂f∂x1 , . . . , ∂f∂xn) using

∂f(x)∂xi

≈ f(x+ hei)− f(x)h , 0 < h� 1We must repeat the function evaluation n times for getting∇f

11/36

Algorithmic differentiation (AD)

Algorithmic differentiationAlso known as automatic differentiation (Griewank & Walther,2008)Gives numeric code that computesthe function AND its derivatives at a given point

f(a, b):c = a * bd = sin creturn d

f'(a, a', b, b'):(c, c') = (a*b, a'*b + a*b')(d, d') = (sin c, c' * cos c)return (d, d')

Derivatives propagated at the elementary operation level,as a side effect, at the same time when the function itself iscomputed→ Prevents the “expression swell” of symbolic derivativesFull expressive capability of the host language→ Including conditionals, looping, branching

12/36

Function evaluation tracesAll numeric evaluations are sequences of elementary operations:a “trace,” also called a “Wengert list” (Wengert, 1964)f(a, b):

c = a * bif c > 0d = log c

elsed = sin c

return d

f(2, 3)

a = 2

b = 3

c = a * b = 6

d = log c = 1.791

return d

(primal)

a = 2a’ = 1b = 3b’ = 0c = a * b = 6c’ = a’ * b + a * b’ = 3d = log c = 1.791d’ = c’ * (1 / c) = 0.5return d, d’

(tangent)i.e., a Jacobian-vector product Jf (1,0)|(2,3) = ∂∂a f(a, b)∣∣(2,3) = 0.5This is called the forward (tangent) mode of AD

13/36



elsed = sin c

return d

f(2, 3)

a = 2

b = 3

c = a * b = 6

d = log c = 1.791

return d

(primal)


(tangent)

i.e., a Jacobian-vector product Jf (1,0)|(2,3) = ∂∂a f(a, b)∣∣(2,3) = 0.5This is called the forward (tangent) mode of AD

13/36



elsed = sin c

return d

f(2, 3)

a = 2

b = 3

c = a * b = 6

d = log c = 1.791

return d

(primal)


(tangent)i.e., a Jacobian-vector product Jf (1,0)|(2,3) = ∂∂a f(a, b)∣∣(2,3) = 0.5This is called the forward (tangent) mode of AD

13/36

Function evaluation tracesf(a, b):


elsed = sin c

return d

f(2, 3)

a = 2b = 3c = a * b = 6d = log c = 1.791return d

(primal)

a = 2b = 3c = a * b = 6d = log c = 1.791d’ = 1c’ = d’ * (1 / c) = 0.166b’ = c’ * a = 0.333a’ = c’ * b = 0.5return d, a’, b’

(adjoint)i.e., a transposed Jacobian-vector productJTf (1)∣∣(2,3) = ∇f|(2,3) = (0.5,0.333)This is called the reverse (adjoint) mode of ADBackpropagation is just a special case of the reverse mode:code your neural network objective computation, apply reverse AD

14/36



elsed = sin c

return d

f(2, 3)


(primal)


(adjoint)

i.e., a transposed Jacobian-vector productJTf (1)∣∣(2,3) = ∇f|(2,3) = (0.5,0.333)This is called the reverse (adjoint) mode of ADBackpropagation is just a special case of the reverse mode:code your neural network objective computation, apply reverse AD

14/36



elsed = sin c

return d

f(2, 3)


(primal)


(adjoint)i.e., a transposed Jacobian-vector productJTf (1)∣∣(2,3) = ∇f|(2,3) = (0.5,0.333)This is called the reverse (adjoint) mode of ADBackpropagation is just a special case of the reverse mode:code your neural network objective computation, apply reverse AD

14/36

How is this useful?

Forward vs reverseIn the extreme cases,for F : R→ Rm, forward AD can compute all (∂F1∂x , . . . , ∂Fm∂x )for f : Rn → R, reverse AD can compute∇f = ( ∂f∂xi , . . . , ∂f∂xn)in just one evaluationIn general, for f : Rn → Rm, the Jacobian J ∈ Rm×n takes

O(n× time(f)) with forward ADO(m× time(f)) with reverse AD

Reverse mode performs better when n� m

15/36

How is this useful?Traditional application domains of AD in industry and academia(Corliss et al., 2002)

Computational fluiddynamicsAtmospheric chemistryEngineering designoptimizationComputational finance

16/36

Functional ADor”Differentiable functional programming”

AD and functional programmingAD has been around since the 1960s(Wengert, 1964; Speelpenning, 1980; Griewank, 1989)The foundations for AD in a functional framework(Siskind and Pearlmutter, 2008; Pearlmutter and Siskind, 2008)With research implementations

R6RS-ADhttps://github.com/qobi/R6RS-AD

Stalingradhttp://www.bcl.hamilton.ie/~qobi/stalingrad/

Alexey Radul’s DVLhttps://github.com/axch/dysvunctional-language

Recently, my DiffSharp libraryhttp://diffsharp.github.io/DiffSharp/

17/36

https://github.com/qobi/R6RS-ADhttp://www.bcl.hamilton.ie/~qobi/stalingrad/https://github.com/axch/dysvunctional-languagehttp://diffsharp.github.io/DiffSharp/

Differentiable functional programmingDeep learning: neural network models are assembled frombuilding blocks and trained with backpropagation

Traditional:FeedforwardConvolutionalRecurrent

18/36

Differentiable functional programmingDeep learning: neural network models are assembled frombuilding blocks and trained with backpropagationTraditional:

FeedforwardConvolutionalRecurrent

18/36

Differentiable functional programmingNewer additions:Make algorithmic elements continuous and differentiable→ enables use in deep learning

NTM on copy task(Graves et al. 2014)

Neural Turing Machine (Graves et al., 2014)→ can infer algorithms: copy, sort, recallStack-augmented RNN (Joulin & Mikolov, 2015)End-to-end memory network (Sukhbaatar et al., 2015)Stack, queue, deque (Grefenstette et al., 2015)Discrete interfaces (Zaremba & Sutskever, 2015)

19/36

Differentiable functional programmingStacking of many layers, trained through backpropagationAlexNet, 8 layers (ILSVRC 2012)

1x1

conv

, 64

3x3

conv

, 64

1x1

conv

, 256

1x1

conv

, 64

3x3

conv

, 64

1x1

conv

, 256

1x1

conv

, 64

3x3

conv

, 64

1x1

conv

, 256

1x1

conv

, 128

, /2

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 128

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 128

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 128

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 128

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 128

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 128

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 128

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 256

, /2

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 512

, /2

3x3

conv

, 512

1x1

conv

, 204

8

1x1

conv

, 512

3x3

conv

, 512

1x1

conv

, 204

8

1x1

conv

, 512

3x3

conv

, 512

1x1

conv

, 204

8

ave

pool

, fc 1

000

7x7

conv

, 64,

/2,

poo

l/2

3x3

conv

, 64

3x3

conv

, 64,

poo

l/2

3x3

conv

, 128

3x3

conv

, 128

, poo

l/2

3x3

conv

, 256

3x3

conv

, 256

3x3

conv

, 256

3x3

conv

, 256

, poo

l/2

3x3

conv

, 512

3x3

conv

, 512

3x3

conv

, 512

3x3

conv

, 512

, poo

l/2

3x3

conv

, 512

3x3

conv

, 512

3x3

conv

, 512

3x3

conv

, 512

, poo

l/2

fc, 4

096

fc, 4

096

fc, 1

000

11x1

1 co

nv, 9

6, /4

, poo

l/2

5x5

conv

, 256

, poo

l/2

3x3

conv

, 384

3x3

conv

, 384

3x3

conv

, 256

, poo

l/2

fc, 4

096

fc, 4

096

fc, 1

000

VGG, 19 layers (ILSVRC 2014)1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 128, /2 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 256, /2 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 512, /2 3x3 conv, 512 1x1 conv, 2048 1x1 conv, 512 3x3 conv, 512 1x1 conv, 2048 1x1 conv, 512 3x3 conv, 512 1x1 conv, 2048 ave pool, fc 10007x7 conv, 64, /2, pool/23x

3 co

nv, 6

4

3x3

conv

, 64,

poo

l/2

3x3

conv

, 128

3x3

conv

, 128

, poo

l/2

3x3

conv

, 256

3x3

conv

, 256

3x3

conv

, 256

3x3

conv

, 256

, poo

l/2

3x3

conv

, 512

3x3

conv

, 512

3x3

conv

, 512

3x3

conv

, 512

, poo

l/2

3x3

conv

, 512

3x3

conv

, 512

3x3

conv

, 512

3x3

conv

, 512

, poo

l/2

fc, 4

096

fc, 4

096

fc, 1

000

11x1

1 co

nv, 9

6, /4

, poo

l/2

5x5

conv

, 256

, poo

l/2

3x3

conv

, 384

3x3

conv

, 384

3x3

conv

, 256

, poo

l/2

fc, 4

096

fc, 4

096

fc, 1

000ResNet, 152 layers (deep residual learning) (ILSVRC 2015)

1x1

conv

, 64

3x3

conv

, 64

1x1

conv

, 256

1x1

conv

, 64

3x3

conv

, 64

1x1

conv

, 256

1x1

conv

, 64

3x3

conv

, 64

1x1

conv

, 256

1x1

conv

, 128

, /2

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 128

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 128

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 128

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 128

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 128

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 128

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 128

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 256

, /2

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 512

, /2

3x3

conv

, 512

1x1

conv

, 204

8

1x1

conv

, 512

3x3

conv

, 512

1x1

conv

, 204

8

1x1

conv

, 512

3x3

conv

, 512

1x1

conv

, 204

8

ave

pool

, fc 1

000

7x7

conv

, 64,

/2,

poo

l/2

3x3

conv

, 64

3x3

conv

, 64,

poo

l/2

3x3

conv

, 128

3x3

conv

, 128

, poo

l/2

3x3

conv

, 256

3x3

conv

, 256

3x3

conv

, 256

3x3

conv

, 256

, poo

l/2

3x3

conv

, 512

3x3

conv

, 512

3x3

conv

, 512

3x3

conv

, 512

, poo

l/2

3x3

conv

, 512

3x3

conv

, 512

3x3

conv

, 512

3x3

conv

, 512

, poo

l/2

fc, 4

096

fc, 4

096

fc, 1

000

11x1

1 co

nv, 9

6, /4

, poo

l/2

5x5

conv

, 256

, poo

l/2

3x3

conv

, 384

3x3

conv

, 384

3x3

conv

, 256

, poo

l/2

fc, 4

096

fc, 4

096

fc, 1

000

(He, Zhang, Ren, Sun. “Deep Residual Learning for Image Recognition.” 2015. arXiv:1512.03385)20/36

Differentiable functional programmingOne way of viewing deep learning systems is“differentiable functional programming”

Two main characteristics:Differentiability→ optimization

Chained function composition→ successivetransformations→ successive levels ofdistributed representations(Bengio 2013)→ the chain rule of calculuspropagates derivatives

21/36

The bigger pictureIn a functional interpretation

Weight-tying or multiple applications of the same neuron(e.g., ConvNets and RNNs) resemble function abstractionStructural patterns of composition resemblehigher-order functions (e.g., map, fold, unfold, zip)

22/36

The bigger pictureEven when you have complex compositions,differentiability ensures that they can be trained end-to-endwith backpropagation

(Vinyals, Toshev, Bengio, Erhan. “Show and tell: a neural image caption generator.” 2014. arXiv:1411.4555)

23/36

The bigger pictureChristopher Olah’s blog post (September 3, 2015)http://colah.github.io/posts/2015-09-NN-Types-FP/

“The field does not (yet) have a unifying insight or narrative”

David Dalrymple’s essay (January 2016)http://edge.org/response-detail/26794

“The most natural playground ... would be a new language that canrun back-propagation directly on functional programs.”

AD in a functional framework is a manifestation of this vision.

24/36

http://colah.github.io/posts/2015-09-NN-Types-FP/http://edge.org/response-detail/26794

DiffSharp

The ambitionDeeply embedded AD (forward and/or reverse)as part of the language infrastructureRich API of differentiation operationsas higher-order functionsHigh-performance matrix operations for deep learning(GPU support, model and data parallelism), gradients,Hessians, Jacobians, directional derivatives, matrix-freeHessian- and Jacobian-vector products

I have been working on these issues with Barak Pearlmutterand created DiffSharp:http://diffsharp.github.io/DiffSharp/

25/36

http://diffsharp.github.io/DiffSharp/

DiffSharp“Generalized AD as a first-class function in an augmentedλ-calculus” (Pearlmutter and Siskind, 2008)Forward, reverse, and any nested combination thereof,instantiated according to usage scenarioNested lambda expressions with free-variable references

min (λx . (f x) +min (λy . g x y))

let m = min (fun x -> (f x) + min (fun y -> g (x y)))

Must handle “perturbation confusion” (Manzyuk et al., 2012)d

dx

(x(

d

dyx+ y)∣∣∣∣

y=1

)∣∣∣∣∣x=1

?= 1

let d = diff (fun x -> x * (diff (fun y -> x + y) 1.)) 1.26/36

DiffSharpHigher-order differentiation APIOp. Value Type signature AD Num. Sym.

f : R → R diff f ′ (R → R) → R → R X, F A Xdiff’ (f, f ′) (R → R) → R → (R× R) X, F A Xdiff2 f ′′ (R → R) → R → R X, F A Xdiff2’ (f, f ′′) (R → R) → R → (R× R) X, F A Xdiff2’’ (f, f ′, f ′′) (R → R) → R → (R× R× R) X, F A Xdiffn f(n) N → (R → R) → R → R X, F Xdiffn’ (f, f(n)) N → (R → R) → R → (R× R) X, F X

f : Rn → R grad ∇f (Rn → R) → Rn → Rn X, R A Xgrad’ (f,∇f) (Rn → R) → Rn → (R× Rn) X, R A Xgradv ∇f · v (Rn → R) → Rn → Rn → R X, F Agradv’ (f,∇f · v) (Rn → R) → Rn → Rn → (R× R) X, F Ahessian Hf (Rn → R) → Rn → Rn×n X, R-F A Xhessian’ (f,Hf ) (Rn → R) → Rn → (R× Rn×n) X, R-F A Xhessianv Hfv (Rn → R) → Rn → Rn → Rn X, F-R Ahessianv’ (f,Hfv) (Rn → R) → Rn → Rn → (R× Rn) X, F-R Agradhessian (∇f,Hf ) (Rn → R) → Rn → (Rn × Rn×n) X, R-F A Xgradhessian’ (f,∇f,Hf ) (Rn → R) → Rn → (R× Rn × Rn×n) X, R-F A Xgradhessianv (∇f · v,Hfv) (Rn → R) → Rn → Rn → (R× Rn) X, F-R Agradhessianv’ (f,∇f · v,Hfv) (Rn → R) → Rn → Rn → (R× R× Rn) X, F-R Alaplacian tr(Hf ) (Rn → R) → Rn → R X, R-F A Xlaplacian’ (f, tr(Hf )) (Rn → R) → Rn → (R× R) X, R-F A X

f : Rn → Rm jacobian Jf (Rn → Rm) → Rn → Rm×n X, F/R A Xjacobian’ (f ,Jf ) (Rn → Rm) → Rn → (Rm × Rm×n) X, F/R A Xjacobianv Jfv (Rn → Rm) → Rn → Rn → Rm X, F Ajacobianv’ (f ,Jfv) (Rn → Rm) → Rn → Rn → (Rm × Rm) X, F AjacobianT JTf (R

n → Rm) → Rn → Rn×m X, F/R A XjacobianT’ (f ,JTf ) (R

n → Rm) → Rn → (Rm × Rn×m) X, F/R A XjacobianTv JTf v (R

n → Rm) → Rn → Rm → Rn X, RjacobianTv’ (f ,JTf v) (R

n → Rm) → Rn → Rm → (Rm × Rn) X, RjacobianTv’’ (f ,JTf (·)) (Rn → Rm) → Rn → (Rm × (Rm → Rn)) X, Rcurl ∇× f (R3 → R3) → R3 → R3 X, F A Xcurl’ (f ,∇× f) (R3 → R3) → R3 → (R3 × R3) X, F A Xdiv ∇ · f (Rn → Rn) → Rn → R X, F A Xdiv’ (f ,∇ · f) (Rn → Rn) → Rn → (Rn × R) X, F A Xcurldiv (∇× f ,∇ · f) (R3 → R3) → R3 → (R3 × R) X, F A Xcurldiv’ (f ,∇× f ,∇ · f) (R3 → R3) → R3 → (R3 × R3 × R) X, F A X 27/36

DiffSharpMatrix operationshttp://diffsharp.github.io/DiffSharp/api-overview.html

High-performance OpenBLAS backend by default, work on aCUDA-based GPU backend underwaySupport for 64- and 32-bit floats (faster on many systems)Benchmarking toolhttp://diffsharp.github.io/DiffSharp/benchmarks.html

A growing collection of tutorials: gradient-based optimizationalgorithms, clustering, Hamiltonian Monte Carlo, neural networks,inverse kinematics

28/36

http://diffsharp.github.io/DiffSharp/api-overview.htmlhttp://diffsharp.github.io/DiffSharp/benchmarks.html

Hypehttp://hypelib.github.io/Hype/

An experimental library for “compositional machine learningand hyperparameter optimization”, built on DiffSharpA robust optimization core

highly configurable functional modulesSGD, conjugate gradient, Nesterov, AdaGrad, RMSProp,Newton’s methodUse nested AD for gradient-based hyperparameteroptimization (Maclaurin et al., 2015)

Researching the differentiable functional programming paradigmfor machine learning29/36

http://hypelib.github.io/Hype/

HypeExtracts from Hype neural network code,use higher-order functions, don’t think about gradients orbackpropagationhttps://github.com/hypelib/Hype/blob/master/src/Hype/Neural.fs

30/36

https://github.com/hypelib/Hype/blob/master/src/Hype/Neural.fs

HypeExtracts from Hype optimization codehttps://github.com/hypelib/Hype/blob/master/src/Hype/Optimize.fs

Optimization and training as higher-order functions→ can be composed, nested

31/36

https://github.com/hypelib/Hype/blob/master/src/Hype/Optimize.fs

HypeUser doesn’t need to think about derivativesThey are instantiated within the optimization code

32/36

HypeBut they can use derivatives within their models, if needed→ input sensitivities→ complex objective functions→ adaptive PID controllers→ integrating differential equations

Thanks to nested generalized ADyou can optimize components that are internally usingdifferentiationresulting higher-order derivatives propagate viaforward/reverse AD as needed

33/36

HypeWe also provide a Torch-like API for neural networks

A cool thing: thanks to AD, we can freely codeany F# function as a layer, it just works

34/36

Hypehttp://hypelib.github.io/Hype/feedforwardnets.html

We also have some nice additions for F# interactive

35/36

http://hypelib.github.io/Hype/feedforwardnets.html

Roadmap

Transformation-based, context-aware ADF# quotations (Syme, 2006) give us a direct path for deeplyembedding ADCurrently experimenting with GPU backends(CUDA, ArrayFire, Magma)Generalizing to tensors(for elegant implementations of, e.g., ConvNets)

36/36

Thank You!References• Baydin AG, Pearlmutter BA, Radul AA, Siskind JM (Submitted) Automatic differentiation in machine learning: a survey [arXiv:1502.05767]• Baydin AG, Pearlmutter BA, Siskind JM (Submitted) DiffSharp: automatic differentiation library [arXiv:1511.07727]• Bengio Y (2013) Deep learning of representations: looking forward. Statistical Language and Speech Processing. LNCS 7978:1–37 [arXiv:1404.7456]• Graves A, Wayne G, Danihelka I (2014) Neural Turing machines. [arXiv:1410.5401]• Grefenstette E, Hermann KM, Suleyman M, Blunsom, P (2015) Learning to transduce with unbounded memory. [arXiv:1506.02516]• Griewank A, Walther A (2008) Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation. Society for Industrial and Applied Mathematics,Philadelphia [DOI 10.1137/1.9780898717761]• He K, Zhang X, Ren S, Sun J (2015) Deep residual learning for image recognition. [arXiv:1512.03385]• Joulin A, Mikolov T (2015) Inferring algorithmic patterns with stack-augmented recurrent nets. [arXiv:1503.01007]• Maclaurin D, David D, Adams RP (2015) Gradient-based Hyperparameter Optimization through Reversible Learning [arXiv:1502.03492]• Manzyuk O, Pearlmutter BA, Radul AA, Rush DR, Siskind JM (2012) Confusion of tagged perturbations in forward automatic differentiation of higher-order functions[arXiv:1211.4892]• Pearlmutter BA, Siskind JM (2008) Reverse-mode AD in a functional framework: Lambda the ultimate backpropagator. ACM TOPLAS 30(2):7 [DOI10.1145/1330017.1330018]• Siskind JM, Pearlmutter BA (2008) Nesting forward-mode AD in a functional framework. Higher Order and Symbolic Computation 21(4):361–76 [DOI10.1007/s10990-008-9037-1]• Sukhbaatar S, Szlam A, Weston J, Fergus R (2015) Weakly supervised memory networks. [arXiv:1503.08895]• Syme D (2006) Leveraging .NET meta-programming components from F#: integrated queries and interoperable heterogeneous execution. 2006 Workshop on ML.ACM.• Vinyals O, Toshev A, Bengio S, Erhan D (2014) Show and tell: a neural image caption generator. [arXiv:1411.4555]• Wengert R (1964) A simple automatic derivative evaluation program. Communications of the ACM 7:463–4• Zaremba W, Sutskever I (2015) Reinforcement learning neural Turing machines. [arXiv:1505.00521]

differentiable functional programminggunes/assets/pdf/baydin... · 2019. 11. 13. · working...

Documents