differentiable functional programminggunes/assets/pdf/baydin... · 2019. 11. 13. · working...

67
Differentiable Functional Programming Atılım Güneş Baydin University of Oxford http://www.robots.ox.ac.uk/~gunes/ F#unctional Londoners Meetup, April 8, 6

Upload: others

Post on 17-Feb-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

  • Differentiable FunctionalProgrammingAtılım Güneş Baydin

    University of Oxfordhttp://www.robots.ox.ac.uk/~gunes/

    F#unctional Londoners Meetup, April 28, 2016

    http://www.robots.ox.ac.uk/~gunes/

  • About me

    Current (from 11 April 2016):Postdoctoral researcher,Machine Learning Research Group, University of Oxfordhttp://www.robots.ox.ac.uk/~parg/

    Previously:Brain and Computation Lab, National University of IrelandMaynooth: http://www.bcl.hamilton.ie/Working primarily with F#, on algorithmic differentiation,functional programming, machine learning

    1/36

    http://www.robots.ox.ac.uk/~parg/http://www.bcl.hamilton.ie/

  • Today’s talk

    Derivatives in computer programsDifferentiable functional programmingDiffSharp + Hype librariesTwo demos

    2/36

  • Derivatives in computer programsHow do we compute them?

  • Manual differentiationf(x) = sin(exp x)let f x = sin (exp x)

    Calculus 101: differentiation rulesd(fg)dx =

    dfdxg+ f

    dgdx

    d(af + bg)dx = a

    dfdx + b

    dgdx

    . . .

    f ′(x) = cos(exp x)× exp xlet f’ x = (cos (exp x)) * (exp x)

    3/36

  • Manual differentiationf(x) = sin(exp x)let f x = sin (exp x)

    Calculus 101: differentiation rulesd(fg)dx =

    dfdxg+ f

    dgdx

    d(af + bg)dx = a

    dfdx + b

    dgdx

    . . .

    f ′(x) = cos(exp x)× exp xlet f’ x = (cos (exp x)) * (exp x)

    3/36

  • Manual differentiationf(x) = sin(exp x)let f x = sin (exp x)

    Calculus 101: differentiation rulesd(fg)dx =

    dfdxg+ f

    dgdx

    d(af + bg)dx = a

    dfdx + b

    dgdx

    . . .

    f ′(x) = cos(exp x)× exp xlet f’ x = (cos (exp x)) * (exp x)

    3/36

  • Manual differentiationIt can get complicatedf(x) = 64x(1− x)(1− 2x)2(1− 8x+ 8x2)2(4th iteration of the logistic map ln+1 = 4ln(1− ln), l1 = x)let f x =

    64*x * (1-x) * ((1 - 2*x) ** 2) * ((1 - 8*x + 8*x*x) ** 2)

    f ′(x) =128x(1−x)(−8+16x)(1−2x)2(1−8x+8x2)+64(1−x)(1−2x)2(1−8x+8x2)2−64x(1−2x)2(1−8x+8x2)2−256x(1−x)(1−2x)(1−8x+8x2)2let f’ x = 128*x * (1-x) * (-8+16*x) * (1-2*x)**2 * (1-8*x+8*x*

    x) + 64 * (1-x) * (1-2*x)**2 * (1-8*x+8*x*x)**2 - 64*x(1-2*x)**2 * (1-8*x+8*x*x)**2 - 256*x*(1-x) * (1-2*x) * (1-8*x+8*x*x)**2

    4/36

  • Manual differentiationIt can get complicatedf(x) = 64x(1− x)(1− 2x)2(1− 8x+ 8x2)2(4th iteration of the logistic map ln+1 = 4ln(1− ln), l1 = x)let f x =

    64*x * (1-x) * ((1 - 2*x) ** 2) * ((1 - 8*x + 8*x*x) ** 2)

    f ′(x) =128x(1−x)(−8+16x)(1−2x)2(1−8x+8x2)+64(1−x)(1−2x)2(1−8x+8x2)2−64x(1−2x)2(1−8x+8x2)2−256x(1−x)(1−2x)(1−8x+8x2)2let f’ x = 128*x * (1-x) * (-8+16*x) * (1-2*x)**2 * (1-8*x+8*x*

    x) + 64 * (1-x) * (1-2*x)**2 * (1-8*x+8*x*x)**2 - 64*x(1-2*x)**2 * (1-8*x+8*x*x)**2 - 256*x*(1-x) * (1-2*x) * (1-8*x+8*x*x)**2

    4/36

  • Symbolic differentiationComputer algebra packages help: Mathematica, Maple, Maxima

    But, it has some serious drawbacks5/36

  • Symbolic differentiationComputer algebra packages help: Mathematica, Maple, Maxima

    But, it has some serious drawbacks5/36

  • Symbolic differentiationWe get “expression swell”

    Logistic map ln+1 = 4ln(1 − ln), l1 = xn ln ddx ln

    1 x 12 4x(1 − x) 4(1 − x) − 4x3 16x(1− x)(1−

    2x)216(1− x)(1− 2x)2 −16x(1 − 2x)2 −64x(1 − x)(1 − 2x)

    4 64x(1− x)(1−2x)2 (1− 8x +8x2)2

    128x(1 − x)(−8 +16x)(1−2x)2(1−8x+8x2) + 64(1− x)(1−2x)2(1−8x+8x2)2−64x(1−2x)2(1−8x+8x2)2 − 256x(1 −x)(1 − 2x)(1 − 8x +8x2)2 1 2 3 4 5

    0100200300400500600

    n

    Number of terms

    ln

    ddx ln

    6/36

  • Symbolic differentiationWe are limited to closed-form formulaeYou can find the derivative of math expressions:f(x) = 64x(1− x)(1− 2x)2(1− 8x+ 8x2)2But not of algorithms, branching, control flow:let f x n =

    if n = 1 thenx

    elselet mutable v = xfor i = 1 to n

    v

  • Symbolic differentiationWe are limited to closed-form formulaeYou can find the derivative of math expressions:f(x) = 64x(1− x)(1− 2x)2(1− 8x+ 8x2)2But not of algorithms, branching, control flow:let f x n =

    if n = 1 thenx

    elselet mutable v = xfor i = 1 to n

    v

  • Symbolic differentiationWe are limited to closed-form formulaeYou can find the derivative of math expressions:f(x) = 64x(1− x)(1− 2x)2(1− 8x+ 8x2)2But not of algorithms, branching, control flow:let f x n =

    if n = 1 thenx

    elselet mutable v = xfor i = 1 to n

    v

  • Numerical differentiationA very common hack:Use the limit definition of the derivative

    dfdx = limh→0

    f(x+ h)− f(x)h

    to approximate the numerical value of the derivativelet diff f x =

    let h = 0.00001(f (x + h) - f (x)) / h

    Again, some serious drawbacks

    8/36

  • Numerical differentiationA very common hack:Use the limit definition of the derivative

    dfdx = limh→0

    f(x+ h)− f(x)h

    to approximate the numerical value of the derivativelet diff f x =

    let h = 0.00001(f (x + h) - f (x)) / h

    Again, some serious drawbacks

    8/36

  • Numerical differentiationA very common hack:Use the limit definition of the derivative

    dfdx = limh→0

    f(x+ h)− f(x)h

    to approximate the numerical value of the derivativelet diff f x =

    let h = 0.00001(f (x + h) - f (x)) / h

    Again, some serious drawbacks

    8/36

  • Numerical differentiationWe must select a proper value of hand we face approximation errors

    h

    10-17 10-15 10-13 10-11 10-9 10-7 10-5 10-3 10-110-1010-810-610-410-2100102

    Round-off errordominant Truncation errordominant

    Error

    Computed using

    E(h, x∗) =∣∣∣∣∣ f(x

    ∗ + h) − f(x∗)h

    −ddx

    f(x)∣∣x∗

    ∣∣∣∣∣f(x) = 64x(1 − x)(1 − 2x)2(1 − 8x + 8x2)2x∗ = 0.2

    9/36

  • Numerical differentiationBetter approximations exist

    Higher-order finite differencesE.g.∂f(x)∂xi

    =f(x+ hei)− f(x− hei)2h + O(h2) ,

    Richardson extrapolationDifferential quadrature

    but they increase rapidly in complexity and never completelyeliminate the error

    10/36

  • Numerical differentiation

    Poor performance:f : Rn → R, approximate the gradient∇f = ( ∂f∂x1 , . . . , ∂f∂xn) using

    ∂f(x)∂xi

    ≈ f(x+ hei)− f(x)h , 0 < h� 1We must repeat the function evaluation n times for getting∇f

    11/36

  • Algorithmic differentiation (AD)

  • Algorithmic differentiationAlso known as automatic differentiation (Griewank & Walther,2008)Gives numeric code that computesthe function AND its derivatives at a given point

    f(a, b):c = a * bd = sin creturn d

    f'(a, a', b, b'):(c, c') = (a*b, a'*b + a*b')(d, d') = (sin c, c' * cos c)return (d, d')

    Derivatives propagated at the elementary operation level,as a side effect, at the same time when the function itself iscomputed→ Prevents the “expression swell” of symbolic derivativesFull expressive capability of the host language→ Including conditionals, looping, branching

    12/36

  • Function evaluation tracesAll numeric evaluations are sequences of elementary operations:a “trace,” also called a “Wengert list” (Wengert, 1964)f(a, b):

    c = a * bif c > 0d = log c

    elsed = sin c

    return d

    f(2, 3)

    a = 2

    b = 3

    c = a * b = 6

    d = log c = 1.791

    return d

    (primal)

    a = 2a’ = 1b = 3b’ = 0c = a * b = 6c’ = a’ * b + a * b’ = 3d = log c = 1.791d’ = c’ * (1 / c) = 0.5return d, d’

    (tangent)i.e., a Jacobian-vector product Jf (1,0)|(2,3) = ∂∂a f(a, b)∣∣(2,3) = 0.5This is called the forward (tangent) mode of AD

    13/36

  • Function evaluation tracesAll numeric evaluations are sequences of elementary operations:a “trace,” also called a “Wengert list” (Wengert, 1964)f(a, b):

    c = a * bif c > 0d = log c

    elsed = sin c

    return d

    f(2, 3)

    a = 2

    b = 3

    c = a * b = 6

    d = log c = 1.791

    return d

    (primal)

    a = 2a’ = 1b = 3b’ = 0c = a * b = 6c’ = a’ * b + a * b’ = 3d = log c = 1.791d’ = c’ * (1 / c) = 0.5return d, d’

    (tangent)i.e., a Jacobian-vector product Jf (1,0)|(2,3) = ∂∂a f(a, b)∣∣(2,3) = 0.5This is called the forward (tangent) mode of AD

    13/36

  • Function evaluation tracesAll numeric evaluations are sequences of elementary operations:a “trace,” also called a “Wengert list” (Wengert, 1964)f(a, b):

    c = a * bif c > 0d = log c

    elsed = sin c

    return d

    f(2, 3)

    a = 2

    b = 3

    c = a * b = 6

    d = log c = 1.791

    return d

    (primal)

    a = 2a’ = 1b = 3b’ = 0c = a * b = 6c’ = a’ * b + a * b’ = 3d = log c = 1.791d’ = c’ * (1 / c) = 0.5return d, d’

    (tangent)i.e., a Jacobian-vector product Jf (1,0)|(2,3) = ∂∂a f(a, b)∣∣(2,3) = 0.5This is called the forward (tangent) mode of AD

    13/36

  • Function evaluation tracesAll numeric evaluations are sequences of elementary operations:a “trace,” also called a “Wengert list” (Wengert, 1964)f(a, b):

    c = a * bif c > 0d = log c

    elsed = sin c

    return d

    f(2, 3)

    a = 2

    b = 3

    c = a * b = 6

    d = log c = 1.791

    return d

    (primal)

    a = 2a’ = 1b = 3b’ = 0c = a * b = 6c’ = a’ * b + a * b’ = 3d = log c = 1.791d’ = c’ * (1 / c) = 0.5return d, d’

    (tangent)

    i.e., a Jacobian-vector product Jf (1,0)|(2,3) = ∂∂a f(a, b)∣∣(2,3) = 0.5This is called the forward (tangent) mode of AD

    13/36

  • Function evaluation tracesAll numeric evaluations are sequences of elementary operations:a “trace,” also called a “Wengert list” (Wengert, 1964)f(a, b):

    c = a * bif c > 0d = log c

    elsed = sin c

    return d

    f(2, 3)

    a = 2

    b = 3

    c = a * b = 6

    d = log c = 1.791

    return d

    (primal)

    a = 2a’ = 1b = 3b’ = 0c = a * b = 6c’ = a’ * b + a * b’ = 3d = log c = 1.791d’ = c’ * (1 / c) = 0.5return d, d’

    (tangent)i.e., a Jacobian-vector product Jf (1,0)|(2,3) = ∂∂a f(a, b)∣∣(2,3) = 0.5This is called the forward (tangent) mode of AD

    13/36

  • Function evaluation tracesf(a, b):

    c = a * bif c > 0d = log c

    elsed = sin c

    return d

    f(2, 3)

    a = 2b = 3c = a * b = 6d = log c = 1.791return d

    (primal)

    a = 2b = 3c = a * b = 6d = log c = 1.791d’ = 1c’ = d’ * (1 / c) = 0.166b’ = c’ * a = 0.333a’ = c’ * b = 0.5return d, a’, b’

    (adjoint)i.e., a transposed Jacobian-vector productJTf (1)∣∣(2,3) = ∇f|(2,3) = (0.5,0.333)This is called the reverse (adjoint) mode of ADBackpropagation is just a special case of the reverse mode:code your neural network objective computation, apply reverse AD

    14/36

  • Function evaluation tracesf(a, b):

    c = a * bif c > 0d = log c

    elsed = sin c

    return d

    f(2, 3)

    a = 2b = 3c = a * b = 6d = log c = 1.791return d

    (primal)

    a = 2b = 3c = a * b = 6d = log c = 1.791d’ = 1c’ = d’ * (1 / c) = 0.166b’ = c’ * a = 0.333a’ = c’ * b = 0.5return d, a’, b’

    (adjoint)i.e., a transposed Jacobian-vector productJTf (1)∣∣(2,3) = ∇f|(2,3) = (0.5,0.333)This is called the reverse (adjoint) mode of ADBackpropagation is just a special case of the reverse mode:code your neural network objective computation, apply reverse AD

    14/36

  • Function evaluation tracesf(a, b):

    c = a * bif c > 0d = log c

    elsed = sin c

    return d

    f(2, 3)

    a = 2b = 3c = a * b = 6d = log c = 1.791return d

    (primal)

    a = 2b = 3c = a * b = 6d = log c = 1.791d’ = 1c’ = d’ * (1 / c) = 0.166b’ = c’ * a = 0.333a’ = c’ * b = 0.5return d, a’, b’

    (adjoint)

    i.e., a transposed Jacobian-vector productJTf (1)∣∣(2,3) = ∇f|(2,3) = (0.5,0.333)This is called the reverse (adjoint) mode of ADBackpropagation is just a special case of the reverse mode:code your neural network objective computation, apply reverse AD

    14/36

  • Function evaluation tracesf(a, b):

    c = a * bif c > 0d = log c

    elsed = sin c

    return d

    f(2, 3)

    a = 2b = 3c = a * b = 6d = log c = 1.791return d

    (primal)

    a = 2b = 3c = a * b = 6d = log c = 1.791d’ = 1c’ = d’ * (1 / c) = 0.166b’ = c’ * a = 0.333a’ = c’ * b = 0.5return d, a’, b’

    (adjoint)i.e., a transposed Jacobian-vector productJTf (1)∣∣(2,3) = ∇f|(2,3) = (0.5,0.333)This is called the reverse (adjoint) mode of ADBackpropagation is just a special case of the reverse mode:code your neural network objective computation, apply reverse AD

    14/36

  • How is this useful?

  • Forward vs reverseIn the extreme cases,for F : R→ Rm, forward AD can compute all (∂F1∂x , . . . , ∂Fm∂x )for f : Rn → R, reverse AD can compute∇f = ( ∂f∂xi , . . . , ∂f∂xn)in just one evaluationIn general, for f : Rn → Rm, the Jacobian J ∈ Rm×n takes

    O(n× time(f)) with forward ADO(m× time(f)) with reverse AD

    Reverse mode performs better when n� m

    15/36

  • Forward vs reverseIn the extreme cases,for F : R→ Rm, forward AD can compute all (∂F1∂x , . . . , ∂Fm∂x )for f : Rn → R, reverse AD can compute∇f = ( ∂f∂xi , . . . , ∂f∂xn)in just one evaluationIn general, for f : Rn → Rm, the Jacobian J ∈ Rm×n takes

    O(n× time(f)) with forward ADO(m× time(f)) with reverse AD

    Reverse mode performs better when n� m

    15/36

  • How is this useful?Traditional application domains of AD in industry and academia(Corliss et al., 2002)

    Computational fluiddynamicsAtmospheric chemistryEngineering designoptimizationComputational finance

    16/36

  • Functional ADor”Differentiable functional programming”

  • AD and functional programmingAD has been around since the 1960s(Wengert, 1964; Speelpenning, 1980; Griewank, 1989)The foundations for AD in a functional framework(Siskind and Pearlmutter, 2008; Pearlmutter and Siskind, 2008)With research implementations

    R6RS-ADhttps://github.com/qobi/R6RS-AD

    Stalingradhttp://www.bcl.hamilton.ie/~qobi/stalingrad/

    Alexey Radul’s DVLhttps://github.com/axch/dysvunctional-language

    Recently, my DiffSharp libraryhttp://diffsharp.github.io/DiffSharp/

    17/36

    https://github.com/qobi/R6RS-ADhttp://www.bcl.hamilton.ie/~qobi/stalingrad/https://github.com/axch/dysvunctional-languagehttp://diffsharp.github.io/DiffSharp/

  • Differentiable functional programmingDeep learning: neural network models are assembled frombuilding blocks and trained with backpropagation

    Traditional:FeedforwardConvolutionalRecurrent

    18/36

  • Differentiable functional programmingDeep learning: neural network models are assembled frombuilding blocks and trained with backpropagationTraditional:

    FeedforwardConvolutionalRecurrent

    18/36

  • Differentiable functional programmingNewer additions:Make algorithmic elements continuous and differentiable→ enables use in deep learning

    NTM on copy task(Graves et al. 2014)

    Neural Turing Machine (Graves et al., 2014)→ can infer algorithms: copy, sort, recallStack-augmented RNN (Joulin & Mikolov, 2015)End-to-end memory network (Sukhbaatar et al., 2015)Stack, queue, deque (Grefenstette et al., 2015)Discrete interfaces (Zaremba & Sutskever, 2015)

    19/36

  • Differentiable functional programmingStacking of many layers, trained through backpropagationAlexNet, 8 layers (ILSVRC 2012)

    1x1

    conv

    , 64

    3x3

    conv

    , 64

    1x1

    conv

    , 256

    1x1

    conv

    , 64

    3x3

    conv

    , 64

    1x1

    conv

    , 256

    1x1

    conv

    , 64

    3x3

    conv

    , 64

    1x1

    conv

    , 256

    1x1

    conv

    , 128

    , /2

    3x3

    conv

    , 128

    1x1

    conv

    , 512

    1x1

    conv

    , 128

    3x3

    conv

    , 128

    1x1

    conv

    , 512

    1x1

    conv

    , 128

    3x3

    conv

    , 128

    1x1

    conv

    , 512

    1x1

    conv

    , 128

    3x3

    conv

    , 128

    1x1

    conv

    , 512

    1x1

    conv

    , 128

    3x3

    conv

    , 128

    1x1

    conv

    , 512

    1x1

    conv

    , 128

    3x3

    conv

    , 128

    1x1

    conv

    , 512

    1x1

    conv

    , 128

    3x3

    conv

    , 128

    1x1

    conv

    , 512

    1x1

    conv

    , 128

    3x3

    conv

    , 128

    1x1

    conv

    , 512

    1x1

    conv

    , 256

    , /2

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 512

    , /2

    3x3

    conv

    , 512

    1x1

    conv

    , 204

    8

    1x1

    conv

    , 512

    3x3

    conv

    , 512

    1x1

    conv

    , 204

    8

    1x1

    conv

    , 512

    3x3

    conv

    , 512

    1x1

    conv

    , 204

    8

    ave

    pool

    , fc 1

    000

    7x7

    conv

    , 64,

    /2,

    poo

    l/2

    3x3

    conv

    , 64

    3x3

    conv

    , 64,

    poo

    l/2

    3x3

    conv

    , 128

    3x3

    conv

    , 128

    , poo

    l/2

    3x3

    conv

    , 256

    3x3

    conv

    , 256

    3x3

    conv

    , 256

    3x3

    conv

    , 256

    , poo

    l/2

    3x3

    conv

    , 512

    3x3

    conv

    , 512

    3x3

    conv

    , 512

    3x3

    conv

    , 512

    , poo

    l/2

    3x3

    conv

    , 512

    3x3

    conv

    , 512

    3x3

    conv

    , 512

    3x3

    conv

    , 512

    , poo

    l/2

    fc, 4

    096

    fc, 4

    096

    fc, 1

    000

    11x1

    1 co

    nv, 9

    6, /4

    , poo

    l/2

    5x5

    conv

    , 256

    , poo

    l/2

    3x3

    conv

    , 384

    3x3

    conv

    , 384

    3x3

    conv

    , 256

    , poo

    l/2

    fc, 4

    096

    fc, 4

    096

    fc, 1

    000

    VGG, 19 layers (ILSVRC 2014)1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 128, /2 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 256, /2 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 512, /2 3x3 conv, 512 1x1 conv, 2048 1x1 conv, 512 3x3 conv, 512 1x1 conv, 2048 1x1 conv, 512 3x3 conv, 512 1x1 conv, 2048 ave pool, fc 10007x7 conv, 64, /2, pool/23x

    3 co

    nv, 6

    4

    3x3

    conv

    , 64,

    poo

    l/2

    3x3

    conv

    , 128

    3x3

    conv

    , 128

    , poo

    l/2

    3x3

    conv

    , 256

    3x3

    conv

    , 256

    3x3

    conv

    , 256

    3x3

    conv

    , 256

    , poo

    l/2

    3x3

    conv

    , 512

    3x3

    conv

    , 512

    3x3

    conv

    , 512

    3x3

    conv

    , 512

    , poo

    l/2

    3x3

    conv

    , 512

    3x3

    conv

    , 512

    3x3

    conv

    , 512

    3x3

    conv

    , 512

    , poo

    l/2

    fc, 4

    096

    fc, 4

    096

    fc, 1

    000

    11x1

    1 co

    nv, 9

    6, /4

    , poo

    l/2

    5x5

    conv

    , 256

    , poo

    l/2

    3x3

    conv

    , 384

    3x3

    conv

    , 384

    3x3

    conv

    , 256

    , poo

    l/2

    fc, 4

    096

    fc, 4

    096

    fc, 1

    000ResNet, 152 layers (deep residual learning) (ILSVRC 2015)

    1x1

    conv

    , 64

    3x3

    conv

    , 64

    1x1

    conv

    , 256

    1x1

    conv

    , 64

    3x3

    conv

    , 64

    1x1

    conv

    , 256

    1x1

    conv

    , 64

    3x3

    conv

    , 64

    1x1

    conv

    , 256

    1x1

    conv

    , 128

    , /2

    3x3

    conv

    , 128

    1x1

    conv

    , 512

    1x1

    conv

    , 128

    3x3

    conv

    , 128

    1x1

    conv

    , 512

    1x1

    conv

    , 128

    3x3

    conv

    , 128

    1x1

    conv

    , 512

    1x1

    conv

    , 128

    3x3

    conv

    , 128

    1x1

    conv

    , 512

    1x1

    conv

    , 128

    3x3

    conv

    , 128

    1x1

    conv

    , 512

    1x1

    conv

    , 128

    3x3

    conv

    , 128

    1x1

    conv

    , 512

    1x1

    conv

    , 128

    3x3

    conv

    , 128

    1x1

    conv

    , 512

    1x1

    conv

    , 128

    3x3

    conv

    , 128

    1x1

    conv

    , 512

    1x1

    conv

    , 256

    , /2

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 256

    3x3

    conv

    , 256

    1x1

    conv

    , 102

    4

    1x1

    conv

    , 512

    , /2

    3x3

    conv

    , 512

    1x1

    conv

    , 204

    8

    1x1

    conv

    , 512

    3x3

    conv

    , 512

    1x1

    conv

    , 204

    8

    1x1

    conv

    , 512

    3x3

    conv

    , 512

    1x1

    conv

    , 204

    8

    ave

    pool

    , fc 1

    000

    7x7

    conv

    , 64,

    /2,

    poo

    l/2

    3x3

    conv

    , 64

    3x3

    conv

    , 64,

    poo

    l/2

    3x3

    conv

    , 128

    3x3

    conv

    , 128

    , poo

    l/2

    3x3

    conv

    , 256

    3x3

    conv

    , 256

    3x3

    conv

    , 256

    3x3

    conv

    , 256

    , poo

    l/2

    3x3

    conv

    , 512

    3x3

    conv

    , 512

    3x3

    conv

    , 512

    3x3

    conv

    , 512

    , poo

    l/2

    3x3

    conv

    , 512

    3x3

    conv

    , 512

    3x3

    conv

    , 512

    3x3

    conv

    , 512

    , poo

    l/2

    fc, 4

    096

    fc, 4

    096

    fc, 1

    000

    11x1

    1 co

    nv, 9

    6, /4

    , poo

    l/2

    5x5

    conv

    , 256

    , poo

    l/2

    3x3

    conv

    , 384

    3x3

    conv

    , 384

    3x3

    conv

    , 256

    , poo

    l/2

    fc, 4

    096

    fc, 4

    096

    fc, 1

    000

    (He, Zhang, Ren, Sun. “Deep Residual Learning for Image Recognition.” 2015. arXiv:1512.03385)20/36

  • Differentiable functional programmingOne way of viewing deep learning systems is“differentiable functional programming”

    Two main characteristics:Differentiability→ optimization

    Chained function composition→ successivetransformations→ successive levels ofdistributed representations(Bengio 2013)→ the chain rule of calculuspropagates derivatives

    21/36

  • The bigger pictureIn a functional interpretation

    Weight-tying or multiple applications of the same neuron(e.g., ConvNets and RNNs) resemble function abstractionStructural patterns of composition resemblehigher-order functions (e.g., map, fold, unfold, zip)

    22/36

  • The bigger pictureEven when you have complex compositions,differentiability ensures that they can be trained end-to-endwith backpropagation

    (Vinyals, Toshev, Bengio, Erhan. “Show and tell: a neural image caption generator.” 2014. arXiv:1411.4555)

    23/36

  • The bigger pictureChristopher Olah’s blog post (September 3, 2015)http://colah.github.io/posts/2015-09-NN-Types-FP/

    “The field does not (yet) have a unifying insight or narrative”

    David Dalrymple’s essay (January 2016)http://edge.org/response-detail/26794

    “The most natural playground ... would be a new language that canrun back-propagation directly on functional programs.”

    AD in a functional framework is a manifestation of this vision.

    24/36

    http://colah.github.io/posts/2015-09-NN-Types-FP/http://edge.org/response-detail/26794

  • The bigger pictureChristopher Olah’s blog post (September 3, 2015)http://colah.github.io/posts/2015-09-NN-Types-FP/

    “The field does not (yet) have a unifying insight or narrative”

    David Dalrymple’s essay (January 2016)http://edge.org/response-detail/26794

    “The most natural playground ... would be a new language that canrun back-propagation directly on functional programs.”

    AD in a functional framework is a manifestation of this vision.

    24/36

    http://colah.github.io/posts/2015-09-NN-Types-FP/http://edge.org/response-detail/26794

  • DiffSharp

  • The ambitionDeeply embedded AD (forward and/or reverse)as part of the language infrastructureRich API of differentiation operationsas higher-order functionsHigh-performance matrix operations for deep learning(GPU support, model and data parallelism), gradients,Hessians, Jacobians, directional derivatives, matrix-freeHessian- and Jacobian-vector products

    I have been working on these issues with Barak Pearlmutterand created DiffSharp:http://diffsharp.github.io/DiffSharp/

    25/36

    http://diffsharp.github.io/DiffSharp/

  • The ambitionDeeply embedded AD (forward and/or reverse)as part of the language infrastructureRich API of differentiation operationsas higher-order functionsHigh-performance matrix operations for deep learning(GPU support, model and data parallelism), gradients,Hessians, Jacobians, directional derivatives, matrix-freeHessian- and Jacobian-vector products

    I have been working on these issues with Barak Pearlmutterand created DiffSharp:http://diffsharp.github.io/DiffSharp/

    25/36

    http://diffsharp.github.io/DiffSharp/

  • DiffSharp“Generalized AD as a first-class function in an augmentedλ-calculus” (Pearlmutter and Siskind, 2008)Forward, reverse, and any nested combination thereof,instantiated according to usage scenarioNested lambda expressions with free-variable references

    min (λx . (f x) +min (λy . g x y))

    let m = min (fun x -> (f x) + min (fun y -> g (x y)))

    Must handle “perturbation confusion” (Manzyuk et al., 2012)d

    dx

    (x(

    d

    dyx+ y)∣∣∣∣

    y=1

    )∣∣∣∣∣x=1

    ?= 1

    let d = diff (fun x -> x * (diff (fun y -> x + y) 1.)) 1.26/36

  • DiffSharp“Generalized AD as a first-class function in an augmentedλ-calculus” (Pearlmutter and Siskind, 2008)Forward, reverse, and any nested combination thereof,instantiated according to usage scenarioNested lambda expressions with free-variable references

    min (λx . (f x) +min (λy . g x y))

    let m = min (fun x -> (f x) + min (fun y -> g (x y)))

    Must handle “perturbation confusion” (Manzyuk et al., 2012)d

    dx

    (x(

    d

    dyx+ y)∣∣∣∣

    y=1

    )∣∣∣∣∣x=1

    ?= 1

    let d = diff (fun x -> x * (diff (fun y -> x + y) 1.)) 1.26/36

  • DiffSharpHigher-order differentiation APIOp. Value Type signature AD Num. Sym.

    f : R → R diff f ′ (R → R) → R → R X, F A Xdiff’ (f, f ′) (R → R) → R → (R× R) X, F A Xdiff2 f ′′ (R → R) → R → R X, F A Xdiff2’ (f, f ′′) (R → R) → R → (R× R) X, F A Xdiff2’’ (f, f ′, f ′′) (R → R) → R → (R× R× R) X, F A Xdiffn f(n) N → (R → R) → R → R X, F Xdiffn’ (f, f(n)) N → (R → R) → R → (R× R) X, F X

    f : Rn → R grad ∇f (Rn → R) → Rn → Rn X, R A Xgrad’ (f,∇f) (Rn → R) → Rn → (R× Rn) X, R A Xgradv ∇f · v (Rn → R) → Rn → Rn → R X, F Agradv’ (f,∇f · v) (Rn → R) → Rn → Rn → (R× R) X, F Ahessian Hf (Rn → R) → Rn → Rn×n X, R-F A Xhessian’ (f,Hf ) (Rn → R) → Rn → (R× Rn×n) X, R-F A Xhessianv Hfv (Rn → R) → Rn → Rn → Rn X, F-R Ahessianv’ (f,Hfv) (Rn → R) → Rn → Rn → (R× Rn) X, F-R Agradhessian (∇f,Hf ) (Rn → R) → Rn → (Rn × Rn×n) X, R-F A Xgradhessian’ (f,∇f,Hf ) (Rn → R) → Rn → (R× Rn × Rn×n) X, R-F A Xgradhessianv (∇f · v,Hfv) (Rn → R) → Rn → Rn → (R× Rn) X, F-R Agradhessianv’ (f,∇f · v,Hfv) (Rn → R) → Rn → Rn → (R× R× Rn) X, F-R Alaplacian tr(Hf ) (Rn → R) → Rn → R X, R-F A Xlaplacian’ (f, tr(Hf )) (Rn → R) → Rn → (R× R) X, R-F A X

    f : Rn → Rm jacobian Jf (Rn → Rm) → Rn → Rm×n X, F/R A Xjacobian’ (f ,Jf ) (Rn → Rm) → Rn → (Rm × Rm×n) X, F/R A Xjacobianv Jfv (Rn → Rm) → Rn → Rn → Rm X, F Ajacobianv’ (f ,Jfv) (Rn → Rm) → Rn → Rn → (Rm × Rm) X, F AjacobianT JTf (R

    n → Rm) → Rn → Rn×m X, F/R A XjacobianT’ (f ,JTf ) (R

    n → Rm) → Rn → (Rm × Rn×m) X, F/R A XjacobianTv JTf v (R

    n → Rm) → Rn → Rm → Rn X, RjacobianTv’ (f ,JTf v) (R

    n → Rm) → Rn → Rm → (Rm × Rn) X, RjacobianTv’’ (f ,JTf (·)) (Rn → Rm) → Rn → (Rm × (Rm → Rn)) X, Rcurl ∇× f (R3 → R3) → R3 → R3 X, F A Xcurl’ (f ,∇× f) (R3 → R3) → R3 → (R3 × R3) X, F A Xdiv ∇ · f (Rn → Rn) → Rn → R X, F A Xdiv’ (f ,∇ · f) (Rn → Rn) → Rn → (Rn × R) X, F A Xcurldiv (∇× f ,∇ · f) (R3 → R3) → R3 → (R3 × R) X, F A Xcurldiv’ (f ,∇× f ,∇ · f) (R3 → R3) → R3 → (R3 × R3 × R) X, F A X 27/36

  • DiffSharpMatrix operationshttp://diffsharp.github.io/DiffSharp/api-overview.html

    High-performance OpenBLAS backend by default, work on aCUDA-based GPU backend underwaySupport for 64- and 32-bit floats (faster on many systems)Benchmarking toolhttp://diffsharp.github.io/DiffSharp/benchmarks.html

    A growing collection of tutorials: gradient-based optimizationalgorithms, clustering, Hamiltonian Monte Carlo, neural networks,inverse kinematics

    28/36

    http://diffsharp.github.io/DiffSharp/api-overview.htmlhttp://diffsharp.github.io/DiffSharp/benchmarks.html

  • Hype

  • Hypehttp://hypelib.github.io/Hype/

    An experimental library for “compositional machine learningand hyperparameter optimization”, built on DiffSharpA robust optimization core

    highly configurable functional modulesSGD, conjugate gradient, Nesterov, AdaGrad, RMSProp,Newton’s methodUse nested AD for gradient-based hyperparameteroptimization (Maclaurin et al., 2015)

    Researching the differentiable functional programming paradigmfor machine learning29/36

    http://hypelib.github.io/Hype/

  • HypeExtracts from Hype neural network code,use higher-order functions, don’t think about gradients orbackpropagationhttps://github.com/hypelib/Hype/blob/master/src/Hype/Neural.fs

    30/36

    https://github.com/hypelib/Hype/blob/master/src/Hype/Neural.fs

  • HypeExtracts from Hype optimization codehttps://github.com/hypelib/Hype/blob/master/src/Hype/Optimize.fs

    Optimization and training as higher-order functions→ can be composed, nested

    31/36

    https://github.com/hypelib/Hype/blob/master/src/Hype/Optimize.fs

  • HypeUser doesn’t need to think about derivativesThey are instantiated within the optimization code

    32/36

  • HypeBut they can use derivatives within their models, if needed→ input sensitivities→ complex objective functions→ adaptive PID controllers→ integrating differential equations

    Thanks to nested generalized ADyou can optimize components that are internally usingdifferentiationresulting higher-order derivatives propagate viaforward/reverse AD as needed

    33/36

  • HypeBut they can use derivatives within their models, if needed→ input sensitivities→ complex objective functions→ adaptive PID controllers→ integrating differential equations

    Thanks to nested generalized ADyou can optimize components that are internally usingdifferentiationresulting higher-order derivatives propagate viaforward/reverse AD as needed

    33/36

  • HypeWe also provide a Torch-like API for neural networks

    A cool thing: thanks to AD, we can freely codeany F# function as a layer, it just works

    34/36

  • HypeWe also provide a Torch-like API for neural networks

    A cool thing: thanks to AD, we can freely codeany F# function as a layer, it just works

    34/36

  • Hypehttp://hypelib.github.io/Hype/feedforwardnets.html

    We also have some nice additions for F# interactive

    35/36

    http://hypelib.github.io/Hype/feedforwardnets.html

  • Roadmap

    Transformation-based, context-aware ADF# quotations (Syme, 2006) give us a direct path for deeplyembedding ADCurrently experimenting with GPU backends(CUDA, ArrayFire, Magma)Generalizing to tensors(for elegant implementations of, e.g., ConvNets)

    36/36

  • Demos

  • Thank You!References• Baydin AG, Pearlmutter BA, Radul AA, Siskind JM (Submitted) Automatic differentiation in machine learning: a survey [arXiv:1502.05767]• Baydin AG, Pearlmutter BA, Siskind JM (Submitted) DiffSharp: automatic differentiation library [arXiv:1511.07727]• Bengio Y (2013) Deep learning of representations: looking forward. Statistical Language and Speech Processing. LNCS 7978:1–37 [arXiv:1404.7456]• Graves A, Wayne G, Danihelka I (2014) Neural Turing machines. [arXiv:1410.5401]• Grefenstette E, Hermann KM, Suleyman M, Blunsom, P (2015) Learning to transduce with unbounded memory. [arXiv:1506.02516]• Griewank A, Walther A (2008) Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation. Society for Industrial and Applied Mathematics,Philadelphia [DOI 10.1137/1.9780898717761]• He K, Zhang X, Ren S, Sun J (2015) Deep residual learning for image recognition. [arXiv:1512.03385]• Joulin A, Mikolov T (2015) Inferring algorithmic patterns with stack-augmented recurrent nets. [arXiv:1503.01007]• Maclaurin D, David D, Adams RP (2015) Gradient-based Hyperparameter Optimization through Reversible Learning [arXiv:1502.03492]• Manzyuk O, Pearlmutter BA, Radul AA, Rush DR, Siskind JM (2012) Confusion of tagged perturbations in forward automatic differentiation of higher-order functions[arXiv:1211.4892]• Pearlmutter BA, Siskind JM (2008) Reverse-mode AD in a functional framework: Lambda the ultimate backpropagator. ACM TOPLAS 30(2):7 [DOI10.1145/1330017.1330018]• Siskind JM, Pearlmutter BA (2008) Nesting forward-mode AD in a functional framework. Higher Order and Symbolic Computation 21(4):361–76 [DOI10.1007/s10990-008-9037-1]• Sukhbaatar S, Szlam A, Weston J, Fergus R (2015) Weakly supervised memory networks. [arXiv:1503.08895]• Syme D (2006) Leveraging .NET meta-programming components from F#: integrated queries and interoperable heterogeneous execution. 2006 Workshop on ML.ACM.• Vinyals O, Toshev A, Bengio S, Erhan D (2014) Show and tell: a neural image caption generator. [arXiv:1411.4555]• Wengert R (1964) A simple automatic derivative evaluation program. Communications of the ACM 7:463–4• Zaremba W, Sutskever I (2015) Reinforcement learning neural Turing machines. [arXiv:1505.00521]