gunes@robots.ox.ac.uk automatic differentiation (1)

Post on 02-Nov-2021

10 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Automatic Differentiation (1)

Slides Prepared By:

Atılım Güneş Baydingunes@robots.ox.ac.uk

2

Outline

This lecture:- Derivatives in machine learning- Review of essential concepts (what is a derivative, Jacobian, etc.)- How do we compute derivatives- Automatic differentiation

Next lecture:- Current landscape of tools- Implementation techniques- Advanced concepts (higher-order API, checkpointing, etc.)

Derivatives and machine learning

3

4

Derivatives in machine learning“Backprop” and gradient descent are at the core of all recent advances Computer vision

NVIDIA DRIVE PX 2 segmentationTop-5 error rate for ImageNet (NVIDIA devblog) Faster R-CNN (Ren et al. 2015)

Speech recognition/synthesis

Word error rates (Huang et al., 2014) Google Neural Machine Translation System (GNMT)

Machine translation

5

Derivatives in machine learning“Backprop” and gradient descent are at the core of all recent advances

Probabilistic programming (and modeling)

Pyro ProbTorch

Edward TensorFlow Probability

- Variational inference- “Neural” density estimation

- Transformed distributions via bijectors- Normalizing flows (Rezende & Mohamed, 2015)- Masked autoregressive flows (Papamakarios et al., 2017)

(2017) (2017)

(2016) (2018)

6

Derivatives in machine learningAt the core of all: differentiable functions (programs) whose parameters are tuned by gradient-based optimization

(Ruder, 2017) http://ruder.io/optimizing-gradient-descent/

7

Automatic differentiationExecute differentiable functions (programs) via automatic differentiation

A word on naming:- Differentiable programming, a generalization of deep learning (Olah, LeCun)

“Neural networks are just a class of differentiable functions”- Automatic differentiation- Algorithmic differentiation- AD- Autodiff- Algodiff- Autograd

Also remember:- Backprop- Backpropagation (backward propagation of errors)

Essential conceptsrefresher

8

9

DerivativeFunction of a real variable

Sensitivity of function value w.r.t. a change in its argument (the instantaneous rate of change)

Newton, c. 1665

Leibniz, c. 1675Leibniz Lagrange Newton

Dependent Independent

10

DerivativeFunction of a real variable

Newton, c. 1665

Leibniz, c. 1675

around 15 such rules

Note: the derivative is a linear operator, a.k.a. a higher-order function in programming languages

11

Partial derivativeFunction of several real variables

A derivative w.r.t. one independent variable, with others held constant

“del”

12

Partial derivativeFunction of several real variables

The gradient, given

is the vector of all partial derivatives

“nabla” or “del”

Nabla is the higher-order function:

points to the direction with the largest rate of change

13

Total derivativeFunction of several real variables

The derivative w.r.t. all variables (independent & dependent)

Consider all partial derivatives simultaneously and accumulate all direct and indirect contributions (Important: will be useful later)

14

Matrix calculus and machine learningExtension to multivariable functions

Scalar output Vector output

Scalar input

Vector input

scalar field vector field

In machine learning, we construct (deep) compositions of- , e.g., a neural network- , e.g., a loss function, KL divergence, or log joint probability

15

Matrix calculus and machine learning

Generalization to tensors (multi-dimensional arrays) for efficientbatching, handling of sequences, channels in convolutions, etc.

And many, many more rules

16

Finally, two constructs relevant to machine learning: Jacobian and Hessian

Matrix calculus and machine learning

How to compute derivatives

17

18

We can compute the derivatives not just of mathematical functions, but of general programs (with control flow)

Derivatives as code

19

Derivatives as code

20

Manual

Analytic derivatives are needed for theoretical insight- analytic solutions, proofs- mathematical analysis, e.g., stability of fixed points

Unnecessary when we just need derivative evaluations for optimization

You can see papers like this:

21

Symbolic differentiationSymbolic computation with Mathematica, Maple, Maxima, and deep learning frameworks such as TheanoProblem: expression swell

22

Symbolic computation with Mathematica, Maple, Maxima, and deep learning frameworks such as TheanoProblem: expression swell

Symbolic differentiation

Graph optimization(e.g., in Theano)

Symbolic graph builders such as Theano and TensorFlowhave limited, unintuitive control flow, loops, recursion

Problem: only applicable to closed-form mathematical functions

You can find the derivative of

but not of

Symbolic differentiation

24

Finite difference approximation of ,

Problem: we must select andwe face approximation errors

Problem: needs to be evaluated times, once with each standard basis vector

Numerical differentiation

25

Finite difference approximation of ,

Better approximations exist:- Higher-order finite differences

e.g., center difference:

- Richardson extrapolation- Differential quadrature

These increase rapidly in complexity and never completely eliminate the error

Numerical differentiation

26

Finite difference approximation of ,

Better approximations exist:- Higher-order finite differences

e.g., center difference:

- Richardson extrapolation- Differential quadrature

These increase rapidly in complexity and never completely eliminate the error

Numerical differentiation

Still extremely useful as a quick check of our gradient implementationsGood to learn:

27

If we don’t need analytic derivative expressions, we can evaluate a gradient exactly with only one forward and one reverse execution

In machine learning, this is known as backpropagation or “backprop”

- Automatic differentiation is more than backprop

- Or, backprop is a specialized reverse mode automatic differentiation

- We will come back to this shortly

Nature 323, 533–536 (9 October 1986)

Automatic differentiation

Backprob or automatic differentiation?

28

29

1960s 1970s 1980s

Precursors

Kelley, 1960Bryson, 1961Pontryagin et al., 1961Dreyfus, 1962

Wengert, 1964Forward mode

Linnainmaa, 1970, 1976Backpropagation

Dreyfus, 1973Control parameters

Werbos, 1974Reverse mode

Speelpenning, 1980Automatic reverse mode

Werbos, 1982First NN-specific backprop

Parker, 1985

LeCun, 1985

Rumelhart, Hinton, Williams, 1986Revived backprop

Griewank, 1989Revived reverse mode

1960s 1970s 1980s

Precursors

Kelley, 1960Bryson, 1961Pontryagin et al., 1961Dreyfus, 1962

Wengert, 1964Forward mode

Linnainmaa, 1970, 1976Backpropagation

Dreyfus, 1973Control parameters

Werbos, 1974Reverse mode

Speelpenning, 1980Automatic reverse mode

Werbos, 1982First NN-specific backprop

Parker, 1985

LeCun, 1985

Rumelhart, Hinton, Williams, 1986Revived backprop

Griewank, 1989Revived reverse mode 30

Recommended reading:

Griewank, A., 2012. Who Invented the Reverse Mode of Differentiation? Documenta Mathematica, Extra Volume ISMP, pp.389-400.

Schmidhuber, J., 2015. Who Invented Backpropagation?http://people.idsia.ch/~juergen/who-invented-backpropagation.html

Automatic differentiation

31

32

Automatic differentiationAll numerical algorithms, when executed, evaluate to compositions of a finite set of elementary operations with known derivatives

- Called a trace or a Wengert list (Wengert,1964)- Alternatively represented as a computational graph showing dependencies

33

Automatic differentiationAll numerical algorithms, when executed, evaluate to compositions of a finite set of elementary operations with known derivatives

- Called a trace or a Wengert list (Wengert,1964)- Alternatively represented as a computational graph showing dependencies

34

Automatic differentiationAll numerical algorithms, when executed, evaluate to compositions of a finite set of elementary operations with known derivatives

- Called a trace or a Wengert list (Wengert,1964)- Alternatively represented as a computational graph showing dependencies

f(a, b):

c = a * b

d = log(c)

return d log*

a

b

cd

35

Automatic differentiationAll numerical algorithms, when executed, evaluate to compositions of a finite set of elementary operations with known derivatives

- Called a trace or a Wengert list (Wengert,1964)- Alternatively represented as a computational graph showing dependencies

f(a, b):

c = a * b

d = log(c)

return d

1.791 = f(2, 3)

log*

a

b

cd

3

26

1.791

primal

36

Automatic differentiationAll numerical algorithms, when executed, evaluate to compositions of a finite set of elementary operations with known derivatives

- Called a trace or a Wengert list (Wengert,1964)- Alternatively represented as a computational graph showing dependencies

f(a, b):

c = a * b

d = log(c)

return d

1.791 = f(2, 3)

[0.5, 0.333] = f’(2, 3)

log*

a

b

cd

3

26

1.7910.5

0.333

0.166 1

derivativetangent, adjoint“gradient”

primal

37

Automatic differentiationAll numerical algorithms, when executed, evaluate to compositions of a finite set of elementary operations with known derivatives

- Called a trace or a Wengert list (Wengert,1964)- Alternatively represented as a computational graph showing dependencies

f(a, b):

c = a * b

d = log(c)

return d

1.791 = f(2, 3)

[0.5, 0.333] = f’(2, 3)

log*

a

b

cd

3

26

1.7910.5

0.333

0.166 1

derivativetangent, adjoint“gradient”

primal

38

Two main flavors

Forward mode Reverse mode (a.k.a. backprop)

Nested combinations (higher-order derivatives, Hessian–vector products, etc.)

- Forward-on-reverse- Reverse-on-forward- ...

PrimalsDerivatives

Primals

Derivatives

Automatic differentiation

(Tangents)(Adjoints)

39

What happens to control flow?

f(a, b):

c = a * b

if c > 0:

d = log(c)

else:

d = sin(c)

return d

It disappears: branches are taken, loops are unrolled, functions are inlined, etc.until we are left with the linear trace of execution

40

What happens to control flow?

f(a = 2, b = 3):

c = a * b = 6

if c > 0:

d = log(c) = 1.791

else:

d = sin(c)

return d

log*

a

b

cd

3

26

1.791

It disappears: branches are taken, loops are unrolled, functions are inlined, etc.until we are left with the linear trace of execution

41

What happens to control flow?

f(a = 2, b = -1):

c = a * b = -2

if c > 0:

d = log(c)

else:

d = sin(c) = -0.909

return d

sin*

a

b

cd

-1

2-2

-0.909

It disappears: branches are taken, loops are unrolled, functions are inlined, etc.until we are left with the linear trace of execution

42

What happens to control flow?

f(a = 2, b = -1):

c = a * b = -2

if c > 0:

d = log(c)

else:

d = sin(c) = -0.909

return d

sin*

a

b

cd

-1

2-2

-0.909

It disappears: branches are taken, loops are unrolled, functions are inlined, etc.until we are left with the linear trace of execution

A directed acyclic graph (DAG)

Topological ordering

43

Forward mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

sin*x1 y1

Primals: independent dependentDerivatives (tangents): independent dependent

log +x2

v2

y2

v1

44

Forward mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v1

Primals: independent dependentDerivatives (tangents): independent dependent

45

Forward mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v12

Primals: independent dependentDerivatives (tangents): independent dependent

46

Forward mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v12

3

Primals: independent dependentDerivatives (tangents): independent dependent

47

Forward mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v12

3

1

Primals: independent dependentDerivatives (tangents): independent dependent

48

Forward mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v12

3

1

0

Primals: independent dependentDerivatives (tangents): independent dependent

49

Forward mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v12

3

1

0

6

Primals: independent dependentDerivatives (tangents): independent dependent

50

Forward mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v12

3

1

0

6

3

Primals: independent dependentDerivatives (tangents): independent dependent

51

Forward mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v12

3

1

0

6

3

1.098

Primals: independent dependentDerivatives (tangents): independent dependent

52

Forward mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v12

3

1

0

6

3

1.098

0

Primals: independent dependentDerivatives (tangents): independent dependent

53

Forward mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v12

3

1

0

6

3

1.098

0

-0.279

Primals: independent dependentDerivatives (tangents): independent dependent

54

Forward mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v12

3

1

0

6

3

1.098

0

-0.279

2.880

Primals: independent dependentDerivatives (tangents): independent dependent

55

Forward mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v12

3

1

0

6

3

1.098

0

-0.279

2.880

7.098

Primals: independent dependentDerivatives (tangents): independent dependent

56

Forward mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v12

3

1

0

6

3

1.098

0

-0.279

2.880

7.098

3

Primals: independent dependentDerivatives (tangents): independent dependent

57

Forward mode

sin*x1 y1

log +x2

v2

y2

v12

3

1

0

6

3

1.098

0

-0.279

2.880

7.098

3

In general, forward mode evaluates a Jacobian–vector product

So we evaluated:

Primals: independent dependentDerivatives (tangents): independent dependent

58

Forward mode

sin*x1 y1

log +x2

v2

y2

v12

3

1

0

6

3

1.098

0

-0.279

2.880

7.098

3

In general, forward mode evaluates a Jacobian–vector product

So we evaluated:

Can be anynot only unit vectors

Primals: independent dependentDerivatives (tangents): independent dependent

For this is a directional derivative

59

Forward mode

In general, forward mode evaluates a Jacobian–vector product

So we evaluated:

Can be anynot only unit vectors

Primals: independent dependentDerivatives (tangents): independent dependent

60

Reverse mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

sin*x1 y1

log +x2

v2

y2

v1

Primals: independent dependentDerivatives (adjoints): independent dependent

61

Reverse mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v1

Primals: independent dependentDerivatives (adjoints): independent dependent

62

Reverse mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v12

Primals: independent dependentDerivatives (adjoints): independent dependent

63

Reverse mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v12

3

Primals: independent dependentDerivatives (adjoints): independent dependent

64

Reverse mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v12

3

6

Primals: independent dependentDerivatives (adjoints): independent dependent

65

Reverse mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v12

3

6

1.098

Primals: independent dependentDerivatives (adjoints): independent dependent

66

Reverse mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v12

3

6

1.098

-0.279

Primals: independent dependentDerivatives (adjoints): independent dependent

67

Reverse mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v12

3

6

1.098

-0.279

7.098

Primals: independent dependentDerivatives (adjoints): independent dependent

68

Reverse mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v12

3

6

1.098

-0.279

7.098

1

Primals: independent dependentDerivatives (adjoints): independent dependent

69

Reverse mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v12

3

6

1.098

-0.279

7.098

1

0

Primals: independent dependentDerivatives (adjoints): independent dependent

70

Reverse mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v12

3

6

1.098

-0.279

7.098

1

0

Primals: independent dependentDerivatives (adjoints): independent dependent

71

Reverse mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v12

3

6

1.098

-0.279

7.098

1

0

0.960

Primals: independent dependentDerivatives (adjoints): independent dependent

72

Reverse mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v12

3

6

1.098

-0.279

7.098

1

0

0.960

Primals: independent dependentDerivatives (adjoints): independent dependent

73

Reverse mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v12

3

6

1.098

-0.279

7.098

1

0

0.960

0

Primals: independent dependentDerivatives (adjoints): independent dependent

74

Reverse mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v12

3

6

1.098

-0.279

7.098

1

0

0.960

0

Primals: independent dependentDerivatives (adjoints): independent dependent

75

Reverse mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v12

3

6

1.098

-0.279

7.098

1

0

0.960

0

2.880

Primals: independent dependentDerivatives (adjoints): independent dependent

76

Reverse mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v12

3

6

1.098

-0.279

7.098

1

0

0.960

0

2.880

Primals: independent dependentDerivatives (adjoints): independent dependent

77

Reverse mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v12

3

6

1.098

-0.279

7.098

1

0

0.960

0

2.880

1.920

Primals: independent dependentDerivatives (adjoints): independent dependent

78

Reverse mode

sin*x1 y1

log +x2

v2

y2

v12

3

6

1.098

-0.279

7.098

1

0

0.960

0

2.880

1.920

In general, forward mode evaluates a transposed Jacobian–vector product

So we evaluated:

Primals: independent dependentDerivatives (adjoints): independent dependent

79

Reverse mode

In general, reverse mode evaluates a transposed Jacobian–vector product

So we evaluated:For this is the gradient

Primals: independent dependentDerivatives (adjoints): independent dependent

80

Forward vs reverse summaryIn the extremeuse forward mode to evaluate

In the extremeuse reverse mode to evaluate

81

Forward vs reverse summaryIn the extremeuse forward mode to evaluate

In the extremeuse reverse mode to evaluate

In general the Jacobian can be evaluated in- with forward mode- with reverse mode

Reverse performs better when

Backprop through normal PDF

82

83

Backprop through normal PDF

-x

µ

σ

·2 *

·2 /

*

2 π

- exp

sqrt 1/·

* f0.5

0

1

Summary

84

85

Summary

This lecture:- Derivatives in machine learning- Review of essential concepts (what is a derivative, etc.)- How do we compute derivatives- Automatic differentiation

Next lecture:- Current landscape of tools- Implementation techniques- Advanced concepts (higher-order API, checkpointing, etc.)

86

ReferencesBaydin, A.G., Pearlmutter, B.A., Radul, A.A. and Siskind, J.M., 2017. Automatic differentiation in machine learning: a survey. Journal of Machine Learning Research (JMLR), 18(153), pp.1-153.

Baydin, Atılım Güneş, Barak A. Pearlmutter, and Jeffrey Mark Siskind. 2016. “Tricks from Deep Learning.” In 7th International Conference on Algorithmic Differentiation, Christ Church Oxford, UK, September 12–15, 2016.

Baydin, Atılım Güneş, Barak A. Pearlmutter, and Jeffrey Mark Siskind. 2016. “DiffSharp: An AD Library for .NET Languages.” In 7th International Conference on Algorithmic Differentiation, Christ Church Oxford, UK, September 12–15, 2016.

Baydin, Atılım Güneş, Robert Cornish, David Martínez Rubio, Mark Schmidt, and Frank Wood. 2018. “Online Learning Rate Adaptation with Hypergradient Descent.” In Sixth International Conference on Learning Representations (ICLR), Vancouver, Canada, April 30 – May 3, 2018.

Griewank, A. and Walther, A., 2008. Evaluating derivatives: principles and techniques of algorithmic differentiation (Vol. 105). SIAM.

Nocedal, J. and Wright, S.J., 1999. Numerical Optimization. Springer.

Extra slides

87

88

Forward mode Primals: independent 🡨 dependentDerivatives (tangents): independent 🡨 dependent

89

Forward mode

f(a, b):

c = a * b

d = log(c)

return d

log*

a

b

cd

Primals: independent 🡨 dependentDerivatives (tangents): independent 🡨 dependent

90

Forward mode

f(a, b):

c = a * b

d = log(c)

return d

f(2, 3)

log*

a

b

cd

Primals: independent 🡨 dependentDerivatives (tangents): independent 🡨 dependent

91

Forward mode

f(a, b):

c = a * b

d = log(c)

return d

f(2, 3)

log*

a

b

cd

2

Primals: independent 🡨 dependentDerivatives (tangents): independent 🡨 dependent

92

Forward mode

f(a, b):

c = a * b

d = log(c)

return d

f(2, 3)

log*

a

b

cd

2

3

Primals: independent 🡨 dependentDerivatives (tangents): independent 🡨 dependent

93

Forward mode

f(a, b):

c = a * b

d = log(c)

return d

f(2, 3)

log*

a

b

cd

2

3

1

Primals: independent 🡨 dependentDerivatives (tangents): independent 🡨 dependent

94

Forward mode

f(a, b):

c = a * b

d = log(c)

return d

f(2, 3)

log*

a

b

cd

2

3

1

0

Primals: independent 🡨 dependentDerivatives (tangents): independent 🡨 dependent

95

Forward mode

f(a, b):

c = a * b

d = log(c)

return d

f(2, 3)

log*

a

b

cd

2

3

1

0

6

Primals: independent 🡨 dependentDerivatives (tangents): independent 🡨 dependent

96

Forward mode

f(a, b):

c = a * b

d = log(c)

return d

f(2, 3)

log*

a

b

cd

2

3

1

0

3

6

Primals: independent 🡨 dependentDerivatives (tangents): independent 🡨 dependent

97

Forward mode

f(a, b):

c = a * b

d = log(c)

return d

f(2, 3)

log*

a

b

cd

2

3

1

0

3

61.791

Primals: independent 🡨 dependentDerivatives (tangents): independent 🡨 dependent

98

Forward mode

f(a, b):

c = a * b

d = log(c)

return d

f(2, 3)

log*

a

b

cd

2

3

1

0

3

61.791

0.5

Primals: independent 🡨 dependentDerivatives (tangents): independent 🡨 dependent

99

Forward mode

f(a, b):

c = a * b

d = log(c)

return d

f(2, 3)

log*

a

b

cd

2

3

Primals: independent 🡨 dependentDerivatives (tangents): independent 🡨 dependent

1

0

3

61.791

0.5

In general, forward mode evaluates a Jacobian–vector product

We evaluated the partial derivative with

100

Reverse mode

f(a, b):

c = a * b

d = log(c)

return d

log*

a

b

cd

Primals: independent 🡨 dependentDerivatives (adjoints): independent 🡨 dependent

101

Reverse mode

f(a, b):

c = a * b

d = log(c)

return d

f(2, 3)

log*

a

b

cd

Primals: independent 🡨 dependentDerivatives (adjoints): independent 🡨 dependent

102

Reverse mode

f(a, b):

c = a * b

d = log(c)

return d

f(2, 3)

log*

a

b

cd

Primals: independent 🡨 dependentDerivatives (adjoints): independent 🡨 dependent

2

103

Reverse mode

f(a, b):

c = a * b

d = log(c)

return d

f(2, 3)

log*

a

b

cd

Primals: independent 🡨 dependentDerivatives (adjoints): independent 🡨 dependent

2

3

104

Reverse mode

f(a, b):

c = a * b

d = log(c)

return d

f(2, 3)

log*

a

b

cd

Primals: independent 🡨 dependentDerivatives (adjoints): independent 🡨 dependent

2

3

6

105

Reverse mode

f(a, b):

c = a * b

d = log(c)

return d

f(2, 3)

log*

a

b

cd

Primals: independent 🡨 dependentDerivatives (adjoints): independent 🡨 dependent

2

3

61.791

106

Reverse mode

f(a, b):

c = a * b

d = log(c)

return d

f(2, 3)

log*

a

b

cd

Primals: independent 🡨 dependentDerivatives (adjoints): independent 🡨 dependent

2

3

61.791

1

107

Reverse mode

f(a, b):

c = a * b

d = log(c)

return d

f(2, 3)

log*

a

b

cd

Primals: independent 🡨 dependentDerivatives (adjoints): independent 🡨 dependent

2

3

61.791

1

108

Reverse mode

f(a, b):

c = a * b

d = log(c)

return d

f(2, 3)

log*

a

b

cd

Primals: independent 🡨 dependentDerivatives (adjoints): independent 🡨 dependent

2

3

61.791

10.166

109

Reverse mode

f(a, b):

c = a * b

d = log(c)

return d

f(2, 3)

log*

a

b

cd

Primals: independent 🡨 dependentDerivatives (adjoints): independent 🡨 dependent

2

3

61.791

10.166

110

Reverse mode

f(a, b):

c = a * b

d = log(c)

return d

f(2, 3)

log*

a

b

cd

Primals: independent 🡨 dependentDerivatives (adjoints): independent 🡨 dependent

2

3

61.791

10.166

0.5

111

Reverse mode

f(a, b):

c = a * b

d = log(c)

return d

f(2, 3)

log*

a

b

cd

Primals: independent 🡨 dependentDerivatives (adjoints): independent 🡨 dependent

2

3

61.791

10.166

0.5

112

Reverse mode

f(a, b):

c = a * b

d = log(c)

return d

f(2, 3)

log*

a

b

cd

Primals: independent 🡨 dependentDerivatives (adjoints): independent 🡨 dependent

2

3

61.791

10.166

0.5

0.333

113

Reverse mode

f(a, b):

c = a * b

d = log(c)

return d

f(2, 3)

log*

a

b

cd

Primals: independent 🡨 dependentDerivatives (adjoints): independent 🡨 dependent

2

3

61.791

10.166

0.5

0.333

In general, reverse mode evaluates a transposed Jacobian–vector product

We evaluated the gradient with

114

Reverse mode

f(a, b):

c = a * b

d = log(c)

return d

f(2, 3)

log*

a

b

cd

Primals: independent 🡨 dependentDerivatives (adjoints): independent 🡨 dependent

2

3

61.791

10.166

0.5

0.333

In general, reverse mode evaluates a transposed Jacobian–vector product

We evaluated the gradient with

top related