cs 4803 / 7643: deep learning
TRANSCRIPT
![Page 1: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/1.jpg)
CS 4803 / 7643: Deep Learning
Dhruv Batra
Georgia Tech
Topics: – Automatic Differentiation
– (Finish) Forward mode vs Reverse mode AD– Patterns in backprop
![Page 2: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/2.jpg)
Administrativia• HW1 Reminder
– Due: 09/09, 11:59pm
• HW2 out on 9/10• Schedule: https://www.cc.gatech.edu/classes/AY2021/cs7643_fall/
• Project discussion next class
(C) Dhruv Batra 2
![Page 3: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/3.jpg)
Recap from last time
(C) Dhruv Batra 3
![Page 4: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/4.jpg)
How do we compute gradients?• Analytic or “Manual” Differentiation
• Symbolic Differentiation
• Numerical Differentiation
• Automatic Differentiation– Forward mode AD– Reverse mode AD
• aka “backprop”
(C) Dhruv Batra 4
![Page 5: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/5.jpg)
Vector/Matrix Derivatives Notation
(C) Dhruv Batra 5
![Page 6: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/6.jpg)
Vector/Matrix Derivatives Notation
(C) Dhruv Batra 6
![Page 7: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/7.jpg)
Vector Derivative Example
(C) Dhruv Batra 7
![Page 8: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/8.jpg)
Extension to Tensors
(C) Dhruv Batra 8
![Page 9: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/9.jpg)
Chain Rule: Composite Functions
(C) Dhruv Batra 9
![Page 10: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/10.jpg)
Chain Rule: Scalar Case
(C) Dhruv Batra 10
![Page 11: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/11.jpg)
Chain Rule: Vector Case
(C) Dhruv Batra 11
![Page 12: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/12.jpg)
Chain Rule: Jacobian view
(C) Dhruv Batra 12
![Page 13: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/13.jpg)
Chain Rule: Graphical view
(C) Dhruv Batra 13
![Page 14: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/14.jpg)
Chain Rule: Cascaded
(C) Dhruv Batra 14
![Page 15: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/15.jpg)
Deep Learning = Differentiable Programming
• Computation = Graph– Input = Data + Parameters– Output = Loss– Scheduling = Topological ordering
• Auto-Diff– A family of algorithms for
implementing chain-rule on computation graphs
(C) Dhruv Batra 15
![Page 16: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/16.jpg)
Directed Acyclic Graphs (DAGs)• Exactly what the name suggests
– Directed edges– No (directed) cycles– Underlying undirected cycles okay
(C) Dhruv Batra 16
![Page 17: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/17.jpg)
Directed Acyclic Graphs (DAGs)• Concept
– Topological Ordering
(C) Dhruv Batra 17
![Page 18: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/18.jpg)
Computational Graphs• Notation
(C) Dhruv Batra 18
![Page 19: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/19.jpg)
Deep Learning = Differentiable Programming
• Computation = Graph– Input = Data + Parameters– Output = Loss– Scheduling = Topological ordering
• Auto-Diff– A family of algorithms for
implementing chain-rule on computation graphs
(C) Dhruv Batra 19
![Page 20: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/20.jpg)
20
g
Forward mode AD
![Page 21: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/21.jpg)
21
g
Reverse mode AD
![Page 22: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/22.jpg)
Plan for Today• Automatic Differentiation
– (Finish) Forward mode vs Reverse mode AD– Backprop– Patterns in backprop
(C) Dhruv Batra 22
![Page 23: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/23.jpg)
Example: Forward mode AD
(C) Dhruv Batra 23
+
sin( )
x1 x2
*
![Page 24: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/24.jpg)
Example: Forward mode AD
(C) Dhruv Batra 24
+
sin( )
x1 x2
*
![Page 25: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/25.jpg)
(C) Dhruv Batra 25
+
sin( )
x1 x2
*
Example: Forward mode AD
![Page 26: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/26.jpg)
(C) Dhruv Batra 26
+
sin( )
x1 x2
*
Example: Forward mode AD
![Page 27: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/27.jpg)
(C) Dhruv Batra 27
+
sin( )
x1 x2
*
Example: Forward mode AD
Q: What happens if there’s another input variable x3?
![Page 28: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/28.jpg)
(C) Dhruv Batra 28
+
sin( )
x1 x2
*
Example: Forward mode AD
Q: What happens if there’s another input variable x3?A: more sophisticated graph; d+1 “passes” for d+1 variables
![Page 29: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/29.jpg)
(C) Dhruv Batra 29
+
sin( )
x1 x2
*
Example: Forward mode AD
Q: What happens if there’s another output variable f2?
![Page 30: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/30.jpg)
(C) Dhruv Batra 30
+
sin( )
x1 x2
*
Example: Forward mode AD
Q: What happens if there’s another output variable f2?A: more sophisticated graph;
d “passes” for variables
![Page 31: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/31.jpg)
Example: Reverse mode AD
(C) Dhruv Batra 31
+
sin( )
x1 x2
*
![Page 32: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/32.jpg)
Example: Reverse mode AD
(C) Dhruv Batra 32
+
sin( )
x1 x2
*
![Page 33: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/33.jpg)
+
Gradients add at branches
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 34: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/34.jpg)
Duality in Fprop and Bprop
(C) Dhruv Batra 34
+
+
FPROP BPROPSUM
COPY
![Page 35: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/35.jpg)
(C) Dhruv Batra 35
Example: Reverse mode AD
+
sin( )
x1 x2
*
![Page 36: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/36.jpg)
(C) Dhruv Batra 36
Example: Reverse mode AD
+
sin( )
x1 x2
*
Q: What happens if there’s another input variable x3?
![Page 37: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/37.jpg)
(C) Dhruv Batra 37
Example: Reverse mode AD
+
sin( )
x1 x2
*
Q: What happens if there’s another input variable x3?A: more sophisticated graph;
single “backward prop”
![Page 38: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/38.jpg)
(C) Dhruv Batra 38
Example: Reverse mode AD
+
sin( )
x1 x2
*
Q: What happens if there’s another output variable f2?
![Page 39: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/39.jpg)
(C) Dhruv Batra 39
Example: Reverse mode AD
+
sin( )
x1 x2
*
Q: What happens if there’s another output variable f2?A: more sophisticated graph; c “backward props” for c vars
![Page 40: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/40.jpg)
Forward Pass vs Forward mode AD vs Reverse Mode AD
(C) Dhruv Batra 40
+
sin( )
x1 x2
*
+
sin( )
x2
*
x1
+
sin( )
x1 x2
*
![Page 41: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/41.jpg)
Forward mode vs Reverse Mode• What are the differences?
(C) Dhruv Batra 41
+
sin( )
x2
*
+
sin( )
x1 x2
*
x1
![Page 42: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/42.jpg)
Forward mode vs Reverse Mode• What are the differences?
• Which one is faster to compute? – Forward or backward?
(C) Dhruv Batra 42
![Page 43: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/43.jpg)
Forward mode vs Reverse Mode• x Graph L• Intuition of Jacobian
(C) Dhruv Batra 43
![Page 44: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/44.jpg)
Forward mode vs Reverse Mode• What are the differences?
• Which one is faster to compute? – Forward or backward?
• Which one is more memory efficient (less storage)? – Forward or backward?
(C) Dhruv Batra 44
![Page 45: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/45.jpg)
Plan for Today• Automatic Differentiation
– (Finish) Forward mode vs Reverse mode AD– Backprop– Patterns in backprop
(C) Dhruv Batra 45
![Page 46: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/46.jpg)
(C) Dhruv Batra 46Figure Credit: Andrea Vedaldi
Neural Network Computation Graph
![Page 47: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/47.jpg)
(C) Dhruv Batra 47Figure Credit: Andrea Vedaldi
Backprop
![Page 48: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/48.jpg)
Any DAG of differentiable modules is allowed!
Slide Credit: Marc'Aurelio Ranzato(C) Dhruv Batra 48
Computational Graph
![Page 49: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/49.jpg)
Key Computation: Forward-Prop
(C) Dhruv Batra 49Slide Credit: Marc'Aurelio Ranzato, Yann LeCun
![Page 50: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/50.jpg)
Key Computation: Back-Prop
(C) Dhruv Batra 50Slide Credit: Marc'Aurelio Ranzato, Yann LeCun
![Page 51: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/51.jpg)
Neural Network Training• Step 1: Compute Loss on mini-batch [F-Pass]
(C) Dhruv Batra 51Slide Credit: Marc'Aurelio Ranzato, Yann LeCun
![Page 52: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/52.jpg)
Neural Network Training• Step 1: Compute Loss on mini-batch [F-Pass]
(C) Dhruv Batra 52Slide Credit: Marc'Aurelio Ranzato, Yann LeCun
![Page 53: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/53.jpg)
Neural Network Training• Step 1: Compute Loss on mini-batch [F-Pass]
(C) Dhruv Batra 53Slide Credit: Marc'Aurelio Ranzato, Yann LeCun
![Page 54: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/54.jpg)
Neural Network Training• Step 1: Compute Loss on mini-batch [F-Pass]• Step 2: Compute gradients wrt parameters [B-Pass]
(C) Dhruv Batra 54Slide Credit: Marc'Aurelio Ranzato, Yann LeCun
![Page 55: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/55.jpg)
Neural Network Training• Step 1: Compute Loss on mini-batch [F-Pass]• Step 2: Compute gradients wrt parameters [B-Pass]
(C) Dhruv Batra 55Slide Credit: Marc'Aurelio Ranzato, Yann LeCun
![Page 56: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/56.jpg)
Neural Network Training• Step 1: Compute Loss on mini-batch [F-Pass]• Step 2: Compute gradients wrt parameters [B-Pass]
(C) Dhruv Batra 56Slide Credit: Marc'Aurelio Ranzato, Yann LeCun
![Page 57: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/57.jpg)
Neural Network Training• Step 1: Compute Loss on mini-batch [F-Pass]• Step 2: Compute gradients wrt parameters [B-Pass]• Step 3: Use gradient to update parameters
(C) Dhruv Batra 57Slide Credit: Marc'Aurelio Ranzato, Yann LeCun
![Page 58: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/58.jpg)
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Backpropagation: a simple example
![Page 59: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/59.jpg)
Backpropagation: a simple example
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 60: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/60.jpg)
Patterns in backward flow
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 61: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/61.jpg)
Patterns in backward flow
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Q: What is an add gate?
![Page 62: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/62.jpg)
Patterns in backward flow
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
add gate: gradient distributor
![Page 63: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/63.jpg)
add gate: gradient distributor
Q: What is a max gate?
Patterns in backward flow
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 64: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/64.jpg)
add gate: gradient distributor
max gate: gradient router
Patterns in backward flow
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 65: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/65.jpg)
add gate: gradient distributor
max gate: gradient router
Q: What is a mul gate?
Patterns in backward flow
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 66: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/66.jpg)
add gate: gradient distributor
max gate: gradient router
mul gate: gradient switcher
Patterns in backward flow
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 67: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/67.jpg)
+
Gradients add at branches
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 68: CS 4803 / 7643: Deep Learning](https://reader030.vdocuments.site/reader030/viewer/2022012915/61c5f463798ac8050b228fb6/html5/thumbnails/68.jpg)
Duality in Fprop and Bprop
(C) Dhruv Batra 68
+
+
FPROP BPROPSUM
COPY