calculus refresher - jrwright.infojrwright.info/aicourse/slides/19-calculusrefresher.pdf · recap:...
TRANSCRIPT
![Page 1: Calculus Refresher - jrwright.infojrwright.info/aicourse/slides/19-CalculusRefresher.pdf · Recap: Monte Carlo • Often we cannot directly estimate expectations from our model •](https://reader033.vdocuments.site/reader033/viewer/2022060502/5f1c7b25c2f9e12b25684da4/html5/thumbnails/1.jpg)
Calculus Refresher
CMPUT 366: Intelligent Systems
GBC §4.1, 4.3
![Page 2: Calculus Refresher - jrwright.infojrwright.info/aicourse/slides/19-CalculusRefresher.pdf · Recap: Monte Carlo • Often we cannot directly estimate expectations from our model •](https://reader033.vdocuments.site/reader033/viewer/2022060502/5f1c7b25c2f9e12b25684da4/html5/thumbnails/2.jpg)
Lecture Outline
1. Recap
2. Gradient-based optimization
3. Numerical issues
![Page 3: Calculus Refresher - jrwright.infojrwright.info/aicourse/slides/19-CalculusRefresher.pdf · Recap: Monte Carlo • Often we cannot directly estimate expectations from our model •](https://reader033.vdocuments.site/reader033/viewer/2022060502/5f1c7b25c2f9e12b25684da4/html5/thumbnails/3.jpg)
Recap: Bayesian Learning
• In Bayesian Learning, we learn a distribution over models instead of a single model
• Model averaging to compute predictive distribution
• Prior can encode bias over models (like regularization)
• Conjugate models: can compute everything analytically
![Page 4: Calculus Refresher - jrwright.infojrwright.info/aicourse/slides/19-CalculusRefresher.pdf · Recap: Monte Carlo • Often we cannot directly estimate expectations from our model •](https://reader033.vdocuments.site/reader033/viewer/2022060502/5f1c7b25c2f9e12b25684da4/html5/thumbnails/4.jpg)
Recap: Monte Carlo
• Often we cannot directly estimate expectations from our model
• Example: non-conjugate Bayesian models
• Monte Carlo estimates: Use a random sample from the distribution to estimate expectations by sample averages
1. Use an easier-to-sample proposal distribution instead
2. Sample parts of the model sequentially
![Page 5: Calculus Refresher - jrwright.infojrwright.info/aicourse/slides/19-CalculusRefresher.pdf · Recap: Monte Carlo • Often we cannot directly estimate expectations from our model •](https://reader033.vdocuments.site/reader033/viewer/2022060502/5f1c7b25c2f9e12b25684da4/html5/thumbnails/5.jpg)
Loss MinimizationIn supervised learning, we choose a hypothesis to minimize a loss function
Example: Predict the temperature
• Dataset: temperatures from a random sample of days
• Hypothesis class: Always predict the same value
• Loss function:
y(i)
μ
L(μ) =1n
n
∑i=1
(y(i) − μ)2
![Page 6: Calculus Refresher - jrwright.infojrwright.info/aicourse/slides/19-CalculusRefresher.pdf · Recap: Monte Carlo • Often we cannot directly estimate expectations from our model •](https://reader033.vdocuments.site/reader033/viewer/2022060502/5f1c7b25c2f9e12b25684da4/html5/thumbnails/6.jpg)
OptimizationOptimization: finding a value of that minimizes
• Temperature example: Find that makes small
Gradient descent: Iteratively move from current estimate in the direction that makes smaller
• For discrete domains, this is just hill climbing: Iteratively choose the neighbour that has minimum
• For continuous domains, neighbourhood is less well-defined
x f(x)
x* = arg minx
f(x)
μ L(μ)
f(x)
f(x)
![Page 7: Calculus Refresher - jrwright.infojrwright.info/aicourse/slides/19-CalculusRefresher.pdf · Recap: Monte Carlo • Often we cannot directly estimate expectations from our model •](https://reader033.vdocuments.site/reader033/viewer/2022060502/5f1c7b25c2f9e12b25684da4/html5/thumbnails/7.jpg)
Derivatives
• The derivative
of a function is the slope of at point
• When , increases with small enough increases in x
• When , decreases with small enough increases in x
f′�(x) =ddx
f(x)
f(x) fx
f′�(x) > 0 f
f′�(x) < 0 f
-4
-3
-2
-1
0
1
2
3
4
𝜇
a-2.0a-1.8a-1.5a-1.3a-1.0a-0.8a-0.6a-0.3a-0.1a+0.2a+0.4a+0.6a+0.9a+1.1a+1.4a+1.6a+1.8
L(𝜇) L'(𝜇)
![Page 8: Calculus Refresher - jrwright.infojrwright.info/aicourse/slides/19-CalculusRefresher.pdf · Recap: Monte Carlo • Often we cannot directly estimate expectations from our model •](https://reader033.vdocuments.site/reader033/viewer/2022060502/5f1c7b25c2f9e12b25684da4/html5/thumbnails/8.jpg)
Multiple InputsExample:Predict the temperature based on pressure and humidity
• Dataset:
• Hypothesis class: Linear regression:
• Loss function:
(x(1)1 , x(1)
2 , y(1)), …, (x(m)1 , x(m)
2 , y(m)) = {(x(i), y(i)) ∣ 1 ≤ i ≤ m}h(x; w) = w0 + w1x1 + w2x2
L(w) =1n
n
∑i=1
(y(i) − h(x(i); w))2
![Page 9: Calculus Refresher - jrwright.infojrwright.info/aicourse/slides/19-CalculusRefresher.pdf · Recap: Monte Carlo • Often we cannot directly estimate expectations from our model •](https://reader033.vdocuments.site/reader033/viewer/2022060502/5f1c7b25c2f9e12b25684da4/html5/thumbnails/9.jpg)
Partial Derivatives
Partial derivatives: How much does change when we only change one of its inputs ?
• Can think of this as the derivative of a conditional function:
.
f(x)xi
g(xi) = f(x1, …, xi, …, xn)
∂∂xi
f(x) =d
dxig(xi)
![Page 10: Calculus Refresher - jrwright.infojrwright.info/aicourse/slides/19-CalculusRefresher.pdf · Recap: Monte Carlo • Often we cannot directly estimate expectations from our model •](https://reader033.vdocuments.site/reader033/viewer/2022060502/5f1c7b25c2f9e12b25684da4/html5/thumbnails/10.jpg)
Gradient
• The gradient of a function is just a vector that contains all of its partial derivatives:
f(x)
∇f(x) =
∂∂x1
f(x)
⋮∂
∂xnf(x)
![Page 11: Calculus Refresher - jrwright.infojrwright.info/aicourse/slides/19-CalculusRefresher.pdf · Recap: Monte Carlo • Often we cannot directly estimate expectations from our model •](https://reader033.vdocuments.site/reader033/viewer/2022060502/5f1c7b25c2f9e12b25684da4/html5/thumbnails/11.jpg)
Gradient Descent
• The gradient of a function tells how to change every element of a vector to increase the function
• If the partial derivative of is positive, increase
• Gradient descent: Iteratively choose new values of x in the (opposite) direction of the gradient:
.
• This only works for sufficiently small changes (why?)
• Question: How much should we change ?
xi xi
xnew = xold − η∇f(xold)
xold learning rate
![Page 12: Calculus Refresher - jrwright.infojrwright.info/aicourse/slides/19-CalculusRefresher.pdf · Recap: Monte Carlo • Often we cannot directly estimate expectations from our model •](https://reader033.vdocuments.site/reader033/viewer/2022060502/5f1c7b25c2f9e12b25684da4/html5/thumbnails/12.jpg)
Where Do Gradients Come From?
Question: How do we compute the gradients we need for gradient descent?
1. Analytic expressions / direct implementation:
L(μ) =1n
n
∑i=1
(y(i) − μ)2
=1n
n
∑i=1
[y(i)2 − 2y(i)μ + μ2]∇L(μ) =
1n
n
∑i=1
[−2y(i) + 2μ]
![Page 13: Calculus Refresher - jrwright.infojrwright.info/aicourse/slides/19-CalculusRefresher.pdf · Recap: Monte Carlo • Often we cannot directly estimate expectations from our model •](https://reader033.vdocuments.site/reader033/viewer/2022060502/5f1c7b25c2f9e12b25684da4/html5/thumbnails/13.jpg)
Where Do Gradients Come From?
2. Method of differences
(for "sufficiently" tiny )
Question: Why would we ever do this?
Question: What are the drawbacks?
∇L(x)i ≈ L(x + ϵei) − L(x)
ϵ
![Page 14: Calculus Refresher - jrwright.infojrwright.info/aicourse/slides/19-CalculusRefresher.pdf · Recap: Monte Carlo • Often we cannot directly estimate expectations from our model •](https://reader033.vdocuments.site/reader033/viewer/2022060502/5f1c7b25c2f9e12b25684da4/html5/thumbnails/14.jpg)
Where Do Gradients Come From?
3. The Chain Rule (of Calculus)
i.e.,
• If we know formulas for the derivatives of components of a function, then we can build up the derivative of their composition mechanically
• Most prominent example: Back-propagation in neural networks
dzdx
=dzdy
dydx
h(x) = f(g(x)) ⟹ h′ �(x) = f′�(g(x))g′�(x)
![Page 15: Calculus Refresher - jrwright.infojrwright.info/aicourse/slides/19-CalculusRefresher.pdf · Recap: Monte Carlo • Often we cannot directly estimate expectations from our model •](https://reader033.vdocuments.site/reader033/viewer/2022060502/5f1c7b25c2f9e12b25684da4/html5/thumbnails/15.jpg)
Approximating Real Numbers• Computers store real numbers as finite number of bits
• Problem: There are an infinite number of real numbers in any interval
• Real numbers are encoded as floating point numbers:
• 1.001...011011 × 21001..0011
• Single precision: 24 bits signficand, 8 bits exponent
• Double precision: 53 bits significand, 11 bits exponent
• Deep learning typically uses single precision!
significand exponent
![Page 16: Calculus Refresher - jrwright.infojrwright.info/aicourse/slides/19-CalculusRefresher.pdf · Recap: Monte Carlo • Often we cannot directly estimate expectations from our model •](https://reader033.vdocuments.site/reader033/viewer/2022060502/5f1c7b25c2f9e12b25684da4/html5/thumbnails/16.jpg)
Underflow
• Numbers that are smaller than 1.00...01 × 2-1111...1111 will be rounded down to zero
• Sometimes that's okay! (Almost every number gets rounded)
• Often it's not (when?)
• Denominators: causes divide-by-zero
• log: returns -inf
• log(negative): returns nan
1. 001…011010significand
× 21001…0011
exponent
![Page 17: Calculus Refresher - jrwright.infojrwright.info/aicourse/slides/19-CalculusRefresher.pdf · Recap: Monte Carlo • Often we cannot directly estimate expectations from our model •](https://reader033.vdocuments.site/reader033/viewer/2022060502/5f1c7b25c2f9e12b25684da4/html5/thumbnails/17.jpg)
Overflow
• Numbers bigger than 1.111...1111 × 21111 will be rounded up to infinity
• Numbers smaller than -1.111...1111 × 21111 will be rounded down to negative infinity
• exp is used very frequently
• Underflows for very negative numbers
• Overflows for "large" numbers
• 89 counts as "large"
1. 001…011010significand
× 21001…0011
exponent
![Page 18: Calculus Refresher - jrwright.infojrwright.info/aicourse/slides/19-CalculusRefresher.pdf · Recap: Monte Carlo • Often we cannot directly estimate expectations from our model •](https://reader033.vdocuments.site/reader033/viewer/2022060502/5f1c7b25c2f9e12b25684da4/html5/thumbnails/18.jpg)
Addition/Subtraction• Adding a small number to a large number can have no effect (why?)
Example:>>> A = np.array([0., 1e-8])>>> A = np.array([0., 1e-8]).astype('float32')>>> A.argmax()1>>> (A + 1).argmax()0
>>> A+1array([1., 1.], dtype=float32)
1. 001…011010significand
× 21001…0011
exponent
1e-8 is not thesmallest possible
float32
![Page 19: Calculus Refresher - jrwright.infojrwright.info/aicourse/slides/19-CalculusRefresher.pdf · Recap: Monte Carlo • Often we cannot directly estimate expectations from our model •](https://reader033.vdocuments.site/reader033/viewer/2022060502/5f1c7b25c2f9e12b25684da4/html5/thumbnails/19.jpg)
Softmax
• Softmax is a very common function
• Used to convert a vector of activations (i.e., numbers) into a probability distribution
• Question: Why not normalize them directly without ?
• But overflows very quickly:
• Solution: where
exp
exp
softmax(z) z = x − maxj
xj
softmax(x)i =exp(xi)
∑nj=1 exp(xj)
![Page 20: Calculus Refresher - jrwright.infojrwright.info/aicourse/slides/19-CalculusRefresher.pdf · Recap: Monte Carlo • Often we cannot directly estimate expectations from our model •](https://reader033.vdocuments.site/reader033/viewer/2022060502/5f1c7b25c2f9e12b25684da4/html5/thumbnails/20.jpg)
Log• Dataset likelihoods shrink exponentially quickly in the number of datapoints
• Example:
• Likelihood of a sequence of 5 fair coin tosses =
• Likelihood of a sequence of 100 fair coin tosses =
• Solution: Use log-probabilities instead of probabilities
• log-prob of 1000 fair coin tosses is
2−5 = 1/32
2−100
log(p1p2p3…pn) = log p1 + … + log pn
1000 log 0.5 ≈ − 693
![Page 21: Calculus Refresher - jrwright.infojrwright.info/aicourse/slides/19-CalculusRefresher.pdf · Recap: Monte Carlo • Often we cannot directly estimate expectations from our model •](https://reader033.vdocuments.site/reader033/viewer/2022060502/5f1c7b25c2f9e12b25684da4/html5/thumbnails/21.jpg)
General Solution
• Question: What is the most general solution to numerical problems?
• Standard libraries • Theano, Tensorflow both detect common unstable expressions
• scipy, numpy have stable implementations of many common patterns (e.g., softmax, logsumexp, sigmoid)
![Page 22: Calculus Refresher - jrwright.infojrwright.info/aicourse/slides/19-CalculusRefresher.pdf · Recap: Monte Carlo • Often we cannot directly estimate expectations from our model •](https://reader033.vdocuments.site/reader033/viewer/2022060502/5f1c7b25c2f9e12b25684da4/html5/thumbnails/22.jpg)
Summary
• Gradients are just vectors of partial derivatives
• Gradients point "uphill"
• Learning rate controls how fast we walk uphill
• Deep learning is fraught with numerical issues:
• Underflow, overflow, magnitude mismatches
• Use standard implementations whenever possible