lecture 04 information theory - github pages
TRANSCRIPT
![Page 1: Lecture 04 Information Theory - GitHub Pages](https://reader031.vdocuments.site/reader031/viewer/2022012502/617c4ac357b83b1de47d0854/html5/thumbnails/1.jpg)
Nakul Gopalan
Georgia Tech
Optimization
Machine Learning CS 4641
These slides are based on slides from Mykel Kochenderfer, Julia Roberts, GlaucioPaulino, and Rodrigo Borela Valente.
![Page 2: Lecture 04 Information Theory - GitHub Pages](https://reader031.vdocuments.site/reader031/viewer/2022012502/617c4ac357b83b1de47d0854/html5/thumbnails/2.jpg)
2
• Overview
• Unconstrained and constrained optimization
• Lagrange multipliers and KKTconditions
• Gradient descent
Complementary reading: Bishop PRML –Appendix E
Outline
![Page 3: Lecture 04 Information Theory - GitHub Pages](https://reader031.vdocuments.site/reader031/viewer/2022012502/617c4ac357b83b1de47d0854/html5/thumbnails/3.jpg)
3
• Overview
• Unconstrained and constrained optimization
• Lagrange multipliers and KKTconditions
• Gradient descent
Outline
![Page 4: Lecture 04 Information Theory - GitHub Pages](https://reader031.vdocuments.site/reader031/viewer/2022012502/617c4ac357b83b1de47d0854/html5/thumbnails/4.jpg)
Whyoptimization?
▪ Machine learning and pattern recognition algorithms often focus on theminimization
or maximization of aquantity
▪ Likelihood of a distribution given adataset
▪ Distortion measure in clusteringanalysis
▪ Misclassification error while predicting labels
▪ Square distance error for a real value prediction task
4
![Page 5: Lecture 04 Information Theory - GitHub Pages](https://reader031.vdocuments.site/reader031/viewer/2022012502/617c4ac357b83b1de47d0854/html5/thumbnails/5.jpg)
Basic optimization problem
the quantity we are trying tooptimize (maximize or▪ Objective or cost function 𝑓 𝐱minimize)
▪ The variables 𝑥1, 𝑥2,…,𝑥𝑛 which can be represented in vector form as𝐱 (Note: 𝑥𝑛 here
does NOT correspond to a point in our dataset)
▪ Constraints that limit how small or big variables can be. These can beequality
constraints, noted asℎ𝑘 𝐱 and inequality constraints noted as 𝑔𝑗(𝐱)
▪ An optimization problem is usually expressedas:
max 𝑓 𝐱𝐱
𝑠. 𝑡.𝐠 𝐱 ≥ 0
𝐡 𝐱 = 0
5
![Page 6: Lecture 04 Information Theory - GitHub Pages](https://reader031.vdocuments.site/reader031/viewer/2022012502/617c4ac357b83b1de47d0854/html5/thumbnails/6.jpg)
6
• Overview
• Unconstrained and constrained optimization
• Lagrange multipliers and KKTconditions
• Gradient descent
Outline
![Page 7: Lecture 04 Information Theory - GitHub Pages](https://reader031.vdocuments.site/reader031/viewer/2022012502/617c4ac357b83b1de47d0854/html5/thumbnails/7.jpg)
Unconstrained and constrainedoptimization
Local minima
Localmaxima
Local minima
Local maxima
7
![Page 8: Lecture 04 Information Theory - GitHub Pages](https://reader031.vdocuments.site/reader031/viewer/2022012502/617c4ac357b83b1de47d0854/html5/thumbnails/8.jpg)
8
• Overview
• Unconstrained and constrained optimization
• Lagrange multipliers and KKTconditions
• Gradient descent
Outline
![Page 9: Lecture 04 Information Theory - GitHub Pages](https://reader031.vdocuments.site/reader031/viewer/2022012502/617c4ac357b83b1de47d0854/html5/thumbnails/9.jpg)
Lagrangian multipliers: equalityconstraint
∇ℎ(𝐱)
∇𝑓 𝐱
9
![Page 10: Lecture 04 Information Theory - GitHub Pages](https://reader031.vdocuments.site/reader031/viewer/2022012502/617c4ac357b83b1de47d0854/html5/thumbnails/10.jpg)
Lagrangian multipliers: equalityconstraint
∇ℎ(𝐱)
∇𝑓 𝐱
10
![Page 11: Lecture 04 Information Theory - GitHub Pages](https://reader031.vdocuments.site/reader031/viewer/2022012502/617c4ac357b83b1de47d0854/html5/thumbnails/11.jpg)
Lagrangian multipliers
11
![Page 12: Lecture 04 Information Theory - GitHub Pages](https://reader031.vdocuments.site/reader031/viewer/2022012502/617c4ac357b83b1de47d0854/html5/thumbnails/12.jpg)
12
• Overview
• Unconstrained and constrained optimization
• Lagrange multipliers and KKTconditions
• Gradient descent
Outline
![Page 13: Lecture 04 Information Theory - GitHub Pages](https://reader031.vdocuments.site/reader031/viewer/2022012502/617c4ac357b83b1de47d0854/html5/thumbnails/13.jpg)
Search
![Page 14: Lecture 04 Information Theory - GitHub Pages](https://reader031.vdocuments.site/reader031/viewer/2022012502/617c4ac357b83b1de47d0854/html5/thumbnails/14.jpg)
Gradient descent
▪ Common in machine learning problems when not all of the data is available
immediately or a closed form solution is computationally intractable
▪ Iterative minimization technique for differentiable functions ona domain
𝐱𝑛+1 = 𝐱𝑛 − 𝛾∇𝐹(𝐱𝑛)
14
![Page 15: Lecture 04 Information Theory - GitHub Pages](https://reader031.vdocuments.site/reader031/viewer/2022012502/617c4ac357b83b1de47d0854/html5/thumbnails/15.jpg)
Closed or Symbolic differential
f(x) = x2 + x + 1
Δf(x)Δx
= 2x + 1
![Page 16: Lecture 04 Information Theory - GitHub Pages](https://reader031.vdocuments.site/reader031/viewer/2022012502/617c4ac357b83b1de47d0854/html5/thumbnails/16.jpg)
Method of Finite differences
f(x) = x2 + x + 1
Δf(x)
Δx=f(x+Δ) – f(x)
2Δ
![Page 17: Lecture 04 Information Theory - GitHub Pages](https://reader031.vdocuments.site/reader031/viewer/2022012502/617c4ac357b83b1de47d0854/html5/thumbnails/17.jpg)
Autodiff
Automatic differentiation (AD): A method to get exact derivatives efficiently, by
storing information as you go forward that you can reuse as you go backwards
• Takes code that computes a function and returns code that computes the
derivative of that function.
• “The goal isn’t to obtain closed-form solutions, but to be able to write a
program that efficiently computes the derivatives.” • Autograd, Torch Autograd
From slides of Paul Vicol, and Ryan Adams
![Page 18: Lecture 04 Information Theory - GitHub Pages](https://reader031.vdocuments.site/reader031/viewer/2022012502/617c4ac357b83b1de47d0854/html5/thumbnails/18.jpg)
Autodiff
An autodiff system will convert the program into a sequence of primitive
operations which have specified routines for computing derivatives
From slides of Roger Grosse
![Page 19: Lecture 04 Information Theory - GitHub Pages](https://reader031.vdocuments.site/reader031/viewer/2022012502/617c4ac357b83b1de47d0854/html5/thumbnails/19.jpg)
Gradient descent: Himmelblau’s function
CS4641B Machine Learning | Fall 2020 19
Image credit: WikimediaImage credit: Wikimedia