lecture 04 information theory - github pages

19
Nakul Gopalan Georgia Tech Optimization Machine Learning CS 4641 These slides are based on slides from Mykel Kochenderfer, Julia Roberts, Glaucio Paulino, and Rodrigo Borela Valente .

Upload: others

Post on 30-Oct-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 04 Information Theory - GitHub Pages

Nakul Gopalan

Georgia Tech

Optimization

Machine Learning CS 4641

These slides are based on slides from Mykel Kochenderfer, Julia Roberts, GlaucioPaulino, and Rodrigo Borela Valente.

Page 2: Lecture 04 Information Theory - GitHub Pages

2

• Overview

• Unconstrained and constrained optimization

• Lagrange multipliers and KKTconditions

• Gradient descent

Complementary reading: Bishop PRML –Appendix E

Outline

Page 3: Lecture 04 Information Theory - GitHub Pages

3

• Overview

• Unconstrained and constrained optimization

• Lagrange multipliers and KKTconditions

• Gradient descent

Outline

Page 4: Lecture 04 Information Theory - GitHub Pages

Whyoptimization?

▪ Machine learning and pattern recognition algorithms often focus on theminimization

or maximization of aquantity

▪ Likelihood of a distribution given adataset

▪ Distortion measure in clusteringanalysis

▪ Misclassification error while predicting labels

▪ Square distance error for a real value prediction task

4

Page 5: Lecture 04 Information Theory - GitHub Pages

Basic optimization problem

the quantity we are trying tooptimize (maximize or▪ Objective or cost function 𝑓 𝐱minimize)

▪ The variables 𝑥1, 𝑥2,…,𝑥𝑛 which can be represented in vector form as𝐱 (Note: 𝑥𝑛 here

does NOT correspond to a point in our dataset)

▪ Constraints that limit how small or big variables can be. These can beequality

constraints, noted asℎ𝑘 𝐱 and inequality constraints noted as 𝑔𝑗(𝐱)

▪ An optimization problem is usually expressedas:

max 𝑓 𝐱𝐱

𝑠. 𝑡.𝐠 𝐱 ≥ 0

𝐡 𝐱 = 0

5

Page 6: Lecture 04 Information Theory - GitHub Pages

6

• Overview

• Unconstrained and constrained optimization

• Lagrange multipliers and KKTconditions

• Gradient descent

Outline

Page 7: Lecture 04 Information Theory - GitHub Pages

Unconstrained and constrainedoptimization

Local minima

Localmaxima

Local minima

Local maxima

7

Page 8: Lecture 04 Information Theory - GitHub Pages

8

• Overview

• Unconstrained and constrained optimization

• Lagrange multipliers and KKTconditions

• Gradient descent

Outline

Page 9: Lecture 04 Information Theory - GitHub Pages

Lagrangian multipliers: equalityconstraint

∇ℎ(𝐱)

∇𝑓 𝐱

9

Page 10: Lecture 04 Information Theory - GitHub Pages

Lagrangian multipliers: equalityconstraint

∇ℎ(𝐱)

∇𝑓 𝐱

10

Page 11: Lecture 04 Information Theory - GitHub Pages

Lagrangian multipliers

11

Page 12: Lecture 04 Information Theory - GitHub Pages

12

• Overview

• Unconstrained and constrained optimization

• Lagrange multipliers and KKTconditions

• Gradient descent

Outline

Page 13: Lecture 04 Information Theory - GitHub Pages

Search

Page 14: Lecture 04 Information Theory - GitHub Pages

Gradient descent

▪ Common in machine learning problems when not all of the data is available

immediately or a closed form solution is computationally intractable

▪ Iterative minimization technique for differentiable functions ona domain

𝐱𝑛+1 = 𝐱𝑛 − 𝛾∇𝐹(𝐱𝑛)

14

Page 15: Lecture 04 Information Theory - GitHub Pages

Closed or Symbolic differential

f(x) = x2 + x + 1

Δf(x)Δx

= 2x + 1

Page 16: Lecture 04 Information Theory - GitHub Pages

Method of Finite differences

f(x) = x2 + x + 1

Δf(x)

Δx=f(x+Δ) – f(x)

Page 17: Lecture 04 Information Theory - GitHub Pages

Autodiff

Automatic differentiation (AD): A method to get exact derivatives efficiently, by

storing information as you go forward that you can reuse as you go backwards

• Takes code that computes a function and returns code that computes the

derivative of that function.

• “The goal isn’t to obtain closed-form solutions, but to be able to write a

program that efficiently computes the derivatives.” • Autograd, Torch Autograd

From slides of Paul Vicol, and Ryan Adams

Page 18: Lecture 04 Information Theory - GitHub Pages

Autodiff

An autodiff system will convert the program into a sequence of primitive

operations which have specified routines for computing derivatives

From slides of Roger Grosse

Page 19: Lecture 04 Information Theory - GitHub Pages

Gradient descent: Himmelblau’s function

CS4641B Machine Learning | Fall 2020 19

Image credit: WikimediaImage credit: Wikimedia