lecture 1 recap · powerpoint presentation author: laura lt created date: 4/22/2018 5:59:36 pm
TRANSCRIPT
Lecture 1 Recap
Prof. Leal-Taixé and Prof. Niessner 1
Reinforcement learning
Agents Environmentinteraction
Machine learning
Unsupervised learning Supervised learning
Prof. Leal-Taixé and Prof. Niessner 2
Nearest Neighbor
?Prof. Leal-Taixé and Prof. Niessner 3
Nearest Neighbor
distance
NN classifier = dog
Prof. Leal-Taixé and Prof. Niessner 4
Nearest Neighbor
Courtesy of Stanford course cs231n
Decision boundaries
Prof. Leal-Taixé and Prof. Niessner 5
Linear Regression
Prof. Leal-Taixé and Prof. Niessner 6
Linear regression
• Supervised learning
• Find a linear model that explains a target given the inputs
Prof. Leal-Taixé and Prof. Niessner 7
Training
Data points
Linear regression
Input (e.g. image, measurement)
Labels (e.g. cat/dog)
Learner
Model parameters
Prof. Leal-Taixé and Prof. Niessner 8
Training
Testing
Data points
Learner
Model parameters
Predictor
Estimation
Linear regression
Prof. Leal-Taixé and Prof. Niessner 9
Input data, features
weights, model parameters
d inputs
Linear prediction• A linear model is expressed in the form
Prof. Leal-Taixé and Prof. Niessner 10
1 bias
Linear prediction• A linear model is expressed in the form
Prof. Leal-Taixé and Prof. Niessner 11
Temperature of a building
Outside temperature
Number of people
Sun exposure
Level of humidity
Linear prediction
Prof. Leal-Taixé and Prof. Niessner 12
Input features
Model parameters
Prediction
Linear prediction
Prof. Leal-Taixé and Prof. Niessner 13
Temperature of the building
Out
sid
e te
mp
erat
ure
Hum
idity
Num
ber
peo
ple
Sun
exp
osur
e (%
)
MODEL
Linear prediction
Prof. Leal-Taixé and Prof. Niessner 14
Temperature of the building
Out
sid
e te
mp
erat
ure
Hum
idity
Num
ber
peo
ple
Sun
exp
osur
e (%
)
MODEL
How do we obtain the
model?
Linear prediction
Prof. Leal-Taixé and Prof. Niessner 15
How to obtain the model?
Data points
Model parameters
Labels (ground truth)
Estimation
Loss function
Optimization
Prof. Leal-Taixé and Prof. Niessner 16
How to obtain the model?
• Loss function: measures how good my estimation is (how good my model is) and tells the optimization method how to make it better.
• Optimization: changes the model in order to improve the loss function (i.e. to improve my estimation).
Prof. Leal-Taixé and Prof. Niessner 17
Prediction: Temperature of
the building
Linear regression: loss function
Prof. Leal-Taixé and Prof. Niessner 18
Prediction: Temperature of
the building
Linear regression: loss function
Prof. Leal-Taixé and Prof. Niessner 19
MinimizingObjective function
EnergyCost function
Linear regression: loss function
Prof. Leal-Taixé and Prof. Niessner 20
Optimization: linear least squares• Linear least squares: an approach to fit a
mathematical model to the data
• Convex problem, there exists a closed-form solution that is unique.
Prof. Leal-Taixé and Prof. Niessner 21
Optimization: linear least squares
The estimation comes from the
linear model
training samples
Prof. Leal-Taixé and Prof. Niessner 22
Optimization: linear least squares
Matrix notation
labelstraining samples, each input vector
has size Prof. Leal-Taixé and Prof. Niessner 23
Optimization: linear least squares
Matrix notation
Thursday April 19th: exercise session, more on matrix notation
Prof. Leal-Taixé and Prof. Niessner 24
Optimization: linear least squares
Convex
OptimumProf. Leal-Taixé and Prof. Niessner 25
Optimization
More info on Thursday!
No theta there!
Prof. Leal-Taixé and Prof. Niessner 26
Output: Temperature of
the buildingInputs: Outside temperature,
number of people…
Optimization
Prof. Leal-Taixé and Prof. Niessner 27
Is this the best estimate?• Least squares estimate
Prof. Leal-Taixé and Prof. Niessner 28
Maximum Likelihood
Prof. Leal-Taixé and Prof. Niessner 29
Maximum Likelihood Estimate
True underlying distribution
Controlled by a parameter
Parametric family of distributions
Prof. Leal-Taixé and Prof. Niessner 30
Maximum Likelihood Estimate• A method of estimating the parameters of a statistical
model given observations,
Observations from
Prof. Leal-Taixé and Prof. Niessner 31
Maximum Likelihood Estimate• A method of estimating the parameters of a statistical
model given observations, by finding the parameter values that maximize the likelihood of making the observations given the parameters.
Training samples
Prof. Leal-Taixé and Prof. Niessner 32
Maximum Likelihood Estimate
Logarithmic property
Prof. Leal-Taixé and Prof. Niessner 33
Back to linear regression
Prof. Leal-Taixé and Prof. Niessner 34
Back to linear regression
Assuming
Gaussian or Normal distribution
mean
Prof. Leal-Taixé and Prof. Niessner 35
Back to linear regression
Assuming
Prof. Leal-Taixé and Prof. Niessner 36
Back to linear regression
Assuming
mean
Prof. Leal-Taixé and Prof. Niessner 37
Back to linear regression
Original optimization problem
Prof. Leal-Taixé and Prof. Niessner 38
Back to linear regression
Matrix notation
Canceling log and e
Prof. Leal-Taixé and Prof. Niessner 39
Back to linear regression
How can we find the estimate of theta?
Prof. Leal-Taixé and Prof. Niessner 40
Linear regression• Maximum Likelihood Estimate (MLE) corresponds to
the Least Squares Estimate (given the assumptions)
• Introduced the concepts of loss function and optimization to obtain the best model for regression
Prof. Leal-Taixé and Prof. Niessner 41
Image classification
Prof. Leal-Taixé and Prof. Niessner 42
Regression vs. Classification
• Regression: predict a continuous output value (e.g. temperature of a room)
• Classification: predicts a discrete value – Binary classification: output is either 0 or 1– Multi-class classification: set of N classes
Prof. Leal-Taixé and Prof. Niessner 43
Logistic regression
CAT classifier
Prof. Leal-Taixé and Prof. Niessner 44
Sigmoid for binary predictions• 0
Prof. Leal-Taixé and Prof. Niessner 45
Can be interpreted as a probability
1
Spoiler alert: 1 layer Neural Network• 0
Prof. Leal-Taixé and Prof. Niessner 46
Can be interpreted as a probability
1
Logistic regression• Probability of a binary output
Prof. Leal-Taixé and Prof. Niessner 47
Logistic regression• Probability of a binary output
Prof. Leal-Taixé and Prof. Niessner 48
Model for coins
Bernoulli trial
Logistic regression• Probability of a binary output
Model for coins
Prof. Leal-Taixé and Prof. Niessner 49
Logistic regression: loss function• Probability of a binary output
• Maximum Likelihood Estimate
Prof. Leal-Taixé and Prof. Niessner 50
Logistic regression: loss function
Prof. Leal-Taixé and Prof. Niessner 51
Logistic regression: loss function
Prof. Leal-Taixé and Prof. Niessner 52
Maximize!
Logistic regression: loss function
Prof. Leal-Taixé and Prof. Niessner 53
We want. large, since logarithm is a monotonically increasing function, we also want
large (1 is the largest value my estimate can take!)
Logistic regression: loss function
Prof. Leal-Taixé and Prof. Niessner 54
We want. large, so we want to be small (0 is the smallest value my estimate can take!)
Logistic regression: loss function
• Related to the multi-class loss you will see in this course (also called softmax loss)
Prof. Leal-Taixé and Prof. Niessner 55
Referred to as cross-entropy loss
Logistic regression: optimization• Loss function
• Cost function
Prof. Leal-Taixé and Prof. Niessner 56
Minimization
Logistic regression: optimization• No closed-form solution
• Make use of an iterative method gradient descent
Prof. Leal-Taixé and Prof. Niessner 57
Gradient descent
Prof. Leal-Taixé and Prof. Niessner 58
Following the slope
Optimum
Prof. Leal-Taixé and Prof. Niessner 59
Initialization
Following the slope
Prof. Leal-Taixé and Prof. Niessner 60
Initialization
Optimum
Following the slope
Prof. Leal-Taixé and Prof. Niessner 61
Initialization
Follow the slope of the DERIVATIVE
Optimum
Gradient steps• From derivative to gradient
• Gradient steps in direction of negative gradient
Prof. Leal-Taixé and Prof. Niessner 62
Direction of greatest
increase of the function
Learning rate
Gradient steps• From derivative to gradient
• Gradient steps in direction of negative gradient
Prof. Leal-Taixé and Prof. Niessner 63
Direction of greatest
increase of the function
SMALL Learning rate
Gradient steps• From derivative to gradient
• Gradient steps in direction of negative gradient
Prof. Leal-Taixé and Prof. Niessner 64
Direction of greatest
increase of the function
LARGE Learning rate
Convergence
Prof. Leal-Taixé and Prof. Niessner 65
Initialization
What is the gradient when we reach this point?
Not guaranteed to reach the
optimumOptimum
Numerical gradient
• Approximate• Slow evaluation
Prof. Leal-Taixé and Prof. Niessner 66
Analytical gradient• Exact and fast
Prof. Leal-Taixé and Prof. Niessner 67
Analytical gradient
Recall Linear Regression
A step in that direction
Gradient descent for least squares
Prof. Leal-Taixé and Prof. Niessner 68
Convex, always converges to the same solution
Gradient descent for logistic regression
Prof. Leal-Taixé and Prof. Niessner 69
Explained in the following lectures when the following topics are introduced:• Computational graphs• Backpropagation of the gradients
Regularization
Prof. Leal-Taixé and Prof. Niessner 71
Regularization
Loss
L2 regularization
L1 regularization
Max norm regularization
Dropout
Prof. Leal-Taixé and Prof. Niessner 72
Regularization: example• Input: 3 features
• Two linear classifiers that give the same result:
Ignores 2 features
Takes information from all features
Prof. Leal-Taixé and Prof. Niessner 73
Regularization: example
Loss
L2 regularization
Minimization
Prof. Leal-Taixé and Prof. Niessner 74
Regularization: example
Loss
Minimization
L1 regularization
Prof. Leal-Taixé and Prof. Niessner 75
Regularization: example• Input: 3 features
• Two linear classifiers that give the same result:
Ignores 2 features
Takes information from all features
Prof. Leal-Taixé and Prof. Niessner 76
Regularization: example• Input: 3 features
• Two linear classifiers that give the same result:
L1 regularization enforces sparsity
Takes information from all features
Prof. Leal-Taixé and Prof. Niessner 77
Regularization: example• Input: 3 features
• Two linear classifiers that give the same result:
L1 regularization enforces sparsity
L2 regularization enforces that the weights have similar values
Prof. Leal-Taixé and Prof. Niessner 78
Regularization: effect• Dog classifier takes different inputs
Furry
Has two eyes
Has a tail
Has paws
Has two ears
L1 regularization will focus all the attention to a few key features
Prof. Leal-Taixé and Prof. Niessner 79
Regularization: effect• Dog classifier takes different inputs
Furry
Has two eyes
Has a tail
Has paws
Has two ears
L2 regularization will take all information into account to make decisions
Prof. Leal-Taixé and Prof. Niessner 80
Regularization
Credits: University of Washington
What is the goal of regularization?
What happens to the training error?
Prof. Leal-Taixé and Prof. Niessner 81
Regularization• Any strategy that aims to
Prof. Leal-Taixé and Prof. Niessner 82
Lower validation error
Increasing training error
Beyond linear
This lecture Next lectures
Prof. Leal-Taixé and Prof. Niessner 83
Next lectures• Next Monday: Introduction to Neural Networks
• This Thursday: Math refresh #2, particularly useful for the jump to NN!
Prof. Leal-Taixé and Prof. Niessner 84
Useful references• General Machine Learning book:
– Pattern Recognition and Machine Learning. C. Bishop.
Prof. Leal-Taixé and Prof. Niessner 85