machine learning: intro & linear clasifier · supervised machine learning •training phase...
TRANSCRIPT
![Page 1: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/1.jpg)
Machine Learning: Intro & Linear Clasifier
Olga Veksler
![Page 2: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/2.jpg)
Outline
• Introduction to Machine Learning
• motivation of machine learning
• types of machine learning
• basic concepts
• overfitting, underfitting, generalization
• Optimization with gradient descent
• Linear Classifier• Two classes
• Loss functions
• Gradient descent schemes
• Multiple classes
![Page 3: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/3.jpg)
INTRO TO ML
![Page 4: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/4.jpg)
Why use Machine Learning? • Difficult to come up with explicit program for some tasks
• Digit Recognition, a classic example
• Easy to collect images of digits with their correct labels
0
• Machine Learning Algorithm takes collected data and produces program for recognizing digits• done right, program will recognize correctly new images it has never seen
4
![Page 5: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/5.jpg)
5
What is Machine Learning?
• General definition (Tom Mitchell):
• Based on experience E, improve performance on task T as measured by performance measure P
• Digit Recognition Example
• T = recognize character in the image
• P = percentage of correctly classified images
• E = dataset of human-labeled images of characters
![Page 6: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/6.jpg)
6
Types of Machine Learning • Supervised Learning
• given training examples with corresponding outputs (also called label, target, class, answer, etc.)
• learn to produces correct output for a new example
• Unsupervised Learning
• given training examples only
• discover good data representation
• e.g. “natural” clusters
• k-means is the most widely known example
• Reinforcement Learning
• learn to select action that maximizes payoff
![Page 7: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/7.jpg)
7
Supervised Machine Learning: subtypes
• Classification • output belongs to a finite set
• example: age {baby, child, adult, elder}
• output is also called class or label
• Regression• output is continuous
• example: age [0,130]
• Difference mostly in design of loss function
![Page 8: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/8.jpg)
Supervised Machine Learning
salmon salmonsea bass sea bass
• Have examples with corresponding outputs
• Example: fish classification, salmon or sea bass?
x1 x2 x3 x4
=
7.5
3.3
=
7.8
3.6
=
7.1
3.2
=
0.7
4.6
y1=0 y2=1 y3 = 0 y4=1
• Each example represented by a vector • data may be given in vector form from the start
• if not, for each example i, extract useful features and put them in a vector xi
• fish classification example• extract two features, fish length and average fish brightness
• can extract as many other features
• for images, can also use raw pixel intensity or color as features
• an example is often called feature vector
• yi is the output for example xi
![Page 9: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/9.jpg)
Supervised Machine Learning
• Training phase
• estimate function y = f(x) from labeled data• f(x) is called classifier, learning machine, prediction function, etc.
• Testing phase (deployment)
• predict output f(x) for a new (unseen) sample x
• We are given
1. Training examples x1, x2,…, xn
2. Target output for each sample y1, y2,…ynlabeled data
![Page 10: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/10.jpg)
label prediction
Training/Testing Phases Illustrated
training labels
training examples
Training
Training
feature vectors
feature vector
Testing
test Image
Learned model f
Learned model f
![Page 11: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/11.jpg)
More on Training Phase
• Estimate prediction function y = f(x) from labeled data
• Choose hypothesis space f(x) belongs to
• hypothesis space f(x,w) is parameterized by vector of weights w
• each setting of w corresponds to a different hypothesis
• find f(x,w) in the hypothesis space s.t. f(xi,w) = yi “as much as possible” for training examples
• define “as much as possible” via some loss function L(f(x,w),y)
f(x,w1)f(x,w3)
f(x,w2)
f(x,w4)
f(x,w5)
hypothesis space
![Page 12: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/12.jpg)
Training Phase Example in 1D• 2 class classification problem, yi ∊{-1,1}
• Training set class -1: {-2, -1, 1} class 1: {2, 3, 5}
x
-1+2x
0.5
class -1 class 1
• Hypothesis space f(x,w) = sign(w0 + w1x)
0
=
1
0
w
ww
• with w0 = -1, w1 = 2, f(x) = sign( -1 + 2x )
f(x)
![Page 13: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/13.jpg)
Training Phase Example in 1D
x
1.5+x
1.5
class -1 class 1
0
• The process of finding good w is weight tuning, or training
• Hypothesis space f(x,w) = sign(w0+w1x )• With w0 = -1.5, w1 = 1, f(x) = sign(-1.5 + x )
• 2 class classification problem, yi ∊{-1,1}
• Training set class -1: {-2, -1, 1} class 1: {2, 3, 5}
![Page 14: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/14.jpg)
Training Phase Example in 2D• For 2 class problem and 2 dimensional samples
f(x,w) = sign(w0+w1x1+w2x2)
decision boundary
decision regions
x1
x2
• Can be generalized to examples of arbitrary dimension
• Classifier that makes a decision based on linear combination of features is called a linear classifier
![Page 15: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/15.jpg)
Training Phase: Linear Classifier
classification error 38%
bad setting of w
x1
x2
x1
x2
better setting of w
classification error 4%
![Page 16: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/16.jpg)
16
Training Stage: More Complex Classifier
• for example if f(x,w) is a polynomial of high degree
• 0% classification error
x1
x2
![Page 17: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/17.jpg)
Test Classifier on New Data
• The goal is to classify well on new data
• Test “wiggly” classifier on new data: 25% error
x1
x2
![Page 18: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/18.jpg)
Overfitting
• Have only limited amount of data for training
• Complex model often has too many parameters to fit reliably with limited data
• Complex model may adapt too closely to random noise of training data, rather than look at a ‘big picture’
x1
x2
![Page 19: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/19.jpg)
Overfitting: Extreme Example• 2 class problem: face and non-face images
• Memorize (i.e. store) all the “face” images
• For a new image, see if it is one of the stored faces
• if yes, output “face” as the classification result
• If no, output “non-face”
• also called “rote learning”
• problem: new “face” images are different from stored
“face” examples
• zero error on stored data, 50% error on test (new) data
• decision boundary is very irregular
• Rote learning is memorization without generalization
slide is modified from Y. LeCun
![Page 20: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/20.jpg)
Generalizationtraining data
• Ability to produce correct outputs on previously unseen examples
is called generalization
• Big question of learning theory: how to get good generalization
with a limited number of examples
• Intuitive idea: favor simpler classifiers
• William of Occam (1284-1347): “entities are not to be multiplied without necessity”
• Simpler decision boundary may not fit ideally to training data but
tends to generalize better to new data
new data
![Page 21: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/21.jpg)
21
Training and Testing• How to diagnose overfitting?
• Divide all labeled samples x1,x2,…xn into training set and test set
• Use training set (training samples) to tune classifier weights w
• Use test set (test samples) to see how well classifier with tuned weights w work on unseen examples
• Thus there are 2 main phases in classifier design
1. training
2. testing
![Page 22: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/22.jpg)
Training Phase
• Find weights w s.t. f(xi,w) = yi “as much as possible” for training samples xi
• define “as much as possible”• penalty whenever f(xi,w) ≠ yi
• via loss function L(f(xi,w), yi)
• how to search for good setting of w?• usually through optimization
• can be very time consuming
• classification error on training data is called training error
![Page 23: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/23.jpg)
Testing Phase• The goal is good performance on unseen examples
• Evaluate performance of the trained classifier f(x,w) on the test samples (unseen labeled samples)
• Testing on unseen labeled examples is an approximation of how well classifier will perform in practice
• If testing results are poor, may have to go back to the training phase and redesign f(x,w)
• Classification error on test data is called test error
• Side note
• when we “deploy” the final classifier f(x,w) in practice, this is also called testing
![Page 24: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/24.jpg)
24
• Can also underfit data, i.e. too simple decision boundary • chosen hypothesis space is not
expressive enough
• No linear decision boundary can separate the samples well
• Training error is too high• test error is, of course, also high
Underfitting
![Page 25: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/25.jpg)
Underfitting → Overfitting
underfitting “just right” overfitting
• high training error
• high test error
• low training error
• low test error
• low training error
• high test error
![Page 26: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/26.jpg)
26
Sketch of Supervised Machine Learning
• Chose a hypothesis space f(x,w)
• w are tunable weights
• x is the input sample
• tune w so that f(x,w) gives the correct label for training samples x
• Which hypothesis space f(x,w) to choose?
• has to be expressive enough to model our problem well, i.e. to avoid underfitting
• yet not to complicated to avoid overfitting
![Page 27: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/27.jpg)
Classification System Design Overview
• Collect and label data by handsalmon salmon salmonsea bass sea bass sea bass
• Preprocess data (i.e. segmenting fish from background)
• Extract possibly discriminating features• length, lightness, width, number of fins,etc.
• Classifier design• choose model for classifier
• train classifier on training data
• Test classifier on test data
• Split data into training and test sets
mostly look at these steps in the course
![Page 28: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/28.jpg)
OPTIMIZATION
![Page 29: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/29.jpg)
( ) 0xJdx
d=
Optimization
• How to minimize a function of a single variable
J(x) =(x-5)2
• From calculus, take derivative, set it to 0
• Solve the resulting equation• maybe easy or hard to solve
• Example above is easy
( ) ( ) 5x05x2xJdx
d==−=
![Page 30: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/30.jpg)
( )
( )
( ) 0xJ
xJx
xJx
d
1
==
Optimization• How to minimize a function of many variables
J(x) = J(x1,…, xd)
• From calculus, take partial derivatives, set them to 0
gradient
• Solve the resulting system of d equations
• It may not be possible to solve the system of equations above analytically
![Page 31: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/31.jpg)
Optimization: Gradient Direction
x2x1
J(x1, x2)
Picture from Andrew Ng
• Gradient J(x) points in the direction of steepest increase of function J(x)
• - J(x) points in the direction of steepest decrease
![Page 32: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/32.jpg)
Gradient Direction in 2D• J(x1, x2) =(x1-5)2+(x2-10)2
( ) ( )5x2xJx
1
1
−=
•
( ) ( )10x2xJx
2
2
−=
•
( ) 101
=
aJ
x•
( ) 102
−=
aJ
x•
a
global min
x1
x2
5
10
10
5
−
10
10
( )
−=
10
10aJ•
• Let
=
5
10a
( )
−=−
10
10aJ•
![Page 33: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/33.jpg)
Gradient Descent: Step Size
• J(x1, x2) =(x1-5)2+(x2-10)2
• Which step size to take?
• Controlled by parameter
• called learning rate
• From previous slide
• J(10, 5) = 50; J(8,7) = 18
a
global min
x1
x2
5
10
10
5
−
10
10
( )
−=−
=
10
10
5
10aJ,a•
• Let = 0.2
( )aJa −α
−+
=
10
1020
5
10.
=
7
8
![Page 34: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/34.jpg)
J(x)
x
Gradient Descent Algorithm
x(1) x(2)
-J(x(1))
-J(x(2))
x(k)
-J(x(k))0
k = 1
x(1) = any initial guess
choose ,
while ||J(x(k))|| >
x(k+1) = x (k) - J(x(k))
k = k + 1
![Page 35: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/35.jpg)
Gradient Descent: Local Minimum
• Not guaranteed to find global minimum• gets stuck in local minimum
J(x)
x
x(1)x(2)
-J(x(1))
-J(x(2))
x(k)
-J(x(k))=0
global minimum
• Still gradient descent is very popular because it is simple and applicable to any differentiable function
![Page 36: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/36.jpg)
x
How to Set Learning Rate ?
• If too large, may overshoot the local minimum and possibly never even converge
J(x)
x
• If too small, too many iterations to converge
x(2) x(1)x(4) x(3)
• It helps to compute J(x) as a function of iteration number, to make sure we are properly minimizing it
J(x)
![Page 37: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/37.jpg)
Variable Learning Rate
k = 1
x(1) = any initial guess
choose ,
while ||J(x(k))|| >
x(k+1) = x (k) - J(x(k))
k = k + 1
• If desired, can change learning rate at each iteration
k = 1
x(1) = any initial guess
choose
while ||J(x(k))|| >
choose (k)
x(k+1) = x (k) - (k) J(x(k))
k = k + 1
![Page 38: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/38.jpg)
Variable Learning Rate
k = 1
x(1) = any initial guess
choose ,
while ||J(x(k))|| >
x(k+1) = x (k) - J(x(k))
k = k + 1
• Usually do not keep track of all intermediate solutions
k = 1
x = any initial guess
choose ,
while ||J(x)|| >
x = x - J(x)
k = k + 1
![Page 39: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/39.jpg)
Learning Rate
• Monitor learning rate by looking at how fast the objective function decreases
J(x)
number of iterations
very high learning rate
high learning rate
low learning rate
good learning rate
![Page 40: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/40.jpg)
1.0=Learning Rate: Loss Surface Illustration
001.0=
updates 3~ k
updates 30.~ k
01.0=
1.0=
![Page 41: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/41.jpg)
Linear Classifiers
![Page 42: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/42.jpg)
Supervised Machine Learning: Recap• Chose type of f(x,w)
• w are tunable weights, x is the input example
• f(x,w) should output the correct class of sample x
• use labeled samples to tune weights w so that f(x,w) give the correct class y for x in training data
• loss function L(f(x,w) ,y)
• How to choose type of f(x,w)?
• many choices
• simple choice: linear classifier
• other choices
• Neural Network
• Convolutional Neural Network
![Page 43: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/43.jpg)
Linear Classifier
• Use
• y = 1 for the first class
• y = -1 for the second class
• One choice for linear classifier
f(x,w) = sign(g(x,w))
• 1 if g(x,w) is positive
• -1 if g(x,w) is negative
g(x)
x-1
1f(x)
• Classifier it makes a decision based on linear combination of features
g(x,w) = w0+x1w1 + … + xdwd
• g(x,w) called discriminant function
![Page 44: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/44.jpg)
bad boundary
Linear Classifier: Decision Boundary
• f(x,w) = sign(g(x,w)) = sign(w0+x1w1 + … + xdwd)
• Decision boundary is linear
• Find w0, w1,…, wd that give best separation of two classes with a linear boundary
better boundary
![Page 45: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/45.jpg)
More on Linear Discriminant Function (LDF)
• LDF: g(x,w,w0) = w0+x1w1 + … + xdwd
x1
x2
decision boundary
g(x) = 0
g(x) > 0
decision region for class 1
g(x) < 0
decision region for class 2
=
dw
...
w
w
w2
1bias or threshold
![Page 46: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/46.jpg)
More on Linear Discriminant Function (LDF)
• Decision boundary: g(x,w) = w0+x1w1 + … + xdwd = 0
• This is a hyperplane, by definition
• a point in 1D
• a line in 2D
• a plane in 3D
• a hyperplane in higher dimensions
![Page 47: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/47.jpg)
Vector Notation • Linear discriminant function g(x, w, w0) = wtx + w0
• Example in 2D
( ) 423 210 ++= xxw,w,xg 42
3
0 =
= w,w
• Shorter notation if add extra feature of value 1 to x
=
2
1
x
xx
=
2
3
4
a
=
2
1
1
x
xz ( )
==
2
1
1
234
x
xaza,zg t
( ) 21 234 xxaza,zg t ++== ( )00 w,w,xgwwxt =+=
• Use atz instead of wtx + w0
![Page 48: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/48.jpg)
dx
x
1
1
Fitting Parameters w
• Rewrite g(x,w,w0) = [w0 wt] = atz = g(z,a)1xnew
feature vector z
new weight vector a
• z is called augmented feature vector
• new problem equivalent to the old g(z,a) = atz
• f(z,a) = sign(g(z,a))
dw
w
w
1
0
![Page 49: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/49.jpg)
a
a
a
Solution Region• If there is a that classifies all examples correctly, it is
called separating or solution vector
• then there are infinitely many solution vectors a
• original samples x1,… xn are also linearly separable
![Page 50: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/50.jpg)
Loss Function• How to find solution vector a?
• or, if no separating a exists, a good approximation a?
• Design a non-negative loss function L(a)
• L(a) is small if a is good
• L(a) is large if a is bad
• Minimize L(a) with gradient descent
• Two steps in design of L(a)
1. per-example loss L(f(zi,a),yi)• penalizes for deviations of f(zi,a) from yi
2. total loss adds up per-sample loss over all training examples
( ) ( )( )=i
ii y,a,zfLaL
![Page 51: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/51.jpg)
Loss Function, First Attempt• Per-example loss function ‘counts’ if error happens
( )( ) ( )
=
=otherwise
ya,zfify,a,zfL
iiii
1
0
• Example
=
2
11z 11 =y
=
4
12z 12 −=y
−=
3
2a
( )( ) 111 =y,a,zfL
( ) ( )11 zasigna,zf t= ( ) ( )22 zasigna,zf t=
( )( ) 022 =y,a,zfL
( )2321 −= sign
1−=
( )4321 −= sign
1−=
![Page 52: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/52.jpg)
Loss Function, First Attempt• Per-example loss function ‘counts’ if error happens
( )( ) ( )
=
=otherwise
ya,zfify,a,zfL
iiii
1
0
• Total loss function counts total number of errors
−=
3
2a
( )( ) 111 =y,a,zfL
( )( ) 022 =y,a,zfL
( ) ( )( )=i
ii y,a,zfLaL
• For previous example
( ) 101 =+=aL
• Total loss
![Page 53: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/53.jpg)
Loss Function: First Attempt
• ‘error count’ loss function not suitable for gradient descent
• piecewise constant, gradient zero or does not exist
a
L(a)
![Page 54: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/54.jpg)
Perceptron Loss Function• Different Loss Function: Perceptron Loss
• Lp(a) is non-negative• positive misclassified example zi
• atzi < 0
• yi = 1
• yi(atzi) < 0
• negative misclassified example zi
• atzi > 0
• yi = -1
• yi(atzi) < 0
• if zi is misclassified then yi(atzi) < 0
• if zi is misclassified then -yi(atzi) > 0
• Lp(a) proportional to distance of misclassified example to boundary
( )( ) ( )( )
−
==
otherwisezay
ya,zfify,a,zfL
iti
iiii
p
0
![Page 55: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/55.jpg)
Perceptron Loss Function
( )( ) ( )( )
−
==
otherwisezay
ya,zfify,a,zfL
iti
iiii
p
0
• Example
=
2
11z 11 =y
=
4
12z 12 −=y
−=
3
2a
( )( ) 411 =y,a,zfLp
( ) ( )11 zasigna,zf t= ( ) ( )22 zasigna,zf t=
( )( ) 022 =y,a,zfLp
( )2321 −= sign
1−=
( )4321 −= sign
1−=( )4−= sign
• Total loss Lp(a) = 4 + 0 = 4
![Page 56: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/56.jpg)
Perceptron Loss Function
• Lp(a) is piecewise linear and suitable for gradient descent
a
Lp(a)
• Per-example loss
( ) ( )( )=i
iip y,a,zfLaL
• Total loss
( )( ) ( )( )
−
==
otherwisezay
ya,zfify,a,zfL
iti
iiii
p
0
![Page 57: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/57.jpg)
Optimizing with Gradient Descent
• Gradient descent to minimize Lp(a)
a= a – Lp(a)
• Need gradient vector Lp(a)• the same dimension as a
=
1
3
2
a( )
=
3
2
1
a
L
a
L
a
L
aL
p
p
p
p
• Per-example loss
( ) ( )( )=i
iip y,a,zfLaL
• Total loss
( )( ) ( )( )
−
==
otherwisezay
ya,zfify,a,zfL
iti
iiii
p
0
![Page 58: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/58.jpg)
Optimizing with Gradient Descent
• Gradient descent to minimize Lp(a)
a= a – Lp(a)
• Per-example loss
( ) ( )( )=i
iip y,a,zfLaL
• Total loss
( )( ) ( )( )
−
==
otherwisezay
ya,zfify,a,zfL
iti
iiii
p
0
• Compute
( )( )=i
iip y,a,zfL( )aLp ( )( )=
i
iip y,a,zfL
( )( )
( )( )
( )( )
3
2
1
a
y,a,zfL
a
y,a,zfL
a
y,a,zfL
iip
iip
iip
per example gradient
• Compute and add up per example gradients
![Page 59: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/59.jpg)
Per Example Loss Gradient
( )( )( )
=
=
otherwise?
ya,zfif
y,a,zfL
ii
iip
000
• First case, f(zi,a) = yi
• Per-example loss has two cases
( )( ) ( )( )
−
==
otherwisezay
ya,zfify,a,zfL
iti
iiii
p
0
( )( )( )
==
otherwise?
ya,zfify,a,zfL
ii
iip
0
• To save space, rewrite
![Page 60: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/60.jpg)
Per Example Loss Gradient
( )( )= iip y,a,zfL
• Second case, f(zi,a) ≠ yi
( )( )
( )( )
( )( )
−
−
−
iti
iti
iti
zaya
L
zaya
L
zaya
L
3
2
1
( )( )
( )( )
( )( )
++−
++−
++−
=
iiii
iiii
iiii
zazazaya
L
zazazaya
L
zazazaya
L
332211
3
332211
2
332211
1
−
−
−
=ii
ii
ii
zy
zy
zy
3
2
1
iizy−=
• Per-example loss has two cases
( )( ) ( )( )
−
==
otherwisezay
ya,zfify,a,zfL
iti
iiii
p
0
( )( ) ( )
−
==
otherwisezy
ya,zfify,a,zfL
ii
iiii
p
0
• Gradient for per-example loss
![Page 61: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/61.jpg)
Optimizing with Gradient Descent
• Simpler formula
( ) −=
iexamplesiedmisclassif
iip zyaL
• Gradient decent update rule for Lp(a)
+=
iexamplesiedmisclassif
iizyaa α
• called batch because it is based on all examples
![Page 62: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/62.jpg)
• Examples
=
=
=53
34
32 321 xxx
Perceptron Loss Batch Example
11 =y
• Labels
12 =y 13 =y 14 −=y 15 −=y
• Add extra feature
=
1
2
11z
=
3
4
12z
=
5
3
13z
=
6
5
15z
=
3
1
14z
=
=65
31 54 xx
class 1 class 2
• Pile all examples as rows in matrix Z
=
651311531341321
Z
−
−
=
1
1
1
1
1
Y
• Pile all labels into column vector Y
![Page 63: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/63.jpg)
• Initial weights
Perceptron Loss Batch Example
• Examples in Z, labels in Y
=
651
311
531
341
321
Z
−
−
=
1
1
1
1
1
Y
=
1
1
1
a
• This is line x1 + x2 + 1 = 0
![Page 64: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/64.jpg)
• Perceptron Batch
Perceptron Loss Batch Example
=
651
311
531
341
321
Z
−
−
=
1
1
1
1
1
Y
=
1
1
1
a
+=
iexamplesiedmisclassif
iizyaa α
• Sample misclassified if y(atz) < 0
• Let us use learning rate α = 0.2
+=
iexamplesiedmisclassif
iizy.aa 20
![Page 65: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/65.jpg)
Perceptron Loss Batch Example
=
651
311
531
341
321
Z
−
−
=
1
1
1
1
1
Y
=
1
1
1
a
• Can compute y(atz) for all samples
( )
−−
==
125986
11111
*.a*Z*.Y
• Sample misclassified if y(atz) < 0
−
−
=
12
5
9
8
6
Total loss is L(a) = 5+12 = 17
( )( ) ( )( )
−
==
otherwisezay
ya,zfify,a,zfL
iti
iiii
p
0• Per example loss is
![Page 66: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/66.jpg)
• Samples 4 and 5 misclassified
• Perceptron Batch rule update
Perceptron Loss Batch Example
=
651
311
531
341
321
Z
−
−
=
1
1
1
1
1
Y
=
1
1
1
a
+=
iexamplesiedmisclassif
iizy.aa 20
−
−+=
6
5
1
1
3
1
1
120.aa
−
−=
80
20
60
.
.
.
−
−
=
21
1
20
60
20
20
1
1
1
.
.
.
.
.
• This is line -0.2x1 -0.8 x2 +0.6 = 0
![Page 67: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/67.jpg)
Perceptron Loss Batch Example
=
651
311
531
341
321
Z
−
−
=
1
1
1
1
1
Y
−
−=
80
20
60
.
.
.
a
• Find all misclassified samples
( ) Y*.a*Z
• Sample misclassified if y(atz) < 0
−
−
−
=
25
2
04
62
22
.
.
.
.
• Total loss is L(a) = 2.2 + 2.6 +4= 8.8• previous loss was 17 with 2 misclassified examples
![Page 68: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/68.jpg)
• Perceptron Batch rule update
Perceptron Loss Batch Example
=
651
311
531
341
321
Z
−
−
=
1
1
1
1
1
Y
+=
iexamplesiedmisclassif
iizy.aa 20
+
+
+=
5
3
1
1
3
4
1
1
3
2
1
120.aa
=
41
61
21
.
.
.
• This is line 1.6x1 +1.4 x2 + 1.2 = 0
−
−=
80
20
60
.
.
.
a
+
+
+
−
−=
1
60
20
60
80
20
60
40
20
80
20
60
.
.
.
.
.
.
.
.
.
.
.
( ) Y*.a*Z
−
−
−
=
25
2
04
62
22
.
.
.
.
![Page 69: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/69.jpg)
Perceptron Loss Batch Example
=
651
311
531
341
321
Z
−
−
=
1
1
1
1
1
Y
=
41
61
21
.
.
.
a
• Find all misclassified samples
( ) Y*.a*Z
• Sample misclassified if y(atz) < 0
−
−
=
617
7
013
811
68
.
.
.
.
• Total loss is L(a) = 7 + 17.6 = 24.6• previous loss was 8.8 with 3 misclassified examples
• loss went up, means learning rate of 0.2 is too high
![Page 70: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/70.jpg)
• Batch gradient descent can be slow to converge if lots of examples
• Single sample optimization• update weights a as soon as possible, after seeing 1 example
• One iteration (epoch) • go over all examples, in random order
• update after seeing one example zi
Single-Sample Gradient Descent
per example gradient
𝐚 = 𝐚 − 𝛂𝛻𝐋p 𝐟 𝐳i, 𝐚 , 𝐲i
![Page 71: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/71.jpg)
Single sample gradient descent, one iteration
global min
finish
start
Batch Size: Loss Surface Illustration
Batch Gradient Descent, one iteration
see only one example
global min
start
finish
![Page 72: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/72.jpg)
• Mini-Batch optimization• update weights a after seeing m examples
• One iteration (epoch) • go over all examples, in random order
• update after seeing m examples
Mini-Batch Gradient Descent
per example gradient
𝐚 = 𝐚 − 𝛂
𝑚
𝛻𝐋p 𝐟 𝐳i, 𝐚 , 𝐲i
• Most often used in practice
• Practical Issue: both batch and mini-batch algorithms converge faster if features are roughly on the same scale
![Page 73: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/73.jpg)
Logistic Regression, Motivation• Recall line fitting
z
1
-1
z
1
-1
• solution with squared error loss
• one sample misclassified
• solution with perceptron loss
• all samples classified correctly
• Instead of trying to get atz close to y• use differentiable function Ϭ(atz) that “squishes” the range
• try to get Ϭ(atz) close to y
![Page 74: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/74.jpg)
Linear Classifier: Logistic Regression
( )( )texp
t−+
=1
1σ
t
Ϭ(t)
0.5
1
• Use logistic sigmoid function Ϭ(t) for “squishing” atz
• Denote classes with 1 and 0 now• yi = 1 for positive class, yi = 0 for negative
• Despite “regression” in the name, logistic regression is used for classification, not regression
![Page 75: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/75.jpg)
Logistic Regression vs. Regresson
Ϭ(atz)
atz
0.5
1
quadratic loss
logistic regression loss
![Page 76: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/76.jpg)
Logistic Regression: Loss Function• Could use (yi- Ϭ(atz)) 2 as per-example loss function
• Instead use a different loss • if z has label 1, want Ϭ(aTz) close to 1, define loss as
–log [Ϭ(aTz)]
• if z has label 0, want Ϭ(aTz) close to 0, define loss as
–log [1-Ϭ(aTz)]
1
t
-log t
• Gradient descent batch update rule
( )( ) i
i
iti zzayaa −+= σα
• Probabilistic interpretation• P(class 1) = Ϭ(aTz)
• P(class 0) = 1 - P(class 1)
• loss function is negative log-likelihood , -log P(y)
• standard objective in statistics
![Page 77: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/77.jpg)
Logistic Regression vs. Perceptron• Green example classified correctly, but
close to decision boundary• no loss under Perceptron
• non-negligible loss under logistic regression
x
Ϭ(wtx)0.5
1
• Logistic Regression encourages decision boundary move away from training samples
• better generalization
• zero Perceptron loss
• smaller LR loss• zero Perceptron loss
• larger LR loss
• red classifier works better for new data
![Page 78: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/78.jpg)
Logistic Regression vs. Regression vs. Perceptron
yatz
misclassified classified correctly but close to
decision boundary
classified correctly and not too close
to the decision boundary
1
quadratic loss
perceptron
loss
• Assuming labels are +1 and -1
logistic regression
loss
![Page 79: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/79.jpg)
• Define m linear discriminant functions
gi(x) = witx + wi0 for i = 1, 2, … m
Multiple Classes: General Case
• Assign x to class i if
gi(x) > gj(x) for all j ≠ i
• Let Ri be decision region for class i• all points in Ri assigned to class i
g2(x) > g1(x)g2(x) > g3(x)
R1R2
R3
g1(x) > g2(x)g1(x) > g3(x)
g3(x) > g1(x)g3(x) > g2(x)
![Page 80: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/80.jpg)
Multiple Classes
• Can be shown that decision regions are convex
• In particular, they must be spatially contiguous
![Page 81: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/81.jpg)
Multiclass Linear Classifier: Matrix Notation
• Assume examples x are augmented with extra feature 1, no need to write bias explicitly• but from now on will not change notation to z’s
• Define m discriminant functions
gi(x) = witx for i = 1, 2, … m
• Assign x to i that gives maximum gi(x)
• Picture illustration
x
g1(x)
g2(x)
g3(x)
g4(x)
−
10
9
3
5→ 5
→ 3
→ -9
→ 10
pile all outputs into one vector
decide class 4
![Page 82: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/82.jpg)
Multiclass Linear Classifier: Matrix Notation• Could use one dimensional output yi ∊ {1, 2, 3, …, m}
got this
0
0
1
0
want this
• Convenient to use multi-dimensional outputs (one-hot encoding)
=
0
0
0
1
jy
class 1
=
0
0
1
0
jy
class 2
=
0
1
0
0
jy
class 3
=
1
0
0
0
jy
class 4
x
g1(x)
g2(x)
g3(x)
g4(x)
−
10
9
3
5• if sample is of class i, want output vector to be 0 everywhere except position i, where it should be 1 x is of class 2
![Page 83: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/83.jpg)
• Assign x to i that gives maximum gi(x)= witx
x
g1(x)
g2(x)
g3(x)
g4(x)
w2tx
w3tx
w4tx
w1tx
x
• In matrix notation
−
−
−
172
254
239
742
4
7
1
−
−=
43
47
4
2w1
w2
w3
w4
xW Wx• Assign x to class that corresponds to largest row of Wx
Multiclass Linear Classifier: Matrix Notation
742 −
x
x
x
x
x
239 −
254
172 −
![Page 84: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/84.jpg)
• Assign sample xi to class that corresponds to largest row of Wxi
• Loss function?
−
−
43
47
4
2
Wxi
1
0
0
0
yi
( ) ( )( ) −=i
tiii xyWxWL
• Can use quadratic loss per sample xi as ½||Wxi - yi ||2
• for example above, loss (22 + 42 + 472 +442)/2
• total loss on all training samples L(W) = ½Σi|| Wxi - yi ||2
• gradient of the loss
• L(W) has the same shape as the same shape as W
• batch gradient descent update
( ) ( )ti
i
ii xyWxWW −−=
Quadratic Loss Function
![Page 85: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/85.jpg)
• Consider gradient descent update, single sample x with α = 1
( ) txyWxWW −−=
• Suppose and is of class 2 and
=
2
3
1
x
−
−
−
=
172
254
239
742
W
−
=
−
−
=−
17
23
3
0
0
0
1
0
17
23
4
0
yWxok
too smalltoo large
−−−
−
−
−
−
=
−
−
−
−
−
=
345117
466923
693
000
172
254
239
742
231
17
23
3
0
172
254
239
742
W
• update rule
Quadratic Loss Function
−−−
−−
−
=
354419
446419
4126
742
![Page 86: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/86.jpg)
Quadratic Loss Function
−−−
−−
−
=
354419
446419
4126
742
W
−
−=
221
299
38
0
Wx• With new ,
• Already saw that quadratic loss does not work that well for classification
−
=
−
−
=−
17
23
3
0
0
0
1
0
17
23
4
0
yWxok
too smalltoo large
![Page 87: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/87.jpg)
Softmax Function
( )
( )
( )
( )
( )
( )
( )
( )
=
=
=
=
4
1j
4
4
1j
3
4
1j
2
4
1j
1
exp
exp
exp
exp
j
j
j
j
aexp
a
aexp
a
aexp
a
aexp
a
• Define softmax(a) function
4
3
2
1
a
a
a
a
=
26760
72750
0050
.
.
.
softmax
• Example
( )( ) ( ) ( )
( )( ) ( ) ( )
( )( ) ( ) ( )
++−
++−
++−
−
123
1exp
123
2exp
123
3exp
expexpexp
expexpexp
expexpexp
−
1
2
3
softmax
• Softmax renormalizes a vector so that it can be interpreted as a vector of probabilities
![Page 88: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/88.jpg)
• Generalization of logistic regression to multiclass case
• Instead of raw scores
xw
xw
xw
xw
T
4
T
3
T
2
T
1
−
−=
3
5
1
2
• Use softmax scores
=
xw
xw
xw
xw
T
4
T
3
T
2
T
1
maxsoft
( )
( )
( )
( )
=
4
3
2
1
classPr
classPr
classPr
classPr
=
00030
95000
00240
04730
.
.
.
.
Softmax Loss Function
−
−
3
5
1
2
maxsoft
• Classifier output interpreted as probability for each class
![Page 89: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/89.jpg)
• Optimize under –log Pr( yi) loss function
Gradient Descent: Softmax Loss Function
• Example
• Loss on this example is –log(0.000000000000001) = 40
=
2
3
1
ix
−
−
−
=
172
254
239
742
W
1
0
0
0
yi
( )=iWxmaxsoft
( )
( )
( )
( )
=
4
3
2
1
classPr
classPr
classPr
classPr
=
−17
23
4
0
maxsoft
00000010.00000000
42945850.99999999
56027960.00000000
01026190.00000000
![Page 90: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/90.jpg)
• Update rule for weight matrix W
( )( ) ( ) −+=i
tiii xWxmaxsoftyWW
Gradient Descent: Softmax Loss Function
• Example, single sample gradient descent with α = 0.1
231
17
23
4
0
1
0
0
0
10
172
254
239
742
−
−
+
−
−
−
= maxsoft.W
=
2
3
1
ix
−
−
−
=
172
254
239
742
W
1
0
0
0
yi
−
=
17
23
4
0
iWx
−
−
−
=
217612
817493
239
742
...
...
• Update for W
![Page 91: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/91.jpg)
More General Discriminant Functions
• Linear discriminant functions
• simple decision boundary
• should try simpler models first to avoid overfitting
• optimal for certain type of data
• Gaussian distributions with equal covariance
• May not be optimal for other data distributions
• Discriminant functions can be more general than linear
• For example, polynomial discriminant functions
• Decision boundaries more complex than linear
• Later will look more at non-linear discriminant functions
![Page 92: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/92.jpg)
• Linear classifier works well when examples are linearly separable, or almost separable
• Perceptron Loss function was the first historic loss
• Logistic regression/softmax work better in practice
• Optimization with gradient descent
• stochastic mini-batch works best in practice
Summary
![Page 93: Machine Learning: Intro & Linear Clasifier · Supervised Machine Learning •Training phase •estimate function y = f(x) from labeled data • f(x) is called classifier, learning](https://reader033.vdocuments.site/reader033/viewer/2022043004/5f86b982ff0cf354f33034d6/html5/thumbnails/93.jpg)
Linear Classifier: Quadratic Loss
( )( ) ( )22
1 itiiip zayz,a,zfL −=
• This is standard line fitting • note that even correctly classified examples can have a large loss
• Can find optimal weight a analytically with least squares• expensive for large problems
• Gradient descent more efficient for a larger problem
( ) ( ) i
i
itip zzayaL −−=
( ) i
i
iti zzayaa −+= α
• Quadratic per-example loss
• Batch update rule
z
1
-1
atz