Basics of Machine Learning
COMPSCI 527 — Computer Vision
COMPSCI 527 — Computer Vision Basics of Machine Learning 1 / 23
Outline
1 Classification and Regression
2 Loss and Risk
3 Training, Validation, Testing
4 The State of the Art of Image Classification
COMPSCI 527 — Computer Vision Basics of Machine Learning 2 / 23
Classification and Regression
A First Classification Problem
• MNIST handwritten digit recognition(60,000 images, labeled, curated)
• 28× 28 pixel black-and-white images of individual digits• What is the label y ∈ Y = {0,1, . . . ,9} for a given input
image x?
COMPSCI 527 — Computer Vision Basics of Machine Learning 3 / 23
Classification and Regression
Supervised Machine Learning• Supervised machine learning is the problem of learning a
function y = h(x) : X ⊆ Rd → Y ⊆ R from sampleinput/output pairs (x, y)
• “Supervised” means that the samples are provided• Depending on the problem, h may map an image, an image
window, or a set of images x to• A yes or no answer to the question “Is this a [person, car, cat]”:
Y = {yes,no} for object detection• A category out of a small set: Y = {0,1, . . . ,9} for digit recognition• A category out of a large set: Y = {person, car, . . . , tree} for
object recognition• A number or small vector of numbers: Y = R5 for camera motion• A whole field (array) of numbers: Y = R2×1000×1000 for image
motion estimation
COMPSCI 527 — Computer Vision Basics of Machine Learning 4 / 23
Classification and Regression
Classification and Regression
• Two types of supervised machine learning problems:• Classification: Y is categorical, i.e., finite and unordered
(For digit recognition, the order in Y = {0,1, . . . ,9} does not matter)
y is then called a label• Examples: Object detection, object recognition,
foreground/background segregation• Regression: Y = Re; y is then called a value or response• Examples: Camera motion estimation, depth from stereo,
image motion estimation, object tracking
COMPSCI 527 — Computer Vision Basics of Machine Learning 5 / 23
Classification and Regression
Is This Just Data Fitting?• Supervised machine learning is the problem of learning a
function y = h(x) : X ⊆ Rd → Y ⊆ R from sampleinput/output pairs (x, y)
0 10
5
k = 1
k = 3
k = 9
COMPSCI 527 — Computer Vision Basics of Machine Learning 6 / 23
Loss and Risk
Loss• To know if the learned function h does well on sample
(x, y), we need to measure how far the value it predicts forx is from the true value y
• The loss is a measure of the discrepancybetween y and y = h(x)
• The loss function maps pairs (y , y) to real values:` : Y × Y → R
• Simplest loss for classification: the zero-one loss ormisclassification loss
`(y , y) = I(y 6= y) =
{1 if y 6= y0 if y = y
• Simplest loss for regression: The quadratic loss:`(y , y) = (y − y)2
• Different problems call for different measures of lossCOMPSCI 527 — Computer Vision Basics of Machine Learning 7 / 23
Loss and Risk
Empirical Risk
• Given a training set T = {(x1, y1), . . . , (xN , yN)}with xn ∈ X and yn ∈ Y , and loss function `,and a set H of allowed functions (e.g., polynomials)the empirical risk or training error is the average loss on T :LT (h) = 1
N
∑Nn=1 `(yn,h(xn))
• This is what we minimize in a data fitting problem:estimate h ∈ arg minh∈H LT (h)
• H is called the hypothesis space
COMPSCI 527 — Computer Vision Basics of Machine Learning 8 / 23
Loss and Risk
Machine Learning and the Statistical Risk• Given a training set T = {(x1, y1), . . . , (xN , yN)}
with xn ∈ X and yn ∈ Y , and loss function `,• Empirical risk LT (h) = 1
N
∑Nn=1 `(yn,h(xn))
• Data fitting: estimate h ∈ arg minh∈H Lp(h)
• In machine learning, we go much farther: We also wanth to do well on previously unseen inputs
• Key assumption: All data (training and otherwise) comesfrom the same joint probability distribution p(x, y)
• p is unknown and cannot be estimated. However, we canstill aim at minimizing the statistical risk or expectedrisk or generalization error Lp(h) = Ep[`(y ,h(x))]
• We can estimate Lp(h) (but not p), by using a sufficientlylarge test set S = {(x1, y1), . . . , (xM , yM)}, disjoint from T
COMPSCI 527 — Computer Vision Basics of Machine Learning 9 / 23
Loss and Risk
The Digit Recognition Problem Revisited• MNIST handwritten digit recognition
(60,000 images, labeled, curated)
• X = set of all 28× 28 pixel black-and-white images• Y = {0,1, . . . ,9}• ` is the zero-one loss• H depends on the classification method used: Fit a
multivariate polynomial of a certain degree to T and thenround the results? All neural networks of a certain depth?
• Learning: Fit h ∈ H to T , but do well on a separate setCOMPSCI 527 — Computer Vision Basics of Machine Learning 10 / 23
Training, Validation, Testing
Parameters and Hyper-Parameters• The example for the MNIST problem had
• parameters (polynomial coefficients, weights in the neuralnetwork)
• hyper-parameters (maximum degree of polynomial, numberof layers in the neural network)
• What is the difference between parameters andhyper-parameters?
• If we were allowed to optimize the hyper-parameters on thetraining set, we would get trivial solutions with zeroempirical loss
• Examples: Fit a degree N − 1 polynomial to N trainingsamples, make a huge neural network
• Hyper-parameters need to be chosen separately fromtraining
COMPSCI 527 — Computer Vision Basics of Machine Learning 11 / 23
Training, Validation, Testing
Underfitting and Overfitting
• Being too stingy with hyper-parameters (H too small)makes the learner do poorly both on T and on previouslyunseen data: underfitting
• Being too generous with hyper-parameters (H too large)makes the learner do well on T , but not on previouslyunseen data: overfitting
COMPSCI 527 — Computer Vision Basics of Machine Learning 12 / 23
Training, Validation, Testing
Example: Polynomial Fitting
0 10
5
k = 1
k = 2
k = 3
k = 6
k = 9
COMPSCI 527 — Computer Vision Basics of Machine Learning 13 / 23
Training, Validation, Testing
Validation
• How to find the Goldilocks hyper-parameters?• For finite hyper-parameter sets: Try them all, train each on
T , use a separate data set V to see which hyper-parameterdoes best
• The separation of T and V (validation set) is crucial• We don’t allow V to “peek at the solution”• For infinite hyper-parameter sets: Find the first local
minimum by binary search
COMPSCI 527 — Computer Vision Basics of Machine Learning 14 / 23
Training, Validation, Testing
Example: Polynomial Fitting
0 10
5
k = 1
k = 2
k = 3
k = 6
k = 9
0 2 4 6 8 100
0.5
1
1.5training risk
validation risk
COMPSCI 527 — Computer Vision Basics of Machine Learning 15 / 23
Training, Validation, Testing
Supervised Learning
• Given• A training set T = {(x1, y1), . . . , (xN , yN)} with
xn ∈ X and yn ∈ Y• A hypothesis space, i.e., a set of functionsH = {h : X → Y}
• A loss function ` : Y × Y → R• Estimate h ∈ arg minh∈H Lp(h)
where Lp(h) = Ep[`(y ,h(x))]
• How can we estimate Lp(h) (so that we can estimatehyper-parameters and optimal estimator h)if we cannot estimate p?
• Use a validation set V separate from T
COMPSCI 527 — Computer Vision Basics of Machine Learning 16 / 23
Training, Validation, Testing
Validation
• Loop on hyper-parameters π ∈ Π, train on T , validate on V :
procedure VALIDATION(H,Π,T ,V , `)L =∞ . Stores the best risk so far on V
for π ∈ Π doh ∈ arg minh′∈Hπ
LT (h′) . Use loss ` to compute best predictor on T
L = LV (h) . Use loss ` to evaluate the predictor’s risk on V
if L < L then(π, h, L) = (π,h,L) . Keep track of the best hyper-parameters, predictor, and risk
end ifend forreturn (π, h, L) . Return best hyper-parameters, predictor, and risk estimate
end procedure
COMPSCI 527 — Computer Vision Basics of Machine Learning 17 / 23
Training, Validation, Testing
Testing
• A machine learning system has been trained, using both Tand V , to yield h
• We cannot report LV (h) as the measure of performance• The set V is tainted since we used it during training• Performance measures are accepted only on pristine sets,
not used in any way for training• We need to test the system on a third set S, the test set• Estimate the true risk Lp(h) = Ep[`(y , h(x))] by computing
the empirical risk LS(h) = 1|S|∑|S|
n=1 `(yn, h(xn)) on S
COMPSCI 527 — Computer Vision Basics of Machine Learning 18 / 23
Training, Validation, Testing
Summary of Sets Involved
• A training set T to train the predictor given a specific set ofhyper-parameters (if any)
• A validation set V to choose good hyper-parameters• A test set S to evaluate the generalization performance of
the predictor h learned by training on T and validating on V• Resampling techniques (“cross-validation”) exist for making
the same set play the role of both T and V• S must still be entirely separate
COMPSCI 527 — Computer Vision Basics of Machine Learning 19 / 23
The State of the Art of Image Classification
The State of the Art of Image Classification• ImageNet Large Scale Visual Recognition Challenge
(ILSVRC)• Based on ImageNet,1.4 million images, 1000 categories
(Fei-Fei Li, Stanford)• Three different competitions:
• Classification:• One label per image, 1.2M images available for training, 50k
for validation, 100k withheld for testing• Zero-one loss for performance evaluation
• Localization: Classification, plus bounding box on oneinstanceCorrect if ≥ 50% overlap with true box
• Detection: Same as localization, but find every instance inthe image. Measure the fraction of mistakes (false positives,false negatives)
COMPSCI 527 — Computer Vision Basics of Machine Learning 20 / 23
The State of the Art of Image Classification
[Image from Russakovsky et al., ImageNet Large Scale Visual Recognition Challenge, Int’l. J. Comp. Vision 115:211-252, 2015]
COMPSCI 527 — Computer Vision Basics of Machine Learning 21 / 23
The State of the Art of Image Classification
Difficulties of ILSVRC
• Images are “natural.” Arbitrary backgrounds, different sizes,viewpoints, lighting. Partially visible objects
• 1,000 categories, subtle distinctions. Example: Siberianhusky and Eskimo dog
• Variations of appearance within one category can besignificant (how many lamps can you think of?)
• What is the label of one image? For instance, a picture of agroup of people examining a fishing rod was labeled as“reel.”
COMPSCI 527 — Computer Vision Basics of Machine Learning 22 / 23
The State of the Art of Image Classification
Performance for Image Classification
• 2010: 28.2 percent• 2017: 2.3 percent (ensemble of several deep networks)• Improvement results from both architectural insights
(residuals, squeeze-and-excitation networks, ...) andpersistent engineering
• A book on “tricks of the trade in deep learning!”• We will see some after studying the basics
COMPSCI 527 — Computer Vision Basics of Machine Learning 23 / 23