ceng 783 on optimizationemre/resources/...softmax classifier βcross-entropy loss β’...
Post on 14-Mar-2021
5 Views
Preview:
TRANSCRIPT
CENG 793
On Machine Learning and Optimization
Sinan Kalkan
2
Now
β’ Introduction to MLβ Problem definitionβ Classes of approachesβ K-NNβ Support Vector Machinesβ Softmax classification / logistic regressionβ Parzen Windows
β’ Optimizationβ Gradient Descent approachesβ A flavor of other approaches
3
Introduction toMachine learning
Uses many figures and material from the following website:
http://www.byclb.com/TR/Tutorials/neural_networks/ch1_1.htm4
Testing
Training
Overview
βLearnβ
LearnedModels or Classifiers
Data with label(label: human, animal etc.)
- Size- Texture- Color- Histogram of
oriented gradients- SIFT- Etc.
Test input
ExtractFeatures
ExtractFeatures
Pred
ict
Human or Animal or β¦
5
Content
β’ Problem definition
β’ General approaches
β’ Popular methods
β kNN
β Linear classification
β Support Vector Machines
β Parzen windows
6
Problem Definition
β’ Given β Data: a set of instances π±1, π±2, β¦ , π±π
sampled from a space π β βπ .
β Labels: the corresponding labels π¦π for each π±π . π¦π β π β β.
β’ Goal: β Learn a mapping from the space of βdataβ
to the space of labels, i.e.,
β’ β³:π β π.
β’ π¦ = π π± .
7
Issues in Machine Learning
β’ Hypothesis space
β’ Loss / cost / objective function
β’ Optimization
β’ Bias vs. variance
β’ Test / evaluate
β’ Overfitting, underfitting
8
GenerativeDiscriminative
General Approaches
Find separating line (in general: hyperplane)
Fit a function to data(regression).
Learn a model for each class.
9
Generative
β’ Regressionβ’ Markov Random Fieldsβ’ Bayesian Networks/Learningβ’ Clustering via Gaussian Mixture Models,
Parzen Windows etc.β’ β¦
Discriminative
β’ Support Vector Machinesβ’ Artificial Neural Networksβ’ Conditional Random Fieldsβ’ K-NN
General Approaches (contβd)
10
Supervised
Instance Label
elma
elma
armut
elma
armut
Unsupervised
General Approaches (contβd)
Extract Features
Learn a model
Extract Features
Learn a model
e.g. SVM
e.g. k-meansclustering
11
General Approaches (contβd)
Generative Discriminative
SupervisedRegression
Markov Random FieldsBayesian Networks
Support Vector MachinesNeural Networks
Conditional Random FieldsDecision Tree Learning
UnsupervisedGaussian Mixture Models
Parzen WindowsK-means
Self-Organizing Maps
12
Now
Generative Discriminative
Supervised
Unsupervised
Support VectorMachines
Parzen Windows
Softmax/Logistic Regression
14
Feature Space
17
General pitfalls/problems
β’ Overfitting
β’ Underfitting
β’ Occamsβ razor
18
21
K-NEAREST NEIGHBOR
22
A very simple algorithm
β’ Advantages:β No trainingβ Simple
β’ Disadvantages:β Slow testing time (more efficient
versions exist)β Needs a lot of memory
Wikipedia
http://cs231n.github.io/classification/23
PARZEN WINDOWS
A non-parametric method
24
Probability and density
β’ Probability of a continuous probability function, π π₯ , satisfies the following:β Probability to be between two values:
π π < π₯ < π = π
π
π π₯ ππ₯
β π π₯ > 0 for all real π₯.β And
ββ
β
π π₯ ππ₯ = 1
β’ In 2-D:β Probability for π± to be inside region π :
π = π
π π± ππ±
β π π± > 0 for all real π±.β And
ββ
β
π π± ππ± = 1
26
Density Estimation
β’ Basic idea:
π = π
π π± ππ±
β’ If π is small enough, so that π(π±) is almost constant in π :
β π = π π π± ππ± β
p π± π ππ± = π π± π
β π: volume of region π .
β’ If π out of π samples fall into π , then
π = π/π
β’ From which we can write:
π
π= π π± π β π π± =
π/π
π27
Parzen Windows
β’ Assume that: β π is a hypercube centered at π±.β β is the length of an edge of the hypercube.β Then, π = β2 in 2D and π = β3 in 3D etc.
β’ Let us define the following window function:
π€π±π β π±π
β=
1, |π±π β π±π|/β < 1/20, otherwise
β’ Then, we can write the number of samples falling into π as follows:
π =
π=1
π
π€π±π β π±π
β28
Parzen Windows (contβd)
β’ Remember: π π± =π/π
π.
β’ Using this definition of π, we can rewrite π(π±)(in 2D):
π π± =1
π
π=1
π1
β2π€
π±π β π±
β
β’ Interpretation: Probability is the contribution of window functions fitted at each sample!
29
Parzen Windows (contβd)
β’ The type of the window function can add different flavors.
β’ If we use Gaussian for the window function:
π π± =1
π
π=1
π1
πβ2πexp β
π±π β π± π
2π2
β’ Interpretation: Fit a Gaussian to each sample.
30
31
LINEAR CLASSIFICATION
32
Linear classificationβ’ Linear classification relies on the following score function:
π(π₯π;π, π) = ππ₯π + πOr
π(π₯π;π, π) = ππ₯πβ’ where the bias is implicitly represented in π and π₯π
http://cs231n.github.io/linear-classify/One row per class 33
Linear classification
β’ We can rewrite the score function as:π(π₯π;π, π) = ππ₯π
β’ where the bias is implicitly represented in πand π₯π
http://cs231n.github.io/linear-classify/34
Linear classification: One interpretation
β’ Since an image can be thought as a vector, we can consider them as points in high-dimensional space.
http://cs231n.github.io/linear-classify/
One interpretation of:
π(π₯π;π, π) = ππ₯π + π
- Each row describes a line for a class, and βbβ
35
Linear classification:Another interpretation
β’ Each row in π can be interpreted as a template of that class.
β π(π₯π;π, π) = ππ₯π + π calculates the inner product to find which template best fits π₯π.
β Effectively, we are doing Nearest Neighbor with the βprototypeβ images of each class.
http://cs231n.github.io/linear-classify/36
Loss functionβ’ A function which measures how good our weights are.
β Other names: cost function, objective function
β’ Let π π = π π₯π;π π
β’ An example loss function:
πΏπ =
πβ π¦π
max(0, π π β π π¦π + Ξ)
Or equivalently:
πΏπ =
πβ π¦π
max(0, π€πππ₯π β π€π¦π
π π₯π + Ξ)
β’ This directs the distances to other classes to be more than Ξ (the margin)
http://cs231n.github.io/linear-classify/37
Example
β’ Consider our scores for π₯π to be π =[13, β7,11] and assume Ξ as 10.
β’ Then, πΏπ = max 0,β7 β 13 + 10 +max(0,11 β 13 + 10)
http://cs231n.github.io/linear-classify/ 38
Regularization
β’ In practice, there are many possible solutions leading to the same loss value.β Based on the requirements of the problem, we
might want to penalize certain solutions.
β’ E.g.,
π π =
π
π
ππ,π2
β which penalizes large weights. β’ Why do we want to do that?
39http://cs231n.github.io/linear-classify/
Combined Loss Function
β’ The loss function becomes:
β’ If you expand it:
πΏ =1
π
π
πβ π¦π
max 0, π π₯π ,π π β π π₯π ,π π¦π + Ξ + π
π
π
ππ,π2
Hyper parameters(estimated using validation set)40http://cs231n.github.io/linear-classify/
Hinge Loss, or Max-Margin Loss
πΏ =1
π
π
πβ π¦π
max 0, π π₯π ,π π β π π₯π ,π π¦π + Ξ + π
π
π
ππ,π2
41http://cs231n.github.io/linear-classify/
Interactive Demo
http://vision.stanford.edu/teaching/cs231n/linear-classify-demo/ 42
SUPPORT VECTOR MACHINES
An alternative formulation of
Barrowed mostly from the slides of:- Machine Learning Group, University of Texas at Austin.- Mingyue Tan, The University of British Columbia
43
Linear Separators
β’ Binary classification can be viewed as the task of separating classes in feature space:
wTx + b = 0
wTx + b < 0wTx + b > 0
f(x) = sign(wTx + b)
44
Linear Separators
β’ Which of the linear separators is optimal?
45
Classification Margin
β’ Distance from example xi to the separator is
β’ Examples closest to the hyperplane are support vectors.
β’ Margin Ο of the separator is the distance between support vectors.
w
xw br i
T
r
Ο
46
Maximum Margin Classification
β’ Maximizing the margin is good according to intuition.
β’ Implies that only support vectors matter; other training examples are ignorable.
47
Linear SVM Mathematically
What we know:
β’ w . x+ + b = +1
β’ w . x- + b = -1
β’ w . (x+-x-) = 2
X-
x+
ww
wxx 2)(
Ο=Margin Width
49
Linear SVMs Mathematically (cont.)
β’ Then we can formulate the optimization problem:
Which can be reformulated as:
Find w and b such that
is maximized
and for all (xi, yi), i=1..n : yi(wTxi + b) β₯ 1
w
2
Find w and b such that
Ξ¦(w) = ||w||2=wTw is minimized
and for all (xi, yi), i=1..n : yi (wTxi + b) β₯ 1
50
Lagrange Multipliers
β’ Given the following optimization problem:minimize π π₯, π¦ subject to π π₯, π¦ = π
β’ We can formulate it as:πΏ π₯, π¦, π = π π₯, π¦ + π(π π₯, π¦ β π)
β’ and set the derivative to zero:π»π₯,π¦,ππΏ π₯, π¦, π = 0
51
Lagrange Multipliers
β’ Main intuition:
β The gradients of f and g are parallel at the maximum
π»π = ππ»π
52
Fig: http://mathworld.wolfram.com/LagrangeMultiplier.html
Lagrange Multipliers
β’ See the following for proofβ http://ocw.mit.edu/courses/mathematics/18-02sc-
multivariable-calculus-fall-2010/2.-partial-derivatives/part-c-lagrange-multipliers-and-constrained-differentials/session-40-proof-of-lagrange-multipliers/MIT18_02SC_notes_22.pdf
β’ A clear example:β http://tutorial.math.lamar.edu/Classes/CalcIII/Lagrang
eMultipliers.aspx
β’ More intuitive explanation:β http://www.slimy.com/~steuard/teaching/tutorials/La
grange.html
53
54
Non-linear SVMs
β’ Datasets that are linearly separable with some noise work out great:
β’ But what are we going to do if the dataset is just too hard?
⒠How about⦠mapping data to a higher-dimensional space:
0
0
0
x2
x
x
x
55
SOFTMAX OR LOGISTIC CLASSIFIERS
56
Cross-entropy
β’ Uses cross-entropy (π‘: target probabilities, π: estimated probabilities):
π» π, π = πΈπ β log π = β
π
ππ log ππ
β’ Wikipedia:β βthe cross entropy between two probability distributions π and π over
the same underlying set of events measures the average number of bits needed to identify an event drawn from the set, if a coding scheme is used that is optimized for an "unnatural" probability distribution π, rather than the "true" distribution πβ
57
Softmax classifier β cross-entropy loss
β’ Cross-entropy: π» π, π = πΈπ β log π = β π ππ log ππ
β’ In our case, β π denotes the correct probabilities of the categories. In other words,
ππ = 1 for the correct label and ππ = 0 for other categories.
β π denotes the estimated probabilities of the categories
β’ But, our scores are not probabilities!
β One solution: Softmax function: π π§π =ππ§π
π ππ§π
β It maps arbitrary ranges to probabilities
β’ Using the normalized values, we can define the cross-entropy loss for classification problem now:
πΏπ = β logπππ¦π
π πππ
= βππ¦π + log
π
πππ
58http://cs231n.github.io/
logistic loss
β’ A special case of cross-entropy for binary classification:
π» π, π = β
π
ππ log ππ = βππ log ππ β 1 β ππ log 1 β ππ
β’ Softmax function reduces to the logistic function:1
1 + ππ₯
β’ And the loss becomes:
πΏπ = β log1
1 + ππ
59http://cs231n.github.io/
Why take logarithm of probabilities?
β’ Maps probability space to logarithmic space
β’ Multiplication becomes addition
β Multiplication is a very frequent operation with probabilities
β’ Speed:
β Addition is more efficient
β’ Accuracy:
β Considering loss in representing real numbers, addition is friendlier
β’ Since log-probability is negative, to work with positive numbers, we usually negate the log-probability
62
Softmax classifier: One interpretation
β’ Information theory
β Cross-entropy between a true distribution and an estimated one:
π» π, π = β
π₯
π π₯ log π π₯ .
β In our case, π = [0,β¦ , 1,0, . . 0], containing only one 1, at the correct label.
β Since π» π, π = π» π + π·πΎπΏ(π||π), we are minimizing the Kullback-Leibler divergence.
63http://cs231n.github.io/
Softmax classifier: Another interpretation
β’ Probabilistic view
π π¦π π₯π;π) =πππ¦π
π πππ.
β’ In our case, we are minimizing the negative log likelihood.
β’ Therefore, this corresponds to Maximum Likelihood Estimation (MLE).
64http://cs231n.github.io/
Numerical Stability
β’ Exponentials may become very large. A trick:
β’ Set log πΆ = βmaxπ
ππ.
65http://cs231n.github.io/
SVM loss vs. cross-entropy loss
β’ SVM is happy when the classification satisfies the margin
β Ex: if score values = [10, 9, 9] or [10, -10, -10]
β’ SVM loss is happy if the margin is 1
β SVM is local
β’ cross-entropy always wants better
β cross-entropy is global
66
0-1 Loss
β’ Minimize the # of cases where the prediction is wrong:
πΏ =
π
π π π₯π;π, π π¦π β π¦π
Or equivalently,
πΏ =
π
π π¦ππ π₯π;π, π π¦π < 0
68
Absolute Value Loss, Squared Error Loss
πΏπ =
π
π π β π¦ππ
β’ π = 1: absolute value loss
β’ π = 2: square error loss.
69Bishop
MORE ON LOSS FUNCTIONS
70
Visualizing Loss Functions
β’ If you look at one of the example loss functions:
πΏπ =
πβ π¦π
max(0, π€πππ₯π βπ€π¦π
π π₯π + 1)
β’ Since π has too many dimensions, this is difficult to plot.
β’ We can visualize this for one weight direction though, which can give us some intuition about the shape of the function.β E.g., start from an arbitrary π0, choose a direction π1
and plot πΏ(π0 + πΌπ1) for different values of πΌ.
75http://cs231n.github.io/
Visualizing Loss Functions
β’ Example:
β’ If we consider just π€0, we have many linear functions in terms of π€0 and loss is a linear combination of them.
76http://cs231n.github.io/
Visualizing Loss Functions
β’ You see that this is a convex function.β Nice and easy for optimization
β’ When you combine many of them in a neural network, it becomes non-convex.
Loss along one direction Loss along two directions Loss along two directions(averaged over many samples)
77http://cs231n.github.io/
β’ 0-1 loss:πΏ = 1(π π₯ β π¦)
or equivalently as:πΏ = 1(π¦π π₯ > 0)
β’ Square loss:πΏ = π π₯ β π¦ 2
in binary case:
πΏ = 1 β π¦π π₯2
β’ Hinge-lossπΏ = max(1 β π¦π π₯ , 0)
β’ Logistic loss:
πΏ = ln 2 β1ln(1 + πβπ¦π π₯ )
Another approach for visualizing loss functions
78
Rosacco et al., 2003
SUMMARY OF LOSS FUNCTIONS
79
80
81
82
83
Sum up
β’ 0-1 loss is not differentiable/helpful at trainingβ It is used in testing
β’ Other losses try to cover the βweaknessβ of 0-1 loss
β’ Hinge-loss imposes weaker constraint compared to cross-entropy
β’ For classification: use hinge-loss or cross-entropy loss
β’ For regression: use squared-error loss, or absolute difference loss
84
SO, WHAT DO WE DO WITH A LOSS FUNCTION?
85
Optimization strategies
β’ We want to find π that minimizes the loss function.
β Remember that π has lots of dimensions.
β’ NaΓ―ve idea: random search
β For a number of iterations:β’ Select a π randomly
β’ If it leads to better loss than the previous ones, select it.
β This yields 15.5% accuracy on CIFAR after 1000 iterations (chance: 10%)
86http://cs231n.github.io/
Optimization strategies
β’ Second idea: random local search
β Start at an arbitrary position (weight π)
β Select an arbitrary direction π + πΏπ and see if it leads to a better loss
β’ If yes, move along
β This leads to 21.4% accuracy after 1000 iterations
β This is actually a variation of simulated annealing
β’ A better idea: follow the gradient
87http://cs231n.github.io/
A quick reminder on gradients / partial derivatives
β In one dimension: ππ π₯
ππ₯= lim
ββ0
π π₯+β βπ π₯
β
β In practice:
β’ππ π
ππ=
π π+β βπ[π]
β
β’ππ π
ππ=
π π+β βπ[πββ]
2β(centered difference β works better)
β In many dimensions:
1. Compute gradient numerically with finite differences
β Slow
β Easy
β Approximate
2. Compute the gradient analytically
β Fast
β Exact
β Error-prone to implement
β’ In practice: implement (2) and check against (1) before testing
88
http://cs231n.github.io/
A quick reminder on gradients / partial derivatives
β’ If you have a many-variable function, e.g., π π₯, π¦ = π₯ + π¦, you can take its derivative wrt either π₯ or π¦:
βππ π₯,π¦
ππ₯= lim
ββ0
π π₯+β,π¦ βπ(π₯,π¦)
β= 1
β Similarly, ππ π₯,π¦
ππ¦= 1
β In fact, we should denote them as follows since they are βpartial derivativesβ or βgradients on x or yβ:
β’ππ
ππ₯and
ππ
ππ¦
β’ Partial derivative tells you the rate of change along a single dimension at a point.
β E.g., if ππ/ππ₯=1, it means that a change of π₯0 in π₯ leads to the same amount of change in the value of the function.
β’ Gradient is a vector of partial derivatives:
β π»π =ππ
ππ₯,ππ
ππ¦
89http://cs231n.github.io/
A quick reminder on gradients / partial derivatives
β’ A simple example:β π π₯, π¦ = max(π₯, π¦)
β π»π = π π₯ β₯ π¦ , π π¦ β₯ π₯
β’ Chaining:β What if we have a composition of functions?
β E.g., π π₯, π¦, π§ = π π₯, π¦ π§ and π π₯, π¦ = π₯ + π¦
βππ
ππ₯=
ππ
ππ
ππ
ππ₯= π§
βππ
ππ¦=
ππ
ππ
ππ
ππ¦= π§
βππ
ππ§= π π₯, π¦ = π₯ + π¦
β’ Back propagationβ Local processing to improve the system globally
β Each gate locally determines to increase or decrease the inputs
90http://cs231n.github.io/
Partial derivatives and backprop
β’ Which has many gates:
91http://cs231n.github.io/
In fact, that was the sigmoid function
β’ We will combine all these gates into a single gate and call it a neuron
92
Intuitive effects
β’ Commonly used operationsβ Add gateβ Max gateβ Multiply gate
β’ Due to the effect of the multiply gate, when one of the inputs is large, the small input gets the large gradientβ This has a negative effect in
NNsβ When one of the weights is
unusually large, it effects the other weights.
β Therefore, it is better to normalize the input / weights
93http://cs231n.github.io/
Optimization strategies: gradient
β’ Move along the negative gradient (since we wish to go down)
β π β π ππΏ π
ππ
β π : step size
β’ Gradient tells us the directionβ Choosing how much to move along that direction is difficult to
determine
β This is also called the learning rate
β If it is small: too slow to converge
β If it is big: you may overshoot and skip the minimum
94http://cs231n.github.io/
Optimization strategies: gradient
β’ Take our loss function:
πΏπ =
πβ π¦π
max(0,π€πππ₯π β π€π¦π
π π₯π + 1)
β’ Its gradient wrt π€π is:ππΏπππ€π
= π π€πππ₯π βπ€π¦π
π π₯π + 1 > 0 π₯π
β’ For the βwinningβ weight, this is:
95http://cs231n.github.io/
Gradient Descent
β’ Update the weight:
ππππ€ β π β π ππΏ π
ππ
β’ This computes the gradient after seeing all examples to update the weight.β Examples can be on the order of millions or billions
β’ Alternative:β Mini-batch gradient descent: Update the weights after, e.g., 256 examples
β Stochastic (or online) gradient descent: Update the weights after each example
β People usually use batches and call it stochastic.
β Performing an update after one example for 100 examples is more expensive than performing an update at once for 100 examples due to matrix/vector operations
96http://cs231n.github.io/
A GENERAL LOOK AT OPTIMIZATION
97
Mathematical Optimization
Convex Optimization
Least-squares LP
Nonlinear Optimization
Rong Jin
100
Mathematical Optimization
101
Rong Jin
Convex Optimization
102
Rong Jin
Interpretation
Functionβs value is below the line connecting two points
Mark Schmidt103
Another interpretation
Mark Schmidt104
Example convex functions
Mark Schmidt106
Operations that conserve convexity
107
Rong Jin
π₯ + π¦π
<= π₯π+ π¦
π
108
Rong Jin
A KTEC Center of Excellence 109
Why convex optimization?
β’ Canβt solve most OPs
β’ E.g. NP Hard, even high polynomial time too slow
β’ Convex OPs
β’ (Generally) No analytic solution
β’ Efficient algorithms to find (global) solution
β’ Interior point methods (basically Iterated Newton) can be used:
β ~[10-100]*max{p3 , p2m, F} ; F cost eval. obj. and constr. f
β’ At worst solve with general IP methods (CVX), faster
specialized
A KTEC Center of Excellence 110
Convex Function
β’ Easy to see why convexity allows for
efficient solution
β’ Just βslideβ down the objective function as
far as possible and will reach a minimum
A KTEC Center of Excellence 111
Convex vs. Non-convex Ex.
β’ Convex, min. easy to find
Affine β border
case of convexity
A KTEC Center of Excellence 112
Convex vs. Non-convex Ex.
β’ Non-convex, easy to get stuck in a local min.
β’ Canβt rely on only local search techniques
A KTEC Center of Excellence 113
Non-convex
β’ Some non-convex problems highly multi-modal,
or NP hard
β’ Could be forced to search all solutions, or hope
stochastic search is successful
β’ Cannot guarantee best solution, inefficient
β’ Harder to make performance guarantees with
approximate solutions
β’ Analytical solutionβ’ Good algorithms and softwareβ’ High accuracy and high reliabilityβ’ Time complexity:
Mathematical Optimization
Convex Optimization
Least-squares LP
Nonlinear Optimization
knC 2
A mature technology!
114
Rong Jin
β’ No analytical solutionβ’ Algorithms and softwareβ’ Reliable and efficientβ’ Time complexity:
Mathematical Optimization
Convex Optimization
Least-squares LP
Nonlinear Optimization
mnC 2
Also a mature technology!
115
Rong Jin
Mathematical Optimization
Convex Optimization
Nonlinear Optimization
Almost a mature technology!
Least-squares LP
β’ No analytical solutionβ’ Algorithms and softwareβ’ Reliable and efficientβ’ Time complexity (roughly)
},,max{ 23 Fmnn
116
Rong Jin
Mathematical Optimization
Convex Optimization
Nonlinear Optimization
Far from a technology! (something to avoid)
Least-squares LP
β’ Sadly, no effective methods to solveβ’ Only approaches with some compromiseβ’ Local optimization: βmore art than technologyβ
β’ Global optimization: greatly compromised efficiency β’ Help from convex optimization
1) Initialization 2) Heuristics 3) Bounds
117
Rong Jin
Why Study Convex Optimization
If not, β¦β¦
-- Section 1.3.2, p8, Convex Optimization
there is little chance you can solve it.
118
Rong Jin
Recognizing Convex Optimization
Problems
121
Rong Jin
Least-squares
126
Rong Jin
Analytical Solution of Least-squares
β’ Set the derivative to zero:
βππ0 π₯
ππ₯= 0
β π΄ππ΄ 2π₯ β 2π΄π = 0
β π΄ππ΄ π₯ = π΄π
β’ Solve this system of linear equations
127
Linear Programming (LP)
128
Rong Jin
To sum up
β’ We introduced some important concepts in machine learning and optimization
β’ We introduced popular machine learning methods
β’ We talked about loss functions and how we can optimize them using gradient descent
135
top related