machine learning cs 6830 - webpages.uncc.edu
TRANSCRIPT
Machine Learning CS 6830
Razvan C. Bunescu
School of Electrical Engineering and Computer Science
Lecture 02
Supervised Learning
• Task = learn a function y : X → T that maps input instances x ∈ X to output targets t ∈ T: – Classification:
• The output t ∈ T is one of a finite set of discrete categories. – Regression:
• The output t ∈ T is continuous, or has a continuous component.
• Supervision = set of training examples: (x1,t1), (x2,t2), … (xn,tn)
2 Lecture 01
Regression: Curve Fitting
• Training: examples (x1,t1), (x2,t2), … (xn,tn)
3 Lecture 01
Regression: Curve Fitting
• Testing: for arbitrary (unseen) instance x ∈ X , compute target output y(x) = t ∈ T .
4 Lecture 01
t = ?
Polynomial Curve Fitting
5 Lecture 01
∑=
=++++==M
j
jj
MM xwxwxwxwwxyxy
0
2210),()( …w
t = ?
parameters features
Polynomial Curve Fitting
• Learning = finding the “right” parameters wT = [w0, w1, … , wM ] – Find w that minimizes an error function E(w) which measures the
misfit between y(xn,w) and tn. – Expect that y(x,w) performing well on training examples xn ⇒ y(x,w)
will perform well on arbitrary test examples x∈ X.
• Sum-of-Squares error function:
6 Lecture 01
∑=
−=N
nnn txyE
1
2}),({21)( ww
Inductive Learning Hyphotesis
why squared?
Sum-of-Squares Error Function
• How do we find w* that minimizes E(w)?
7 Lecture 01
∑=
−=N
nnn txyE
1
2}),({21)( ww
)(minarg* ww Ew
=
Polynomial Curve Fitting
• Least Square solution is found by solving a set of M + 1 linear equations:
• Generalization = how well the parameterized y(x,w*) performs on arbitrary (unseen) test instances x∈ X. – Generalization performance depends on the value of M.
8 Lecture 01
∑ ∑∑= =
+
=
===N
n
N
n
inni
jinijij
M
jij xtTxATwA
1 10
and , where,
0th Order Polynomial
9 Lecture 01
1st Order Polynomial
10 Lecture 01
3rd Order Polynomial
11 Lecture 01
9th Order Polynomial
12 Lecture 01
Polynomial Curve Fitting
• Model Selection: choosing the order M of the polynomial. – Best generalization obtained with M = 3. – M = 9 obtains poor generlization, even though it fits training
examples perfectly: • But M = 9 polynomials subsume M = 3 polynomials!
• Overfitting ≡ good performance on training examples, poor performance on test examples.
13 Lecture 01
Overfitting
• Measure fit to training/testing examples using the Root-Mean-Square
(RMS) error:
• Use 100 random test examples, generated in the same way as the
training examples.
14 Lecture 01
NwEERMS /*)(2=
Over-fitting and Parameter Values
15 Lecture 01
Overfitting vs. Data Set Size
• More training data ⇒ less overfitting. • What if we do not have more training data?
– Use regularization. – Use a probabilistic model in a Bayesian setting.
16 Lecture 01
M = 9 M = 9
Regularization
• Penalize large parameter values:
17 Lecture 01
E(w) = 12
{y(xn,w)− tn}2
n=1
N
∑ +λ2w 2
regularizer
)(minarg* ww Ew
=
9th Order Polynomial with Regularization
18 Lecture 01
9th Order Polynomial with Regularization
19 Lecture 01
Training & Test error vs.
20 Lecture 01
How do we find the optimal value of λ?
Model Selection
• Put aside an independent validation set. • Select parameters giving best performance on validation set.
21 Lecture 01
Validation Training
}15,20,25,30,35,40{ln −−−−−−∈λ
ln λ -40 -35 -30 -25 -20 -15 ERMS 1.05 0.30 0.25 0.27 0.52 0.55
Model Evaluation
• K-fold cross-validation – randomly partition dataset in K equally sized subsets P1, P2, … Pk
– for each fold i in {1, 2, …, k}: • test on Pi, train on P1 ∪ … ∪ Pi-1 ∪ Pi+1 ∪ … ∪ Pk
– compute average error/accuracy across K folds.
22 Lecture 01
4-fold cross validation
Sum-of-Squares Error Function (Revisited)
• Training objective: minimize sum-of-squares error. • Why least squares?
23 Lecture 01
∑=
−=N
nnn txyE
1
2}),({21)( ww
∑=
−=N
nnn
T tx1
2})({21
φw
Least Squares <=> Maximum Likelihood (1)
• Assume observations from a deterministic function with added Gaussian noise:
which is the same as saying:
• Given observed inputs X = {x1, ..., xN} and targets t = [t1, ..., tN]T, we obtain the likelihood function:
24 Lecture 01
where
Least Squares <=> Maximum Likelihood (2)
• Taking the logarithm, we get the log-likelihood function:
where
• ED(w) is the sum-of-squares error!
25 Lecture 01
Least Squares <=> Maximum Likelihood (3)
• Minimizing square error <=> maximizing likelihood:
• How do we find w (and β)?
26 Lecture 01
),|(lnmaxarg)(minarg* βwtwww
pE MLD === ww
Least Squares <=> Maximum Likelihood (4)
27 Lecture 01
• Computing the gradient and setting it to zero yields:
• Solving for w, we get
where
The Moore-Penrose pseudo-inverse, .
Least Squares <=> Maximum Likelihood (5)
• Minimizing square error <=> maximizing likelihood:
• Maximizing with respect to w gives:
• Maximizing with respect to β gives:
28 Lecture 01
),|(lnmaxarg)(minarg* βwtwww
pE MLD === ww
Regularized Least Square
29 Lecture 01
• Consider the error function:
• With the sum-of-squares error function and a quadratic regularizer, we get:
which is minimized by:
Data term + Regularization term
λ is called the regularization coefficient.
Regularized Least Square <=> Maximum A Posteriori (MAP)
• Define a conjugate prior over w
• Combining this with the likelihood function and using results for marginal and conditional Gaussian distributions, gives the posterior
where
30 Lecture 01
Regularized Least Square <=> Maximum A Posteriori (MAP)
• Taking the logarithm of the posterior distribution:
• The MAP estimate of w is:
31 Lecture 01
ln p(w | t) = − β2
{tnn=1
N
∑ −wTϕ(xn )}2 −
α2wTw+ const
)|(ln maxarg twww
pMAP =
= argmaxw
− 12
{tn −wTϕ(xn )}2
n=1
N
∑ −αβ
2wTw
= argminw
12
{tn −wTϕ(xn )}2
n=1
N
∑ +λ2wTw
)()( minarg www WD EE +=
Regularization & Occam’s Razor
William of Occam (1288 – 1348) • English Franciscan friar, theologian and philosopher.
• “Entia non sunt multiplicanda praeter necessitatem”
– Entities must not be multiplied beyond necessity.
i.e. Do not make things needlessly complicated. i.e. Prefer the simplest hypothesis that fits the data.
32 Lecture 01
Gradient Descent (Batch)
• Want to minimize a function f : Rn → R. – f is differentiable and convex. – compute gradient of f i.e. direction of steepest increase:
– choose a sequence of points x1, x2, … and a learning rate η such that:
• Sum-of-squares error:
33 Lecture 01
∑=
−=N
nnn
T txE1
2})({21)( φww
⎥⎦
⎤⎢⎣
⎡=∇ )(),...,(),()(
21
xxxxndxdf
dxdf
dxdff
)(1 τττ η xxx f∇−=+
Gradient Descent
34 Lecture 01
Gradient Descent
35 Lecture 01
Stochastic Gradient Descent (Online)
• Decompose error function in sum of example errors:
• Update parameters w after each example, sequentially:
=> the least-mean-square (LMS) algorithm.
36 Lecture 01
∑∑==
=−=N
nn
N
nnn
T EtxE11
2 )(21})({
21)( www φ
)( )()()1( τττ η www nE∇−=+
)())(( )()(nn
Tnt xxww ϕϕη ττ −+=
Regularization: Ridge vs. Lasso
• Ridge regression:
• Lasso:
– If λ is sufficiently large, some of the coefficients wj are driven to 0 => sparse model.
37 Lecture 01
E(w) = 12
{y(xn,w)− tn}2
n=1
N
∑ +λ2
wj2
j=1
M
∑
E(w) = 12
{y(xn,w)− tn}2
n=1
N
∑ +λ2
wjj=1
M
∑
Polynomial Curve Fitting (Revisited)
38 Lecture 01
∑=
=++++==M
j
jj
MM xwxwxwxwwxyxy
0
2210),()( …w
t = ?
parameters features
Generalization: Basis Functions as Features
• Generally
where ϕj(x) are known as basis functions. • Typically ϕ0(x) = 1, so that w0 acts as a bias.
• In the simplest case, use linear basis functions : ϕd(x) = xd.
Linear Basis Function Models (1)
• Polynomial basis functions:
• Global behavior: – a small change in x affect all basis
functions.
Linear Basis Function Models (2)
• Gaussian basis functions:
• Local behavior: – a small change in x only
affects nearby basis functions. – µj and s control location and
scale (width).
Linear Basis Function Models (3)
• Sigmoidal basis functions:
where
• Local behavior: – a small change in x only affect
nearby basis functions. – µj and s control location and
scale (slope).