statistical machine learning€¦ · principal component analysis autoencoders graphical models 1...

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Outlines

OverviewIntroductionLinear Algebra

Probability

Linear Regression 1

Linear Regression 2

Linear Classification 1

Linear Classification 2

Kernel MethodsSparse Kernel Methods

Mixture Models and EM 1Mixture Models and EM 2Neural Networks 1Neural Networks 2Principal Component Analysis

AutoencodersGraphical Models 1

Graphical Models 2

Graphical Models 3

Sampling

Sequential Data 1

Sequential Data 2

1of 825

Statistical Machine Learning

Christian Walder

Machine Learning Research GroupCSIRO Data61

and

College of Engineering and Computer ScienceThe Australian National University

CanberraSemester One, 2020.

(Many figures from C. M. Bishop, "Pattern Recognition and Machine Learning")




University

Review

Linear Basis FunctionModels

Maximum Likelihood andLeast Squares

Sequential Learning

Regularized LeastSquares

Multiple Outputs

Loss Function forRegression

The Bias-VarianceDecomposition

113of 825

Part III

Linear Regression 1




University

Review



Sequential Learning


Multiple Outputs



114of 825

Linear Regression

N = 10

x ≡ (x1, . . . , xN)T

t ≡ (t1, . . . , tN)T

xi ∈ R i = 1,. . . , N

ti ∈ R i = 1,. . . , N0 2 4 6 8 10

x

0

1

2

3

4

5

6

7

8

tPredictor y(x,w)?Performance measure?Optimal solution w∗?Recall: projection, inverse




University

Review



Sequential Learning


Multiple Outputs



115of 825

Probabilities, Losses

Gaussian DistributionBayes RuleExpected Loss




University

Review



Sequential Learning


Multiple Outputs



116of 825

Linear Curve Fitting - Least Squares

N = 10

x ≡ (x1, . . . , xN)T

t ≡ (t1, . . . , tN)T

xi ∈ R i = 1,. . . , N

ti ∈ R i = 1,. . . , N

y(x,w) = w1x + w0

X ≡ [x 1]

w∗ = (XTX)−1XT t

0 2 4 6 8 10x

0

1

2

3

4

5

6

7

8

t

We assume

t = y(x,w)︸︷︷︸deterministic

+ ε︸︷︷︸Gaussian noise




University

Review



Sequential Learning


Multiple Outputs



117of 825

Curve fitting - revisited

a priori belief about the parameter w captured in the priorprobability p(w)

observed data D = {t1, . . . , tN}calculate the belief in w after the data D have beenobserved

p(w | D) = p(D |w)p(w)

p(D)p(D |w) as a function of w : likelihood functionlikelihood expresses how probable the data are fordifferent values of w — it is not a probability density withrespect w (but it is with respect to D ; prove it)




University

Review



Sequential Learning


Multiple Outputs



118of 825

Maximum Likelihood

Consider the linear regression problem, where we haverandom variables xn and tn.We assume a conditional model tn|xn

We propose a distribution, parameterized by θ

tn|xn ∼ density(θ)

For a given θ the density defines the probability ofobserving tn|xn.We are interested in finding θ that maximises theprobability (called the likelihood) of the data.




University

Review



Sequential Learning


Multiple Outputs



119of 825

Likelihood Function - Frequentist versus Bayesian

Likelihood function p(D |w)

Frequentist Approachw considered fixedparametervalue defined bysome ’estimator’error bars on theestimated wobtained from thedistribution ofpossible data sets D

Bayesian Approachonly one single dataset Duncertainty in theparameters comesfrom a probabilitydistribution over w




University

Review



Sequential Learning


Multiple Outputs



120of 825

Frequentist Estimator - Maximum Likelihood

choose w for which the likelihood p(D |w) (the probabilityof the observed data) is maximalthe most common heuristic for learning a single fixed wequivalently: error function is negative log of likelihoodfunction, to be minimisedlog is a monotonic functionmaximising the likelihood ⇐⇒ minimising the errorExample: Fair-looking coin is tossed three times, alwayslanding on heads.Maximum likelihood estimate of the probability of landingheads will give 1.




University

Review



Sequential Learning


Multiple Outputs



121of 825

Bayesian Approach

including prior knowledge easy (via prior w)subjective choice of prior, allows better results byincorporating domain knowledgesometimes choice of prior motivated by convinientmathematical formprior irrelevant as N →∞, but helps for small Nneed to sum/integrate over the whole parameter space

advances in sampling (Markov Chain Monte Carlo methods)advances in approximation schemes (Variational Bayes,Expectation Propagation)

there is no true w:




University

Review



Sequential Learning


Multiple Outputs



122of 825

Regression

Given a training data set of N observations {xn} and targetvalues tn.Goal : Learn to predict the value of one ore more targetvalues t given a new value of the input x.Example: Polynomial curve fitting (see Introduction).

x

t

?




University

Review



Sequential Learning


Multiple Outputs



123of 825

Supervised Learning:(non-Bayesian) Point Estimate

model with adjustable parameter w

training data x

training targets

t

model with fixed parameter w*

test data x

test target

t

Training Phase

Test Phasefix the most appropriate w*




University

Review



Sequential Learning


Multiple Outputs



124of 825

Why Linear Regression?

Analytic solution when minimising sum of squared errorsWell understood statistical behaviourEfficient algorithms exist for convex losses andregularizersBut what if the relationship is non-linear?

0 1 2 3 4 5 61.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00




University

Review



Sequential Learning


Multiple Outputs



125of 825

Linear Basis Function Models

Linear combination of fixed nonlinear basis functions

y(x,w) =

M−1∑j=0

wjφj(x) = wTφ(x)

parameter w = (w0, . . . ,wM−1)T

basis functions φ(x) = (φ0(x), . . . , φM−1(x))T

convention φ0(x) = 1w0 is the bias parameter




University

Review



Sequential Learning


Multiple Outputs



126of 825

Polynomial Basis Functions

Scalar input variable x

φj(x) = xj

Limitation : Polynomials are global functions of the inputvariable x so the learned function will extrapolate poorly

−1 0 1−1

−0.5

0

0.5

1




University

Review



Sequential Learning


Multiple Outputs



127of 825

’Gaussian’ Basis Functions


φj(x) = exp

{− (x− µj)

2

2s2

}Not a probability distribution.No normalisation required, taken care of by the modelparameters w.Well behaved away from the data (though pulled to zero)

−1 0 10

0.25

0.5

0.75

1




University

Review



Sequential Learning


Multiple Outputs



128of 825

Sigmoidal Basis Functions


φj(x) = σ

(x− µj

s

)where σ(a) is the logistic sigmoid function defined by

σ(a) =1

1 + exp(−a)

σ(a) is related to the hyperbolic tangent tanh(a) bytanh(a) = 2σ(a)− 1.

−1 0 10

0.25

0.5

0.75

1




University

Review



Sequential Learning


Multiple Outputs



129of 825

Other Basis Functions

Fourier Basis : each basis function represents a specificfrequency and has infinite spatial extent.Wavelets : localised in both space and frequency (alsomutually orthogonal to simplify appliciation).Splines (piecewise polynomials restricted to regions of theinput space; additional constraints where pieces meet, e.g.smoothness constraints→ conditions on the derivatives).

0 1 2 3 4 5

-2.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

LinearSplines

0 1 2 3 4 5

-2.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

QuadraticSplines

0 1 2 3 4 5

-2.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

CubicSplines

0 1 2 3 4 5

-2.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

QuarticSplines

Approximate the points{(0, 0), (1, 1), (2,−1), (3, 0), (4,−2), (5, 1)} by different

splines.




University

Review



Sequential Learning


Multiple Outputs



130of 825

Maximum Likelihood and Least Squares

No special assumption about the basis functions φj(x). Inthe simplest case, one can think of φj(x) = xj, or φ(x) = x.Assume target t is given by

t = y(x,w)︸︷︷︸deterministic

+ ε︸︷︷︸noise

where ε is a zero-mean Gaussian random variable withprecision (inverse variance) β.Thus

p(t | x,w, β) = N (t | y(x,w), β−1)

t

xx0

y(x0)

y(x)

p(t|x0)




University

Review



Sequential Learning


Multiple Outputs



131of 825


Likelihood of one target t given the data x was

p(t | x,w, β) = N (t | y(x,w), β−1)

Now, a set of inputs X with corresponding target values t.Assume data are independent and identically distributed(i.i.d.) (means : data are drawn independent and from thesame distribution). The likelihood of the target t is then

p(t |X,w, β) =N∏

n=1

N (tn | y(xn,w), β−1)

=

N∏n=1

N (tn |wTφ(xn), β−1)

From now on drop the conditioning variable X from thenotation, as with supervised learning we do not seek tomodel the distribution of the input data.




University

Review



Sequential Learning


Multiple Outputs



132of 825


Consider the logarithm of the likelihood p(t |w, β) (thelogarithm is a monotone function! )

ln p(t |w, β) =N∑

n=1

lnN (tn |wTφ(xn), β−1)

=

N∑n=1

ln

(√β

2πexp

{−β

2(tn − wTφ(xn))

2})

=N2lnβ − N

2ln(2π)− βED(w)

where the sum-of-squares error function is

ED(w) =12

N∑n=1

{tn − wTφ(xn)}2.

argmaxw ln p(t |w, β)→ argminw ED(w)




University

Review



Sequential Learning


Multiple Outputs



133of 825


Goal: Find a more compact representation.Rewrite the error function

ED(w) =12

N∑n=1

{tn − wTφ(xn)}2 =12(t−Φw)T(t−Φw)

where t = (t1, . . . , tN)T , and

Φ =

φ0(x1) φ1(x1) . . . φM−1(x1)φ0(x2) φ1(x2) . . . φM−1(x2)

......

. . ....

φ0(xN) φ1(xN) . . . φM−1(xN)




University

Review



Sequential Learning


Multiple Outputs



134of 825


The log likelihood is now

ln p(t |w, β) = N2lnβ − N

2ln(2π)− βED(w)

=N2lnβ − N

2ln(2π)− β 1

2(t−Φw)T(t−Φw)

Find critical points of ln p(t |w, β).The gradient with respect to w is

∇w ln p(t |w, β) = βΦT(t−Φw).

Setting the gradient to zero gives

0 = ΦT t−ΦTΦw,

which results in

wML = (ΦTΦ)−1ΦT t = Φ†t

where Φ† is the Moore-Penrose pseudo-inverse of thematrix Φ.




University

Review



Sequential Learning


Multiple Outputs



135of 825


The log likelihood with the optimal wML is now

ln p(t |wML, β)

=N2lnβ − N

2ln(2π)− β 1

2(t−ΦwML)

T(t−ΦwML)

Find critical points of ln p(t |w, β) wrt β,

∂ ln p(t |wML, β)

∂β= 0

results in

1βML

=1N(t−ΦwML)

T(t−ΦwML)

Note: We can first find the maximum likelihood for w asthis does not depend on β. Then we can use wML to findthe maximum likelihood solution for β.Could we have chosen optimisation wrt β first, and thenwrt to w ?




University

Review



Sequential Learning


Multiple Outputs



136of 825

Sequential Learning - Stochastic GradientDescent

For large data sets, calculating the maximum likelihoodparameters wML and βML may be costly.For online applications, never all data in memory.Use a sequential algorithms (online algorithm).If the error function is a sum over data points E =

∑n En,

then1 initialise w(0) to some starting value2 update the parameter vector at iteration τ + 1 by

w(τ+1) = w(τ) − η∇En,

where En is the error function after presenting the nth dataset, and η is the learning rate.




University

Review



Sequential Learning


Multiple Outputs



137of 825

Sequential Learning - Stochastic GradientDescent

For the sum-of-squares error function, stochastic gradientdescent results in

w(τ+1) = w(τ) + η(

tn − w(τ)Tφ(xn))φ(xn)

The value for the learning rate must be chosen carefully. Atoo large learning rate may prevent the algorithm fromconverging. A too small learning rate does follow the datatoo slowly.




University

Review



Sequential Learning


Multiple Outputs



138of 825

Regularized Least Squares

Add regularisation in order to prevent overfitting

ED(w) + λEW(w)

with regularisation coefficient λ.Simple quadratic regulariser

EW(w) =12

wTw

Maximum likelihood solution

w =(λI +ΦTΦ

)−1ΦT t




University

Review



Sequential Learning


Multiple Outputs



139of 825

Regularized Least Squares

More general regulariser

EW(w) =12

M∑j=1

|wj|q

q = 1 (lasso) leads to a sparse model if λ large enough.

q = 0.5 q = 1 q = 2 q = 4




University

Review



Sequential Learning


Multiple Outputs



140of 825

Lagrangian Dual View of the Regulariser

By the Lagrange multiplier method, minimization of theregularized error function

12

N∑n=1

(tn − w>φ(xn))2 +

λ

2

M∑j=1

|wj|q ,

is equivalent to minimizing the unregularizedsum-of-squares error,

12

N∑n=1

(tn − w>φ(xn))2 subject to

M∑j=1

|wj|q ≤ η.

This yields the figures on the next slide.




University

Review



Sequential Learning


Multiple Outputs



141of 825

Comparison of Quadratic and Lasso Regulariser

Quadratic regulariser

12

M∑j=1

w2j

w1

w2

w?

Lasso regulariser

12

M∑j=1

|wj|

w1

w2

w?




University

Review



Sequential Learning


Multiple Outputs



142of 825

Multiple Outputs

More than 1 target variable per data point.y becomes a vector instead of a scalar. Each dimensioncan be treated with a different set of basis functions (andthat may be necessary if the data in the different targetdimensions represent very different types of information.)Here we restrict ourselves to the SAME basis functions

y(x,w) = WTφ(x)

where y is a K-dimensional column vector, W is an M × Kmatrix of model parameters, andφ(x) =

(φ0(x), . . . , φM−1(x)

), with φ0(x) = 1, as before.

Define target matrix T containing the target vector tTn in the

nth row.




University

Review



Sequential Learning


Multiple Outputs



143of 825

Multiple Outputs

Suppose the conditional distribution of the target vector isan isotropic Gaussian of the form

p(t | x,W, β) = N (t |WTφ(x), β−1I).

The log likelihood is then

ln p(T |X,W, β) =

N∑n=1

lnN (tn |WTφ(xn), β−1I)

=NK2

ln

(β

2π

)− β

2

N∑n=1

‖tn −WTφ(xn)‖2




University

Review



Sequential Learning


Multiple Outputs



144of 825

Multiple Outputs

Maximisation with respect to W results in

WML = (ΦTΦ)−1ΦTT.

For each target variable tk, we get

wk = (ΦTΦ)−1ΦT tk = Φ†tk.

The solution between the different target variablesdecouples.Holds also for a general Gaussian noise distribution witharbitrary covariance matrix.Why? W defines the mean of the Gaussian noisedistribution. And the maximum likelihood solution for themean of a multivariate Gaussian is independent of thecovariance.




University

Review



Sequential Learning


Multiple Outputs



145of 825

Loss Function for Regression

Over-fitting results from a large number of basis functionsand a relatively small training set.Regularisation can prevent overfitting, but how to find thecorrect value for the regularisation constant λ ?Frequentists viewpoint of the model complexity is thebias-variance trade-off.




University

Review



Sequential Learning


Multiple Outputs



146of 825


Choose an estimator y(x) to estimate the target value t foreach input x.Choose a loss function L(t, y(x)) which measures thedifference between the target t and the estimate y(x).The expected loss is then

E [L] =∫∫

L(t, y(x)) p(x, t) dx dt

Common choice: Squared Loss

L(t, y(x)) = {y(x)− t}2.

Expected loss for squared loss function

E [L] =∫∫{y(x)− t}2 p(x, t) dx dt.




University

Review



Sequential Learning


Multiple Outputs



147of 825


Expected loss for squared loss function

E [L] =∫∫{y(x)− t}2 p(x, t) dx dt.

Minimise E [L] by choosing the regression function

y(x) =∫

t p(x, t) dtp(x)

=

∫t p(t | x) dt = Et [t | x]

(calculus of variations is not required to derive this result ;we may work point-wise by fixing an x and usingstationarity to solve for y(x) — why is that sufficient?).




University

Review



Sequential Learning


Multiple Outputs



148of 825

Optimal Predictor for Squared Loss

The regression function which minimises the expectedsquared loss, is given by the mean of the conditionaldistribution p(t | x).

t

xx0

y(x0)

y(x)

p(t|x0)




University

Review



Sequential Learning


Multiple Outputs



149of 825

Analysing the Squared Loss (1)

Analyse the expected loss

E [L] =∫∫{y(x)− t}2 p(x, t) dx dt.

Rewrite the squared loss

{y(x)− t}2= {y(x)− E [t | x] + E [t | x]− t}2

= {y(x)− E [t | x]}2+ {E [t | x]− t}2

+ 2 {y(x)− E [t | x]} {E [t | x]− t}

Claim∫∫{y(x)− E [t | x]} {E [t | x]− t} p(x, t) dx dt = 0.




University

Review



Sequential Learning


Multiple Outputs



150of 825


Claim∫∫{y(x)− E [t | x]} {E [t | x]− t} p(x, t) dx dt = 0.

Seperate functions depending on t from functiondepending on x∫

{y(x)− E [t | x]}(∫{E [t | x]− t} p(x, t) dt

)dx

Calculate the integral over t∫{E [t | x]− t} p(x, t) dt = E [t | x] p(x)− p(x)

∫t p(x, t)

p(x)dt

= E [t | x] p(x)− p(x)E [t | x]= 0




University

Review



Sequential Learning


Multiple Outputs



151of 825


The expected loss is now

E [L] =∫{y(x)− E [t | x]}2p(x) dx +

∫var[t | x] p(x) dx (1)

Minimise first term by choosing y(x) = E [t | x] (as we sawalready).Second term represents the intrinsic variability of thetarget data (can be regarded as noise). Independent of thechoice y(x), can not be reduced by learning a better y(x).




University

Review



Sequential Learning


Multiple Outputs



152of 825

The Bias-Variance Decomposition (1)

Consider again squared loss for which the optimalprediction is given by the conditional expectation h(x)

h(x) = E [t | x] =∫

t p(t | x) dt.

Since h(x) is unavailable to us, it must be estimated from a(finite) dataset D.D is a finite sample from the unknown joint p(x, t)Notate the dependency of the learned function on the databy y(x;D).Evaluate performance of algorithm by taking theexpectation ED [L] over all data sets D




University

Review



Sequential Learning


Multiple Outputs



153of 825


Taking the expectation over data sets D, using Eqn 1, andinterchanging the order of expectations for the first term:

ED [E [L]] =∫

ED[{y(x;D)− h(x)}2] p(x) dx

+

∫∫{h(x)− t}2p(x, t) dx dt

Again, add and subtract the expectation ED [y(x;D)]

{y(x;D)− h(x)}2 = { y(x;D)− ED [y(x;D)]+ ED [y(x;D)]− h(x)}2

and show that the mixed term vanishes under theexpectation ED.




University

Review



Sequential Learning


Multiple Outputs



154of 825


Expected loss ED [L] over all data sets D

expected loss = (bias)2 + variance + noise.

where

(bias)2 =

∫{ED [y(x;D)]− h(x)}2 p(x) dx

variance =

∫ED[{y(x;D)− ED [y(x;D)]}2

]p(x) dx

noise =

∫∫{h(x)− t}2 p(x, t) dx dt.

(bias)2 : How accurate is a model across different trainingsets? (How much does the average prediction over alldata sets differ from the desired regression function ?)variance : How sensitive is the model to small changes inthe training set? (How much do solutions for individualdata sets vary around their average ?




University

Review



Sequential Learning


Multiple Outputs



155of 825

The Bias-Variance Decomposition

Simple models have low variance and high bias.

x

tln λ = 2.6

0 1

−1

0

1

x

t

0 1

−1

0

1

Left: Result of fitting the model to 100 data sets, only 25 shown.Right: Average of the 100 fits in red, the sinusoidal functionfrom where the data were created in green.




University

Review



Sequential Learning


Multiple Outputs



156of 825


Complex models have high variance and low bias.

x

tln λ = −2.4

0 1

−1

0

1

x

t

0 1

−1

0

1

Left: Result of fitting the model to 100 data sets, only 25 shown.Right: Average of the 100 fits in red, the sinusoidal functionfrom where the data were created in green.




University

Review



Sequential Learning


Multiple Outputs



157of 825


Dependence of bias and variance on the model complexitySquared bias, variance, their sum, and test dataThe minimum for (bias)2 + variance occurs close to thevalue that gives the minimum error

ln λ

−3 −2 −1 0 1 20

0.03

0.06

0.09

0.12

0.15

(bias)2

variance

(bias)2 + variancetest error




University

Review



Sequential Learning


Multiple Outputs



158of 825

Unbiased Estimators

You may have encountered unbiased estimatorsWhy guarantee zero bias? To quote the pioneer ofBayesian inference, Edwin Jaynes, from his bookProbability Theory: The Logic of Science (2003):




University

Review



Sequential Learning


Multiple Outputs



159of 825


Tradeoff between bias and variancesimple models have low variance and high biascomplex models have high variance and low bias

The sum of bias and variance has a minimum at a certainmodel complexity.Expected loss ED [L] over all data sets D

expected loss = (bias)2 + variance + noise.

The noise comes from the data, and can not be removedfrom the expected loss.To analyse the bias-variance decomposition : many datasets needed, which are not always available.

statistical machine learning€¦ · principal component analysis autoencoders graphical models 1...

Documents