statistical machine learning€¦ · principal component analysis autoencoders graphical models 1...

48
Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction Linear Algebra Probability Linear Regression 1 Linear Regression 2 Linear Classification 1 Linear Classification 2 Kernel Methods Sparse Kernel Methods Mixture Models and EM 1 Mixture Models and EM 2 Neural Networks 1 Neural Networks 2 Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data 2 1of 825 Statistical Machine Learning Christian Walder Machine Learning Research Group CSIRO Data61 and College of Engineering and Computer Science The Australian National University Canberra Semester One, 2020. (Many figures from C. M. Bishop, "Pattern Recognition and Machine Learning")

Upload: others

Post on 31-Jul-2020

27 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Statistical Machine Learning€¦ · Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Outlines

OverviewIntroductionLinear Algebra

Probability

Linear Regression 1

Linear Regression 2

Linear Classification 1

Linear Classification 2

Kernel MethodsSparse Kernel Methods

Mixture Models and EM 1Mixture Models and EM 2Neural Networks 1Neural Networks 2Principal Component Analysis

AutoencodersGraphical Models 1

Graphical Models 2

Graphical Models 3

Sampling

Sequential Data 1

Sequential Data 2

1of 825

Statistical Machine Learning

Christian Walder

Machine Learning Research GroupCSIRO Data61

and

College of Engineering and Computer ScienceThe Australian National University

CanberraSemester One, 2020.

(Many figures from C. M. Bishop, "Pattern Recognition and Machine Learning")

Page 2: Statistical Machine Learning€¦ · Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Linear Basis FunctionModels

Maximum Likelihood andLeast Squares

Sequential Learning

Regularized LeastSquares

Multiple Outputs

Loss Function forRegression

The Bias-VarianceDecomposition

113of 825

Part III

Linear Regression 1

Page 3: Statistical Machine Learning€¦ · Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Linear Basis FunctionModels

Maximum Likelihood andLeast Squares

Sequential Learning

Regularized LeastSquares

Multiple Outputs

Loss Function forRegression

The Bias-VarianceDecomposition

114of 825

Linear Regression

N = 10

x ≡ (x1, . . . , xN)T

t ≡ (t1, . . . , tN)T

xi ∈ R i = 1,. . . , N

ti ∈ R i = 1,. . . , N0 2 4 6 8 10

x

0

1

2

3

4

5

6

7

8

tPredictor y(x,w)?Performance measure?Optimal solution w∗?Recall: projection, inverse

Page 4: Statistical Machine Learning€¦ · Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Linear Basis FunctionModels

Maximum Likelihood andLeast Squares

Sequential Learning

Regularized LeastSquares

Multiple Outputs

Loss Function forRegression

The Bias-VarianceDecomposition

115of 825

Probabilities, Losses

Gaussian DistributionBayes RuleExpected Loss

Page 5: Statistical Machine Learning€¦ · Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Linear Basis FunctionModels

Maximum Likelihood andLeast Squares

Sequential Learning

Regularized LeastSquares

Multiple Outputs

Loss Function forRegression

The Bias-VarianceDecomposition

116of 825

Linear Curve Fitting - Least Squares

N = 10

x ≡ (x1, . . . , xN)T

t ≡ (t1, . . . , tN)T

xi ∈ R i = 1,. . . , N

ti ∈ R i = 1,. . . , N

y(x,w) = w1x + w0

X ≡ [x 1]

w∗ = (XTX)−1XT t

0 2 4 6 8 10x

0

1

2

3

4

5

6

7

8

t

We assume

t = y(x,w)︸ ︷︷ ︸deterministic

+ ε︸︷︷︸Gaussian noise

Page 6: Statistical Machine Learning€¦ · Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Linear Basis FunctionModels

Maximum Likelihood andLeast Squares

Sequential Learning

Regularized LeastSquares

Multiple Outputs

Loss Function forRegression

The Bias-VarianceDecomposition

117of 825

Curve fitting - revisited

a priori belief about the parameter w captured in the priorprobability p(w)

observed data D = {t1, . . . , tN}calculate the belief in w after the data D have beenobserved

p(w | D) = p(D |w)p(w)

p(D)p(D |w) as a function of w : likelihood functionlikelihood expresses how probable the data are fordifferent values of w — it is not a probability density withrespect w (but it is with respect to D ; prove it)

Page 7: Statistical Machine Learning€¦ · Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Linear Basis FunctionModels

Maximum Likelihood andLeast Squares

Sequential Learning

Regularized LeastSquares

Multiple Outputs

Loss Function forRegression

The Bias-VarianceDecomposition

118of 825

Maximum Likelihood

Consider the linear regression problem, where we haverandom variables xn and tn.We assume a conditional model tn|xn

We propose a distribution, parameterized by θ

tn|xn ∼ density(θ)

For a given θ the density defines the probability ofobserving tn|xn.We are interested in finding θ that maximises theprobability (called the likelihood) of the data.

Page 8: Statistical Machine Learning€¦ · Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Linear Basis FunctionModels

Maximum Likelihood andLeast Squares

Sequential Learning

Regularized LeastSquares

Multiple Outputs

Loss Function forRegression

The Bias-VarianceDecomposition

119of 825

Likelihood Function - Frequentist versus Bayesian

Likelihood function p(D |w)

Frequentist Approachw considered fixedparametervalue defined bysome ’estimator’error bars on theestimated wobtained from thedistribution ofpossible data sets D

Bayesian Approachonly one single dataset Duncertainty in theparameters comesfrom a probabilitydistribution over w

Page 9: Statistical Machine Learning€¦ · Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Linear Basis FunctionModels

Maximum Likelihood andLeast Squares

Sequential Learning

Regularized LeastSquares

Multiple Outputs

Loss Function forRegression

The Bias-VarianceDecomposition

120of 825

Frequentist Estimator - Maximum Likelihood

choose w for which the likelihood p(D |w) (the probabilityof the observed data) is maximalthe most common heuristic for learning a single fixed wequivalently: error function is negative log of likelihoodfunction, to be minimisedlog is a monotonic functionmaximising the likelihood ⇐⇒ minimising the errorExample: Fair-looking coin is tossed three times, alwayslanding on heads.Maximum likelihood estimate of the probability of landingheads will give 1.

Page 10: Statistical Machine Learning€¦ · Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Linear Basis FunctionModels

Maximum Likelihood andLeast Squares

Sequential Learning

Regularized LeastSquares

Multiple Outputs

Loss Function forRegression

The Bias-VarianceDecomposition

121of 825

Bayesian Approach

including prior knowledge easy (via prior w)subjective choice of prior, allows better results byincorporating domain knowledgesometimes choice of prior motivated by convinientmathematical formprior irrelevant as N →∞, but helps for small Nneed to sum/integrate over the whole parameter space

advances in sampling (Markov Chain Monte Carlo methods)advances in approximation schemes (Variational Bayes,Expectation Propagation)

there is no true w:

Page 11: Statistical Machine Learning€¦ · Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Linear Basis FunctionModels

Maximum Likelihood andLeast Squares

Sequential Learning

Regularized LeastSquares

Multiple Outputs

Loss Function forRegression

The Bias-VarianceDecomposition

122of 825

Regression

Given a training data set of N observations {xn} and targetvalues tn.Goal : Learn to predict the value of one ore more targetvalues t given a new value of the input x.Example: Polynomial curve fitting (see Introduction).

x

t

?

Page 12: Statistical Machine Learning€¦ · Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Linear Basis FunctionModels

Maximum Likelihood andLeast Squares

Sequential Learning

Regularized LeastSquares

Multiple Outputs

Loss Function forRegression

The Bias-VarianceDecomposition

123of 825

Supervised Learning:(non-Bayesian) Point Estimate

model with adjustable parameter w

training data x

training targets

t

model with fixed parameter w*

test data x

test target

t

Training Phase

Test Phasefix the most appropriate w*

Page 13: Statistical Machine Learning€¦ · Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Linear Basis FunctionModels

Maximum Likelihood andLeast Squares

Sequential Learning

Regularized LeastSquares

Multiple Outputs

Loss Function forRegression

The Bias-VarianceDecomposition

124of 825

Why Linear Regression?

Analytic solution when minimising sum of squared errorsWell understood statistical behaviourEfficient algorithms exist for convex losses andregularizersBut what if the relationship is non-linear?

0 1 2 3 4 5 61.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

Page 14: Statistical Machine Learning€¦ · Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Linear Basis FunctionModels

Maximum Likelihood andLeast Squares

Sequential Learning

Regularized LeastSquares

Multiple Outputs

Loss Function forRegression

The Bias-VarianceDecomposition

125of 825

Linear Basis Function Models

Linear combination of fixed nonlinear basis functions

y(x,w) =

M−1∑j=0

wjφj(x) = wTφ(x)

parameter w = (w0, . . . ,wM−1)T

basis functions φ(x) = (φ0(x), . . . , φM−1(x))T

convention φ0(x) = 1w0 is the bias parameter

Page 15: Statistical Machine Learning€¦ · Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Linear Basis FunctionModels

Maximum Likelihood andLeast Squares

Sequential Learning

Regularized LeastSquares

Multiple Outputs

Loss Function forRegression

The Bias-VarianceDecomposition

126of 825

Polynomial Basis Functions

Scalar input variable x

φj(x) = xj

Limitation : Polynomials are global functions of the inputvariable x so the learned function will extrapolate poorly

−1 0 1−1

−0.5

0

0.5

1

Page 16: Statistical Machine Learning€¦ · Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Linear Basis FunctionModels

Maximum Likelihood andLeast Squares

Sequential Learning

Regularized LeastSquares

Multiple Outputs

Loss Function forRegression

The Bias-VarianceDecomposition

127of 825

’Gaussian’ Basis Functions

Scalar input variable x

φj(x) = exp

{− (x− µj)

2

2s2

}Not a probability distribution.No normalisation required, taken care of by the modelparameters w.Well behaved away from the data (though pulled to zero)

−1 0 10

0.25

0.5

0.75

1

Page 17: Statistical Machine Learning€¦ · Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Linear Basis FunctionModels

Maximum Likelihood andLeast Squares

Sequential Learning

Regularized LeastSquares

Multiple Outputs

Loss Function forRegression

The Bias-VarianceDecomposition

128of 825

Sigmoidal Basis Functions

Scalar input variable x

φj(x) = σ

(x− µj

s

)where σ(a) is the logistic sigmoid function defined by

σ(a) =1

1 + exp(−a)

σ(a) is related to the hyperbolic tangent tanh(a) bytanh(a) = 2σ(a)− 1.

−1 0 10

0.25

0.5

0.75

1

Page 18: Statistical Machine Learning€¦ · Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Linear Basis FunctionModels

Maximum Likelihood andLeast Squares

Sequential Learning

Regularized LeastSquares

Multiple Outputs

Loss Function forRegression

The Bias-VarianceDecomposition

129of 825

Other Basis Functions

Fourier Basis : each basis function represents a specificfrequency and has infinite spatial extent.Wavelets : localised in both space and frequency (alsomutually orthogonal to simplify appliciation).Splines (piecewise polynomials restricted to regions of theinput space; additional constraints where pieces meet, e.g.smoothness constraints→ conditions on the derivatives).

0 1 2 3 4 5

-2.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

LinearSplines

0 1 2 3 4 5

-2.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

QuadraticSplines

0 1 2 3 4 5

-2.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

CubicSplines

0 1 2 3 4 5

-2.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

QuarticSplines

Approximate the points{(0, 0), (1, 1), (2,−1), (3, 0), (4,−2), (5, 1)} by different

splines.

Page 19: Statistical Machine Learning€¦ · Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Linear Basis FunctionModels

Maximum Likelihood andLeast Squares

Sequential Learning

Regularized LeastSquares

Multiple Outputs

Loss Function forRegression

The Bias-VarianceDecomposition

130of 825

Maximum Likelihood and Least Squares

No special assumption about the basis functions φj(x). Inthe simplest case, one can think of φj(x) = xj, or φ(x) = x.Assume target t is given by

t = y(x,w)︸ ︷︷ ︸deterministic

+ ε︸︷︷︸noise

where ε is a zero-mean Gaussian random variable withprecision (inverse variance) β.Thus

p(t | x,w, β) = N (t | y(x,w), β−1)

t

xx0

y(x0)

y(x)

p(t|x0)

Page 20: Statistical Machine Learning€¦ · Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Linear Basis FunctionModels

Maximum Likelihood andLeast Squares

Sequential Learning

Regularized LeastSquares

Multiple Outputs

Loss Function forRegression

The Bias-VarianceDecomposition

131of 825

Maximum Likelihood and Least Squares

Likelihood of one target t given the data x was

p(t | x,w, β) = N (t | y(x,w), β−1)

Now, a set of inputs X with corresponding target values t.Assume data are independent and identically distributed(i.i.d.) (means : data are drawn independent and from thesame distribution). The likelihood of the target t is then

p(t |X,w, β) =N∏

n=1

N (tn | y(xn,w), β−1)

=

N∏n=1

N (tn |wTφ(xn), β−1)

From now on drop the conditioning variable X from thenotation, as with supervised learning we do not seek tomodel the distribution of the input data.

Page 21: Statistical Machine Learning€¦ · Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Linear Basis FunctionModels

Maximum Likelihood andLeast Squares

Sequential Learning

Regularized LeastSquares

Multiple Outputs

Loss Function forRegression

The Bias-VarianceDecomposition

132of 825

Maximum Likelihood and Least Squares

Consider the logarithm of the likelihood p(t |w, β) (thelogarithm is a monotone function! )

ln p(t |w, β) =N∑

n=1

lnN (tn |wTφ(xn), β−1)

=

N∑n=1

ln

(√β

2πexp

{−β

2(tn − wTφ(xn))

2})

=N2lnβ − N

2ln(2π)− βED(w)

where the sum-of-squares error function is

ED(w) =12

N∑n=1

{tn − wTφ(xn)}2.

argmaxw ln p(t |w, β)→ argminw ED(w)

Page 22: Statistical Machine Learning€¦ · Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Linear Basis FunctionModels

Maximum Likelihood andLeast Squares

Sequential Learning

Regularized LeastSquares

Multiple Outputs

Loss Function forRegression

The Bias-VarianceDecomposition

133of 825

Maximum Likelihood and Least Squares

Goal: Find a more compact representation.Rewrite the error function

ED(w) =12

N∑n=1

{tn − wTφ(xn)}2 =12(t−Φw)T(t−Φw)

where t = (t1, . . . , tN)T , and

Φ =

φ0(x1) φ1(x1) . . . φM−1(x1)φ0(x2) φ1(x2) . . . φM−1(x2)

......

. . ....

φ0(xN) φ1(xN) . . . φM−1(xN)

Page 23: Statistical Machine Learning€¦ · Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Linear Basis FunctionModels

Maximum Likelihood andLeast Squares

Sequential Learning

Regularized LeastSquares

Multiple Outputs

Loss Function forRegression

The Bias-VarianceDecomposition

134of 825

Maximum Likelihood and Least Squares

The log likelihood is now

ln p(t |w, β) = N2lnβ − N

2ln(2π)− βED(w)

=N2lnβ − N

2ln(2π)− β 1

2(t−Φw)T(t−Φw)

Find critical points of ln p(t |w, β).The gradient with respect to w is

∇w ln p(t |w, β) = βΦT(t−Φw).

Setting the gradient to zero gives

0 = ΦT t−ΦTΦw,

which results in

wML = (ΦTΦ)−1ΦT t = Φ†t

where Φ† is the Moore-Penrose pseudo-inverse of thematrix Φ.

Page 24: Statistical Machine Learning€¦ · Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Linear Basis FunctionModels

Maximum Likelihood andLeast Squares

Sequential Learning

Regularized LeastSquares

Multiple Outputs

Loss Function forRegression

The Bias-VarianceDecomposition

135of 825

Maximum Likelihood and Least Squares

The log likelihood with the optimal wML is now

ln p(t |wML, β)

=N2lnβ − N

2ln(2π)− β 1

2(t−ΦwML)

T(t−ΦwML)

Find critical points of ln p(t |w, β) wrt β,

∂ ln p(t |wML, β)

∂β= 0

results in

1βML

=1N(t−ΦwML)

T(t−ΦwML)

Note: We can first find the maximum likelihood for w asthis does not depend on β. Then we can use wML to findthe maximum likelihood solution for β.Could we have chosen optimisation wrt β first, and thenwrt to w ?

Page 25: Statistical Machine Learning€¦ · Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Linear Basis FunctionModels

Maximum Likelihood andLeast Squares

Sequential Learning

Regularized LeastSquares

Multiple Outputs

Loss Function forRegression

The Bias-VarianceDecomposition

136of 825

Sequential Learning - Stochastic GradientDescent

For large data sets, calculating the maximum likelihoodparameters wML and βML may be costly.For online applications, never all data in memory.Use a sequential algorithms (online algorithm).If the error function is a sum over data points E =

∑n En,

then1 initialise w(0) to some starting value2 update the parameter vector at iteration τ + 1 by

w(τ+1) = w(τ) − η∇En,

where En is the error function after presenting the nth dataset, and η is the learning rate.

Page 26: Statistical Machine Learning€¦ · Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Linear Basis FunctionModels

Maximum Likelihood andLeast Squares

Sequential Learning

Regularized LeastSquares

Multiple Outputs

Loss Function forRegression

The Bias-VarianceDecomposition

137of 825

Sequential Learning - Stochastic GradientDescent

For the sum-of-squares error function, stochastic gradientdescent results in

w(τ+1) = w(τ) + η(

tn − w(τ)Tφ(xn))φ(xn)

The value for the learning rate must be chosen carefully. Atoo large learning rate may prevent the algorithm fromconverging. A too small learning rate does follow the datatoo slowly.

Page 27: Statistical Machine Learning€¦ · Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Linear Basis FunctionModels

Maximum Likelihood andLeast Squares

Sequential Learning

Regularized LeastSquares

Multiple Outputs

Loss Function forRegression

The Bias-VarianceDecomposition

138of 825

Regularized Least Squares

Add regularisation in order to prevent overfitting

ED(w) + λEW(w)

with regularisation coefficient λ.Simple quadratic regulariser

EW(w) =12

wTw

Maximum likelihood solution

w =(λI +ΦTΦ

)−1ΦT t

Page 28: Statistical Machine Learning€¦ · Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Linear Basis FunctionModels

Maximum Likelihood andLeast Squares

Sequential Learning

Regularized LeastSquares

Multiple Outputs

Loss Function forRegression

The Bias-VarianceDecomposition

139of 825

Regularized Least Squares

More general regulariser

EW(w) =12

M∑j=1

|wj|q

q = 1 (lasso) leads to a sparse model if λ large enough.

q = 0.5 q = 1 q = 2 q = 4

Page 29: Statistical Machine Learning€¦ · Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Linear Basis FunctionModels

Maximum Likelihood andLeast Squares

Sequential Learning

Regularized LeastSquares

Multiple Outputs

Loss Function forRegression

The Bias-VarianceDecomposition

140of 825

Lagrangian Dual View of the Regulariser

By the Lagrange multiplier method, minimization of theregularized error function

12

N∑n=1

(tn − w>φ(xn))2 +

λ

2

M∑j=1

|wj|q ,

is equivalent to minimizing the unregularizedsum-of-squares error,

12

N∑n=1

(tn − w>φ(xn))2 subject to

M∑j=1

|wj|q ≤ η.

This yields the figures on the next slide.

Page 30: Statistical Machine Learning€¦ · Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Linear Basis FunctionModels

Maximum Likelihood andLeast Squares

Sequential Learning

Regularized LeastSquares

Multiple Outputs

Loss Function forRegression

The Bias-VarianceDecomposition

141of 825

Comparison of Quadratic and Lasso Regulariser

Quadratic regulariser

12

M∑j=1

w2j

w1

w2

w?

Lasso regulariser

12

M∑j=1

|wj|

w1

w2

w?

Page 31: Statistical Machine Learning€¦ · Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Linear Basis FunctionModels

Maximum Likelihood andLeast Squares

Sequential Learning

Regularized LeastSquares

Multiple Outputs

Loss Function forRegression

The Bias-VarianceDecomposition

142of 825

Multiple Outputs

More than 1 target variable per data point.y becomes a vector instead of a scalar. Each dimensioncan be treated with a different set of basis functions (andthat may be necessary if the data in the different targetdimensions represent very different types of information.)Here we restrict ourselves to the SAME basis functions

y(x,w) = WTφ(x)

where y is a K-dimensional column vector, W is an M × Kmatrix of model parameters, andφ(x) =

(φ0(x), . . . , φM−1(x)

), with φ0(x) = 1, as before.

Define target matrix T containing the target vector tTn in the

nth row.

Page 32: Statistical Machine Learning€¦ · Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Linear Basis FunctionModels

Maximum Likelihood andLeast Squares

Sequential Learning

Regularized LeastSquares

Multiple Outputs

Loss Function forRegression

The Bias-VarianceDecomposition

143of 825

Multiple Outputs

Suppose the conditional distribution of the target vector isan isotropic Gaussian of the form

p(t | x,W, β) = N (t |WTφ(x), β−1I).

The log likelihood is then

ln p(T |X,W, β) =

N∑n=1

lnN (tn |WTφ(xn), β−1I)

=NK2

ln

)− β

2

N∑n=1

‖tn −WTφ(xn)‖2

Page 33: Statistical Machine Learning€¦ · Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Linear Basis FunctionModels

Maximum Likelihood andLeast Squares

Sequential Learning

Regularized LeastSquares

Multiple Outputs

Loss Function forRegression

The Bias-VarianceDecomposition

144of 825

Multiple Outputs

Maximisation with respect to W results in

WML = (ΦTΦ)−1ΦTT.

For each target variable tk, we get

wk = (ΦTΦ)−1ΦT tk = Φ†tk.

The solution between the different target variablesdecouples.Holds also for a general Gaussian noise distribution witharbitrary covariance matrix.Why? W defines the mean of the Gaussian noisedistribution. And the maximum likelihood solution for themean of a multivariate Gaussian is independent of thecovariance.

Page 34: Statistical Machine Learning€¦ · Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Linear Basis FunctionModels

Maximum Likelihood andLeast Squares

Sequential Learning

Regularized LeastSquares

Multiple Outputs

Loss Function forRegression

The Bias-VarianceDecomposition

145of 825

Loss Function for Regression

Over-fitting results from a large number of basis functionsand a relatively small training set.Regularisation can prevent overfitting, but how to find thecorrect value for the regularisation constant λ ?Frequentists viewpoint of the model complexity is thebias-variance trade-off.

Page 35: Statistical Machine Learning€¦ · Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Linear Basis FunctionModels

Maximum Likelihood andLeast Squares

Sequential Learning

Regularized LeastSquares

Multiple Outputs

Loss Function forRegression

The Bias-VarianceDecomposition

146of 825

Loss Function for Regression

Choose an estimator y(x) to estimate the target value t foreach input x.Choose a loss function L(t, y(x)) which measures thedifference between the target t and the estimate y(x).The expected loss is then

E [L] =∫∫

L(t, y(x)) p(x, t) dx dt

Common choice: Squared Loss

L(t, y(x)) = {y(x)− t}2.

Expected loss for squared loss function

E [L] =∫∫{y(x)− t}2 p(x, t) dx dt.

Page 36: Statistical Machine Learning€¦ · Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Linear Basis FunctionModels

Maximum Likelihood andLeast Squares

Sequential Learning

Regularized LeastSquares

Multiple Outputs

Loss Function forRegression

The Bias-VarianceDecomposition

147of 825

Loss Function for Regression

Expected loss for squared loss function

E [L] =∫∫{y(x)− t}2 p(x, t) dx dt.

Minimise E [L] by choosing the regression function

y(x) =∫

t p(x, t) dtp(x)

=

∫t p(t | x) dt = Et [t | x]

(calculus of variations is not required to derive this result ;we may work point-wise by fixing an x and usingstationarity to solve for y(x) — why is that sufficient?).

Page 37: Statistical Machine Learning€¦ · Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Linear Basis FunctionModels

Maximum Likelihood andLeast Squares

Sequential Learning

Regularized LeastSquares

Multiple Outputs

Loss Function forRegression

The Bias-VarianceDecomposition

148of 825

Optimal Predictor for Squared Loss

The regression function which minimises the expectedsquared loss, is given by the mean of the conditionaldistribution p(t | x).

t

xx0

y(x0)

y(x)

p(t|x0)

Page 38: Statistical Machine Learning€¦ · Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Linear Basis FunctionModels

Maximum Likelihood andLeast Squares

Sequential Learning

Regularized LeastSquares

Multiple Outputs

Loss Function forRegression

The Bias-VarianceDecomposition

149of 825

Analysing the Squared Loss (1)

Analyse the expected loss

E [L] =∫∫{y(x)− t}2 p(x, t) dx dt.

Rewrite the squared loss

{y(x)− t}2= {y(x)− E [t | x] + E [t | x]− t}2

= {y(x)− E [t | x]}2+ {E [t | x]− t}2

+ 2 {y(x)− E [t | x]} {E [t | x]− t}

Claim∫∫{y(x)− E [t | x]} {E [t | x]− t} p(x, t) dx dt = 0.

Page 39: Statistical Machine Learning€¦ · Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Linear Basis FunctionModels

Maximum Likelihood andLeast Squares

Sequential Learning

Regularized LeastSquares

Multiple Outputs

Loss Function forRegression

The Bias-VarianceDecomposition

150of 825

Analysing the Squared Loss (2)

Claim∫∫{y(x)− E [t | x]} {E [t | x]− t} p(x, t) dx dt = 0.

Seperate functions depending on t from functiondepending on x∫

{y(x)− E [t | x]}(∫{E [t | x]− t} p(x, t) dt

)dx

Calculate the integral over t∫{E [t | x]− t} p(x, t) dt = E [t | x] p(x)− p(x)

∫t p(x, t)

p(x)dt

= E [t | x] p(x)− p(x)E [t | x]= 0

Page 40: Statistical Machine Learning€¦ · Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Linear Basis FunctionModels

Maximum Likelihood andLeast Squares

Sequential Learning

Regularized LeastSquares

Multiple Outputs

Loss Function forRegression

The Bias-VarianceDecomposition

151of 825

Analysing the Squared Loss (3)

The expected loss is now

E [L] =∫{y(x)− E [t | x]}2p(x) dx +

∫var[t | x] p(x) dx (1)

Minimise first term by choosing y(x) = E [t | x] (as we sawalready).Second term represents the intrinsic variability of thetarget data (can be regarded as noise). Independent of thechoice y(x), can not be reduced by learning a better y(x).

Page 41: Statistical Machine Learning€¦ · Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Linear Basis FunctionModels

Maximum Likelihood andLeast Squares

Sequential Learning

Regularized LeastSquares

Multiple Outputs

Loss Function forRegression

The Bias-VarianceDecomposition

152of 825

The Bias-Variance Decomposition (1)

Consider again squared loss for which the optimalprediction is given by the conditional expectation h(x)

h(x) = E [t | x] =∫

t p(t | x) dt.

Since h(x) is unavailable to us, it must be estimated from a(finite) dataset D.D is a finite sample from the unknown joint p(x, t)Notate the dependency of the learned function on the databy y(x;D).Evaluate performance of algorithm by taking theexpectation ED [L] over all data sets D

Page 42: Statistical Machine Learning€¦ · Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Linear Basis FunctionModels

Maximum Likelihood andLeast Squares

Sequential Learning

Regularized LeastSquares

Multiple Outputs

Loss Function forRegression

The Bias-VarianceDecomposition

153of 825

The Bias-Variance Decomposition (2)

Taking the expectation over data sets D, using Eqn 1, andinterchanging the order of expectations for the first term:

ED [E [L]] =∫

ED[{y(x;D)− h(x)}2] p(x) dx

+

∫∫{h(x)− t}2p(x, t) dx dt

Again, add and subtract the expectation ED [y(x;D)]

{y(x;D)− h(x)}2 = { y(x;D)− ED [y(x;D)]+ ED [y(x;D)]− h(x)}2

and show that the mixed term vanishes under theexpectation ED.

Page 43: Statistical Machine Learning€¦ · Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Linear Basis FunctionModels

Maximum Likelihood andLeast Squares

Sequential Learning

Regularized LeastSquares

Multiple Outputs

Loss Function forRegression

The Bias-VarianceDecomposition

154of 825

The Bias-Variance Decomposition (3)

Expected loss ED [L] over all data sets D

expected loss = (bias)2 + variance + noise.

where

(bias)2 =

∫{ED [y(x;D)]− h(x)}2 p(x) dx

variance =

∫ED[{y(x;D)− ED [y(x;D)]}2

]p(x) dx

noise =

∫∫{h(x)− t}2 p(x, t) dx dt.

(bias)2 : How accurate is a model across different trainingsets? (How much does the average prediction over alldata sets differ from the desired regression function ?)variance : How sensitive is the model to small changes inthe training set? (How much do solutions for individualdata sets vary around their average ?

Page 44: Statistical Machine Learning€¦ · Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Linear Basis FunctionModels

Maximum Likelihood andLeast Squares

Sequential Learning

Regularized LeastSquares

Multiple Outputs

Loss Function forRegression

The Bias-VarianceDecomposition

155of 825

The Bias-Variance Decomposition

Simple models have low variance and high bias.

x

tln λ = 2.6

0 1

−1

0

1

x

t

0 1

−1

0

1

Left: Result of fitting the model to 100 data sets, only 25 shown.Right: Average of the 100 fits in red, the sinusoidal functionfrom where the data were created in green.

Page 45: Statistical Machine Learning€¦ · Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Linear Basis FunctionModels

Maximum Likelihood andLeast Squares

Sequential Learning

Regularized LeastSquares

Multiple Outputs

Loss Function forRegression

The Bias-VarianceDecomposition

156of 825

The Bias-Variance Decomposition

Complex models have high variance and low bias.

x

tln λ = −2.4

0 1

−1

0

1

x

t

0 1

−1

0

1

Left: Result of fitting the model to 100 data sets, only 25 shown.Right: Average of the 100 fits in red, the sinusoidal functionfrom where the data were created in green.

Page 46: Statistical Machine Learning€¦ · Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Linear Basis FunctionModels

Maximum Likelihood andLeast Squares

Sequential Learning

Regularized LeastSquares

Multiple Outputs

Loss Function forRegression

The Bias-VarianceDecomposition

157of 825

The Bias-Variance Decomposition

Dependence of bias and variance on the model complexitySquared bias, variance, their sum, and test dataThe minimum for (bias)2 + variance occurs close to thevalue that gives the minimum error

ln λ

−3 −2 −1 0 1 20

0.03

0.06

0.09

0.12

0.15

(bias)2

variance

(bias)2 + variancetest error

Page 47: Statistical Machine Learning€¦ · Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Linear Basis FunctionModels

Maximum Likelihood andLeast Squares

Sequential Learning

Regularized LeastSquares

Multiple Outputs

Loss Function forRegression

The Bias-VarianceDecomposition

158of 825

Unbiased Estimators

You may have encountered unbiased estimatorsWhy guarantee zero bias? To quote the pioneer ofBayesian inference, Edwin Jaynes, from his bookProbability Theory: The Logic of Science (2003):

Page 48: Statistical Machine Learning€¦ · Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Linear Basis FunctionModels

Maximum Likelihood andLeast Squares

Sequential Learning

Regularized LeastSquares

Multiple Outputs

Loss Function forRegression

The Bias-VarianceDecomposition

159of 825

The Bias-Variance Decomposition

Tradeoff between bias and variancesimple models have low variance and high biascomplex models have high variance and low bias

The sum of bias and variance has a minimum at a certainmodel complexity.Expected loss ED [L] over all data sets D

expected loss = (bias)2 + variance + noise.

The noise comes from the data, and can not be removedfrom the expected loss.To analyse the bias-variance decomposition : many datasets needed, which are not always available.