model selection for gaussian processeshomepages.inf.ed.ac.uk/ckiw/talks/nips06.pdf · model...

25
Model Selection for Gaussian Processes Chris Williams Institute for Adaptive and Neural Computation School of Informatics, University of Edinburgh, UK December 2006 Chris Williams University of Edinburgh Model Selection for Gaussian Processes

Upload: vudiep

Post on 18-Jun-2018

234 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Model Selection for Gaussian Processeshomepages.inf.ed.ac.uk/ckiw/talks/nips06.pdf · Model Selection for Gaussian Processes Chris Williams Institute for Adaptive and Neural Computation

Model Selection for Gaussian Processes

Chris Williams

Institute for Adaptive and Neural ComputationSchool of Informatics, University of Edinburgh, UK

December 2006

Chris Williams University of Edinburgh

Model Selection for Gaussian Processes

Page 2: Model Selection for Gaussian Processeshomepages.inf.ed.ac.uk/ckiw/talks/nips06.pdf · Model Selection for Gaussian Processes Chris Williams Institute for Adaptive and Neural Computation

Outline

GP basics

Model selection: covariance functions and parameterizations

Criteria for model selection

Marginal likelihood

Cross-validation

Examples

Thanks to Carl Rasmussen (book co-author)

Chris Williams University of Edinburgh

Model Selection for Gaussian Processes

Page 3: Model Selection for Gaussian Processeshomepages.inf.ed.ac.uk/ckiw/talks/nips06.pdf · Model Selection for Gaussian Processes Chris Williams Institute for Adaptive and Neural Computation

Features

X Bayesian learning

X Higher level regularization

X Predictive uncertainty

X Cross-validation

X Ensemble learning

X Search strategies (no more grid search)

X Feature selection

X More than 2 levels of inference

Chris Williams University of Edinburgh

Model Selection for Gaussian Processes

Page 4: Model Selection for Gaussian Processeshomepages.inf.ed.ac.uk/ckiw/talks/nips06.pdf · Model Selection for Gaussian Processes Chris Williams Institute for Adaptive and Neural Computation

Gaussian Process Basics

For a stochastic process f (x), mean function is

µ(x) = E [f (x)].

Assume µ(x) ≡ 0 ∀xCovariance function

k(x, x′) = E [f (x)f (x′)].

Priors over function-space can be defined directly by choosinga covariance function, e.g.

E [f (x)f (x′)] = exp(−w |x− x′|)

Gaussian processes are stochastic processes defined by theirmean and covariance functions.

Chris Williams University of Edinburgh

Model Selection for Gaussian Processes

Page 5: Model Selection for Gaussian Processeshomepages.inf.ed.ac.uk/ckiw/talks/nips06.pdf · Model Selection for Gaussian Processes Chris Williams Institute for Adaptive and Neural Computation

Gaussian Process Prediction

A Gaussian process places a prior over functions

Observe data D = (xi , yi )ni=1, obtain a posterior distribution

p(f |D) ∝ p(f )p(D|f )

posterior ∝ prior× likelihood

For a Gaussian likelihood (regression), predictions can bemade exactly via matrix computations

For classification, we need approximations (or MCMC)

Chris Williams University of Edinburgh

Model Selection for Gaussian Processes

Page 6: Model Selection for Gaussian Processeshomepages.inf.ed.ac.uk/ckiw/talks/nips06.pdf · Model Selection for Gaussian Processes Chris Williams Institute for Adaptive and Neural Computation

GP Regression

Prediction at x∗ ∼ N (f (x∗), var(x∗)), with

f (x∗) =n∑

i=1

αik(x∗, xi )

whereα = (K + σ2I )−1y

and

var(x∗) = k(x∗, x∗)− kT (x∗)(K + σ2I )−1k(x∗)

with k(x∗) = (k(x∗, x1), . . . , k(x, xn))T

Chris Williams University of Edinburgh

Model Selection for Gaussian Processes

Page 7: Model Selection for Gaussian Processeshomepages.inf.ed.ac.uk/ckiw/talks/nips06.pdf · Model Selection for Gaussian Processes Chris Williams Institute for Adaptive and Neural Computation

GP classification

Response function π(x) = r(f (x)), e.g. logistic or probit

For these choices log likelihood is concave ⇒ unique maximum

Use MAP or EP for inference (unconstrained optimization, c.f.SVMs)

For GPR and GPC, some consistency results are available fornon-degenerate kernels

Chris Williams University of Edinburgh

Model Selection for Gaussian Processes

Page 8: Model Selection for Gaussian Processeshomepages.inf.ed.ac.uk/ckiw/talks/nips06.pdf · Model Selection for Gaussian Processes Chris Williams Institute for Adaptive and Neural Computation

Model Selection

Covariance function often has some free parameters θ

We can choose different families of covariance functions

kSE (xp, xq) = σ2f exp

(− 1

2(xp − xq)

>M(xp − xq))

+ σ2nδpq,

kNN(xp, xq) = σ2f sin−1

( 2x̃>p M x̃q√(1 + 2x̃>p M x̃p)(1 + 2x̃′>q M x̃q)

)+ σ2

nδpq,

where x̃ = (1, x1, . . . , xD)>

Chris Williams University of Edinburgh

Model Selection for Gaussian Processes

Page 9: Model Selection for Gaussian Processeshomepages.inf.ed.ac.uk/ckiw/talks/nips06.pdf · Model Selection for Gaussian Processes Chris Williams Institute for Adaptive and Neural Computation

Automatic Relevance Determination

kSE (xp, xq) = σ2f exp

(− 1

2(xp − xq)

>M(xp − xq))

+ σ2nδpq

Isotropic M = `−2IARD: M = diag(`−2

1 , `−22 , . . . , `−2

D )

−20

2

−2

0

2

−2

−1

0

1

2

input x1input x2

outp

ut y

−20

2

−2

0

2

−2

−1

0

1

2

input x1input x2

outp

ut y

Chris Williams University of Edinburgh

Model Selection for Gaussian Processes

Page 10: Model Selection for Gaussian Processeshomepages.inf.ed.ac.uk/ckiw/talks/nips06.pdf · Model Selection for Gaussian Processes Chris Williams Institute for Adaptive and Neural Computation

Further modelling flexibility

We can combine covariance functions to make things moregeneral

Example, functional ANOVA, e.g.

k(x, x′) =D∑

i=1

ki (xi , x′i ) +

D∑i=2

i−1∑j=1

kij(xi , xj ; x′i , x

′j )

Non-linear warping of the input space (Sampson and Guttorp,1992)

Chris Williams University of Edinburgh

Model Selection for Gaussian Processes

Page 11: Model Selection for Gaussian Processeshomepages.inf.ed.ac.uk/ckiw/talks/nips06.pdf · Model Selection for Gaussian Processes Chris Williams Institute for Adaptive and Neural Computation

The Baby and the Bathwater

MacKay (2003 ch 45): In moving from neural networks tokernel machines did we throw out the baby with thebathwater? i.e. the ability to learn hiddenfeatures/representations

But consider M = ΛΛ> + diag(`)−2 for Λ being D × k, fork < D

The k columns of Λ can identify directions in the input spacewith specially high relevance (Vivarelli and Williams, 1999)

Chris Williams University of Edinburgh

Model Selection for Gaussian Processes

Page 12: Model Selection for Gaussian Processeshomepages.inf.ed.ac.uk/ckiw/talks/nips06.pdf · Model Selection for Gaussian Processes Chris Williams Institute for Adaptive and Neural Computation

Understanding the prior

We can analyze and draw samples from the prior

0 1

−2

−1

0

1

0 1−2

−1

0

1

k(x − x ′) = exp−|x − x ′|/` k(x − x ′) = exp−|x − x ′|2/2`2

with ` = 0.1

Chris Williams University of Edinburgh

Model Selection for Gaussian Processes

Page 13: Model Selection for Gaussian Processeshomepages.inf.ed.ac.uk/ckiw/talks/nips06.pdf · Model Selection for Gaussian Processes Chris Williams Institute for Adaptive and Neural Computation

−4 0 4

−1

0

1

input, x

outp

ut, f

(x)

σ = 10σ = 3σ = 1

Samples from the neural network covariance function

Chris Williams University of Edinburgh

Model Selection for Gaussian Processes

Page 14: Model Selection for Gaussian Processeshomepages.inf.ed.ac.uk/ckiw/talks/nips06.pdf · Model Selection for Gaussian Processes Chris Williams Institute for Adaptive and Neural Computation

Criteria for Model Selection

Bayesian marginal likelihood (“evidence”) p(D|model)Estimate the generalization error (e.g. cross-validation)

Bound the generalization error (e.g. PAC-Bayes)

Chris Williams University of Edinburgh

Model Selection for Gaussian Processes

Page 15: Model Selection for Gaussian Processeshomepages.inf.ed.ac.uk/ckiw/talks/nips06.pdf · Model Selection for Gaussian Processes Chris Williams Institute for Adaptive and Neural Computation

Bayesian Model Selection

For GPR, we can compute the marginal likelihoodp(y|X ,θ,M) exactly (integrating out f). For GPC itcan be approximated using Laplace approx or EP

Can also use MCMC to sample

p(Mi |y,X ) ∝ p(y|X ,Mi )p(Mi ) where

p(y|X ,Mi ) =

∫p(y|X ,θi ,Mi )p(θi |Mi ) dθi

y

f

M

0

Chris Williams University of Edinburgh

Model Selection for Gaussian Processes

Page 16: Model Selection for Gaussian Processeshomepages.inf.ed.ac.uk/ckiw/talks/nips06.pdf · Model Selection for Gaussian Processes Chris Williams Institute for Adaptive and Neural Computation

Type-II Maximum Likelihood

Type-II maximum likelihood maximizes the marginal likelihoodp(y|X ,θi ,Mi ) rather than integrates

p(y|X ,θi ,Mi ) is differentiable wrt θ: no more grid search!

∂θjlog p(y|X ,θ) =

1

2y>K−1 ∂K

∂θjK−1y − 1

2trace

(K−1 ∂K

∂θj

)This was in Williams and Rasmussen (NIPS*95)

Can also use MAP estimation or MCMC in θ-space, see e.g.Williams and Barber (1998)

Chris Williams University of Edinburgh

Model Selection for Gaussian Processes

Page 17: Model Selection for Gaussian Processeshomepages.inf.ed.ac.uk/ckiw/talks/nips06.pdf · Model Selection for Gaussian Processes Chris Williams Institute for Adaptive and Neural Computation

y

mar

gina

l lik

elih

ood,

p(y

|X,H

i)

all possible data sets

simpleintermediatecomplex

100

−100

−80

−60

−40

−20

0

20

40

log

prob

abili

ty

characteristic lengthscale

minus complexity penaltydata fitmarginal likelihood

Marginal likelihood automatically incorporates a trade-offbetween model fit and model complexity

log p(y|X ,θi ,Mi ) = −1

2yTK−1

y y − 1

2log |Ky | −

n

2log 2π

Chris Williams University of Edinburgh

Model Selection for Gaussian Processes

Page 18: Model Selection for Gaussian Processeshomepages.inf.ed.ac.uk/ckiw/talks/nips06.pdf · Model Selection for Gaussian Processes Chris Williams Institute for Adaptive and Neural Computation

Marginal Likelihood and Local Optima

−5 0 5−2

−1

0

1

2

input, x

outp

ut, y

−5 0 5−2

−1

0

1

2

input, x

outp

ut, y

There can be multiple optima of the marginal likelihood

These correspond to different interpretations of the data

Chris Williams University of Edinburgh

Model Selection for Gaussian Processes

Page 19: Model Selection for Gaussian Processeshomepages.inf.ed.ac.uk/ckiw/talks/nips06.pdf · Model Selection for Gaussian Processes Chris Williams Institute for Adaptive and Neural Computation

Cross-validation for GPR

log p(yi |X , y−i ,θ) = −1

2log(2πσ2

i )−(yi − µi )

2

2σ2i

LLOO(X , y,θ) =n∑

i=1

log p(yi |X , y−i ,θ)

Leave-one-out predictions can be made efficiently (e.g.Wahba, 1990; Sundararajan and Keerthi, 2001)

We can also compute derivatives ∂LLOO/∂θj

LOO for squared error ignores predictive variances, and doesnot determine overall scale of the covariance function

For GPC LOO computations are trickier, but the cavitymethod (Opper and Winther, 2000) seems to be effective

Chris Williams University of Edinburgh

Model Selection for Gaussian Processes

Page 20: Model Selection for Gaussian Processeshomepages.inf.ed.ac.uk/ckiw/talks/nips06.pdf · Model Selection for Gaussian Processes Chris Williams Institute for Adaptive and Neural Computation

Comparing marginal likelihood and LOO-CV

L =n∑

i=1

log p(yi |{yj , j < i},θ)

LLOO =n∑

i=1

log p(yi |{yj , j 6= i},θ)

Marginal likelihood tells us the probability of the data giventhe assumptions of the model

LOO-CV gives an estimate of the predictive log probability,whether or not the model assumptions are fulfilled

CV procedures should be more robust against modelmis-specification (e.g. Wahba, 1990)

Chris Williams University of Edinburgh

Model Selection for Gaussian Processes

Page 21: Model Selection for Gaussian Processeshomepages.inf.ed.ac.uk/ckiw/talks/nips06.pdf · Model Selection for Gaussian Processes Chris Williams Institute for Adaptive and Neural Computation

Example 1: Mauna Loa CO2

1960 1970 1980 1990 2000 2010 2020

320

340

360

380

400

420

year

CO

2 con

cent

ratio

n, p

pm

Fit this data with sum of four covariance functions, modelling(i) smooth trend (ii) periodic component (with some decay)(iii) medium term irregularities (iv) noise

Optimize and compare models using marginal likelihood

Chris Williams University of Edinburgh

Model Selection for Gaussian Processes

Page 22: Model Selection for Gaussian Processeshomepages.inf.ed.ac.uk/ckiw/talks/nips06.pdf · Model Selection for Gaussian Processes Chris Williams Institute for Adaptive and Neural Computation

Example 2: Robot Arm Inverse Dynamics

44,484 training, 4,449 test examples, in 21-dimensions

Map from 7 joint positions, velocities and accelerations of 7joints to torque

Use SE (Gaussian) covariance function with ARD ⇒ 23hyperparameters, optimizing marginal likelihood or LLOO

Similar accuracy for both SMSE and mean standardized logloss, but marginal likelihood optimzation is quicker

Chris Williams University of Edinburgh

Model Selection for Gaussian Processes

Page 23: Model Selection for Gaussian Processeshomepages.inf.ed.ac.uk/ckiw/talks/nips06.pdf · Model Selection for Gaussian Processes Chris Williams Institute for Adaptive and Neural Computation

Multi-task Learning

D D D

D D D

0

0 0 0

o

3

3

3

2

22 1

1

1

Minka and Picard (1999)

Chris Williams University of Edinburgh

Model Selection for Gaussian Processes

Page 24: Model Selection for Gaussian Processeshomepages.inf.ed.ac.uk/ckiw/talks/nips06.pdf · Model Selection for Gaussian Processes Chris Williams Institute for Adaptive and Neural Computation

Summary

X Bayesian learning

X Higher level regularization

X Predictive uncertainty

X Cross-validation

X Ensemble learning

X Search strategies (no more grid search)

X Feature selection

X More than 2 levels of inference

Model selection is much more than just setting parameters ina covariance function

Chris Williams University of Edinburgh

Model Selection for Gaussian Processes

Page 25: Model Selection for Gaussian Processeshomepages.inf.ed.ac.uk/ckiw/talks/nips06.pdf · Model Selection for Gaussian Processes Chris Williams Institute for Adaptive and Neural Computation

Relationship to alignment

A(K , y) =y>Ky

n‖K‖F

where y has +1/-1 elements

log A(K , y) = log(y>Ky

)− log tr(K )

log q(y|K ) = −1

2f̂>K−1f̂ + log p(y|̂f)− 1

2log |B|,

where B = I + W12 KW

12 , f̂ is the maximum of the posterior found

by Newton’s method, and W is the diagonal matrixW = −∇∇ log p(y|̂f).

Chris Williams University of Edinburgh

Model Selection for Gaussian Processes