new 9. gaussian processes - regression - tum · 2016. 6. 17. · gaussian processes for...

41
Computer Vision Group Prof. Daniel Cremers 9. Gaussian Processes - Regression

Upload: others

Post on 12-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: New 9. Gaussian Processes - Regression - TUM · 2016. 6. 17. · Gaussian Processes For Classification In regression we have , in binary classification we have To use a GP for classification,

Computer Vision Group Prof. Daniel Cremers

9. Gaussian Processes - Regression

Page 2: New 9. Gaussian Processes - Regression - TUM · 2016. 6. 17. · Gaussian Processes For Classification In regression we have , in binary classification we have To use a GP for classification,

PD Dr. Rudolph TriebelComputer Vision Group

Machine Learning for Computer Vision

Repetition: Regularized Regression

Before, we solved for w using the pseudoinverse.

But: we can kernelize this problem as well!

First step: Matrix inversion lemma

2

Page 3: New 9. Gaussian Processes - Regression - TUM · 2016. 6. 17. · Gaussian Processes For Classification In regression we have , in binary classification we have To use a GP for classification,

PD Dr. Rudolph TriebelComputer Vision Group

Machine Learning for Computer Vision

Kernelized Regression

Thus, we have:

by defining:

we get:

(same result as last lecture)

This means that the predicted output is a linear combination of the training outputs, where the coefficients depend on the similarities to the training input.

3

w = (�ID + �T�)�1�T t = �T (�IN + ��T )�1t

a = (�IN +K)�1tK = ��T

y(x) = �(x)Tw = �(x)T�Ta

= k(x)T (K + �IN )�1t

Page 4: New 9. Gaussian Processes - Regression - TUM · 2016. 6. 17. · Gaussian Processes For Classification In regression we have , in binary classification we have To use a GP for classification,

PD Dr. Rudolph TriebelComputer Vision Group

Machine Learning for Computer Vision

Motivation

•We have found a way to predict function

values of y for new input points x

•As we used regularized regression, we can equivalently find the predictive distribution

by marginalizing out the parameters w •Can we find a closed form for that

distribution?

•How can we model the uncertainty of our prediction?

•Can we use that for classification?

4

Page 5: New 9. Gaussian Processes - Regression - TUM · 2016. 6. 17. · Gaussian Processes For Classification In regression we have , in binary classification we have To use a GP for classification,

PD Dr. Rudolph TriebelComputer Vision Group

Machine Learning for Computer Vision

Gaussian Marginals and Conditionals

Before we start, we need some formulae:

Assume we have two variables and that are jointly Gaussian distributed, i.e.

with

Then the cond. distributionwhere

and

The marginal is

5

xa xb

N (x | µ,⌃)

x =

✓xa

xb

◆µ =

✓µa

µb

◆⌃ =

✓⌃aa ⌃ab

⌃ba ⌃bb

p(xa) = N (xa | µa,⌃aa)

p(xa | xb) = N (x | µa|b,⌃a|b)

⌃a|b = ⌃aa � ⌃ab⌃�1bb ⌃ba

µa|b = µa + ⌃ab⌃�1bb (xb � µb)

“Schur Complement”

Page 6: New 9. Gaussian Processes - Regression - TUM · 2016. 6. 17. · Gaussian Processes For Classification In regression we have , in binary classification we have To use a GP for classification,

PD Dr. Rudolph TriebelComputer Vision Group

Machine Learning for Computer Vision

Gaussian Marginals and Conditionals

Main idea of the proof for the conditional (using inverse of block matrices):

The lower line corresponds to a quadratic form that is only dependent on , i.e. the rest can be identified with the conditional Normal distribution .

(for details see, e.g. Bishop or Murphy)

6

✓⌃aa ⌃ab

⌃ba ⌃bb

◆�1

=

✓I 0

�⌃�1bb ⌃ba I

◆✓(⌃/⌃bb)�1 0

0 ⌃�1bb

◆✓I �⌃ab⌃

�1bb

0 I

p(xb)

p(xa | xb)

Page 7: New 9. Gaussian Processes - Regression - TUM · 2016. 6. 17. · Gaussian Processes For Classification In regression we have , in binary classification we have To use a GP for classification,

PD Dr. Rudolph TriebelComputer Vision Group

Machine Learning for Computer Vision

Definition

Definition: A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution.

The number of random variables can be infinite!

This means: a GP is a Gaussian distribution over functions!

To specify a GP we need:

mean function:

covariance function:

7

m(x) = E[y(x)]

k(x1,x2) = E[y(x1)�m(x1)y(x2)�m(x2)]

Page 8: New 9. Gaussian Processes - Regression - TUM · 2016. 6. 17. · Gaussian Processes For Classification In regression we have , in binary classification we have To use a GP for classification,

PD Dr. Rudolph TriebelComputer Vision Group

Machine Learning for Computer Vision

Example

•green line: sinusoidal data source

•blue circles: data points with Gaussian noise

•red line: mean function of the Gaussian process

8

Page 9: New 9. Gaussian Processes - Regression - TUM · 2016. 6. 17. · Gaussian Processes For Classification In regression we have , in binary classification we have To use a GP for classification,

PD Dr. Rudolph TriebelComputer Vision Group

Machine Learning for Computer Vision

How Can We Handle Infinity?

Idea: split the (infinite) number of random variables into a finite and an infinite subset.

From the marginalization property we get:

This means we can use finite vectors.

9

x =

✓xf

xi

◆⇠ N

✓✓µf

µi

◆,

✓⌃f ⌃fi

⌃Tfi ⌃i

◆◆

finite part infinite part

p(xf ) =

Zp(xf ,xi)dxi = N (xf | µf ,⌃f )

Page 10: New 9. Gaussian Processes - Regression - TUM · 2016. 6. 17. · Gaussian Processes For Classification In regression we have , in binary classification we have To use a GP for classification,

PD Dr. Rudolph TriebelComputer Vision Group

Machine Learning for Computer Vision

A Simple Example

In Bayesian linear regression, we hadwith prior probability . This means:

Any number of function valuesis jointly Gaussian with zero mean.

The covariance function of this process is

In general, any valid kernel function can be used.

10

y(x) = �(x)Tw

w ⇠ N (0,⌃p)

E[y(x)] = �(x)TE[w] = 0

E[y(x1)y(x2))] = �(x1)TE[ww

T ]�(x2) = �(x1)T⌃p�(x2)

y(x1), . . . , y(xN )

k(x1,x2) = �(x1)T⌃p�(x2)

Page 11: New 9. Gaussian Processes - Regression - TUM · 2016. 6. 17. · Gaussian Processes For Classification In regression we have , in binary classification we have To use a GP for classification,

PD Dr. Rudolph TriebelComputer Vision Group

Machine Learning for Computer Vision

The Covariance Function

The most used covariance function (kernel) is:

It is known as “squared exponential”, “radial basis function” or “Gaussian kernel”.

Other possibilities exist, e.g. the exponential kernel:

This is used in the “Ornstein-Uhlenbeck” process.

11

signal variance

k(xp,xq) = �2f exp(�

1

2l2(xp � xq)

2) + �2

n�pq

length scale noise variance

k(xp,xq) = exp(�✓|xp � xq|)

Page 12: New 9. Gaussian Processes - Regression - TUM · 2016. 6. 17. · Gaussian Processes For Classification In regression we have , in binary classification we have To use a GP for classification,

PD Dr. Rudolph TriebelComputer Vision Group

Machine Learning for Computer Vision

Sampling from a GP

Just as we can sample from a Gaussian distribution, we can also generate samples from a GP. Every sample will then be a function!

Process:

1.Choose a number of input points

2.Compute the covariance matrix K where

3.Generate a random Gaussian vector from

4.Plot the values versus

12

x

⇤1, . . . ,x

⇤M

Kij = k(x⇤i ,x

⇤j )

y⇤ ⇠ N (0,K)

x

⇤1, . . . ,x

⇤M y⇤1 , . . . , y

⇤M

Page 13: New 9. Gaussian Processes - Regression - TUM · 2016. 6. 17. · Gaussian Processes For Classification In regression we have , in binary classification we have To use a GP for classification,

PD Dr. Rudolph TriebelComputer Vision Group

Machine Learning for Computer Vision

Sampling from a GP

Squared exponential kernel

13

Exponential kernel

Page 14: New 9. Gaussian Processes - Regression - TUM · 2016. 6. 17. · Gaussian Processes For Classification In regression we have , in binary classification we have To use a GP for classification,

PD Dr. Rudolph TriebelComputer Vision Group

Machine Learning for Computer Vision

Prediction with a Gaussian Process

Most often we are more interested in predicting new function values for given input data.

We have:

training data

test input

And we want test outputs

The joint probability is

and we need to compute .

14

y⇤1 , . . . , y⇤M

x1, . . . ,xN

x

⇤1, . . . ,x

⇤M

✓yy⇤

◆⇠ N

✓0,

✓K(X,X) K(X,X⇤)K(X⇤, X) K(X⇤, X⇤)

◆◆

p(y⇤ | x⇤, X,y)

y1, . . . , yN

Page 15: New 9. Gaussian Processes - Regression - TUM · 2016. 6. 17. · Gaussian Processes For Classification In regression we have , in binary classification we have To use a GP for classification,

PD Dr. Rudolph TriebelComputer Vision Group

Machine Learning for Computer Vision

Prediction with a Gaussian Process

In the case of only one test point we have

Now we compute the conditional distribution

where

This defines the predictive distribution.

15

x

K(X,x⇤) =

0

B@k(x1,x⇤)

...k(xN ,x⇤)

1

CA = k⇤

µ⇤ = kT⇤ K

�1t

⌃⇤ = k(x⇤,x⇤)� k

T⇤ K

�1k⇤

p(y⇤ | x⇤, X,y) = N (y⇤ | µ⇤,⌃⇤)

Page 16: New 9. Gaussian Processes - Regression - TUM · 2016. 6. 17. · Gaussian Processes For Classification In regression we have , in binary classification we have To use a GP for classification,

PD Dr. Rudolph TriebelComputer Vision Group

Machine Learning for Computer Vision

Example

Functions sampled from a Gaussian Process prior

16

−5 0 5−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

−5 0 5−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

Functions sampled from the predictive distribution

The predictive distribution is itself a Gaussian process.

It represents the posterior after observing the data.

The covariance is low in the vicinity of data points.

Page 17: New 9. Gaussian Processes - Regression - TUM · 2016. 6. 17. · Gaussian Processes For Classification In regression we have , in binary classification we have To use a GP for classification,

l = �f = 1, �n = 0.1

l = 0.3,

�f = 1.08,

�n = 0.0005

�n = 0.89

�f = 1.16

l = 3

PD Dr. Rudolph TriebelComputer Vision Group

Machine Learning for Computer Vision

Varying the Hyperparameters

•20 data samples

•GP prediction with different kernelhyper parameters

17

−8 −6 −4 −2 0 2 4 6 8−3

−2

−1

0

1

2

3

−8 −6 −4 −2 0 2 4 6 8−3

−2

−1

0

1

2

3

−8 −6 −4 −2 0 2 4 6 8−3

−2

−1

0

1

2

3

Page 18: New 9. Gaussian Processes - Regression - TUM · 2016. 6. 17. · Gaussian Processes For Classification In regression we have , in binary classification we have To use a GP for classification,

PD Dr. Rudolph TriebelComputer Vision Group

Machine Learning for Computer Vision

Varying the Hyperparameters

The squared exponential covariance function can be generalized to

where M can be:

• : this is equal to the above case

• : every feature dimension has its own length scale parameter

• : here Λ has less than

D columns

18

k(xp,xq) = �2f exp(�

1

2

(xp � xq)TM(xp � xq)) + �2

n�pq

M = l�2I

M = diag(l1, . . . , lD)�2

M = ⇤⇤T + diag(l1, . . . , lD)�2

Page 19: New 9. Gaussian Processes - Regression - TUM · 2016. 6. 17. · Gaussian Processes For Classification In regression we have , in binary classification we have To use a GP for classification,

M = I M = diag(1, 3)�2

M =

✓1 �1�1 1

◆+ diag(6, 6)�2

PD Dr. Rudolph TriebelComputer Vision Group

Machine Learning for Computer Vision

Varying the Hyperparameters

19

−20

2

−20

2−2

−1

0

1

2

input x1input x2

outp

ut y

−20

2

−20

2−2

−1

0

1

2

input x1input x2

outp

ut y

−20

2

−20

2−2

−1

0

1

2

input x1input x2

outp

ut y

Page 20: New 9. Gaussian Processes - Regression - TUM · 2016. 6. 17. · Gaussian Processes For Classification In regression we have , in binary classification we have To use a GP for classification,

PD Dr. Rudolph TriebelComputer Vision Group

Machine Learning for Computer Vision

Implementation

•Cholesky decomposition is numerically stable

•Can be used to compute inverse efficiently

20

Algorithm 1: GP regression

Data: training data (X,y), test data x⇤Input: Hyper parameters �2

f , l, �2n

Kij k(xi,xj)L cholesky(K + �2

yI)↵ LT \(L\y)E[f⇤] k

T⇤ ↵

v L\k⇤var[f⇤] k(x⇤,x⇤)� v

Tv

log p(y | X) � 12y

T↵�P

i logLii � N2 log(2⇡)

Training Phase

Test Phase

n

Page 21: New 9. Gaussian Processes - Regression - TUM · 2016. 6. 17. · Gaussian Processes For Classification In regression we have , in binary classification we have To use a GP for classification,

PD Dr. Rudolph TriebelComputer Vision Group

Machine Learning for Computer Vision

Estimating the Hyperparameters

To find optimal hyper parameters we need the marginal likelihood:

This expression implicitly depends on the hyper

parameters, but y and X are given from the training data. It can be computed in closed form, as all terms are Gaussians.

We take the logarithm, compute the derivative

and set it to 0. This is the training step.

21

p(y | X) =

Zp(y | f , X)p(f | X)df

Page 22: New 9. Gaussian Processes - Regression - TUM · 2016. 6. 17. · Gaussian Processes For Classification In regression we have , in binary classification we have To use a GP for classification,

PD Dr. Rudolph TriebelComputer Vision Group

Machine Learning for Computer Vision

Estimating the Hyperparameters

The log marginal likelihood is not necessarily concave, i.e. it can have local maxima.

The local maxima can correspond to sub-optimal solutions.

22

100 101

10−1

100

characteristic lengthscale

nois

e st

anda

rd d

evia

tion

−5 0 5−2

−1

0

1

2

input, x

outp

ut, y

−5 0 5−2

−1

0

1

2

input, x

outp

ut, y

Page 23: New 9. Gaussian Processes - Regression - TUM · 2016. 6. 17. · Gaussian Processes For Classification In regression we have , in binary classification we have To use a GP for classification,

PD Dr. Rudolph TriebelComputer Vision Group

Machine Learning for Computer Vision

Automatic Relevance Determination

•We have seen how the covariance function can

be generalized using a matrix M

•If M is diagonal this results in the kernel function

•We can interpret the as weights for each feature dimension

•Thus, if the length scale of an input dimension is large, the input is less relevant

•During training this is done automatically

23

k(x,x

0) = �f exp

1

2

DX

i=1

⌘i(xi � x

0i)

2

!

⌘i

li = 1/⌘i

Page 24: New 9. Gaussian Processes - Regression - TUM · 2016. 6. 17. · Gaussian Processes For Classification In regression we have , in binary classification we have To use a GP for classification,

PD Dr. Rudolph TriebelComputer Vision Group

Machine Learning for Computer Vision

Automatic Relevance Determination

During the optimization process to learn the hyper-parameters, the reciprocal length scale for one parameter decreases, i.e.:

This hyper parameter is not very relevant!

24

3-dimensional data, parameters as they evolve during training

⌘1 ⌘2 ⌘3

⌘1

⌘2

⌘3

Page 25: New 9. Gaussian Processes - Regression - TUM · 2016. 6. 17. · Gaussian Processes For Classification In regression we have , in binary classification we have To use a GP for classification,

Computer Vision Group Prof. Daniel Cremers

Gaussian Processes -Classification

Page 26: New 9. Gaussian Processes - Regression - TUM · 2016. 6. 17. · Gaussian Processes For Classification In regression we have , in binary classification we have To use a GP for classification,

PD Dr. Rudolph TriebelComputer Vision Group

Machine Learning for Computer Vision

Gaussian Processes For Classification

In regression we have , in binary classification we have

To use a GP for classification, we can apply a sigmoid function to the posterior obtained from the GP and compute the class probability as:

If the sigmoid function is symmetric:then we have .

A typical type of sigmoid function is the logistic sigmoid:

26

y 2 Ry 2 {�1; 1}

p(y = +1 | x) = �(f(x))

�(�z) = 1� �(z)

p(y | x) = �(yf(x))

�(z) =1

1 + exp(�z)

Page 27: New 9. Gaussian Processes - Regression - TUM · 2016. 6. 17. · Gaussian Processes For Classification In regression we have , in binary classification we have To use a GP for classification,

PD Dr. Rudolph TriebelComputer Vision Group

Machine Learning for Computer Vision

Application of the Sigmoid Function

Function sampled from a Gaussian Process

27

Sigmoid function applied to the GP function

Another symmetric sigmoid function is the cumulative Gaussian:

�(z) =

Z z

�1N (x | 0, 1)dx

Page 28: New 9. Gaussian Processes - Regression - TUM · 2016. 6. 17. · Gaussian Processes For Classification In regression we have , in binary classification we have To use a GP for classification,

PD Dr. Rudolph TriebelComputer Vision Group

Machine Learning for Computer Vision

Visualization of Sigmoid Functions

The cumulative Gaussian is slightly steeper than the logistic sigmoid

28

Page 29: New 9. Gaussian Processes - Regression - TUM · 2016. 6. 17. · Gaussian Processes For Classification In regression we have , in binary classification we have To use a GP for classification,

PD Dr. Rudolph TriebelComputer Vision Group

Machine Learning for Computer Vision

The Latent Variables

In regression, we directly estimated f asand values of f where observed in the training

data. Now only labels +1 or -1 are observed and

f is treated as a set of latent variables.

A major advantage of the Gaussian process

classifier over other methods is that it

marginalizes over all latent functions rather

than maximizing some model parameters.

29

f(x) ⇠ GP(m(x), k(x,x0))

Page 30: New 9. Gaussian Processes - Regression - TUM · 2016. 6. 17. · Gaussian Processes For Classification In regression we have , in binary classification we have To use a GP for classification,

PD Dr. Rudolph TriebelComputer Vision Group

Machine Learning for Computer Vision

Class Prediction with a GP

The aim is to compute the predictive distribution

30

p(y⇤ = +1 | X,y,x⇤) =

Zp(y⇤ | f⇤)p(f⇤ | X,y,x⇤)df⇤

�(f⇤)

Page 31: New 9. Gaussian Processes - Regression - TUM · 2016. 6. 17. · Gaussian Processes For Classification In regression we have , in binary classification we have To use a GP for classification,

PD Dr. Rudolph TriebelComputer Vision Group

Machine Learning for Computer Vision

Class Prediction with a GP

The aim is to compute the predictive distribution

we marginalize over the latent variables from the training data:

31

p(y⇤ = +1 | X,y,x⇤) =

Zp(y⇤ | f⇤)p(f⇤ | X,y,x⇤)df⇤

p(f⇤ | X,y,x⇤) =

Zp(f⇤ | X,x⇤, f)p(f | X,y)df

predictive distribution of the latent variable (from regression)

Page 32: New 9. Gaussian Processes - Regression - TUM · 2016. 6. 17. · Gaussian Processes For Classification In regression we have , in binary classification we have To use a GP for classification,

PD Dr. Rudolph TriebelComputer Vision Group

Machine Learning for Computer Vision

Class Prediction with a GP

The aim is to compute the predictive distribution

we marginalize over the latent variables from the training data:

we need the posterior over the latent variables:

32

p(y⇤ = +1 | X,y,x⇤) =

Zp(y⇤ | f⇤)p(f⇤ | X,y,x⇤)df⇤

p(f⇤ | X,y,x⇤) =

Zp(f⇤ | X,x⇤, f)p(f | X,y)df

p(f | X,y) =p(y | f)p(f | X)

p(y | X)

likelihood (sigmoid)

prior

normalizer

Page 33: New 9. Gaussian Processes - Regression - TUM · 2016. 6. 17. · Gaussian Processes For Classification In regression we have , in binary classification we have To use a GP for classification,

PD Dr. Rudolph TriebelComputer Vision Group

Machine Learning for Computer Vision

A Simple Example

•Red: Two-class training data

•Green: mean function of

•Light blue: sigmoid of the mean function

33

p(f | X,y)

Page 34: New 9. Gaussian Processes - Regression - TUM · 2016. 6. 17. · Gaussian Processes For Classification In regression we have , in binary classification we have To use a GP for classification,

PD Dr. Rudolph TriebelComputer Vision Group

Machine Learning for Computer Vision

But There Is A Problem...

•The likelihood term is not a Gaussian!

•This means, we can not compute the posterior in closed form.

•There are several different solutions in the literature, e.g.:

•Laplace approximation

•Expectation Propagation

•Variational methods

34

p(f | X,y) =p(y | f)p(f | X)

p(y | X)

Page 35: New 9. Gaussian Processes - Regression - TUM · 2016. 6. 17. · Gaussian Processes For Classification In regression we have , in binary classification we have To use a GP for classification,

PD Dr. Rudolph TriebelComputer Vision Group

Machine Learning for Computer Vision

Laplace Approximation

where

and

To compute an iterative approach using Newton’s method has to be used.

The Hessian matrix A can be computed as

where is a diagonal matrix which depends on the sigmoid function.

35

p(f | X,y) ⇡ q(f | X,y) = N (f | f̂ , A�1)

ˆf = argmax

fp(f | X,y)

A = �rr log p(f | X,y)|f=f̂

second-order Taylor expansion

A = K�1 +W

W = �rr log p(y | f)

Page 36: New 9. Gaussian Processes - Regression - TUM · 2016. 6. 17. · Gaussian Processes For Classification In regression we have , in binary classification we have To use a GP for classification,

PD Dr. Rudolph TriebelComputer Vision Group

Machine Learning for Computer Vision

Laplace Approximation

•Yellow: a non-Gaussian posterior

•Red: a Gaussian approximation, the mean is the mode of the posterior, the variance is the negative second derivative at the mode

36

Page 37: New 9. Gaussian Processes - Regression - TUM · 2016. 6. 17. · Gaussian Processes For Classification In regression we have , in binary classification we have To use a GP for classification,

Now that we have we can compute:

From the regression case we have:

where

This reminds us of a property of Gaussians that we saw earlier!

p(f | X,y)

PD Dr. Rudolph TriebelComputer Vision Group

Machine Learning for Computer Vision

Predictions

37

p(f⇤ | X,y,x⇤) =

Zp(f⇤ | X,x⇤, f)p(f | X,y)df

⌃⇤ = k(x⇤,x⇤)� k

T⇤ K

�1k⇤

p(f⇤ | X,x⇤, f) = N (f⇤ | µ⇤,⌃⇤)

µ⇤ = kT⇤ K

�1f

Linear in f

Page 38: New 9. Gaussian Processes - Regression - TUM · 2016. 6. 17. · Gaussian Processes For Classification In regression we have , in binary classification we have To use a GP for classification,

PD Dr. Rudolph TriebelComputer Vision Group

Machine Learning for Computer Vision

Gaussian Properties (Rep.)

If we are given this:

I.

II.

Then it follows (properties of Gaussians):

III.

IV.

where

38

p(x) = N (x | µ,⌃1)

p(y | x) = N (y | Ax+ b,⌃2)

p(y) = N (y | Aµ+ b,⌃2 +A⌃1AT )

p(x | y) = N (x | ⌃(AT⌃�12 (y � b) + ⌃�1

1 y),⌃)

⌃ = (⌃�11 +AT⌃�1

s A)�1

Page 39: New 9. Gaussian Processes - Regression - TUM · 2016. 6. 17. · Gaussian Processes For Classification In regression we have , in binary classification we have To use a GP for classification,

V[f⇤ | X,y,x⇤] = k(x⇤,x⇤)� k

T⇤ (K +W�1)�1

k⇤

PD Dr. Rudolph TriebelComputer Vision Group

Machine Learning for Computer Vision

Applying this to Laplace

It remains to compute

Depending on the kind of sigmoid function we

• can compute this in closed form (cumulative Gaussian sigmoid)

• have to use sampling methods or analytical approximations (logistic sigmoid)

39

E[f⇤ | X,y,x⇤] = k(x⇤)TK�1

p(y⇤ = +1 | X,y,x⇤) =

Zp(y⇤ | f⇤)p(f⇤ | X,y,x⇤)df⇤

Page 40: New 9. Gaussian Processes - Regression - TUM · 2016. 6. 17. · Gaussian Processes For Classification In regression we have , in binary classification we have To use a GP for classification,

PD Dr. Rudolph TriebelComputer Vision Group

Machine Learning for Computer Vision

A Simple Example

•Two-class problem (training data in red and blue)

•Green line: optimal decision boundary

•Black line: GP classifier decision boundary

•Right: posterior probability

40

Page 41: New 9. Gaussian Processes - Regression - TUM · 2016. 6. 17. · Gaussian Processes For Classification In regression we have , in binary classification we have To use a GP for classification,

PD Dr. Rudolph TriebelComputer Vision Group

Machine Learning for Computer Vision

Summary

•Gaussian Processes are Normal distributions over functions

•To specify a GP we need a covariance function (kernel) and a mean function

•For regression we can compute the predictive distribution in closed form

•For classification, we use a sigmoid and have to approximate the latent posterior

•More on Gaussian Processes:http://videolectures.net/epsrcws08_rasmussen_lgp/

41