applying sequential, sparse gaussian processes – an illustration based on sic2004

Applying Sequential, Sparse Gaussian Processes – an Illustration Based on

SIC2004

Ben IngramNeural Computing Research GroupAston University, Birmingham, UK

Spatial Interpolation Comparison 2004

What is SIC2004? SIC2004 objectives are to:

generate results that are reliable generate results in smallest amount of time generate results automatically deal with anomalies

Data provided: background gamma radiation in Germany

Spatial Interpolation Comparison 2004 Radiation data from 10

randomly selected days were given to participants to devise a method that met the criteria of SIC2004

For each day there were 200 observations made at the locations shown by red circles

The aim: to predict as fast and as accurately as possible at 808 locations (black crosses) given 200 observations for an 11th randomly selected day

0 10 20

x 104

0

1

2

3

4

5

6

x 105

0 10 20

x 104

0

1

2

3

4

5

6

x 105

Sequential Sparse Gaussian Processes

Gaussian processes equivalent to Kriging [Cornford 2002]

SSGP use a subset of the dataset called ‘basis vectors’ to best approximate the Gaussian process

Traditional methods require a matrix inversion which is n3 operation, nm2 (where m is number of ‘basis vectors’)

Model complexity controlled by ‘basis vectors’, but important features in the data retained

Sequential Sparse Gaussian Processes

Bayesian Approach Utilizes prior knowledge such as experience, expert knowledge

or previous datasets

Model parameters described by Prior probability distribution Likelihood: how likely is it that the parameters w generated the

data D Posterior distribution of parameters proportional to product of the

likelihood and prior

)(

)()|()|(

DP

wPwDPDwP Bayes rule

PosteriorLikelihood Prior

Normalising constant

Choosing Model for SSGP

Machine Learning community treat estimating the covariance function differently In Geostatistics experiment variogram computed and an

appropriate model fitted In Machine Learning the model is chosen based on experience

or informed intuition How could the 10 prior datasets be used?

Assume data is independent but identically distributed Compute experimental variograms for subset of data (160

observations) for 10 prior days Fitted various variogram models and used them in cross-

validation for predicting at the 40 withheld locations

Variography

0 0.5 1 1.5 2 2.5 3 3.5 40

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

Experimental

Mixture

SqExp(Gaussian)

Exp

variance

lag distance

Several models were fitted including mixtures of models

Mixture model consistently fitted better

Variography

Experimental variogram used to select covariance model for SSGP

Insufficient number of observations at smaller lag distances to learn behaviourAssume little variation at short separation

distancesUse tighter variance with hyper-parameters of

squared exponential component

Boosting

Boosting used to estimate ‘best’ hyper-parameters (nugget, sill and range)

Adjust the hyper-parameters to maximize the likelihood of the training data Iterative method used to search for optimal values of the hyper-parameters Boosting assumes that each iterative step to locate the optimal hyper-

parameters is composed of a linear combination of the individual iterative steps calculated for each day

Leave-one-out cross-validation used 9 days used to estimate optimal parameters Used resulting hyper-parameters as mean value for hyper-parameters to

on left out dataset Some information about hyper-parameters learnt, but the values are not

fixed. Differing degrees of uncertainty associated with each hyper-parameter

Interpolating using SSGP

Anisotropic covariance functions were used because we believed that the variation was not uniform in all directions

Learnt hyper-parameters used to set initial hyper-parameter values for SSGP

How were the number of ‘basis vectors’ (model complexity) chosen? Cross-validation Accuracy decreases as number of ‘basis vectors’

decreases

Using our method with the competition data

SSGP was used with 11th day dataset to predict at 808 locations

In addition to the data for the 11th day a “joker” dataset was given

‘Joker’ dataset simulated a radiation leak into the environment – but contestants did not know this until after the contest

SSGP was used with ‘joker’ dataset to predict at the same 808 locations

Results

To determine how well SSGP performed, we compared it with some standard Machine learning techniques:Multi-layer perceptronsRadial basis functionsGaussian processes

Netlab Matlab toolbox was used for calculations

ResultsN = 808 Min Max Mean Median Std. dev. Observed 57.00 180.00 98.01 98.80 20.02 SSGP 68.82 125.41 96.75 98.96 14.41 GP (mixture) 67.07 127.20 96.58 98.65 14.98 GP (sqexp) 69.26 123.77 96.87 99.19 14.32 RBF 68.22 129.55 96.85 98.66 14.58 MLP 66.05 129.13 96.80 98.07 14.65

MAE ME Pearson’s r RMSE SSGP 9.10 -1.27 0.788 12.46

GP (mixture) 9.08 -1.44 0.787 12.47

GP (sqexp) 9.47 -1.15 0.776 12.75 RBF 9.49 -1.19 0.776 12.71

MLP 9.48 -1.22 0.775 12.73

Contour Maps

0 100000 200000

0

100000

200000

300000

400000

500000

600000

0 100000 200000

0

100000

200000

300000

400000

500000

600000

SSGP GP Actual

60

80

100

120

140

160

180

70

90

110

130

150

170

190

Results – Joker datasetN = 808 Min Max mean Median Std. dev. Observed 57.00 1528.20 105.42 98.95 83.71 SSGP 74.25 634.49 100.78 95.68 39.22

GP (mixture) 87.09 150.81 106.13 102.46 15.51 GP (sqexp) 80.80 161.73 108.51 101.28 22.53

RBF 82.22 160.73 108.22 101.77 21.91 MLP -129.18 760.02 102.41 94.71 80.03

MAE ME Pearson’s r

RMSE SSGP 18.55 -4.64 0.856 54.22

GP (mixture) 21.77 0.72 0.350 79.57

GP (sqexp) 22.53 3.09 0.331 79.16 RBF 22.73 3.21 0.334 79.31

MLP 48.41 -3.01 0.384 90.89

Contour Maps - Joker

0 100000 200000

0

100000

200000

300000

400000

500000

600000

60

80

100

120

140

160

180

70

90

110

130

150

170

190

0 100000 200000

0

100000

200000

300000

400000

500000

600000

SSGP GP Actual

Learnt hyper-parameters

Exponential range parameters break down as noise parameter becomes large

Squared Exponential parameters relatively constant between datasets

Original Joker dataset

SqExp x range 0.02 0.03

SqExp y range 0.24 0.24

SqExp amplitude

0.05 0.05

Exp x range 1.89 0.10

Exp y range 1.11 0.00

Exp amplitude 0.90 60.12

Noise 0.30 41.59

Conclusions

Once nature of covariance structure is understood, interpolation with SSGP is completely automatic

There were problems predicting when there were extreme values, this would be expected

Incorporating a robust estimation method for data with anomalies should be investigated

For 11th day dataset SSGP and GP produced similar results, but SSGP is faster

SSGP devised for large datasets, but can improve speed with small datasets.

Acknowledgements

Lehel Csato – Developer of SSGP algorithm

SSGP software available from:

http://www.ncrg.aston.ac.uk

applying sequential, sparse gaussian processes – an illustration based on sic2004

Documents

optimal hyperparameters

hyperparameters nugget

parameters w

radiation data

covariance model

subset of data

prior datasets

optimal values