applying sequential, sparse gaussian processes – an illustration based on sic2004
DESCRIPTION
Applying Sequential, Sparse Gaussian Processes – an Illustration Based on SIC2004. Ben Ingram Neural Computing Research Group Aston University, Birmingham, UK. Spatial Interpolation Comparison 2004. What is SIC2004? SIC2004 objectives are to: generate results that are reliable - PowerPoint PPT PresentationTRANSCRIPT
Applying Sequential, Sparse Gaussian Processes – an Illustration Based on
SIC2004
Ben IngramNeural Computing Research GroupAston University, Birmingham, UK
Spatial Interpolation Comparison 2004
What is SIC2004? SIC2004 objectives are to:
generate results that are reliable generate results in smallest amount of time generate results automatically deal with anomalies
Data provided: background gamma radiation in Germany
Spatial Interpolation Comparison 2004 Radiation data from 10
randomly selected days were given to participants to devise a method that met the criteria of SIC2004
For each day there were 200 observations made at the locations shown by red circles
The aim: to predict as fast and as accurately as possible at 808 locations (black crosses) given 200 observations for an 11th randomly selected day
0 10 20
x 104
0
1
2
3
4
5
6
x 105
0 10 20
x 104
0
1
2
3
4
5
6
x 105
Sequential Sparse Gaussian Processes
Gaussian processes equivalent to Kriging [Cornford 2002]
SSGP use a subset of the dataset called ‘basis vectors’ to best approximate the Gaussian process
Traditional methods require a matrix inversion which is n3 operation, nm2 (where m is number of ‘basis vectors’)
Model complexity controlled by ‘basis vectors’, but important features in the data retained
Sequential Sparse Gaussian Processes
Bayesian Approach Utilizes prior knowledge such as experience, expert knowledge
or previous datasets
Model parameters described by Prior probability distribution Likelihood: how likely is it that the parameters w generated the
data D Posterior distribution of parameters proportional to product of the
likelihood and prior
)(
)()|()|(
DP
wPwDPDwP Bayes rule
PosteriorLikelihood Prior
Normalising constant
Choosing Model for SSGP
Machine Learning community treat estimating the covariance function differently In Geostatistics experiment variogram computed and an
appropriate model fitted In Machine Learning the model is chosen based on experience
or informed intuition How could the 10 prior datasets be used?
Assume data is independent but identically distributed Compute experimental variograms for subset of data (160
observations) for 10 prior days Fitted various variogram models and used them in cross-
validation for predicting at the 40 withheld locations
Variography
0 0.5 1 1.5 2 2.5 3 3.5 40
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Experimental
Mixture
SqExp(Gaussian)
Exp
variance
lag distance
Several models were fitted including mixtures of models
Mixture model consistently fitted better
Variography
Experimental variogram used to select covariance model for SSGP
Insufficient number of observations at smaller lag distances to learn behaviourAssume little variation at short separation
distancesUse tighter variance with hyper-parameters of
squared exponential component
Boosting
Boosting used to estimate ‘best’ hyper-parameters (nugget, sill and range)
Adjust the hyper-parameters to maximize the likelihood of the training data Iterative method used to search for optimal values of the hyper-parameters Boosting assumes that each iterative step to locate the optimal hyper-
parameters is composed of a linear combination of the individual iterative steps calculated for each day
Leave-one-out cross-validation used 9 days used to estimate optimal parameters Used resulting hyper-parameters as mean value for hyper-parameters to
on left out dataset Some information about hyper-parameters learnt, but the values are not
fixed. Differing degrees of uncertainty associated with each hyper-parameter
Interpolating using SSGP
Anisotropic covariance functions were used because we believed that the variation was not uniform in all directions
Learnt hyper-parameters used to set initial hyper-parameter values for SSGP
How were the number of ‘basis vectors’ (model complexity) chosen? Cross-validation Accuracy decreases as number of ‘basis vectors’
decreases
Using our method with the competition data
SSGP was used with 11th day dataset to predict at 808 locations
In addition to the data for the 11th day a “joker” dataset was given
‘Joker’ dataset simulated a radiation leak into the environment – but contestants did not know this until after the contest
SSGP was used with ‘joker’ dataset to predict at the same 808 locations
Results
To determine how well SSGP performed, we compared it with some standard Machine learning techniques:Multi-layer perceptronsRadial basis functionsGaussian processes
Netlab Matlab toolbox was used for calculations
ResultsN = 808 Min Max Mean Median Std. dev. Observed 57.00 180.00 98.01 98.80 20.02 SSGP 68.82 125.41 96.75 98.96 14.41 GP (mixture) 67.07 127.20 96.58 98.65 14.98 GP (sqexp) 69.26 123.77 96.87 99.19 14.32 RBF 68.22 129.55 96.85 98.66 14.58 MLP 66.05 129.13 96.80 98.07 14.65
MAE ME Pearson’s r RMSE SSGP 9.10 -1.27 0.788 12.46
GP (mixture) 9.08 -1.44 0.787 12.47
GP (sqexp) 9.47 -1.15 0.776 12.75 RBF 9.49 -1.19 0.776 12.71
MLP 9.48 -1.22 0.775 12.73
Contour Maps
0 100000 200000
0
100000
200000
300000
400000
500000
600000
0 100000 200000
0
100000
200000
300000
400000
500000
600000
SSGP GP Actual
60
80
100
120
140
160
180
70
90
110
130
150
170
190
Results – Joker datasetN = 808 Min Max mean Median Std. dev. Observed 57.00 1528.20 105.42 98.95 83.71 SSGP 74.25 634.49 100.78 95.68 39.22
GP (mixture) 87.09 150.81 106.13 102.46 15.51 GP (sqexp) 80.80 161.73 108.51 101.28 22.53
RBF 82.22 160.73 108.22 101.77 21.91 MLP -129.18 760.02 102.41 94.71 80.03
MAE ME Pearson’s r
RMSE SSGP 18.55 -4.64 0.856 54.22
GP (mixture) 21.77 0.72 0.350 79.57
GP (sqexp) 22.53 3.09 0.331 79.16 RBF 22.73 3.21 0.334 79.31
MLP 48.41 -3.01 0.384 90.89
Contour Maps - Joker
0 100000 200000
0
100000
200000
300000
400000
500000
600000
60
80
100
120
140
160
180
70
90
110
130
150
170
190
0 100000 200000
0
100000
200000
300000
400000
500000
600000
SSGP GP Actual
Learnt hyper-parameters
Exponential range parameters break down as noise parameter becomes large
Squared Exponential parameters relatively constant between datasets
Original Joker dataset
SqExp x range 0.02 0.03
SqExp y range 0.24 0.24
SqExp amplitude
0.05 0.05
Exp x range 1.89 0.10
Exp y range 1.11 0.00
Exp amplitude 0.90 60.12
Noise 0.30 41.59
Conclusions
Once nature of covariance structure is understood, interpolation with SSGP is completely automatic
There were problems predicting when there were extreme values, this would be expected
Incorporating a robust estimation method for data with anomalies should be investigated
For 11th day dataset SSGP and GP produced similar results, but SSGP is faster
SSGP devised for large datasets, but can improve speed with small datasets.
Acknowledgements
Lehel Csato – Developer of SSGP algorithm
SSGP software available from:
http://www.ncrg.aston.ac.uk