optimal designs for gaussian process models via spectral

Optimal Designs for Gaussian Process Models via

Spectral Decomposition

Ofir Harari

Department of Statistics & Actuarial Sciences, Simon Fraser University

September 2014

Dynamic Computer Experiments, 2014 1/38

Outline

1 Introduction

Gaussian Process Regression

Karhunen-Loeve Decomposition of a GP

2 IMSPE-optimal Designs

Application of the K-L Decomposition

Approximate Minimum IMSPE Designs

Adaptive Designs

Extending to Models with Unknown Regression Parameters

3 Other Optimality Criteria

4 Concluding Remarks


Introduction

Outline

1 Introduction






Adaptive Designs





Introduction Gaussian Process Regression

The Gaussian Process Model

� Suppose that we observe y = [y (x1) , . . . , y (xn)]T , where

y (x) = f (x)T β + Z (x) .

� Here f (x) is a vector of regression functions evaluated at x and Z (x) is a

zero mean stationary Gaussian process with marginal variance σ2 and

correlation function R(·; θ) .

� An extremely popular modeling approach in spatial statistics, geostatistics,

computer experiments and more.

� Sometimes measurement error (i.e. a nugget term) is added.



The Gaussian Process Model (cont.)

Notations:

R ij = R(xi − xj ) - the correlation matrix at the design D = {x1, . . . , xn}.

y =[y (x1), . . . , y (xn)

]T- the vecor of observatios at D.

r (x) =[R(x− x1), . . . ,R(x− xn)

]T- the vector of correlations at site x.

F ij = fj (x i ) - the design matrix at D.



The Gaussian Process Model (cont.)

Universal Kriging

Assuming π (β) ∝ 1

y (x) = E{

y (x)∣∣y} = f T (x) β + rT (x) R−1

(y − F β

), (1)

is the minimizer of the mean squared prediction error (MSPE), which will then be

E[{

y (x)− y (x)}2∣∣∣y] = var

{y (x)

∣∣y}

= σ2

1−[f T (x) , rT (x)

] [ 0 F T

F R

]−1 [f (x)

r (x)

] .

• If the assumption of Gaussianity is dropped, (1) would still be the BLUP for

y (x).


Introduction Karhunen-Loeve Decomposition of a GP

Spectral Decomposition of a GP

Mercer’s Theorem

If X is compact and R (the covariance function of a GP) is continuous in X 2,

then R(x , y) =∞∑i=1

λiϕi (x)ϕi (y) ,

where the ϕi ’s and the λi ’s are the solutions of the homogeneuos Fredholm

integral equation of the second kind∫X

R(x , y)ϕi (y)dy = λiϕi (x)

and ∫Xϕi (x)ϕj (x)dx = δij .



Spectral Decomposition of a GP (cont.)

The Karhunen-Loeve Decomposition

Under the aforementioned conditions

Z (x) =∞∑i=1

αiϕi (x) (in the MSE sense),

where the αi ’s are independent N (0, λi ) random variables.

• The best approximation (in the MSE sense) of Z (x) is the truncated series

in which the eigenvalues are arranged in decreasing order.

• For piecewise analytic R(x , y), λk ≤ c1 exp(−c2k1/d

)for some constants

c1 and c2 (Frauenfelder et al. 2005).

• Numerical solutions involve Galerkin-type methods.



A Worked Example

R(x ,w ) = exp{− (x1 − w1)2 − 2 (x2 − w2)2} ,

Eigenvalues vs. Index (log scale)

0 1 2 3 4 5 6 7 8 9 11 13 15 17

−5

−4

−3

−2

−1

0

logλk

k



A Worked Example (cont.)

−0.5 0.0 0.5

−0.

50.

00.

5

−0.7

−0.65

−0.6

−0.55

−0.5

−0.45 −0.4

−0.4

−0.35

−0.35

−0.3

−0.3

−0.3

−0.3

−0.25

−0.25

−0.25

−0.25

Eigenfunction #1 , λ = 1.3597

−0.5 0.0 0.5

−0.

50.

00.

5

−0.6

−0.4

−0.2

0

0.2

0.4

0.4

0.4

0.6


−0.5 0.0 0.5

−0.

50.

00.

5

−0.8

−0.6

−0.4

−0.

2 0.2 0.4

0.6

0.8


−0.5 0.0 0.5

−0.

50.

00.

5

−0.8

−0.8

−0.6

−0.6

−0.4

−0.4

−0.2

−0.2

0

0

0.2

0.2

0.4

0.4

0.6


−0.5 0.0 0.5

−0.

50.

00.

5

−0.8

−0.8

−0.6

−0.6

−0.4

−0.4

−0.2

−0.2

0 0

0.2

0.2

0.4

0.4

0.6

0.6

0.8

0.8


−0.5 0.0 0.5

−0.

50.

00.

5

−1

−1

−0.

8 −0.8

−0.

6 −0.6

−0.

4 −0.4

−0.

2

−0.2

0

0

0.2 0.2

0.4

0.6





A single realization based on the first 17 K-L terms (≈ 99.5% of the process’

energy)

X1

X2

y

−0.5 0.0 0.5

−0.

50.

00.

5

−0.5

0

0

0.5 0.5

1

1.5

1.5

2

2.5


IMSPE-optimal Designs

Outline

1 Introduction






Adaptive Designs






Minimum IMSPE Designs

Integrated Mean Squared Prediction Error

Sacks et al. (1989b) suggested a suitable design for computer experiments

should minimize

J(D, y

)σ2 =

1σ2

∫XE[{

y (x)− y (x)}2∣∣∣y] dx

= vol (X )− tr

[

0 F T

F R

]−1 ∫X

[f (x) f T (x) f (x) rT (x)

r (x) f T (x) r (x) rT (x)

]dx

where vol(X ) is the volume of X .




Consider the standard Gaussian process

y (x) = f (x)T β + Z (x) ,

where β is (for now) a known vector.

• The Kriging predictor would now be y (x) = f (x)T β + rT (x) R−1 (y − Fβ) .

Proposition

For any compact X and continuous covariance kernel R,

J(D, y

)σ2 = vol(X )−

∞∑k=1

λ2kφ

Tk R−1φk , (2)

where φp = [ϕp(x1), . . . , ϕp(xn)]T .



Application of the K-L Decomposition (cont.)

Theorem

1. For any design D = {x1, . . . , xn},

J(D, y

)≥ σ2

Z

∞∑k=n+1

λk (3)

2. The lower bound (3) will be achieved if D satisfies

λkφTk R−1φk =

1 k ≤ n

0 k > n

3. If such an ideal D exists, the prediction y (x) at any x ∈ X would be

identical to the prediction based on the finite dimensional Bayesian linear

regression model y (x) = α1ϕ1 (x) + · · · + αnϕn (x) where αj ∼ N(0 , λj

)are

independent.



Application of the K-L Decomposition (cont.)

� By definition, IMSPE-optimal designs are prediction-oriented. What the

Theorem tells us is that in some sense, they are also useful for variable

selection (w.r.t the K-L basis functions).

� In reality, such ideal designs do not exist - but IMSPE-optimal designs get

pretty close:

1 5 9 15 21 27 33 39 45 51 57

0.00

0.50

1.00

λkφkTR−1φk

k

k ≤ 17k > 17


IMSPE-optimal Designs Approximate Minimum IMSPE Designs


Approximate IMSPE

JM

(D, y

)σ2 = vol(X )−

M∑k=1

λ2kφ

Tk R−1φk = vol(X )− tr

{Λ2ΦTR−1Φ

},

where

Λ = diag (λ1, . . . , λM ) and Φij = ϕj (x i ) , 1 ≤ i ≤ n , 1 ≤ j ≤ M .


D∗ = argminD⊂X

JM

(D, y

)= argmaxD⊂X

tr{Λ2ΦTR−1Φ

}




Size 7, 15 and 25 approximate Minimum IMSPE designs for

R(x ,w ) = exp{− (x1 − w1)2 − 2 (x2 − w2)2}



Controlling the Relative Truncation Error

Proposition

Under some conditions (which are easily met),

rM =

∞∑k=M+1

λ2kφ

Tk R−1φk

∞∑j=1

λ2j φ

Tj R−1φj

≤

∞∑k=M+1

λk

∞∑j=1

λj

.

� By conserving enough of the process’ energy, we are guaranteed to have a

design as close to optimal as we wish.



Controlling the Relative Truncation Error (cont.)

Relative truncation error vs. sample size for M = 17 (⇐⇒ 0.492% energy loss):

5 13 21 30 40 50 60 700.00

00.

002

0.00

4

ε = 0.00492

IdealTrue

Sample Size

RelativeError

Relative Truncation Error vs. Sample Size


IMSPE-optimal Designs Adaptive Designs

Adaptive Designs

• Suppose now that we have the opportunity (or we are forced) to run a

multi-stage experiment: n1 runs at stage 1, n2 runs at stage 2 and so on.

• We can use that to learn/re-estimate the essential parameters and plug-in

the new estimates, to hopefully improve the design from one stage to

another.

• Such designs are called “adaptive” or “batch-sequential” designs.



Adaptive Designs (cont.)

Denote

Raug =

Rnewn2 × n2

Rcrossn2 × n1

R>crossn1 × n2

Roldn1 × n1

,

where Raug, Rnew, Rold and Rcross are the augmented correlation matrix, the

matrix of correlations within the new inputs, the matrix of correlations within the

original design and the cross correlation matrix, and

Φaug =

Φnewn2 × M

Φoldn1 × M

.

• Parameters may be re-estimated in between batches.




Using basic block matrix properties, we look to maximize

tr{Λ2ΦT

augR−1augΦaug

}=

= tr

[Λ2 {(ΦT

new −ΦToldR−1

old RTcross

)Q−1

1 Φnew +(ΦT

old −ΦTnewR−1

newRcross)

Q−12 Φold

}],

where

Q1 = Rnew − RTcrossR

−1old Rcross and Q2 = Rold − RcrossR−1

newRTcross ,

and Rold and Φold remain unchanged.




Adding 7 New Runs to an Existing 8 Run Design (λ = 0.1):

X1

X2

−1.0 −0.5 0.0 0.5 1.0

−1.

0−

0.5

0.0

0.5

1.0

Size 8+7 design, IMSPE=0.77494

1st Batch2nd Batch


IMSPE-optimal Designs Extending to Models with Unknown Regression Parameters

Approximate IMSPE Criterion for the Universal Kriging Model

Expanding the covariance kernel, using Mercer’s Theorem, and truncating after

the first M terms yields

HM

(D, y

)=JM

(D, y

)+tr

{(F TR−1F

)−1[∫X

f (x) f T (x) dx − 2AT∫Xφ (x) f T (x) dx + ATA

]},

(4)

where JM is the criterion previously derived for the simple Kriging model,

A = ΛΦTR−1F and φ (x) = [ϕ1 (x) , . . . , ϕM (x)]T .



Approximate IMSPE Criterion for the Universal Kriging Model

Expression (4) is somewhat simplified by substituting f (x) = 1 and F = 1n to

HM

(D, y

)=JM

(D, y

)+

11T

nR−11n

{vol (X )− 2aTγ + aTa

},

where

a = ΛΦTR−11n and γ =[∫Xϕ1(x)dx , . . . ,

∫XϕM (x)dx

]T

.

• The vector of integrals γ only needs to be evaluated once.

• Almost identical designs to those obtained for the simple Kriging model, for

a reasonably large n.



With and Without Estimating the Intercept

−1.0 −0.5 0.0 0.5 1.0

−1.

0−

0.5

0.0

0.5

1.0

X1

X2

−1.0 −0.5 0.0 0.5 1.0

−1.

0−

0.5

0.0

0.5

1.0

X1

X2



With and Without Estimating the Intercept (cont.)

Deviation from the optimum vs. sample size

5 7 10 12 14 16 18 20 23 25 27

0.00

00.

004

0.00

7

∆

n


Other Optimality Criteria

Outline

1 Introduction






Adaptive Designs






Bayesian A-optimal Designs

• In classical DOE, a design is called A-optimal if it minimizes tr{

var(θ)}

,

where θ is the vector of estimators for the model parameters.

• For our model

y (x) = f (x)T β + Z (x) = f (x)T β +∞∑k=1

αkϕk (x) , αk ∼ N (0, λi ) ,

we may consider the Bayesian A-optimality criterion

Q(D, y

)=∞∑k=1

var{αk

∣∣y} .

• When β is a known vector, the IMSPE and the A-optimality criteria coincide.



Bayesian A-optimal Designs (cont.)

Proposition

If π (β) ∝ 1,

Q(D, y

)= vol (X )− tr

0 F T

F R

−1 ∫X

0 0

0 r (x) rT (x)

dx

.

(and hence we don’t even need to derive the Mercer expansion)

−1.0 −0.5 0.0 0.5 1.0

−1.

0−

0.5

0.0

0.5

1.0

X1

X2

IMSPE−optimal

−1.0 −0.5 0.0 0.5 1.0

−1.

0−

0.5

0.0

0.5

1.0

X1

X2

A−optimal



Bayesian D-optimal Designs

• In classical DOE, a design is called D-optimal if it minimizes the

determinant of var(θ)

(typically the inverse Fisher information).

• In our framework, an approximate Bayesian D-optimal design would be

D∗ = argminD

det(

var{α1, . . . , αM

∣∣y})Where

var{α1, . . . , αM

∣∣y} = Λ−ΛΦT[R−1 − R−1F

(F TR−1F

)−1F TR−1

]ΦΛ .

−1.0 −0.5 0.0 0.5 1.0

−1.

0−

0.5

0.0

0.5

1.0

X1

X2

D−optimal

−1.0 −0.5 0.0 0.5 1.0

−1.

0−

0.5

0.0

0.5

1.0

X1

X2

IMSPE−optimal

−1.0 −0.5 0.0 0.5 1.0

−1.

0−

0.5

0.0

0.5

1.0

X1

X2

A−optimal


Concluding Remarks

Outline

1 Introduction






Adaptive Designs





Concluding Remarks

Few comments before we go

• If Z (x) is a zero mean non-Gaussian, weakly-stationary process with the

same correlation structure

- the Kriging predictor will no longer be the posterior mean, but will still

be the BLUP.

- The MSPE will no longer be the posterior variance, but all the results

still hold.

• Inclusion of measurement error (i.e. Nugget Term)

y (x) = f (x)T β + Z (x) + ε (x) , ε ∼ N(0, τ 2I

), λ = τ 2/σ2

will simply result in replacing R with R + λI everywhere, but the theory will

cease to apply.

- Great computational benefits for small values of λ.


Concluding Remarks

Thank you!


References

References I

1. Andrianakis, Ioannis, & Challenor, Peter G. 2012. The effect of the nugget on Gaussian process

emulators of computer models. Computational Statistics and Data Analysis, 56(12), 4215–4228.

2. Atkinson, A.C., Doney, A.N., & Tobias, R.D. 2007. Optimum Experimental Designs, with SAS. Oxford

Statistical Science Series.

3. Bursztyn, D., & Steinberg, D. M. 2004. Data analytic tools for understanding random field regression

models. Technometrics, 46(4), 411–420.

4. Cressie, Noel A. C. 1993. Statistics for Spatial Data. Wiley-Interscience.

5. Frauenfelder, Philipp, Schwab, Christoph, & Todor, Radu Alexandru. 2005. Finite Elements for Elliptic

Problems with Stochastic Coefficients. Computer Methods in Applied Mechanics and Engineering,

194(2), 205–228.

6. Guenther, Ronald B., & Lee, John W. 1996. Partial Differential Equations of Mathematical Physics and

Integral Equations. Courier Dover Publications.

7. Jones, Brad, & Johnson, Rachel T. 2009. The Design and Analysis of the Gaussian Process Model.

Quality and Reliability Engineering International, 25(5), 515–524.

8. Journel, A., & Rossi, M. 1989. When Do We Need a Trend Model in Kriging? Mathematical Geology,

21(7), 715–739.


References

References II

9. Linkletter, Crystal, Bingham, Derek, Hengartner, Nicholas, Higdon, David, & Ye, Kenny Q. 2006.

Variable Selection for Gaussian Process Models in Computer Experiments. Technometrics, 48(4),

478–490.

10. Loeppky, Jason L., Sacks, Jerome, & Welch, William J. 2009. Choosing the Sample Size of a

Computer Experiment: A Practical Guide. Technometrics, 51(4), 366–376.

11. Loeppky, Jason L., Moore, Leslie M., & Williams, Brian J. 2010. Batch Sequential Designs for

Computer Experiments. Journal of Statistical Planning and Inference, 140(6), 1452–1464.

12. Picheny, Victor, Ginsbourger, David, Roustant, Olivier, Haftka, Raphael T., & Kim, Nam H. 2010.

Adaptive Designs of Experiments for Accurate Approximation of a Target Region. Journal of

Mechanical Design, 132(7), 071008.

13. Rasmussen, Carl Edward, & Williams, Chris. 2006. Gaussian Processes for Machine Learning. The

MIT Press.

14. Sacks, Jerome, Welch, William J., Mitchell, Toby J., & Wynn, Henry P. 1989a. Design and Analysis of

Computer Experiments. Statistical Science, 4(4), 409–423.

15. Sacks, Jerome, Schiller, Susannah B., & Welch, William J. 1989b. Designs for Computer Experiments.

Technometrics, 31(1), 41–47.


References

References III

16. Santner, Thomas J., Williams, Brian J., & Notz, William I. 2003. The Design and Analysis of Computer

Experiments. Springer.

17. Shewry, Michael C., & Wynn, Henry P. 1987. Maximum Entropy Sampling. Journal of Applied

Statistics, 14(2), 165–170.

18. Singhee, Amith, Singhal, Sonia, & Rutenbar, Rob A. 2008. Exploiting Correlation Kernels for Efficient

Handling of Intra-Die Spatial Correlation, with Application to Statistical Timing. Pages 856–861 of:

Proceedings of the conference on Design, Automation and Test in Europe.

19. Wahba, Grace. 1990. Spline Models for Observational Data. Society for Industrial and Applied

Mathematics.


optimal designs for gaussian process models via spectral

Documents