optimal designs for gaussian process models via spectral
TRANSCRIPT
Optimal Designs for Gaussian Process Models via
Spectral Decomposition
Ofir Harari
Department of Statistics & Actuarial Sciences, Simon Fraser University
September 2014
Dynamic Computer Experiments, 2014 1/38
Outline
1 Introduction
Gaussian Process Regression
Karhunen-Loeve Decomposition of a GP
2 IMSPE-optimal Designs
Application of the K-L Decomposition
Approximate Minimum IMSPE Designs
Adaptive Designs
Extending to Models with Unknown Regression Parameters
3 Other Optimality Criteria
4 Concluding Remarks
Dynamic Computer Experiments, 2014 2/38
Introduction
Outline
1 Introduction
Gaussian Process Regression
Karhunen-Loeve Decomposition of a GP
2 IMSPE-optimal Designs
Application of the K-L Decomposition
Approximate Minimum IMSPE Designs
Adaptive Designs
Extending to Models with Unknown Regression Parameters
3 Other Optimality Criteria
4 Concluding Remarks
Dynamic Computer Experiments, 2014 3/38
Introduction Gaussian Process Regression
The Gaussian Process Model
� Suppose that we observe y = [y (x1) , . . . , y (xn)]T , where
y (x) = f (x)T β + Z (x) .
� Here f (x) is a vector of regression functions evaluated at x and Z (x) is a
zero mean stationary Gaussian process with marginal variance σ2 and
correlation function R(·; θ) .
� An extremely popular modeling approach in spatial statistics, geostatistics,
computer experiments and more.
� Sometimes measurement error (i.e. a nugget term) is added.
Dynamic Computer Experiments, 2014 4/38
Introduction Gaussian Process Regression
The Gaussian Process Model (cont.)
Notations:
R ij = R(xi − xj ) - the correlation matrix at the design D = {x1, . . . , xn}.
y =[y (x1), . . . , y (xn)
]T- the vecor of observatios at D.
r (x) =[R(x− x1), . . . ,R(x− xn)
]T- the vector of correlations at site x.
F ij = fj (x i ) - the design matrix at D.
Dynamic Computer Experiments, 2014 5/38
Introduction Gaussian Process Regression
The Gaussian Process Model (cont.)
Universal Kriging
Assuming π (β) ∝ 1
y (x) = E{
y (x)∣∣y} = f T (x) β + rT (x) R−1
(y − F β
), (1)
is the minimizer of the mean squared prediction error (MSPE), which will then be
E[{
y (x)− y (x)}2∣∣∣y] = var
{y (x)
∣∣y}
= σ2
1−[f T (x) , rT (x)
] [ 0 F T
F R
]−1 [f (x)
r (x)
] .
• If the assumption of Gaussianity is dropped, (1) would still be the BLUP for
y (x).
Dynamic Computer Experiments, 2014 6/38
Introduction Karhunen-Loeve Decomposition of a GP
Spectral Decomposition of a GP
Mercer’s Theorem
If X is compact and R (the covariance function of a GP) is continuous in X 2,
then R(x , y) =∞∑i=1
λiϕi (x)ϕi (y) ,
where the ϕi ’s and the λi ’s are the solutions of the homogeneuos Fredholm
integral equation of the second kind∫X
R(x , y)ϕi (y)dy = λiϕi (x)
and ∫Xϕi (x)ϕj (x)dx = δij .
Dynamic Computer Experiments, 2014 7/38
Introduction Karhunen-Loeve Decomposition of a GP
Spectral Decomposition of a GP (cont.)
The Karhunen-Loeve Decomposition
Under the aforementioned conditions
Z (x) =∞∑i=1
αiϕi (x) (in the MSE sense),
where the αi ’s are independent N (0, λi ) random variables.
• The best approximation (in the MSE sense) of Z (x) is the truncated series
in which the eigenvalues are arranged in decreasing order.
• For piecewise analytic R(x , y), λk ≤ c1 exp(−c2k1/d
)for some constants
c1 and c2 (Frauenfelder et al. 2005).
• Numerical solutions involve Galerkin-type methods.
Dynamic Computer Experiments, 2014 8/38
Introduction Karhunen-Loeve Decomposition of a GP
A Worked Example
R(x ,w ) = exp{− (x1 − w1)2 − 2 (x2 − w2)2} ,
Eigenvalues vs. Index (log scale)
0 1 2 3 4 5 6 7 8 9 11 13 15 17
−5
−4
−3
−2
−1
0
logλk
k
Dynamic Computer Experiments, 2014 9/38
Introduction Karhunen-Loeve Decomposition of a GP
A Worked Example (cont.)
−0.5 0.0 0.5
−0.
50.
00.
5
−0.7
−0.65
−0.6
−0.55
−0.5
−0.45 −0.4
−0.4
−0.35
−0.35
−0.3
−0.3
−0.3
−0.3
−0.25
−0.25
−0.25
−0.25
Eigenfunction #1 , λ = 1.3597
−0.5 0.0 0.5
−0.
50.
00.
5
−0.6
−0.4
−0.2
0
0.2
0.4
0.4
0.4
0.6
Eigenfunction #2 , λ = 0.7893
−0.5 0.0 0.5
−0.
50.
00.
5
−0.8
−0.6
−0.4
−0.
2 0.2 0.4
0.6
0.8
Eigenfunction #3 , λ = 0.5588
−0.5 0.0 0.5
−0.
50.
00.
5
−0.8
−0.8
−0.6
−0.6
−0.4
−0.4
−0.2
−0.2
0
0
0.2
0.2
0.4
0.4
0.6
Eigenfunction #4 , λ = 0.328
−0.5 0.0 0.5
−0.
50.
00.
5
−0.8
−0.8
−0.6
−0.6
−0.4
−0.4
−0.2
−0.2
0 0
0.2
0.2
0.4
0.4
0.6
0.6
0.8
0.8
Eigenfunction #5 , λ = 0.3244
−0.5 0.0 0.5
−0.
50.
00.
5
−1
−1
−0.
8 −0.8
−0.
6 −0.6
−0.
4 −0.4
−0.
2
−0.2
0
0
0.2 0.2
0.4
0.6
Eigenfunction #6 , λ = 0.1396
Dynamic Computer Experiments, 2014 10/38
Introduction Karhunen-Loeve Decomposition of a GP
A Worked Example (cont.)
A single realization based on the first 17 K-L terms (≈ 99.5% of the process’
energy)
X1
X2
y
−0.5 0.0 0.5
−0.
50.
00.
5
−0.5
0
0
0.5 0.5
1
1.5
1.5
2
2.5
Dynamic Computer Experiments, 2014 11/38
IMSPE-optimal Designs
Outline
1 Introduction
Gaussian Process Regression
Karhunen-Loeve Decomposition of a GP
2 IMSPE-optimal Designs
Application of the K-L Decomposition
Approximate Minimum IMSPE Designs
Adaptive Designs
Extending to Models with Unknown Regression Parameters
3 Other Optimality Criteria
4 Concluding Remarks
Dynamic Computer Experiments, 2014 12/38
IMSPE-optimal Designs
Minimum IMSPE Designs
Integrated Mean Squared Prediction Error
Sacks et al. (1989b) suggested a suitable design for computer experiments
should minimize
J(D, y
)σ2 =
1σ2
∫XE[{
y (x)− y (x)}2∣∣∣y] dx
= vol (X )− tr
[
0 F T
F R
]−1 ∫X
[f (x) f T (x) f (x) rT (x)
r (x) f T (x) r (x) rT (x)
]dx
where vol(X ) is the volume of X .
Dynamic Computer Experiments, 2014 13/38
IMSPE-optimal Designs
Application of the K-L Decomposition
Consider the standard Gaussian process
y (x) = f (x)T β + Z (x) ,
where β is (for now) a known vector.
• The Kriging predictor would now be y (x) = f (x)T β + rT (x) R−1 (y − Fβ) .
Proposition
For any compact X and continuous covariance kernel R,
J(D, y
)σ2 = vol(X )−
∞∑k=1
λ2kφ
Tk R−1φk , (2)
where φp = [ϕp(x1), . . . , ϕp(xn)]T .
Dynamic Computer Experiments, 2014 14/38
IMSPE-optimal Designs
Application of the K-L Decomposition (cont.)
Theorem
1. For any design D = {x1, . . . , xn},
J(D, y
)≥ σ2
Z
∞∑k=n+1
λk (3)
2. The lower bound (3) will be achieved if D satisfies
λkφTk R−1φk =
1 k ≤ n
0 k > n
3. If such an ideal D exists, the prediction y (x) at any x ∈ X would be
identical to the prediction based on the finite dimensional Bayesian linear
regression model y (x) = α1ϕ1 (x) + · · · + αnϕn (x) where αj ∼ N(0 , λj
)are
independent.
Dynamic Computer Experiments, 2014 15/38
IMSPE-optimal Designs
Application of the K-L Decomposition (cont.)
� By definition, IMSPE-optimal designs are prediction-oriented. What the
Theorem tells us is that in some sense, they are also useful for variable
selection (w.r.t the K-L basis functions).
� In reality, such ideal designs do not exist - but IMSPE-optimal designs get
pretty close:
1 5 9 15 21 27 33 39 45 51 57
0.00
0.50
1.00
λkφkTR−1φk
k
k ≤ 17k > 17
Dynamic Computer Experiments, 2014 16/38
IMSPE-optimal Designs Approximate Minimum IMSPE Designs
Approximate Minimum IMSPE Designs
Approximate IMSPE
JM
(D, y
)σ2 = vol(X )−
M∑k=1
λ2kφ
Tk R−1φk = vol(X )− tr
{Λ2ΦTR−1Φ
},
where
Λ = diag (λ1, . . . , λM ) and Φij = ϕj (x i ) , 1 ≤ i ≤ n , 1 ≤ j ≤ M .
Approximate Minimum IMSPE Designs
D∗ = argminD⊂X
JM
(D, y
)= argmaxD⊂X
tr{Λ2ΦTR−1Φ
}
Dynamic Computer Experiments, 2014 17/38
IMSPE-optimal Designs Approximate Minimum IMSPE Designs
A Worked Example (cont.)
Size 7, 15 and 25 approximate Minimum IMSPE designs for
R(x ,w ) = exp{− (x1 − w1)2 − 2 (x2 − w2)2}
Dynamic Computer Experiments, 2014 18/38
IMSPE-optimal Designs Approximate Minimum IMSPE Designs
Controlling the Relative Truncation Error
Proposition
Under some conditions (which are easily met),
rM =
∞∑k=M+1
λ2kφ
Tk R−1φk
∞∑j=1
λ2j φ
Tj R−1φj
≤
∞∑k=M+1
λk
∞∑j=1
λj
.
� By conserving enough of the process’ energy, we are guaranteed to have a
design as close to optimal as we wish.
Dynamic Computer Experiments, 2014 19/38
IMSPE-optimal Designs Approximate Minimum IMSPE Designs
Controlling the Relative Truncation Error (cont.)
Relative truncation error vs. sample size for M = 17 (⇐⇒ 0.492% energy loss):
5 13 21 30 40 50 60 700.00
00.
002
0.00
4
ε = 0.00492
IdealTrue
Sample Size
RelativeError
Relative Truncation Error vs. Sample Size
Dynamic Computer Experiments, 2014 20/38
IMSPE-optimal Designs Adaptive Designs
Adaptive Designs
• Suppose now that we have the opportunity (or we are forced) to run a
multi-stage experiment: n1 runs at stage 1, n2 runs at stage 2 and so on.
• We can use that to learn/re-estimate the essential parameters and plug-in
the new estimates, to hopefully improve the design from one stage to
another.
• Such designs are called “adaptive” or “batch-sequential” designs.
Dynamic Computer Experiments, 2014 21/38
IMSPE-optimal Designs Adaptive Designs
Adaptive Designs (cont.)
Denote
Raug =
Rnewn2 × n2
Rcrossn2 × n1
R>crossn1 × n2
Roldn1 × n1
,
where Raug, Rnew, Rold and Rcross are the augmented correlation matrix, the
matrix of correlations within the new inputs, the matrix of correlations within the
original design and the cross correlation matrix, and
Φaug =
Φnewn2 × M
Φoldn1 × M
.
• Parameters may be re-estimated in between batches.
Dynamic Computer Experiments, 2014 22/38
IMSPE-optimal Designs Adaptive Designs
Adaptive Designs (cont.)
Using basic block matrix properties, we look to maximize
tr{Λ2ΦT
augR−1augΦaug
}=
= tr
[Λ2 {(ΦT
new −ΦToldR−1
old RTcross
)Q−1
1 Φnew +(ΦT
old −ΦTnewR−1
newRcross)
Q−12 Φold
}],
where
Q1 = Rnew − RTcrossR
−1old Rcross and Q2 = Rold − RcrossR−1
newRTcross ,
and Rold and Φold remain unchanged.
Dynamic Computer Experiments, 2014 23/38
IMSPE-optimal Designs Adaptive Designs
Adaptive Designs (cont.)
Adding 7 New Runs to an Existing 8 Run Design (λ = 0.1):
X1
X2
−1.0 −0.5 0.0 0.5 1.0
−1.
0−
0.5
0.0
0.5
1.0
Size 8+7 design, IMSPE=0.77494
1st Batch2nd Batch
Dynamic Computer Experiments, 2014 24/38
IMSPE-optimal Designs Adaptive Designs
Adaptive Designs (cont.)
Adding 7 New Runs to an Existing 8 Run Design (λ = 0.1):
X1
X2
−1.0 −0.5 0.0 0.5 1.0
−1.
0−
0.5
0.0
0.5
1.0
Size 8+7 design, IMSPE=0.77494
1st Batch2nd Batch
Dynamic Computer Experiments, 2014 24/38
IMSPE-optimal Designs Extending to Models with Unknown Regression Parameters
Approximate IMSPE Criterion for the Universal Kriging Model
Expanding the covariance kernel, using Mercer’s Theorem, and truncating after
the first M terms yields
HM
(D, y
)=JM
(D, y
)+tr
{(F TR−1F
)−1[∫X
f (x) f T (x) dx − 2AT∫Xφ (x) f T (x) dx + ATA
]},
(4)
where JM is the criterion previously derived for the simple Kriging model,
A = ΛΦTR−1F and φ (x) = [ϕ1 (x) , . . . , ϕM (x)]T .
Dynamic Computer Experiments, 2014 25/38
IMSPE-optimal Designs Extending to Models with Unknown Regression Parameters
Approximate IMSPE Criterion for the Universal Kriging Model
Expression (4) is somewhat simplified by substituting f (x) = 1 and F = 1n to
HM
(D, y
)=JM
(D, y
)+
11T
nR−11n
{vol (X )− 2aTγ + aTa
},
where
a = ΛΦTR−11n and γ =[∫Xϕ1(x)dx , . . . ,
∫XϕM (x)dx
]T
.
• The vector of integrals γ only needs to be evaluated once.
• Almost identical designs to those obtained for the simple Kriging model, for
a reasonably large n.
Dynamic Computer Experiments, 2014 26/38
IMSPE-optimal Designs Extending to Models with Unknown Regression Parameters
With and Without Estimating the Intercept
−1.0 −0.5 0.0 0.5 1.0
−1.
0−
0.5
0.0
0.5
1.0
X1
X2
−1.0 −0.5 0.0 0.5 1.0
−1.
0−
0.5
0.0
0.5
1.0
X1
X2
Dynamic Computer Experiments, 2014 27/38
IMSPE-optimal Designs Extending to Models with Unknown Regression Parameters
With and Without Estimating the Intercept (cont.)
Deviation from the optimum vs. sample size
5 7 10 12 14 16 18 20 23 25 27
0.00
00.
004
0.00
7
∆
n
Dynamic Computer Experiments, 2014 28/38
Other Optimality Criteria
Outline
1 Introduction
Gaussian Process Regression
Karhunen-Loeve Decomposition of a GP
2 IMSPE-optimal Designs
Application of the K-L Decomposition
Approximate Minimum IMSPE Designs
Adaptive Designs
Extending to Models with Unknown Regression Parameters
3 Other Optimality Criteria
4 Concluding Remarks
Dynamic Computer Experiments, 2014 29/38
Other Optimality Criteria
Bayesian A-optimal Designs
• In classical DOE, a design is called A-optimal if it minimizes tr{
var(θ)}
,
where θ is the vector of estimators for the model parameters.
• For our model
y (x) = f (x)T β + Z (x) = f (x)T β +∞∑k=1
αkϕk (x) , αk ∼ N (0, λi ) ,
we may consider the Bayesian A-optimality criterion
Q(D, y
)=∞∑k=1
var{αk
∣∣y} .
• When β is a known vector, the IMSPE and the A-optimality criteria coincide.
Dynamic Computer Experiments, 2014 30/38
Other Optimality Criteria
Bayesian A-optimal Designs (cont.)
Proposition
If π (β) ∝ 1,
Q(D, y
)= vol (X )− tr
0 F T
F R
−1 ∫X
0 0
0 r (x) rT (x)
dx
.
(and hence we don’t even need to derive the Mercer expansion)
−1.0 −0.5 0.0 0.5 1.0
−1.
0−
0.5
0.0
0.5
1.0
X1
X2
IMSPE−optimal
−1.0 −0.5 0.0 0.5 1.0
−1.
0−
0.5
0.0
0.5
1.0
X1
X2
A−optimal
Dynamic Computer Experiments, 2014 31/38
Other Optimality Criteria
Bayesian D-optimal Designs
• In classical DOE, a design is called D-optimal if it minimizes the
determinant of var(θ)
(typically the inverse Fisher information).
• In our framework, an approximate Bayesian D-optimal design would be
D∗ = argminD
det(
var{α1, . . . , αM
∣∣y})Where
var{α1, . . . , αM
∣∣y} = Λ−ΛΦT[R−1 − R−1F
(F TR−1F
)−1F TR−1
]ΦΛ .
−1.0 −0.5 0.0 0.5 1.0
−1.
0−
0.5
0.0
0.5
1.0
X1
X2
D−optimal
−1.0 −0.5 0.0 0.5 1.0
−1.
0−
0.5
0.0
0.5
1.0
X1
X2
IMSPE−optimal
−1.0 −0.5 0.0 0.5 1.0
−1.
0−
0.5
0.0
0.5
1.0
X1
X2
A−optimal
Dynamic Computer Experiments, 2014 32/38
Concluding Remarks
Outline
1 Introduction
Gaussian Process Regression
Karhunen-Loeve Decomposition of a GP
2 IMSPE-optimal Designs
Application of the K-L Decomposition
Approximate Minimum IMSPE Designs
Adaptive Designs
Extending to Models with Unknown Regression Parameters
3 Other Optimality Criteria
4 Concluding Remarks
Dynamic Computer Experiments, 2014 33/38
Concluding Remarks
Few comments before we go
• If Z (x) is a zero mean non-Gaussian, weakly-stationary process with the
same correlation structure
- the Kriging predictor will no longer be the posterior mean, but will still
be the BLUP.
- The MSPE will no longer be the posterior variance, but all the results
still hold.
• Inclusion of measurement error (i.e. Nugget Term)
y (x) = f (x)T β + Z (x) + ε (x) , ε ∼ N(0, τ 2I
), λ = τ 2/σ2
will simply result in replacing R with R + λI everywhere, but the theory will
cease to apply.
- Great computational benefits for small values of λ.
Dynamic Computer Experiments, 2014 34/38
Concluding Remarks
Thank you!
Dynamic Computer Experiments, 2014 35/38
References
References I
1. Andrianakis, Ioannis, & Challenor, Peter G. 2012. The effect of the nugget on Gaussian process
emulators of computer models. Computational Statistics and Data Analysis, 56(12), 4215–4228.
2. Atkinson, A.C., Doney, A.N., & Tobias, R.D. 2007. Optimum Experimental Designs, with SAS. Oxford
Statistical Science Series.
3. Bursztyn, D., & Steinberg, D. M. 2004. Data analytic tools for understanding random field regression
models. Technometrics, 46(4), 411–420.
4. Cressie, Noel A. C. 1993. Statistics for Spatial Data. Wiley-Interscience.
5. Frauenfelder, Philipp, Schwab, Christoph, & Todor, Radu Alexandru. 2005. Finite Elements for Elliptic
Problems with Stochastic Coefficients. Computer Methods in Applied Mechanics and Engineering,
194(2), 205–228.
6. Guenther, Ronald B., & Lee, John W. 1996. Partial Differential Equations of Mathematical Physics and
Integral Equations. Courier Dover Publications.
7. Jones, Brad, & Johnson, Rachel T. 2009. The Design and Analysis of the Gaussian Process Model.
Quality and Reliability Engineering International, 25(5), 515–524.
8. Journel, A., & Rossi, M. 1989. When Do We Need a Trend Model in Kriging? Mathematical Geology,
21(7), 715–739.
Dynamic Computer Experiments, 2014 36/38
References
References II
9. Linkletter, Crystal, Bingham, Derek, Hengartner, Nicholas, Higdon, David, & Ye, Kenny Q. 2006.
Variable Selection for Gaussian Process Models in Computer Experiments. Technometrics, 48(4),
478–490.
10. Loeppky, Jason L., Sacks, Jerome, & Welch, William J. 2009. Choosing the Sample Size of a
Computer Experiment: A Practical Guide. Technometrics, 51(4), 366–376.
11. Loeppky, Jason L., Moore, Leslie M., & Williams, Brian J. 2010. Batch Sequential Designs for
Computer Experiments. Journal of Statistical Planning and Inference, 140(6), 1452–1464.
12. Picheny, Victor, Ginsbourger, David, Roustant, Olivier, Haftka, Raphael T., & Kim, Nam H. 2010.
Adaptive Designs of Experiments for Accurate Approximation of a Target Region. Journal of
Mechanical Design, 132(7), 071008.
13. Rasmussen, Carl Edward, & Williams, Chris. 2006. Gaussian Processes for Machine Learning. The
MIT Press.
14. Sacks, Jerome, Welch, William J., Mitchell, Toby J., & Wynn, Henry P. 1989a. Design and Analysis of
Computer Experiments. Statistical Science, 4(4), 409–423.
15. Sacks, Jerome, Schiller, Susannah B., & Welch, William J. 1989b. Designs for Computer Experiments.
Technometrics, 31(1), 41–47.
Dynamic Computer Experiments, 2014 37/38
References
References III
16. Santner, Thomas J., Williams, Brian J., & Notz, William I. 2003. The Design and Analysis of Computer
Experiments. Springer.
17. Shewry, Michael C., & Wynn, Henry P. 1987. Maximum Entropy Sampling. Journal of Applied
Statistics, 14(2), 165–170.
18. Singhee, Amith, Singhal, Sonia, & Rutenbar, Rob A. 2008. Exploiting Correlation Kernels for Efficient
Handling of Intra-Die Spatial Correlation, with Application to Statistical Timing. Pages 856–861 of:
Proceedings of the conference on Design, Automation and Test in Europe.
19. Wahba, Grace. 1990. Spline Models for Observational Data. Society for Industrial and Applied
Mathematics.
Dynamic Computer Experiments, 2014 38/38