spatially adaptive bayesian penalized splines - department of

24
Spatially Adaptive Bayesian Penalized Splines With Heteroscedastic Errors Ciprian M. Crainiceanu, David Ruppert, Raymond J. Carroll, Adarsh Joshi, and Billy Goodner Penalized splines have become an increasingly popular tool for nonparametric smoothing because of their use of low-rank spline bases, which makes computations tractable while maintaining accuracy as good as smoothing splines. This article extends penalized spline methodology by both modeling the variance function nonparametri- cally and using a spatially adaptive smoothing parameter. This combination is needed for satisfactory inference and can be implemented effectively by Bayesian MCMC. The variance process controlling the spatially adaptive shrinkage of the mean and the variance of the heteroscedastic error process are modeled as log-penalized splines. We discuss the choice of priors and extensions of the methodology, in particular, to multi- variate smoothing. A fully Bayesian approach provides the joint posterior distribution of all parameters, in particular, of the error standard deviation and penalty functions. MATLAB, C, and FORTRAN programs implementing our methodology are publicly available. Key Words: Heteroscedasticity; MCMC; Multivariate smoothing; Regression splines; Spatially adaptive penalty; Thin-plate splines; Variance functions. 1. INTRODUCTION Penalized splines (Eilers and Marx 1996; Marx and Eilers 1998; Ruppert, Wand, and Carroll 2003) have become a popular nonparametric tool. Their success is due mainly to the use of low rank bases, which makes computations tractable. Also, penalized splines can be viewed as mixed models and fit with widely available statistical software (Ngo Ciprian M. Crainiceanu is Assistant Professor, Department of Biostatistics, Johns Hopkins University, 615 N. Wolfe St. E3636 Baltimore, MD 21205 (E-mail: [email protected]). David Ruppert is Andrew Schultz Jr. Professor of Engineering and Professor of Statistical Science, School of Operational Research and Industrial Engineering, Cornell University, Rhodes Hall, NY 14853 (E-mail: [email protected]). Raymond J. Carroll is Distinguished Professor of Statistics and Professor of Nutrition and Toxicology, Department of Statistics, 447 Blocker Building, Texas A&M University, College Station, TX 77843–3143 (E-mail: [email protected]). Adarsh Joshi is a Graduate student, Department of Statistics, 447 Blocker Building, Texas A&M University, College Station, TX 77843–3143 (E-mail: [email protected]). Billy Goodner is a Graduate student, Department of Statistics, Texas A&M University College Station, 447 Blocker Building, TX 77843–3143 (E-mail: [email protected]). c 2007 American Statistical Association, Institute of Mathematical Statistics, and Interface Foundation of North America Journal of Computational and Graphical Statistics, Volume 16, Number 2, Pages 265–288 DOI: 10.1198/106186007X208768 265

Upload: others

Post on 11-Feb-2022

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Spatially Adaptive Bayesian Penalized Splines - Department of

Spatially Adaptive Bayesian Penalized SplinesWith Heteroscedastic Errors

Ciprian M Crainiceanu David Ruppert Raymond J CarrollAdarsh Joshi and Billy Goodner

Penalized splines have become an increasingly popular tool for nonparametricsmoothing because of their use of low-rank spline bases which makes computationstractable while maintaining accuracy as good as smoothing splines This article extendspenalized spline methodology by both modeling the variance function nonparametri-cally and using a spatially adaptive smoothing parameter This combination is neededfor satisfactory inference and can be implemented effectively by Bayesian MCMCThe variance process controlling the spatially adaptive shrinkage of the mean and thevariance of the heteroscedastic error process are modeled as log-penalized splines Wediscuss the choice of priors and extensions of the methodology in particular to multi-variate smoothing A fully Bayesian approach provides the joint posterior distributionof all parameters in particular of the error standard deviation and penalty functionsMATLAB C and FORTRAN programs implementing our methodology are publiclyavailable

Key Words Heteroscedasticity MCMC Multivariate smoothing Regression splinesSpatially adaptive penalty Thin-plate splines Variance functions

1 INTRODUCTION

Penalized splines (Eilers and Marx 1996 Marx and Eilers 1998 Ruppert Wand andCarroll 2003) have become a popular nonparametric tool Their success is due mainly tothe use of low rank bases which makes computations tractable Also penalized splinescan be viewed as mixed models and fit with widely available statistical software (Ngo

Ciprian M Crainiceanu is Assistant Professor Department of Biostatistics Johns Hopkins University 615 N WolfeSt E3636 Baltimore MD 21205 (E-mail ccrainicjhsphedu) David Ruppert is Andrew Schultz Jr Professorof Engineering and Professor of Statistical Science School of Operational Research and Industrial EngineeringCornell University Rhodes Hall NY 14853 (E-mail dr24cornelledu) Raymond J Carroll is DistinguishedProfessor of Statistics and Professor of Nutrition and Toxicology Department of Statistics 447 Blocker BuildingTexas AampM University College Station TX 77843ndash3143 (E-mail carrollstattamuedu) Adarsh Joshi is aGraduate student Department of Statistics 447 Blocker Building Texas AampM University College Station TX77843ndash3143 (E-mail adarshstattamuedu) Billy Goodner is a Graduate student Department of Statistics TexasAampM University College Station 447 Blocker Building TX 77843ndash3143 (E-mail bgoodnerstattamuedu)

ccopy 2007 American Statistical Association Institute of Mathematical Statisticsand Interface Foundation of North America

Journal of Computational and Graphical Statistics Volume 16 Number 2 Pages 265ndash288DOI 101198106186007X208768

265

266 C M Crainiceanu et al

and Wand 2004 Crainiceanu Ruppert and Wand 2005 Wood 2006) Bayesian penalizedsplines (Ruppert Wand and Carroll 2003 Lang and Brezger 2004 Crainiceanu Ruppertand Wand 2005) use a stochastic process model as a prior for the regression function It istypical to assume that both this process and the errors are homoscedastic

The penalized spline methodology has been extended to heteroscedastic errors usingan iterative frequentist approach (Ruppert Wand and Carroll 2003) We develop the corre-sponding Bayesian inferential methodology using a simple previously unknown solutionSpatially adaptive penalty parameters have also been used before (Ruppert and Carroll 2000Eilers and Marx 2002 Baladandayuthapani Mallick and Carroll 2005 Lang and Brezger2004 Jullion and Lambert 2006 Pintore Speckman and Holmes 2006) However this isthe first article to combine these features We show that this combination is important andthat spatially varying penalties can adapt to spatial heterogeneity of both the mean functionand the error variance Implementation of this extension was not straightforward but wedeveloped and implemented an algorithm that works well in practice

Our methodology can be extended to almost any penalized spline modelmdashfor exampleadditive varying coefficient interaction and multivariate smoothing models To illustratethis we also study multivariate low-rank thin-plate splines As in the univariate case wemodel the mean the variance and the smoothing functions nonparametrically and estimatethem from the data using a fully Bayesian approach The computational advantage of lowrank over full rank smoothers becomes even greater in more than one dimension

Section 2 provides a quick introduction to penalized splines Section 3 discusses thechoice of priors for penalized spline regression Section 4 describes how to obtain simul-taneous credible bounds for mean and variance functions and their derivatives Section 5compares our proposed methodology to the one proposed by Baladandayuthapani et al(2005) for adaptive univariate penalized spline regression Section 6 describes the exten-sion to multivariate smoothing and Section 7 compares our multivariate methodology withother adaptive surface fitting methods Section 8 discusses an example of bivariate smooth-ing with heteroscedastic errors and spatially adaptive smoothing Section 9 presents thefull conditional distributions and discusses our implementation of models in MATLABFORTRAN and C Section 11 provides our conclusions

2 PENALIZED REGRESSION SPLINES

Consider the regression equation yi = m(xi) + εi i = 1 n where the εi areindependent and the mean is modeled as

m(x) = m(x θθθX) = β0 + β1x + middot middot middot + βpxp +

Kmsumk=1

bk(x minus κm

k

)p+

where βββ = (β0 βp)T b = (b1 bKm)

T θθθX = (βββT bT )T κm1 lt κm

2 lt middot middot middot lt

κmKm

are fixed knots and ap+ denotes max(a 0)p Following Gray (1994) and Ruppert

(2002) we take Km large enough (eg 20) to ensure the desired flexibility To avoid over-fitting the bk sim N0 σ 2

b (κmk ) and εi sim N0 σ 2

ε (xi) are shrunk towards zero by an amountcontrolled by σ 2

b (middot) and σ 2ε (middot) which vary across the range of x

Spatially Adaptive Bayesian Penalized Splines 267

The penalized spline model can be written asyi = β0 + β1xi + middot middot middot + βpx

pi +sumKm

k=1 bk(xi minus κm

k

)p+ + εi

εi sim N0 σ 2ε (xi) i = 1 n

bk sim N0 σ 2b (κ

mk ) k = 1 Km

(21)

where εi and bk are mutually independent given σ 2ε (middot) σ 2

b (middot) In Sections 21 and 22 σ 2ε (xi)

and σ 2b (κ

mk ) are modeled using log-spline models Denote by X the ntimes (p+1) matrix with

the ith row equal to Xi = (1 xi xpi ) and by Z the ntimesKm matrix with ith row equal to

Zi = (xi minus κm1 )

p+ (xi minus κm

Km)p+ Y = (y1 yn)

T and εεε = (ε1 εn)T Model

(21) can be rewritten in matrix form as

Y = Xβββ + Zb + εεε E

(bεεε

)=(

00

) cov

(bεεε

)=(

b 00 ε

) (22)

where the joint conditional distribution of b and εεε given b and ε is assumed normal Theβββ parameters are treated as fixed effects The covariance matrices b and ε are diago-nal with the vectors σ 2

b (κm1 ) σ 2

b (κmKm

) and σ 2ε (x1) σ

2ε (xn) as main diagonals

respectively For this model E(Y) = Xβββ and cov(Y) = εεε + ZbZT

21 Error Variance Function Estimation

Ignoring heteroscedasticity may lead to incorrect inferences and inefficient estimationespecially when the response standard deviation varies over several orders of magnitudeMoreover understanding how variability changes with the predictor may be of intrinsicinterest (Carroll 2003) Transformations can stabilize the response variance when the con-ditional response variance is a function of the conditional expectation (Carroll and Ruppert1988) but in other cases one needs to estimate it by modeling the variance as a function ofthe predictors

We model the variance function as a loglinear modellogσ 2

ε (xi) = γ0 + middot middot middot + γpxpi +sumKε

s=1 cs(xi minus κεs )

p+

cs sim N(0 σ 2c ) s = 1 Kε

(23)

where the γγγ parameters are fixed effects and κε1 lt middot middot middot lt κε

Kεare knots The normal

distribution of the cs parameters shrinks them towards 0 and ensures stable estimation

22 Smoothly Varying Local Penalties

Following Baladandayuthapani et al (2005) we model σ 2b (middot) as

logσ 2b (x) = δ0 + + δqx

q +sumKb

s=1 ds(x minus κbs )

q+

ds sim N(0 σ 2d ) s = 1 Kb

(24)

where the δrsquos are fixed effects the dsrsquos are mutually independent and κbs Kb

s=1 are knotsThe particular case of a global smoothing parameter corresponds to the case when the splinefunction is a constant that is logσ 2

b (κms ) = δ0

268 C M Crainiceanu et al

Ruppert and Carroll (2000) proposed a ldquolocal penaltyrdquo model similar to (24) Bal-adandayuthapani Mallick and Carroll (2005) developed a Bayesian version of Ruppertand Carrollrsquos (2000) estimator and showed that their estimator is similar to Ruppert andCarrollrsquos in terms of mean square error and outperforms other Bayesian methods LangFronk and Fahrmeir (2002) considered locally adaptive dynamic models and find that theirmethod is roughly comparable to the method of Ruppert and Carroll (2000) in terms ofmean square error (MSE) and coverage probability Krivobokova Crainiceanu and Kauer-mann (2007) used a model similar to (24) for low rank thin plate spline and apply Laplaceapproximations instead of Bayesian MCMC

In Lang and Brezgerrsquos (2004) prior for the bk the nonconstant variance is independentfrom knot to knot since the variances of the bk are assumed to be independent inverse-gammas In contrast our model allows dependence so that if the variance is high at oneknot then it is high at neighboring knots Stated differently the Lang and Brezger model isone of random heteroscedasticity and ours of systematic heteroscedasticity Both types ofpriors are sensible and will have applications but in any specific application it is likely thatone of the two will be better For example Lang and Brezger found that for ldquoDoppler typerdquofunctions their estimator is not quite as good as Ruppert and Carrollrsquos and that the coverageof their credible intervals is not so accurate either It is not hard to understand why Doppler-type functions that oscillate more slowly going from left to right are consistent with a priorvariance for the bk that is monotonically decreasing exactly the type of prior included inour model but not Lang and Brezgerrsquos

23 Choice of Spline Basis Number and Spacing of Knots and Penalties

For concreteness we have made some specific choices about the spline bases numberand knot spacings and form of the penalty In terms of model formulation the choice ofspline basis in the model is not important since an equivalent basis gives the same modeland the basis used in computations need not be the same as the one used to express themodel However spline bases are very different in terms of computational stability Forexample cubic thin plate splines (Ruppert Wand and Carroll 2003 pp 72ndash74) have betternumerical properties than the truncated polynomial basis In more than one dimension thetensor product of truncated polynomials proved unstable and we preferred using low rankradial smoothers (see Section 6)

The penalty we use is somewhat different from that of Eilers and Marx (1996) andalso somewhat different from the penalties used by smoothing splines We believe thatthe penalty parameter is the crucial choice so we have concentrated on spatially adaptivemodeling of the penalty parameter

In this article we use knots to model the mean variance and spatially adaptive smoothingparameter The methodology described here does not depend conceptually on the number ofknots For example one could use a knot at each observation for each of the three functionsAlthough we use quantile knot spacings the best type of knot spacings is controversialEilers and Marx (2004) stated that ldquoEqually spaced knots are always to be preferredrdquo Incontrast Ruppert Wand and Carroll (2003) used quantile-spacing in all of their examples

Spatially Adaptive Bayesian Penalized Splines 269

though they did not make any categorical statement that quantile-spacing is always to bepreferred Since knot spacing is not the main focus of this article and our results were notsensitive to it we do not discuss this issue here

In this article we have used quantile knot spacing for the univariate case and spacefilling design (Nychka and Saltzman 1998) for the multivariate case which is based on themaximal separation principle and generalizes the quantile knot spacing in more than onedimension For the multivariate case we also used equally spaced knots in simulations tomake our methods more comparable to other published methods

3 PRIOR SPECIFICATION

Any smoother depends heavily on the choice of smoothing parameter and for penalizedsplines the smoothing parameter is the ratio of the error variance to the prior variance on themean (Ruppert Wand and Carroll 2003) The smoothness of the fit depends on how thesevariances are estimated For example Crainiceanu and Ruppert (2004) showed that in finitesamples the (RE)ML estimator of the smoothing parameter is biased towards oversmoothingand Kauermann (2002) obtained corresponding asymptotic results for smoothing splines

In Bayesian models the estimates of the variance components are known to be sensi-tive to the prior specification for example see Gelman (2006) To study the effect of thissensitivity upon Bayesian penalized splines consider model (21) with one smoothing pa-rameter and homoscedastic errors so that σ 2

b and σ 2ε are constant In terms of the precision

parameters τb = 1σ 2b and τε = 1σ 2

ε the smoothing parameter is λ = τετb = σ 2b σ

and a small (large) λ corresponds to oversmoothing (undersmoothing)

31 Priors on the Fixed Effects Parameters

It is standard to assume that the fixed effects parameters βi are apriori independentwith prior distributions either [βi] prop 1 or βi prop N(0 σ 2

β ) where σ 2β is very large In our

applications we used σ 2β = 106 which we recommend if x and y have been standardized

or at least have standard deviations with order of magnitude oneFor the fixed effects γ and δ used in the log-spline models (23) and (24) we also used

independent N(0 106) priors When this prior is not consistent with the true value of theparameter a possible strategy is to fit the model using a given set of priors and obtainthe estimators γ σγ δ and σδ We could then use independent priors N(γ 106σ 2

γ ) andN(δ 106σ 2

δ) for γ and δ respectively

32 Priors on the Precision Parameters

As just mentioned the priors for the precisions τb and τε are crucial We now show howcritically the choice of τb may depend upon the scaling of the variables The gamma familyof priors for the precisions is conjugate If [τb] sim Gamma(Ab Bb) and independently of

270 C M Crainiceanu et al

τb [τε] sim Gamma(Aε Bε) where Gamma(AB) has mean AB and variance AB2 then

[τb|Y βββb τε] sim Gamma

(Ab + Km

2 Bb + ||b||2

2

) (31)

and

[τε |Y βββb τε] prop Gamma

(Aε + n

2 Bε + ||Y minus Xβββ minus Zb||2

2

)

Also

E(τb|Y βββb τε) = Ab + Km2

Bb + ||b||22 var(τb|Y βββb τε) = Ab + Km2

(Bb + ||b||22)2

and similarly for τε The prior does not influence the posterior distribution of τε when both Ab and Bb are

small compared to Km2 and ||b||22 respectively Since the number of knots is Km ge 1and in most problems considered Km ge 5 it is safe to choose Ab le 001 When Bb ltlt

||b||22 the posterior distribution is practically unaffected by the prior assumptions WhenBb increases compared to ||b||22 the conditional distribution is increasingly affectedby the prior assumptions E(τb|Y βββb τε) is decreasing in Bb indicating that large Bb

compared to ||b||22 corresponds to undersmoothing Since the posterior variance of τb isalso decreasing in Bb a poor choice of Bb will likely result in underestimating the variabilityof the smoothing parameter λ = τετb causing confidence intervals for the mean functionto be too narrow The condition Bb ltlt ||b||22 shows that the ldquononinformativenessrdquo ofthe gamma prior depends essentially on the scale of the problem

To show the potential severity of these effects we consider the model (21) with a globalsmoothing parameter and homoscedastic errors We used quadratic splines with 30 knotsand the light detection and ranging (LIDAR) data for illustration The LIDAR data wasobtained from atmospheric monitoring of pollutants and consists of the range or distancetraveled before light is reflected back to a source and log ratio the logarithm of the ratioof received light from two laser sources For more details see Ruppert Wand Holst andHossier (1997)

Figure 1 shows the effect of four mean-one Gamma priors for the precision of thetruncated polynomial parameters The variances of these priors are 10 103 106 and 1010respectively Obviously the first two inferences provide severely undersmoothed almostindistinguishable posterior means The third graph is much smoother but still exhibitsroughness especially in the right-hand side of the plot while the fourth graph displaysa pleasing smooth pattern consistent with our frequentist inference Using either priordistribution one obtains that the posterior mean of ||b||22 is of order 10minus6 to 10minus5 Thisexplains why values of Bb larger than 10minus6 proved inappropriate for this problem

The size of ||b||22 depends upon the scaling of the x and y variables and in the caseof the LIDAR data ||b||22 is small because the standard deviation of x is much largerthan the standard deviation of y If y is rescaled to ayy and x to axx then the regressionfunction becomes aym(axx) whose pth derivative is aya

px m

(p)(axx) so that ||b||22 is

Spatially Adaptive Bayesian Penalized Splines 271

Figure 1 LIDAR data with unstandardized covariate effect of four mean-one Gamma priors for the precisionof the truncated polynomial parameters The variances of these priors are 10 103 106 and 1010 respectively

rescaled by the factor a2ya

2px Thus ||b||22 is particularly sensitive to the scaling of x

The size of ||b||22 also depends on Km Because the integral of m(p) over the range ofx will be approximately

sumKm

k=1 bk asymp radicKmσb we can expect that σ 2

b prop (Km)minus1 and thesmoothing parameter should be proportional to Km For the LIDAR data the GCV chosensmoothing parameter is 00095 00205 00440 and 00831 for Km equal to 15 30 60 and120 respectively so as expected the smoothing parameter approximately doubles as Km

doublesPractical experience with LMMs for longitudinal or clustered data should be applied

with caution to penalized splines In a mixed effects model ||b||2 is an estimator of Kmσ 2b

For longitudinal data Kmσ 2b would generally be large because Km is the number of subjects

with constant subject effect variance σ 2b As just discussed for a penalized spline Kmσ 2

b

should be nearly independent of Km and could be quite smallFigure 2 presents the same type of results as Figure 1 for Gamma priors for the precision

parameter τb with the mean held fixed at 10minus6 and variances equal to 10 103 106 and 1010respectively These prior distributions have a much smaller effect on the posterior mean ofthe regression function The fit seems to be undersmooth when the variance is 10 Clearlywhen the variance increases the fit becomes smooth indicating that a value of the variancelarger than 103 will produce a reasonable fit

A similar discussion holds true for τε but now large Bε corresponds to oversmoothingand τε does not depend on the scaling of x In applications it is less likely that Bε iscomparable in size to ||Y minus Xβββ minus Zb||2 because the latter is an estimator of nσ 2

ε If σ 2ε is

an estimator of σ 2ε a good rule of thumb is to use values of Bε smaller than nσ 2

ε 100 This

272 C M Crainiceanu et al

Figure 2 LIDAR data with unstandardized covariate effect of four Gamma priors with mean 10minus6 for theprecision of the truncated polynomial parameters The variances of these priors are 10 103 106 and 1010respectively

rule should work well when σ 2ε does not have an extremely large variance

Alternative to gamma priors are discussed by for example Natarajan and Kass (2000)and Gelman (2006) These have the advantage of requiring less care in the choice of thehyperparameters However we find that with reasonable care the conjugate gamma priorscan be used in practice Nonetheless exploration of other prior families for penalized splineswould be well worthwhile though beyond the scope of this article

4 SIMULTANEOUS CREDIBLE BOUNDS

Simultaneous credible intervals have been discussed before in the literature for exampleBesag Green Higdon and Mengersen (1995) The idea can be used to produce simultaneouscredible bounds for functions estimated smoothly To show this let f (middot) be either m(middot)logσ 2

ε (middot) logσ 2b (middot) or a derivative of order q 1 le q le p of one of these functions It is

straightforward to use MCMC output to construct simultaneous credible bounds on f overan arbitrary finite interval [x1 xN ] Typically x1 and xN would be the smallest and largestobserved values of x

Let x1 lt x2 lt middot middot middot lt xN be a fine grid of points on this interval Let Ef (xi)and SDf (xi) be the posterior mean and standard deviation of f (xi) estimated from aMCMC sample Let Mα be the (1minusα) sample quantile of max1leileN

∣∣[f (xi)minusEf (xi)]SDf (xi)∣∣ computed from the realizations of f in the MCMC sample Then I (xi) =Ef (xi) plusmn MαSDf (xi) i = 1 N are simultaneous credible intervals For N

Spatially Adaptive Bayesian Penalized Splines 273

large the upper and lower limits of these intervals can be connected to form simultaneouscredible bands In the remainder of this article we only report simultaneous credible boundsfor the functions and their derivatives In the examples considered in the following thesebounds tend to be roughly 30ndash50 wider than pointwise credible bounds

5 COMPARISON WITH OTHER UNIVARIATE SMOOTHERS

Baladandayuthapani et al (2005) presented a comprehensive simulation study of theirBayesian spatially adaptive penalized spline model which is obtained as a particular case ofour full model with constant error variance The results of this study indicate that the methodis comparable to or better than several other spatially adaptive Bayesian methods on a varietyof datasets Results also showed that the method produced practically indistinguishableinferences from those obtained using the fast frequentist method of Ruppert and Carroll(2000) Since we are unaware of any other estimation methodology that estimates jointlythe nonparametric adaptive mean and nonparametric variance models we will compareour methodology with that of Ruppert and Carroll (2000) We will compare the frequentistproperties of our Bayesian methodology such as MSE and coverage probabilities calculatedby simulating from existent benchmark models

Consider the regression model

yi = m(xi) + εi

where εi sim N(0 σ 2ε ) are independent errors The first function considered was the Doppler

function described by Donoho and Johnstone (1994)

m(x) = radicx(1 minus x) sin

2π(1 + 005)

x + 005

For the Doppler function we considered n = 2048 equally spaced xrsquos between [0 1] andσ 2ε = 1

Ruppert and Carroll (2000) proposed the following family of models related to theDoppler but allowing for more shape flexibility

m(x j) = radicx(1 minus x) sin

2π(1 + 2(9minus4j)5)

x + 2(9minus4j)5

where j = 3 and j = 6 correspond to low and severe spatial heterogeneity respectivelyFor this function we considered n = 400 equally spaced xrsquos on [0 1] and σ 2

ε (x) = 004 Wewill refer to this as the RCj function where j controls the amount of spatial heterogeneity

Ruppert and Carroll (2000) and Baladandayuthapani et al (2005) also considered thefollowing function

m(x) = expminus400(x minus 06)2

+ 5

3expminus500(x minus 075)2 + 2 expminus500(x minus 09)2

(51)

274 C M Crainiceanu et al

Table 1 Average mean square error (AMSE) for different mean functions based on 100 simulations The columnldquoTrue variancerdquo indicates whether the data-generating process is homoscedastic or heteroscedastic Thecolumns ldquoFitted error variancerdquo refer to the type of assumption used for fitting the data ldquoHomoscedas-ticrdquo uses the Ruppert and Carroll (2000) fitting algorithm while ldquoHeteroscedasticrdquo uses the Bayesianinferential algorithm described in this article The mean and log-variance functions were modeled usingdegree 2 splines with Km = Kε = 40 equally spaced knots The shrinkage process was modeled usingdegree 1 splines with Kb = 4 subknots

Fitted error variance

Function True variance Homoscedastic Heteroscedastic

Doppler Constant 0197 0218RCj=3 Constant 0011 0012RCj=6 Constant 0061 00653 Bumps Constant 0054 00553 Bumps Variable 0031 0026

which is roughly constant between [0 05] and has three sharp peaks at 06 075 and 09For this function we considered n = 1000 equally spaced xrsquos on [0 1] and two differentscenarios for the standard error of the error process (a) σε(x) = 05 for homoscedasticerrors and (b) σε(xi) = 05minus08x+16(xminus05)+ We will refer to this as the ldquothree-bumprdquofunction

We used 100 simulations from these models and fit a spatially adaptive penalized splinethat uses a constant variance For the three-bump function we also fitted a model withnonconstant variance modeled as a log-penalized-spline For the mean and log-variancefunctions we used quadratic penalized splines with Km = Kε = 40 knots For the log-variance corresponding to the shrinkage process we used Kb = 4 knots

We calculated the mean square error for each simulated dataset as

MSE =nsum

i=1

m(xi) minus m(xi)2 n

where m(x) is the posterior mean of m(x) using a given model Table 1 provides a summaryof the average MSE (AMSE) results over 100 simulations comparing the method that allowsfor heteroscedastic errors with the one that does not Remarkably our method that estimatesa nonconstant variance performed well compared to methods that assume a constant variancewhen the variance is constant In contrast when the true variance process is not constantour method performed better

Our method was slightly outperformed by the Ruppert and Caroll (2000) method in theDoppler case This is probably due to the fact that the Ruppert and Carroll method assumesa constant variance for the Doppler function when the variance used for simulations wasσ 2ε = 1 However both methods outperformed the method of Baladandayuthapani et al

(2005) who reported an AMSE of 0024 for the Doppler functionFor the three-bump function with constant variance σε(x) = 05 the two models pro-

duced practically indistinguishable MSEs Note that our AMSE over all xrsquos using 100 simu-

Spatially Adaptive Bayesian Penalized Splines 275

Figure 3 Comparison of pointwise coverage probabilities of the 95 credible intervals in 500 simulationsData was simulated from the model described in Section 5 with heteroscedastic error variance Estimation wasdone using two methods Bayesian adaptive penalized splines with homoscedastic errors (dashed) and Bayesianadaptive penalized splines with heteroscedastic errors (solid) The coverage probabilities have been smoothed toremove Monte Carlo variability

lations was AMSE = 00054 which is smaller than 00061 reported by Baladandayuthapaniet al (2005) The coverage probabilities of the 95 credible intervals were very similarfor the two models for each value of the covariate with a slight advantage in favor of themodel using homoscedastic errors The average coverage probability was 947 for themodel with homoscedastic errors and 935 for the model with heteroscedastic errorsThese coverage probabilities are very similar to the ones reported by Baladandayuthapaniet al (2005) in their Figure 3

For case (b) σε(xi) = 05 minus 08x + 16(x minus 05)+ the heteroscedastic model sub-stantially outperformed the homoscedastic model both in terms of AMSE and coverageprobabilities In this situation it would be misleading to compare only the average coverageprobability for the 95 credible intervals Indeed these averages are very close 941for the heteroscedastic and 930 for the homoscedastic method but they are obtainedfrom very different sets of pointwise coverage probabilities Figure 3 displays the coverageprobabilities for these two methods Note that for the heteroscedastic method the coverageprobability in the [0 05] interval is close to 95 In the same interval the coverage prob-ability for the homoscedastic method starts from around 08 and increases until it crossesthe 95 target around 02 Moreover in the interval [03 05] this coverage probabilityis estimated to be 1 This is due to the fact that on this interval σε(x) decreases linearlyfrom 05 to 01 Since the homoscedastic method assumes a constant variance the credibleintervals will tend to be shorter than nominal in regions of higher variability (x close tozero) thus producing lower coverage probabilities In regions of smaller variability (x closeto 05) the credible intervals will tend to be much wider thus producing extremely large

276 C M Crainiceanu et al

coverage probabilitiesIn the interval [055 065] the homoscedastic method slightly outperforms the het-

eroscedastic method This seems to be the effect of a ldquoluckyrdquo combination of two factorsAs we discussed the size of the credible intervals for the homoscedastic method in a neigh-borhood of 05 is much larger than nominal However at the same point the mean functionchanges from a constant to a rapidly oscillating function and the coverage probabilities droproughly at the same rate In the interval [08 1] one can notice a phenomenon very similarto the one described for the interval [0 05] While the function oscillates more rapidly inthis region the heteroscedastic adaptive method produces credible intervals with coverageprobabilities close to the nominal 95 However the homoscedastic method does not takeinto account the increased variability and produces credible intervals that are too short

For these examples both methods performed roughly similar on simulated data sets thatdid not require variance estimation However when the error variance is not constant themethod using log-penalized-splines to estimate the error variance considerably outperformsthe method that does not

6 LOW RANK MULTIVARIATE SMOOTHING

In this section we generalize the ideas in Section 2 to multivariate smoothing whilepreserving the appealing geometric interpretation of the smoother Consider the followingregression yi = m(xi ) + εi where m(middot) is a smooth function of L covariates We will useradial basis functions which have the advantage of being more numerically stable than thetensor product of truncated polynomials in more than one dimension Suppose that xi isin RL1 le i le n are n vectors of covariates and κk isin RL 1 le k le Km are Km knots Considerthe following distance function

C(r) =

||r||2MminusL for L odd||r||2MminusL log ||r|| for L even

where || middot || denotes the Euclidean norm in RL the integer M controls the smoothnessof C(middot) X the matrix with ith row Xi = [1 xT

i ] ZKm = C(||xi minus κκκk||)1leilen1lekleKm

Km = C(||κκκk minus κκκkprime ||)1lekleK1lekprimeleK and define Z = ZKminus12K where

minus12K is the

principal square root of K With these notations the low rank approximation of thin platespline regression can be obtained as the BLUP in the LMM (Kammann and Wand 2003Ruppert Wand and Carroll 2003)

Y = Xβββ + Zb + εεε E

(bεεε

)=(

00

) cov

(bεεε

)=(

σ 2b IKm 0

0 σ 2ε In

) (61)

This model contains only one variance component σ 2b for controlling the shrinkage of

b which is equivalent to one global smoothing parameter λ = σ 2b σ

2ε and implicitly as-

sumes homoscedastic errors To relax these assumptions we consider a new set of knotsκκκlowast

1 κκκlowastKb

and define Xlowast and Zlowast similarly with the corresponding definition of matrices

Spatially Adaptive Bayesian Penalized Splines 277

X and Z where the x-covariates are replaced by the knots κκκk and the knots are replaced bythe subknots κκκlowast

k Consider the following model for the responseyi = β0 +sumL

j=1 βjxij +sumKm

k=1 bkzik + εiεi sim N0 σ 2

ε (xi ) i = 1 nbk sim N0 σ 2

b (κκκk) k = 1 Km

(62)

where the error variance logσ 2ε (xi ) is modeled as a low rank log-thin plate spline

logσ 2ε (xi ) = γ0 +sumL

j=1 γjxij +sumKε

k=1 ckzikck sim N(0 σ 2

c ) k = 1 Kε(63)

and the random coefficient variance logσ 2b (κκκk) is modeled as another low rank log-thin

plate spline logσ 2

b (κκκk) = δ0 +sumLj=1 δj x

lowastkj +sumKb

j=1 dj zlowastkj

dj sim N(0 σ 2d ) j = 1 Kb

(64)

where xlowastkj and zlowast

kj are the entries of Xlowast and Zlowast matrices respectively and ck and ds areassumed mutually independent

In the case of low rank smoothers the set of knots for the covariates and subknots formodeling the shrinkage process have to be chosen One possibility is to use equally spacedknots and subknots We use this approach in our simulation study in Section 7 to make ourmethod comparable from this point of view with previously published methods Anotherpossibility is to select the knots and subknots using the space filling design (Nychka andSaltzman 1998) which is based on the maximal separation principle This avoids wastingknots and is likely to lead to better approximations in sparse regions of the data Thecoverdesign() function from the R package Fields (Fields Development Team 2006)provides software for space filling knot selection We use this approach in Section 8 in theNoshiro elevation example

7 COMPARISONS WITH OTHER ADAPTIVE SURFACEFITTING METHODS

Lang and Brezger (2004) compared their adaptive surface smoothing method with sev-eral methods used in a simulation study by Smith and Kohn (1997) MARS (Friedman1991) ldquolocfitrdquo (Cleveland and Grosse 1991) ldquotpsrdquo (bivariate cubic thin plate splines with asingle smoothing parameter) tensor product cubic smoothing splines with five smoothingparameters and a parametric linear interaction model We will use the following regressionfunctions also used by Lang and Brezger

bull m1(x1 x2) = x1 sin(4πx2) where x1 and x2 are distributed independently uniformon [0 1]

bull m2(x1 x2) = 15 exp(minus8x21 ) + 35 exp(minus8x2

2 ) where x1 and x2 are distributedindependently normal with mean 05 and variance 01

278 C M Crainiceanu et al

Function m1(middot) contains complex interactions and moderate spatial variability whilem2(middot) has only main effects and is much smoother We used a sample size n = 300σ = 14 range(f ) and 250 simulations from each model

We will compare the performance of our method with that of Lang and Brezgerrsquos whichperformed at least as good or better than other adaptive methods in a variety of examplesTo avoid dependence of results on knot choice we used the same choice as the one describedby Lang and Brezger 12 times 12 equally spaced knots for the mean function However weused thin-plate splines instead of tensor products of B-splines We modeled the smoothingparameter using a low rank log-thin plate spline with an equally spaced 5 times 5 knot grid Weconsidered two estimators one with constant error variance (CEV) and one with estimatederror variance using a log-thin plate spline (EEV) with the same 12 times 12 knot grid as forthe mean function The CEV estimator is the one that can directly be compared with theestimators considered by Lang and Brezger (2004) Because Lang and Brezger presentedonly comparative results using boxplots we will use their results as read from their Figure5 The performance of all estimators was measured by the log of the mean squared errorlog(MSE) where MSE(f ) = 1n

sumni=1f (xi) minus f (xi)2

For the CEV estimator and function f1(middot) corresponding to moderate spatial heterogene-ity we obtained a median log(MSE) of minus367 with an interquartile range [minus380minus353]and a range [minus421minus313] These values are better than the best values reported by Langand Brezgerrsquos methods in their Figure 5(b) Lang and Brezger reported that their method de-noted ldquoP-splinerdquo outperforms all other methods in this case and achieved a median log(MSE)of minus350 Remarkably the EEV estimator performed almost as well as the CEV estimatorin this case and also outperformed the other methods The median log(MSE) for EEV wasof minus359 with an interquartile range [minus374minus343] and a range [minus414minus302] Theseresults are consistent with univariate results reported by Ruppert and Carroll (2000) andBaladandayuthapani Mallick and Carroll (2005) who found that allowing the smoothingparameter to vary smoothly outperforms other adaptive techniques

For function f1(middot) Figure 4 displays the coverage probabilities for the 95 credibleintervals calculated over a 20 times 20 equally spaced grid in [0 1]2 For one grid point wecalculated the frequency with which the 95 pointwise credible interval covers the truevalue of the function at the grid point As expected these coverage probabilities showstrong spatial correlation with lower coverage probabilities along the ridges of the sinusfunction Coverage probability is lowest when x1 is in the [02 05] range The signal-to-noise in this region is about half the signal-to-noise ratio in the region corresponding to x1

close to 1 This explains why the coverage probability is smaller in this region Interestinglythe zeros of the true function appearing at x2 = 025 050 075 are covered with at least95 probability Another interesting feature is the lower coverage probabilities near thenorth south and east boundaries These features of the pointwise coverage probabilitymap are consistent with the frequentist analysis of Krivobokova et al (2007) but are notconsistent with the coverage probabilities reported by Lang and Brezger for their BayesianP-spline method

For our CEV estimator and function f2(middot) corresponding to very low spatial heterogene-ity we obtained a median log(MSE) of minus595 with an interquartile range [minus612minus566]

Spatially Adaptive Bayesian Penalized Splines 279

Figure 4 Coverage probability of the pointwise 95 credible intervals for the mean function for functionf1(x1 x2) = x1 sin(4πx2) with a constant error standard deviation σ = range(f )4 Probabilities are computedon a 20 times 20 equally spaced grid of points in [0 1]2 The model used was a thin plate spline with K = 100 knotsfor the mean function a log-thin plate spline with K = 100 knots for the variance function and a log-thin platespline with Klowast = 16 knots for the spatially adaptive shrinkage parameter The parameter controlling the degreeof smoothness of the thin-plate spline basis was m = 2

and a range [minus651minus541] The results of Lang and Brezger in their Figure 5(a) indicatethat their ldquoP-splinerdquo method achieved a median log(MSE) of minus600 Our method performsroughly similar to the Bayesian P-spline method of Lang and Brezger and to the cubicthin plate spline (tps) being outperformed only by Lang and Brezgerrsquos adaptive P-splinemethod with two main effects ldquoP-spline mainrdquo We could also add two main effects to ourbivariate adaptive smoother to improve MSE for this example but this is not our concernin this article Again the EEV estimator performed very similarly to our CEV estimator

The last simulation study was done using the function f1(middot middot) where x1 x2 are indepen-dent uniformly distributed in [0 1] with an error standard deviation function

σε(x1 x2) = r

32+ 3r

32x2

1

where r = range(f1) We compared our CEV and EEV estimators described at the beginningof this section and as expected the EEV estimator outperformed the CEV both in terms oflog(MSE) and coverage probabilities More precisely for log(MSE) we obtained a medianof minus526 for CEV and minus535 for EEV an interquartile range [minus537minus515] for CEV and[minus548minus524] for EEV and a range [minus465minus572] for CEV and [minus484minus584] for EEVFor both estimators the general patterns for coverage probabilities were the ones displayedin Figure 4 EEV outperformed CEV in terms of coverage probabilities For example theregions where the nominal level is exceeded for both estimators have a roughly similar

280 C M Crainiceanu et al

Figure 5 Locations of sampling (dots) with 100 knots (circles) in the Noshiro example

shape and location but the EEV coverage probabilities tend to be closer to their nominallevel

Note that our model incorporating nonparametric variance estimation is more generalthan the models considered in the literature However its performance for function m1(middot middot)with moderate spatial variability and constant variance was better than all methods consid-ered by Lang and Brezger in their simulation study (see their Figure 5 panel b) Our methodhas good performance for the function m2(middot middot) being the second best when compared tomethods consider by Lang and Brezger (see their Figure 5 panel a) For the case whenvariance is not constant our adaptive method that estimates the variance nonparametricallyoutperformed our adaptive method that estimates a constant variance

8 THE NOSHIRO EXAMPLE

Noshiro Japan was the site of a major earthquake Much of the damage from thequake was due to soil movement Professor Thomas OrsquoRourke a geotechnical engineerwas investigating the factors that might help predict soil movement during a future quakeOne factor thought to be of importance was the slope of the land To estimate the slopeRuppert (1997) used a dataset with 799 observations where x was longitude and latitudeand y was elevation at locations in Noshiro The object of primary interest is the gradientat least its magnitude and possibly its direction

We used the thin-plate spline models (62) for the mean of y with Km = 100 knotsselected by the maximum separation criterion implemented in the function coverdesign()

of the R package Fields (Nychka Haaland OrsquoConnell and Ellner 1998) The sampling

Spatially Adaptive Bayesian Penalized Splines 281

Figure 6 Posterior mean of the mean regression function for the Noshiro example The model used was a thinplate spline with K = 100 knots for the mean function a log-thin plate spline with K = 100 knots for the variancefunction and a log-thin plate spline with Klowast = 16 knots for the spatially adaptive shrinkage parameter Theparameter controlling the degree of smoothness of the thin-plate spline basis was m = 2

locations and the knots used for fitting are displayed in Figure 5 The log-thin plate splinedescribed in (63) with the same Kε = 100 knots was used to model the variance of theerror process σ 2

ε (xi ) The log-thin plate spline described in (64) with Kb = 16 subknotswas used to model the variance σ 2

b (κκκk) which controls the amount of shrinkage of the b

parameters The subknots were obtained using the coverdesign() function applied to the100 knots

Figure 6 displays the posterior mean of the mean response function There is a rathersharp peak around (042 04) with a slow decay from the peak in the general south directionand a much faster decay in all other directions The function becomes smoother towardsthe boundary This is exactly the type of function for which adaptive spatial smoothing cansubstantially improve the fit

Figure 7 shows that in large areas of the map σε(middot) is smooth and with small valuesHowever in a neighborhood of the peak of the mean function σε(middot) displays two relativelysharp maxima located NNW and SSE of the peak We also performed a two-stage frequentistanalysis In the first stage we used a low rank thin plate spline for the mean function and theresults from this regression were used to obtain residuals We then fitted another low rankthin plate spline to the absolute values of these residuals While this map was not identicalto the one in Figure 7 it did exhibit the same patterns of heteroscedasticity

The process logσ 2b (κκκk) controlling the shrinkage of the bk is displayed in Figure 8

282 C M Crainiceanu et al

Figure 7 Posterior mean of the standard deviation σε(xi ) of the error process function for the Noshiro exampleThe model used was a thin plate spline with K = 100 knots for the mean function a log-thin plate spline withK = 100 knots for the variance function and a log-thin plate spline with Klowast = 16 knots for the spatially adaptiveshrinkage parameter The parameter controlling the degree of smoothness of the thin-plate spline basis was m = 2

Figure 8 Posterior mean of the shrinkage process minus logσ 2b(κκκk) The model used was a thin plate spline with

K = 100 knots for the mean function a log-thin plate spline with K = 100 knots for the variance function and alog-thin plate spline with Klowast = 16 knots for the spatially adaptive shrinkage parameter The parameter controllingthe degree of smoothness of the thin-plate spline basis was m = 2

Spatially Adaptive Bayesian Penalized Splines 283

Smaller values of this function correspond to less shrinkage of bk towards zero and morelocal behavior of the smoother Figure 8 indicates that in a neighborhood of the peak of themean function the shrinkage is smaller to allow the mean function to change rapidly Awayfrom the peak the shrinkage is larger corresponding to a smoother mean

9 IMPLEMENTATION USING MCMC

We will now provide some details for MCMC simulations of model (21) wherethe variances σ 2

ε (xi) and σ 2b (κk) are modeled by Equations (23) and (24) The imple-

mentation for other models for example for the multivariate smoothing in Section 6is similar Consider independent normal priors for the coefficients of the monomialsβi sim N(0 σ 2

0β) γi sim N(0 σ 20γ ) i = 0 p δi sim N(0 σ 2

0δ) i = 0 q and in-

dependent inverse Gamma priors for the variance components σ 2c sim IGamma(ac bc) and

σ 2d sim IGamma(ad bd) Using these priors many full conditionals of the posterior distribu-

tion are easy to derive while a few have complex multivariate forms Our implementationof the MCMC using multivariate Metropolis-Hastings steps proved to be unstable with poormixing properties A simple and reliable solution was to change the model by adding errorterms to the log-spline models that is

logσ 2ε (xi)

= γ0 + middot middot middot + γpxpi +sumKε

s=1 cs(xi minus κεs )

p+ + ui

logσ 2b (κ

mk ) = δ0 + middot middot middot + δq(κ

mk )q +sumKb

s=1 ds(κmk minus κb

s )q+ + vk

(91)

where ui sim N(0 σ 2u ) and vk sim N(0 σ 2

v ) This idea was also used for σ 2b by Baladan-

dayuthapani Mallick and Carroll (2005) We fixed the values of σ 2u = σ 2

v = 001 as thesevariances appear nonidentifiable and a standard deviation of 01 is small on a log-scale Thisdevice makes computations feasible by replacing sampling from complex full conditionalsby simple univariate MH steps

Define as before by ε b θθθX and define θθθε = (γγγ T cT )T θθθb = (δδδT dT )T CX =(X Z) Cε = (Xε Zε) Cb = (Xb Zb) where X Xε Xb contain the monomials and ZZε Zb contain the truncated polynomials of the spline models for m logσ 2

ε and logσ 2b

respectively Also denote by

0X =[

σ 20βIp+1 0

0 b

]

0ε =[

σ 20γ Ip+1 0

0 σ 2c IKε

]

0b =[

σ 20δIq+1 0

0 σ 2d IKm

]

The full conditionals of the posterior are detailed below

1 [θθθX] sim N(MXCT

Xminus1ε YMX

) where MX =

(CT

Xminus1ε CX + minus1

0X

)minus1

2 [θθθε] sim N(MεCT

ε Yεσ2u Mε

) where Yε = [log(σ 2

ε1) log(σ 2εn)]

T and Mε

= (CTε Cεσ

2u +minus1

0ε )minus1

284 C M Crainiceanu et al

Figure 9 Computation time (in minutes) versus sample size for 10000 simulations from the joint posteriordistribution given the data The mean function is modeled as a penalized spline with Km knots with the log of thevariance of the shrinkage process modeled as another penalized spline with Kb = 4 subknots The log of the errorvariance is modeled as another penalized spline with Kε = Km knots Different symbols correspond to increasingnumber of knots dots (Km = 10) asterisks (Km = 20) circles (Km = 30) xrsquos (Km = 40)

3 [θθθb] sim N(MbCT

b Ybσ2v Mb

)where Yb =

[log(σ 2

1 ) log(σ 2Km

)]T

and Mb =(CT

b Cbσ2v +minus1

0b )minus1

4 [σ 2c ] sim IGamma

(ac + Kε2 bc + ||c||22

)

5 [σ 2d ] sim IGamma

(ad + Kb2 bd + ||d||22

)

6 [σ 2εi] prop σminus3

εi exp

minus(yi minus microi)

2(2σ 2εi ) minus

[log(σ 2

εi ) minus ηi)]2

(2σ 2u )

where microi

and ηi are the ith components of CXθθθX and Cεθθθε respectively

7 [σ 2k ] prop σminus3

k expminusb2

k(2σ2k ) minus [

log(σ 2k ) minus ζk)

]2(σ 2

v ) where ζk is the kth com-

ponent of Cbθθθb

All the above conditionals have an explicit form with the exception of the n + Km

one-dimensional conditionals from 6 and 7 For these distributions we use the Metropolis-Hastings algorithm with a normal proposal distribution centered at the current value andsmall variance

10 IMPLEMENTATION AND COMPUTATIONALPERFORMANCE

Our MCMC simulation algorithm provided reliable results and is reasonably fast To il-lustrate this Figure 9 displays computation time (in minutes) versus sample size for 10000

Spatially Adaptive Bayesian Penalized Splines 285

Figure 10 MCMC histories for eight parameters of interest including the log of standard deviation of the errorprocess and log of standard deviation of the shrinkage process as well as some parameters modeling the differentfunctions The model uses 40 equally spaced knots for the mean and error variance functions and 4 subknotsfor the adaptive shrinkage process Chains correspond to the Doppler function example using quadratic truncatedpolynomials to fit the mean and log variance function and linear truncated polynomial basis for fitting the shrinkageprocess For clarity of presentation only every 20th simulated observation is displayed

simulations from the joint posterior distribution given the data The mean function is mod-eled as a penalized spline with Km knots with the log of the variance of the shrinkageprocess modeled as another penalized spline with Kb = 4 subknots The log of the errorvariance is modeled as another penalized spline with Kε = Km knots Different symbolscorrespond to increasing number of knots dots (Km = 10) asterisks (Km = 20) circles(Km = 30) xrsquos (Km = 40) Computation time increases roughly linearly with the numberof observations with the same intercept 0 and larger slopes for larger number of knotsNote that even with n = 1000 observations and Km = Kε = 40 simulation time is abouttwo minutes

For the models and datasets presented in this article 10000 simulations proved suffi-cient for accurate and reliable inference Some parameters such as the mean and variancefunctions tended to exhibit much better mixing properties than the variance componentsσ 2c and σ 2

d controlling the shrinkage processes This is probably due to the reduced amountof information in the data about these parameters The MCMC convergence and mixingproperties were assessed by visual inspection of the chain histories of many parameters ofinterest Since we started with good starting values of the parameters convergence is fastFigure 10 displays the histories of eight parameters of interest for the model fitted to the

286 C M Crainiceanu et al

Doppler function described in Section 5 For clarity we display only every 100th sampledvalued of the chain but the same patterns would be observed in the full chain history Similargood mixing properties have been noted in all the other examples presented in this article

In the search for a good software platform we implemented our algorithms in MATLABC and FORTRAN Programs can be downloaded from wwwbiostatjhsphedu simccrainicFor our application and implementation there were no serious differences between C andFORTRAN which have performed better than MATLAB by reducing computation timetwo to three times The reported times in Figure 9 are for our implementation in C

Given the complexity of the model we were interested in improving the mixing prop-erties of the Markov Chains The computational trick described in Section 9 was the sinequa non method that allowed us to replace an intractable multivariate MH step with severaltractable univariate MH steps We used normal random walk proposal distributions centeredat the current value and variance tuned to provide acceptance rates between 30 and 50

11 COMMENTS

Our joint models for heterogeneous mean and nonconstant variance are designed to ex-tract the maximum amount of information from complex data and suggest simpler paramet-ric models Bayesian inference based on MCMC offers a natural and powerful frameworkfor model fitting However MCMC in this context is not straightforward and implemen-tation depends essentially on the ability of transforming intractable simulations from highdimensional distributions into tractable one dimensional updates as described in Section9 Our implementation provides accurate fast and reproducible results while software isavailable Actual simulation time increases only linearly with the number of observationsand mixing properties are adequate provided that proposal distribution variances are tunedfor optimal acceptance rates

ACKNOWLEDGMENTS

Carrollrsquos research was supported by a grant from the National Cancer Institute (CA-57030) and by the TexasAampM Center for Environmental and Rural Health via a grant from the National Institute of Environmental HealthSciences (P30-ES09106)

[Received September 2004 Revised October 2006]

REFERENCES

Baladandayuthapani V Mallick B K and Carroll R J (2005) ldquoSpatially Adaptive Bayesian RegressionSplinesrdquo Journal of Computational and Graphical Statistics 14 378ndash394

Besag J Green P Higdon D and Mengersen K (1995) ldquoBayesian Computation and Stochastic Systemsrdquo(with discussions) Statistical Science 10 3ndash66

Carroll R J (2003) ldquoVariances Are Not Always Nuisance Parameters The 2002 R A Fisher Lecturerdquo Biometrics59 211ndash220

Carroll R J and Ruppert D (1988) Transformation and Weighting in Regression New York Chapman andHall

Spatially Adaptive Bayesian Penalized Splines 287

Cleveland W S and Grosse E (1991) ldquoComputational Methods for Local Regressionrdquo Statistics and Computing1 47ndash62

Crainiceanu C and Ruppert D (2004) ldquoLikelihood Ratio Tests in Linear Mixed Models With One VarianceComponentrdquo Journal of the Royal Statistical Society Series B 66 165ndash185

Crainiceanu C M Ruppert D and Wand M P (2005) ldquoBayesian Analysis for Penalized Spline Regressionusing WinBUGSrdquo Journal of Statistical Software 14 Available online at http wwwjstatsoftorg v14 i14

Donoho D L and Johnstone I M (1994) ldquoIdeal Spatial Adaptation by Wavelet Shrinkagerdquo Biometrika 81425ndash455

Eilers P H C and Marx B D (1996) ldquoFlexible Smoothing With B-Splines and Penaltiesrdquo (with discussion)Statistical Science 11 89ndash121

(2002) ldquoGeneralized Linear Additive Smooth Structuresrdquo Journal of Computational and Graphical Statis-tics 11 758ndash783

(2004) ldquoSplines Knots and Penaltiesrdquo unpublished manuscript

Fields Development Team (2006) fields Tools for Spatial Data National Center for Atmospheric ResearchBoulder CO Available online at http wwwcgducaredu Software Fields

Friedman J H (1991) ldquoMultivariate Adaptive Regression Splinesrdquo (with discussion) The Annals of Statistics19 1ndash141

Gelman A (2006) ldquoPrior Distributions for Variance Parameters in Hierarchical Modelsrdquo Bayesian Analysis 1515ndash533

Gray R J (1994) ldquoSpline-Based Tests in Survival Analysisrdquo Biometrics 50 640ndash652

Holst U Hossjer O Bjorklund C Ragnarson P and Edner H (1996) ldquoLocally Weighted Least SquaresKernel Regression and Statistical Evaluation of LIDAR Measurementsrdquo Environmetrics 7 401ndash416

Jullion A and Lambert P (2006) Discussion paper DP0534 UCL

Kammann E E and Wand M P (2003) ldquoGeoadditive Modelsrdquo Applied Statistics 52 1ndash18

Kauermann G (2002) ldquoA Note on Bandwidth Selection for Penalised Spline Smoothingrdquo Technical Report 02-13Department of Statistics University of Glasgow

Krivobokova T Crainiceanu C M and Kauermann G (2007) ldquoFast Adaptive Penalized Splinesrdquo Journal ofComputational and Graphical Statistics to appear

Lang S and Bretzger A (2004) ldquoBayesian P-Splinesrdquo Journal of Compuational and Graphical Statistics 13183ndash212

Lang S Fronk E M and Fahrmeir L (2002) ldquoFunction Estimation with Locally Adaptive Dynamic ModelsrdquoComputational Statistics 17 479ndash499

Marx B D and Eilers PHC (1998) ldquoDirect Generalized Additive Modeling With Penalized LikelihoodrdquoComputational Statistics amp Data Analysis 28 193ndash209

Natarajan R and Kass R E (2000) ldquoReference Bayesian Methods for Generalized Linear Mixed ModelsrdquoJournal of the American Statistical Association 95 227ndash237

Ngo L and Wand M P (2004) ldquoSmoothing With Mixed Model Softwarerdquo Journal of Statistical Software 9Available online at http wwwjstatsoftorg v09 i01

Nychka D Haaland P OrsquoConnell M and Ellner S (1998) ldquoFUNFITS Data Analysis and Statistical Toolsfor Estimating Functionsrdquo in Case Studies in Environmental Statistics Lecture Notes in Statistics vol 132eds D Nychka W W Piegorsch and L H Cox New York Springer Verlag pp 159ndash179 Online athttp wwwcgducaredu stats Software Funfits

Nychka D and Saltzman N (1998) ldquoDesign of Air Quality Monitoring Networksrdquo in Case studies in Environ-mental Statistics Lecture Notes in Statistics vol 132 eds D Nychka W W Piegorsch and L H Cox NewYork Springer Verlag pp 51ndash76

288 C M Crainiceanu et al

Pintore A Speckman P and Holmes C C (2006) ldquoSpatially Adaptive Smoothing Splinesrdquo Biometrika 93113ndash125

Ruppert D (1997) ldquoLocal Polynomial Regression and its Applications in Environmental Statisticsrdquo in Statisticsfor the Environment (vol 3) eds V Barnett and F Turkman Chicester Wiley pp 155ndash173

(2002) ldquoSelecting the Number of Knots for Penalized Splinesrdquo Journal of Computational and GraphicalStatistics 11 735ndash757

Ruppert D and Carroll R J (2000) ldquoSpatially-Adaptive Penalties for Spline Fittingrdquo Australian and NewZealand Journal of Statistics 42 205ndash223

Ruppert D Wand M P Holst and U Hossier O (1997) Technometrics 39 262ndash273

Ruppert D Wand M P and Carroll R J (2003) Semiparametric Regression Cambridge Cambridge UniversityPress

Smith M and Kohn R (1997) ldquoA Bayesian Approach to Nonparametric Bivariate Regressionrdquo Journal of theAmerican Statistical Association 92 1522ndash1535

Wood S (2006) Generalized Additive Models An Introduction with R New York Chapman amp HallCRC

Page 2: Spatially Adaptive Bayesian Penalized Splines - Department of

266 C M Crainiceanu et al

and Wand 2004 Crainiceanu Ruppert and Wand 2005 Wood 2006) Bayesian penalizedsplines (Ruppert Wand and Carroll 2003 Lang and Brezger 2004 Crainiceanu Ruppertand Wand 2005) use a stochastic process model as a prior for the regression function It istypical to assume that both this process and the errors are homoscedastic

The penalized spline methodology has been extended to heteroscedastic errors usingan iterative frequentist approach (Ruppert Wand and Carroll 2003) We develop the corre-sponding Bayesian inferential methodology using a simple previously unknown solutionSpatially adaptive penalty parameters have also been used before (Ruppert and Carroll 2000Eilers and Marx 2002 Baladandayuthapani Mallick and Carroll 2005 Lang and Brezger2004 Jullion and Lambert 2006 Pintore Speckman and Holmes 2006) However this isthe first article to combine these features We show that this combination is important andthat spatially varying penalties can adapt to spatial heterogeneity of both the mean functionand the error variance Implementation of this extension was not straightforward but wedeveloped and implemented an algorithm that works well in practice

Our methodology can be extended to almost any penalized spline modelmdashfor exampleadditive varying coefficient interaction and multivariate smoothing models To illustratethis we also study multivariate low-rank thin-plate splines As in the univariate case wemodel the mean the variance and the smoothing functions nonparametrically and estimatethem from the data using a fully Bayesian approach The computational advantage of lowrank over full rank smoothers becomes even greater in more than one dimension

Section 2 provides a quick introduction to penalized splines Section 3 discusses thechoice of priors for penalized spline regression Section 4 describes how to obtain simul-taneous credible bounds for mean and variance functions and their derivatives Section 5compares our proposed methodology to the one proposed by Baladandayuthapani et al(2005) for adaptive univariate penalized spline regression Section 6 describes the exten-sion to multivariate smoothing and Section 7 compares our multivariate methodology withother adaptive surface fitting methods Section 8 discusses an example of bivariate smooth-ing with heteroscedastic errors and spatially adaptive smoothing Section 9 presents thefull conditional distributions and discusses our implementation of models in MATLABFORTRAN and C Section 11 provides our conclusions

2 PENALIZED REGRESSION SPLINES

Consider the regression equation yi = m(xi) + εi i = 1 n where the εi areindependent and the mean is modeled as

m(x) = m(x θθθX) = β0 + β1x + middot middot middot + βpxp +

Kmsumk=1

bk(x minus κm

k

)p+

where βββ = (β0 βp)T b = (b1 bKm)

T θθθX = (βββT bT )T κm1 lt κm

2 lt middot middot middot lt

κmKm

are fixed knots and ap+ denotes max(a 0)p Following Gray (1994) and Ruppert

(2002) we take Km large enough (eg 20) to ensure the desired flexibility To avoid over-fitting the bk sim N0 σ 2

b (κmk ) and εi sim N0 σ 2

ε (xi) are shrunk towards zero by an amountcontrolled by σ 2

b (middot) and σ 2ε (middot) which vary across the range of x

Spatially Adaptive Bayesian Penalized Splines 267

The penalized spline model can be written asyi = β0 + β1xi + middot middot middot + βpx

pi +sumKm

k=1 bk(xi minus κm

k

)p+ + εi

εi sim N0 σ 2ε (xi) i = 1 n

bk sim N0 σ 2b (κ

mk ) k = 1 Km

(21)

where εi and bk are mutually independent given σ 2ε (middot) σ 2

b (middot) In Sections 21 and 22 σ 2ε (xi)

and σ 2b (κ

mk ) are modeled using log-spline models Denote by X the ntimes (p+1) matrix with

the ith row equal to Xi = (1 xi xpi ) and by Z the ntimesKm matrix with ith row equal to

Zi = (xi minus κm1 )

p+ (xi minus κm

Km)p+ Y = (y1 yn)

T and εεε = (ε1 εn)T Model

(21) can be rewritten in matrix form as

Y = Xβββ + Zb + εεε E

(bεεε

)=(

00

) cov

(bεεε

)=(

b 00 ε

) (22)

where the joint conditional distribution of b and εεε given b and ε is assumed normal Theβββ parameters are treated as fixed effects The covariance matrices b and ε are diago-nal with the vectors σ 2

b (κm1 ) σ 2

b (κmKm

) and σ 2ε (x1) σ

2ε (xn) as main diagonals

respectively For this model E(Y) = Xβββ and cov(Y) = εεε + ZbZT

21 Error Variance Function Estimation

Ignoring heteroscedasticity may lead to incorrect inferences and inefficient estimationespecially when the response standard deviation varies over several orders of magnitudeMoreover understanding how variability changes with the predictor may be of intrinsicinterest (Carroll 2003) Transformations can stabilize the response variance when the con-ditional response variance is a function of the conditional expectation (Carroll and Ruppert1988) but in other cases one needs to estimate it by modeling the variance as a function ofthe predictors

We model the variance function as a loglinear modellogσ 2

ε (xi) = γ0 + middot middot middot + γpxpi +sumKε

s=1 cs(xi minus κεs )

p+

cs sim N(0 σ 2c ) s = 1 Kε

(23)

where the γγγ parameters are fixed effects and κε1 lt middot middot middot lt κε

Kεare knots The normal

distribution of the cs parameters shrinks them towards 0 and ensures stable estimation

22 Smoothly Varying Local Penalties

Following Baladandayuthapani et al (2005) we model σ 2b (middot) as

logσ 2b (x) = δ0 + + δqx

q +sumKb

s=1 ds(x minus κbs )

q+

ds sim N(0 σ 2d ) s = 1 Kb

(24)

where the δrsquos are fixed effects the dsrsquos are mutually independent and κbs Kb

s=1 are knotsThe particular case of a global smoothing parameter corresponds to the case when the splinefunction is a constant that is logσ 2

b (κms ) = δ0

268 C M Crainiceanu et al

Ruppert and Carroll (2000) proposed a ldquolocal penaltyrdquo model similar to (24) Bal-adandayuthapani Mallick and Carroll (2005) developed a Bayesian version of Ruppertand Carrollrsquos (2000) estimator and showed that their estimator is similar to Ruppert andCarrollrsquos in terms of mean square error and outperforms other Bayesian methods LangFronk and Fahrmeir (2002) considered locally adaptive dynamic models and find that theirmethod is roughly comparable to the method of Ruppert and Carroll (2000) in terms ofmean square error (MSE) and coverage probability Krivobokova Crainiceanu and Kauer-mann (2007) used a model similar to (24) for low rank thin plate spline and apply Laplaceapproximations instead of Bayesian MCMC

In Lang and Brezgerrsquos (2004) prior for the bk the nonconstant variance is independentfrom knot to knot since the variances of the bk are assumed to be independent inverse-gammas In contrast our model allows dependence so that if the variance is high at oneknot then it is high at neighboring knots Stated differently the Lang and Brezger model isone of random heteroscedasticity and ours of systematic heteroscedasticity Both types ofpriors are sensible and will have applications but in any specific application it is likely thatone of the two will be better For example Lang and Brezger found that for ldquoDoppler typerdquofunctions their estimator is not quite as good as Ruppert and Carrollrsquos and that the coverageof their credible intervals is not so accurate either It is not hard to understand why Doppler-type functions that oscillate more slowly going from left to right are consistent with a priorvariance for the bk that is monotonically decreasing exactly the type of prior included inour model but not Lang and Brezgerrsquos

23 Choice of Spline Basis Number and Spacing of Knots and Penalties

For concreteness we have made some specific choices about the spline bases numberand knot spacings and form of the penalty In terms of model formulation the choice ofspline basis in the model is not important since an equivalent basis gives the same modeland the basis used in computations need not be the same as the one used to express themodel However spline bases are very different in terms of computational stability Forexample cubic thin plate splines (Ruppert Wand and Carroll 2003 pp 72ndash74) have betternumerical properties than the truncated polynomial basis In more than one dimension thetensor product of truncated polynomials proved unstable and we preferred using low rankradial smoothers (see Section 6)

The penalty we use is somewhat different from that of Eilers and Marx (1996) andalso somewhat different from the penalties used by smoothing splines We believe thatthe penalty parameter is the crucial choice so we have concentrated on spatially adaptivemodeling of the penalty parameter

In this article we use knots to model the mean variance and spatially adaptive smoothingparameter The methodology described here does not depend conceptually on the number ofknots For example one could use a knot at each observation for each of the three functionsAlthough we use quantile knot spacings the best type of knot spacings is controversialEilers and Marx (2004) stated that ldquoEqually spaced knots are always to be preferredrdquo Incontrast Ruppert Wand and Carroll (2003) used quantile-spacing in all of their examples

Spatially Adaptive Bayesian Penalized Splines 269

though they did not make any categorical statement that quantile-spacing is always to bepreferred Since knot spacing is not the main focus of this article and our results were notsensitive to it we do not discuss this issue here

In this article we have used quantile knot spacing for the univariate case and spacefilling design (Nychka and Saltzman 1998) for the multivariate case which is based on themaximal separation principle and generalizes the quantile knot spacing in more than onedimension For the multivariate case we also used equally spaced knots in simulations tomake our methods more comparable to other published methods

3 PRIOR SPECIFICATION

Any smoother depends heavily on the choice of smoothing parameter and for penalizedsplines the smoothing parameter is the ratio of the error variance to the prior variance on themean (Ruppert Wand and Carroll 2003) The smoothness of the fit depends on how thesevariances are estimated For example Crainiceanu and Ruppert (2004) showed that in finitesamples the (RE)ML estimator of the smoothing parameter is biased towards oversmoothingand Kauermann (2002) obtained corresponding asymptotic results for smoothing splines

In Bayesian models the estimates of the variance components are known to be sensi-tive to the prior specification for example see Gelman (2006) To study the effect of thissensitivity upon Bayesian penalized splines consider model (21) with one smoothing pa-rameter and homoscedastic errors so that σ 2

b and σ 2ε are constant In terms of the precision

parameters τb = 1σ 2b and τε = 1σ 2

ε the smoothing parameter is λ = τετb = σ 2b σ

and a small (large) λ corresponds to oversmoothing (undersmoothing)

31 Priors on the Fixed Effects Parameters

It is standard to assume that the fixed effects parameters βi are apriori independentwith prior distributions either [βi] prop 1 or βi prop N(0 σ 2

β ) where σ 2β is very large In our

applications we used σ 2β = 106 which we recommend if x and y have been standardized

or at least have standard deviations with order of magnitude oneFor the fixed effects γ and δ used in the log-spline models (23) and (24) we also used

independent N(0 106) priors When this prior is not consistent with the true value of theparameter a possible strategy is to fit the model using a given set of priors and obtainthe estimators γ σγ δ and σδ We could then use independent priors N(γ 106σ 2

γ ) andN(δ 106σ 2

δ) for γ and δ respectively

32 Priors on the Precision Parameters

As just mentioned the priors for the precisions τb and τε are crucial We now show howcritically the choice of τb may depend upon the scaling of the variables The gamma familyof priors for the precisions is conjugate If [τb] sim Gamma(Ab Bb) and independently of

270 C M Crainiceanu et al

τb [τε] sim Gamma(Aε Bε) where Gamma(AB) has mean AB and variance AB2 then

[τb|Y βββb τε] sim Gamma

(Ab + Km

2 Bb + ||b||2

2

) (31)

and

[τε |Y βββb τε] prop Gamma

(Aε + n

2 Bε + ||Y minus Xβββ minus Zb||2

2

)

Also

E(τb|Y βββb τε) = Ab + Km2

Bb + ||b||22 var(τb|Y βββb τε) = Ab + Km2

(Bb + ||b||22)2

and similarly for τε The prior does not influence the posterior distribution of τε when both Ab and Bb are

small compared to Km2 and ||b||22 respectively Since the number of knots is Km ge 1and in most problems considered Km ge 5 it is safe to choose Ab le 001 When Bb ltlt

||b||22 the posterior distribution is practically unaffected by the prior assumptions WhenBb increases compared to ||b||22 the conditional distribution is increasingly affectedby the prior assumptions E(τb|Y βββb τε) is decreasing in Bb indicating that large Bb

compared to ||b||22 corresponds to undersmoothing Since the posterior variance of τb isalso decreasing in Bb a poor choice of Bb will likely result in underestimating the variabilityof the smoothing parameter λ = τετb causing confidence intervals for the mean functionto be too narrow The condition Bb ltlt ||b||22 shows that the ldquononinformativenessrdquo ofthe gamma prior depends essentially on the scale of the problem

To show the potential severity of these effects we consider the model (21) with a globalsmoothing parameter and homoscedastic errors We used quadratic splines with 30 knotsand the light detection and ranging (LIDAR) data for illustration The LIDAR data wasobtained from atmospheric monitoring of pollutants and consists of the range or distancetraveled before light is reflected back to a source and log ratio the logarithm of the ratioof received light from two laser sources For more details see Ruppert Wand Holst andHossier (1997)

Figure 1 shows the effect of four mean-one Gamma priors for the precision of thetruncated polynomial parameters The variances of these priors are 10 103 106 and 1010respectively Obviously the first two inferences provide severely undersmoothed almostindistinguishable posterior means The third graph is much smoother but still exhibitsroughness especially in the right-hand side of the plot while the fourth graph displaysa pleasing smooth pattern consistent with our frequentist inference Using either priordistribution one obtains that the posterior mean of ||b||22 is of order 10minus6 to 10minus5 Thisexplains why values of Bb larger than 10minus6 proved inappropriate for this problem

The size of ||b||22 depends upon the scaling of the x and y variables and in the caseof the LIDAR data ||b||22 is small because the standard deviation of x is much largerthan the standard deviation of y If y is rescaled to ayy and x to axx then the regressionfunction becomes aym(axx) whose pth derivative is aya

px m

(p)(axx) so that ||b||22 is

Spatially Adaptive Bayesian Penalized Splines 271

Figure 1 LIDAR data with unstandardized covariate effect of four mean-one Gamma priors for the precisionof the truncated polynomial parameters The variances of these priors are 10 103 106 and 1010 respectively

rescaled by the factor a2ya

2px Thus ||b||22 is particularly sensitive to the scaling of x

The size of ||b||22 also depends on Km Because the integral of m(p) over the range ofx will be approximately

sumKm

k=1 bk asymp radicKmσb we can expect that σ 2

b prop (Km)minus1 and thesmoothing parameter should be proportional to Km For the LIDAR data the GCV chosensmoothing parameter is 00095 00205 00440 and 00831 for Km equal to 15 30 60 and120 respectively so as expected the smoothing parameter approximately doubles as Km

doublesPractical experience with LMMs for longitudinal or clustered data should be applied

with caution to penalized splines In a mixed effects model ||b||2 is an estimator of Kmσ 2b

For longitudinal data Kmσ 2b would generally be large because Km is the number of subjects

with constant subject effect variance σ 2b As just discussed for a penalized spline Kmσ 2

b

should be nearly independent of Km and could be quite smallFigure 2 presents the same type of results as Figure 1 for Gamma priors for the precision

parameter τb with the mean held fixed at 10minus6 and variances equal to 10 103 106 and 1010respectively These prior distributions have a much smaller effect on the posterior mean ofthe regression function The fit seems to be undersmooth when the variance is 10 Clearlywhen the variance increases the fit becomes smooth indicating that a value of the variancelarger than 103 will produce a reasonable fit

A similar discussion holds true for τε but now large Bε corresponds to oversmoothingand τε does not depend on the scaling of x In applications it is less likely that Bε iscomparable in size to ||Y minus Xβββ minus Zb||2 because the latter is an estimator of nσ 2

ε If σ 2ε is

an estimator of σ 2ε a good rule of thumb is to use values of Bε smaller than nσ 2

ε 100 This

272 C M Crainiceanu et al

Figure 2 LIDAR data with unstandardized covariate effect of four Gamma priors with mean 10minus6 for theprecision of the truncated polynomial parameters The variances of these priors are 10 103 106 and 1010respectively

rule should work well when σ 2ε does not have an extremely large variance

Alternative to gamma priors are discussed by for example Natarajan and Kass (2000)and Gelman (2006) These have the advantage of requiring less care in the choice of thehyperparameters However we find that with reasonable care the conjugate gamma priorscan be used in practice Nonetheless exploration of other prior families for penalized splineswould be well worthwhile though beyond the scope of this article

4 SIMULTANEOUS CREDIBLE BOUNDS

Simultaneous credible intervals have been discussed before in the literature for exampleBesag Green Higdon and Mengersen (1995) The idea can be used to produce simultaneouscredible bounds for functions estimated smoothly To show this let f (middot) be either m(middot)logσ 2

ε (middot) logσ 2b (middot) or a derivative of order q 1 le q le p of one of these functions It is

straightforward to use MCMC output to construct simultaneous credible bounds on f overan arbitrary finite interval [x1 xN ] Typically x1 and xN would be the smallest and largestobserved values of x

Let x1 lt x2 lt middot middot middot lt xN be a fine grid of points on this interval Let Ef (xi)and SDf (xi) be the posterior mean and standard deviation of f (xi) estimated from aMCMC sample Let Mα be the (1minusα) sample quantile of max1leileN

∣∣[f (xi)minusEf (xi)]SDf (xi)∣∣ computed from the realizations of f in the MCMC sample Then I (xi) =Ef (xi) plusmn MαSDf (xi) i = 1 N are simultaneous credible intervals For N

Spatially Adaptive Bayesian Penalized Splines 273

large the upper and lower limits of these intervals can be connected to form simultaneouscredible bands In the remainder of this article we only report simultaneous credible boundsfor the functions and their derivatives In the examples considered in the following thesebounds tend to be roughly 30ndash50 wider than pointwise credible bounds

5 COMPARISON WITH OTHER UNIVARIATE SMOOTHERS

Baladandayuthapani et al (2005) presented a comprehensive simulation study of theirBayesian spatially adaptive penalized spline model which is obtained as a particular case ofour full model with constant error variance The results of this study indicate that the methodis comparable to or better than several other spatially adaptive Bayesian methods on a varietyof datasets Results also showed that the method produced practically indistinguishableinferences from those obtained using the fast frequentist method of Ruppert and Carroll(2000) Since we are unaware of any other estimation methodology that estimates jointlythe nonparametric adaptive mean and nonparametric variance models we will compareour methodology with that of Ruppert and Carroll (2000) We will compare the frequentistproperties of our Bayesian methodology such as MSE and coverage probabilities calculatedby simulating from existent benchmark models

Consider the regression model

yi = m(xi) + εi

where εi sim N(0 σ 2ε ) are independent errors The first function considered was the Doppler

function described by Donoho and Johnstone (1994)

m(x) = radicx(1 minus x) sin

2π(1 + 005)

x + 005

For the Doppler function we considered n = 2048 equally spaced xrsquos between [0 1] andσ 2ε = 1

Ruppert and Carroll (2000) proposed the following family of models related to theDoppler but allowing for more shape flexibility

m(x j) = radicx(1 minus x) sin

2π(1 + 2(9minus4j)5)

x + 2(9minus4j)5

where j = 3 and j = 6 correspond to low and severe spatial heterogeneity respectivelyFor this function we considered n = 400 equally spaced xrsquos on [0 1] and σ 2

ε (x) = 004 Wewill refer to this as the RCj function where j controls the amount of spatial heterogeneity

Ruppert and Carroll (2000) and Baladandayuthapani et al (2005) also considered thefollowing function

m(x) = expminus400(x minus 06)2

+ 5

3expminus500(x minus 075)2 + 2 expminus500(x minus 09)2

(51)

274 C M Crainiceanu et al

Table 1 Average mean square error (AMSE) for different mean functions based on 100 simulations The columnldquoTrue variancerdquo indicates whether the data-generating process is homoscedastic or heteroscedastic Thecolumns ldquoFitted error variancerdquo refer to the type of assumption used for fitting the data ldquoHomoscedas-ticrdquo uses the Ruppert and Carroll (2000) fitting algorithm while ldquoHeteroscedasticrdquo uses the Bayesianinferential algorithm described in this article The mean and log-variance functions were modeled usingdegree 2 splines with Km = Kε = 40 equally spaced knots The shrinkage process was modeled usingdegree 1 splines with Kb = 4 subknots

Fitted error variance

Function True variance Homoscedastic Heteroscedastic

Doppler Constant 0197 0218RCj=3 Constant 0011 0012RCj=6 Constant 0061 00653 Bumps Constant 0054 00553 Bumps Variable 0031 0026

which is roughly constant between [0 05] and has three sharp peaks at 06 075 and 09For this function we considered n = 1000 equally spaced xrsquos on [0 1] and two differentscenarios for the standard error of the error process (a) σε(x) = 05 for homoscedasticerrors and (b) σε(xi) = 05minus08x+16(xminus05)+ We will refer to this as the ldquothree-bumprdquofunction

We used 100 simulations from these models and fit a spatially adaptive penalized splinethat uses a constant variance For the three-bump function we also fitted a model withnonconstant variance modeled as a log-penalized-spline For the mean and log-variancefunctions we used quadratic penalized splines with Km = Kε = 40 knots For the log-variance corresponding to the shrinkage process we used Kb = 4 knots

We calculated the mean square error for each simulated dataset as

MSE =nsum

i=1

m(xi) minus m(xi)2 n

where m(x) is the posterior mean of m(x) using a given model Table 1 provides a summaryof the average MSE (AMSE) results over 100 simulations comparing the method that allowsfor heteroscedastic errors with the one that does not Remarkably our method that estimatesa nonconstant variance performed well compared to methods that assume a constant variancewhen the variance is constant In contrast when the true variance process is not constantour method performed better

Our method was slightly outperformed by the Ruppert and Caroll (2000) method in theDoppler case This is probably due to the fact that the Ruppert and Carroll method assumesa constant variance for the Doppler function when the variance used for simulations wasσ 2ε = 1 However both methods outperformed the method of Baladandayuthapani et al

(2005) who reported an AMSE of 0024 for the Doppler functionFor the three-bump function with constant variance σε(x) = 05 the two models pro-

duced practically indistinguishable MSEs Note that our AMSE over all xrsquos using 100 simu-

Spatially Adaptive Bayesian Penalized Splines 275

Figure 3 Comparison of pointwise coverage probabilities of the 95 credible intervals in 500 simulationsData was simulated from the model described in Section 5 with heteroscedastic error variance Estimation wasdone using two methods Bayesian adaptive penalized splines with homoscedastic errors (dashed) and Bayesianadaptive penalized splines with heteroscedastic errors (solid) The coverage probabilities have been smoothed toremove Monte Carlo variability

lations was AMSE = 00054 which is smaller than 00061 reported by Baladandayuthapaniet al (2005) The coverage probabilities of the 95 credible intervals were very similarfor the two models for each value of the covariate with a slight advantage in favor of themodel using homoscedastic errors The average coverage probability was 947 for themodel with homoscedastic errors and 935 for the model with heteroscedastic errorsThese coverage probabilities are very similar to the ones reported by Baladandayuthapaniet al (2005) in their Figure 3

For case (b) σε(xi) = 05 minus 08x + 16(x minus 05)+ the heteroscedastic model sub-stantially outperformed the homoscedastic model both in terms of AMSE and coverageprobabilities In this situation it would be misleading to compare only the average coverageprobability for the 95 credible intervals Indeed these averages are very close 941for the heteroscedastic and 930 for the homoscedastic method but they are obtainedfrom very different sets of pointwise coverage probabilities Figure 3 displays the coverageprobabilities for these two methods Note that for the heteroscedastic method the coverageprobability in the [0 05] interval is close to 95 In the same interval the coverage prob-ability for the homoscedastic method starts from around 08 and increases until it crossesthe 95 target around 02 Moreover in the interval [03 05] this coverage probabilityis estimated to be 1 This is due to the fact that on this interval σε(x) decreases linearlyfrom 05 to 01 Since the homoscedastic method assumes a constant variance the credibleintervals will tend to be shorter than nominal in regions of higher variability (x close tozero) thus producing lower coverage probabilities In regions of smaller variability (x closeto 05) the credible intervals will tend to be much wider thus producing extremely large

276 C M Crainiceanu et al

coverage probabilitiesIn the interval [055 065] the homoscedastic method slightly outperforms the het-

eroscedastic method This seems to be the effect of a ldquoluckyrdquo combination of two factorsAs we discussed the size of the credible intervals for the homoscedastic method in a neigh-borhood of 05 is much larger than nominal However at the same point the mean functionchanges from a constant to a rapidly oscillating function and the coverage probabilities droproughly at the same rate In the interval [08 1] one can notice a phenomenon very similarto the one described for the interval [0 05] While the function oscillates more rapidly inthis region the heteroscedastic adaptive method produces credible intervals with coverageprobabilities close to the nominal 95 However the homoscedastic method does not takeinto account the increased variability and produces credible intervals that are too short

For these examples both methods performed roughly similar on simulated data sets thatdid not require variance estimation However when the error variance is not constant themethod using log-penalized-splines to estimate the error variance considerably outperformsthe method that does not

6 LOW RANK MULTIVARIATE SMOOTHING

In this section we generalize the ideas in Section 2 to multivariate smoothing whilepreserving the appealing geometric interpretation of the smoother Consider the followingregression yi = m(xi ) + εi where m(middot) is a smooth function of L covariates We will useradial basis functions which have the advantage of being more numerically stable than thetensor product of truncated polynomials in more than one dimension Suppose that xi isin RL1 le i le n are n vectors of covariates and κk isin RL 1 le k le Km are Km knots Considerthe following distance function

C(r) =

||r||2MminusL for L odd||r||2MminusL log ||r|| for L even

where || middot || denotes the Euclidean norm in RL the integer M controls the smoothnessof C(middot) X the matrix with ith row Xi = [1 xT

i ] ZKm = C(||xi minus κκκk||)1leilen1lekleKm

Km = C(||κκκk minus κκκkprime ||)1lekleK1lekprimeleK and define Z = ZKminus12K where

minus12K is the

principal square root of K With these notations the low rank approximation of thin platespline regression can be obtained as the BLUP in the LMM (Kammann and Wand 2003Ruppert Wand and Carroll 2003)

Y = Xβββ + Zb + εεε E

(bεεε

)=(

00

) cov

(bεεε

)=(

σ 2b IKm 0

0 σ 2ε In

) (61)

This model contains only one variance component σ 2b for controlling the shrinkage of

b which is equivalent to one global smoothing parameter λ = σ 2b σ

2ε and implicitly as-

sumes homoscedastic errors To relax these assumptions we consider a new set of knotsκκκlowast

1 κκκlowastKb

and define Xlowast and Zlowast similarly with the corresponding definition of matrices

Spatially Adaptive Bayesian Penalized Splines 277

X and Z where the x-covariates are replaced by the knots κκκk and the knots are replaced bythe subknots κκκlowast

k Consider the following model for the responseyi = β0 +sumL

j=1 βjxij +sumKm

k=1 bkzik + εiεi sim N0 σ 2

ε (xi ) i = 1 nbk sim N0 σ 2

b (κκκk) k = 1 Km

(62)

where the error variance logσ 2ε (xi ) is modeled as a low rank log-thin plate spline

logσ 2ε (xi ) = γ0 +sumL

j=1 γjxij +sumKε

k=1 ckzikck sim N(0 σ 2

c ) k = 1 Kε(63)

and the random coefficient variance logσ 2b (κκκk) is modeled as another low rank log-thin

plate spline logσ 2

b (κκκk) = δ0 +sumLj=1 δj x

lowastkj +sumKb

j=1 dj zlowastkj

dj sim N(0 σ 2d ) j = 1 Kb

(64)

where xlowastkj and zlowast

kj are the entries of Xlowast and Zlowast matrices respectively and ck and ds areassumed mutually independent

In the case of low rank smoothers the set of knots for the covariates and subknots formodeling the shrinkage process have to be chosen One possibility is to use equally spacedknots and subknots We use this approach in our simulation study in Section 7 to make ourmethod comparable from this point of view with previously published methods Anotherpossibility is to select the knots and subknots using the space filling design (Nychka andSaltzman 1998) which is based on the maximal separation principle This avoids wastingknots and is likely to lead to better approximations in sparse regions of the data Thecoverdesign() function from the R package Fields (Fields Development Team 2006)provides software for space filling knot selection We use this approach in Section 8 in theNoshiro elevation example

7 COMPARISONS WITH OTHER ADAPTIVE SURFACEFITTING METHODS

Lang and Brezger (2004) compared their adaptive surface smoothing method with sev-eral methods used in a simulation study by Smith and Kohn (1997) MARS (Friedman1991) ldquolocfitrdquo (Cleveland and Grosse 1991) ldquotpsrdquo (bivariate cubic thin plate splines with asingle smoothing parameter) tensor product cubic smoothing splines with five smoothingparameters and a parametric linear interaction model We will use the following regressionfunctions also used by Lang and Brezger

bull m1(x1 x2) = x1 sin(4πx2) where x1 and x2 are distributed independently uniformon [0 1]

bull m2(x1 x2) = 15 exp(minus8x21 ) + 35 exp(minus8x2

2 ) where x1 and x2 are distributedindependently normal with mean 05 and variance 01

278 C M Crainiceanu et al

Function m1(middot) contains complex interactions and moderate spatial variability whilem2(middot) has only main effects and is much smoother We used a sample size n = 300σ = 14 range(f ) and 250 simulations from each model

We will compare the performance of our method with that of Lang and Brezgerrsquos whichperformed at least as good or better than other adaptive methods in a variety of examplesTo avoid dependence of results on knot choice we used the same choice as the one describedby Lang and Brezger 12 times 12 equally spaced knots for the mean function However weused thin-plate splines instead of tensor products of B-splines We modeled the smoothingparameter using a low rank log-thin plate spline with an equally spaced 5 times 5 knot grid Weconsidered two estimators one with constant error variance (CEV) and one with estimatederror variance using a log-thin plate spline (EEV) with the same 12 times 12 knot grid as forthe mean function The CEV estimator is the one that can directly be compared with theestimators considered by Lang and Brezger (2004) Because Lang and Brezger presentedonly comparative results using boxplots we will use their results as read from their Figure5 The performance of all estimators was measured by the log of the mean squared errorlog(MSE) where MSE(f ) = 1n

sumni=1f (xi) minus f (xi)2

For the CEV estimator and function f1(middot) corresponding to moderate spatial heterogene-ity we obtained a median log(MSE) of minus367 with an interquartile range [minus380minus353]and a range [minus421minus313] These values are better than the best values reported by Langand Brezgerrsquos methods in their Figure 5(b) Lang and Brezger reported that their method de-noted ldquoP-splinerdquo outperforms all other methods in this case and achieved a median log(MSE)of minus350 Remarkably the EEV estimator performed almost as well as the CEV estimatorin this case and also outperformed the other methods The median log(MSE) for EEV wasof minus359 with an interquartile range [minus374minus343] and a range [minus414minus302] Theseresults are consistent with univariate results reported by Ruppert and Carroll (2000) andBaladandayuthapani Mallick and Carroll (2005) who found that allowing the smoothingparameter to vary smoothly outperforms other adaptive techniques

For function f1(middot) Figure 4 displays the coverage probabilities for the 95 credibleintervals calculated over a 20 times 20 equally spaced grid in [0 1]2 For one grid point wecalculated the frequency with which the 95 pointwise credible interval covers the truevalue of the function at the grid point As expected these coverage probabilities showstrong spatial correlation with lower coverage probabilities along the ridges of the sinusfunction Coverage probability is lowest when x1 is in the [02 05] range The signal-to-noise in this region is about half the signal-to-noise ratio in the region corresponding to x1

close to 1 This explains why the coverage probability is smaller in this region Interestinglythe zeros of the true function appearing at x2 = 025 050 075 are covered with at least95 probability Another interesting feature is the lower coverage probabilities near thenorth south and east boundaries These features of the pointwise coverage probabilitymap are consistent with the frequentist analysis of Krivobokova et al (2007) but are notconsistent with the coverage probabilities reported by Lang and Brezger for their BayesianP-spline method

For our CEV estimator and function f2(middot) corresponding to very low spatial heterogene-ity we obtained a median log(MSE) of minus595 with an interquartile range [minus612minus566]

Spatially Adaptive Bayesian Penalized Splines 279

Figure 4 Coverage probability of the pointwise 95 credible intervals for the mean function for functionf1(x1 x2) = x1 sin(4πx2) with a constant error standard deviation σ = range(f )4 Probabilities are computedon a 20 times 20 equally spaced grid of points in [0 1]2 The model used was a thin plate spline with K = 100 knotsfor the mean function a log-thin plate spline with K = 100 knots for the variance function and a log-thin platespline with Klowast = 16 knots for the spatially adaptive shrinkage parameter The parameter controlling the degreeof smoothness of the thin-plate spline basis was m = 2

and a range [minus651minus541] The results of Lang and Brezger in their Figure 5(a) indicatethat their ldquoP-splinerdquo method achieved a median log(MSE) of minus600 Our method performsroughly similar to the Bayesian P-spline method of Lang and Brezger and to the cubicthin plate spline (tps) being outperformed only by Lang and Brezgerrsquos adaptive P-splinemethod with two main effects ldquoP-spline mainrdquo We could also add two main effects to ourbivariate adaptive smoother to improve MSE for this example but this is not our concernin this article Again the EEV estimator performed very similarly to our CEV estimator

The last simulation study was done using the function f1(middot middot) where x1 x2 are indepen-dent uniformly distributed in [0 1] with an error standard deviation function

σε(x1 x2) = r

32+ 3r

32x2

1

where r = range(f1) We compared our CEV and EEV estimators described at the beginningof this section and as expected the EEV estimator outperformed the CEV both in terms oflog(MSE) and coverage probabilities More precisely for log(MSE) we obtained a medianof minus526 for CEV and minus535 for EEV an interquartile range [minus537minus515] for CEV and[minus548minus524] for EEV and a range [minus465minus572] for CEV and [minus484minus584] for EEVFor both estimators the general patterns for coverage probabilities were the ones displayedin Figure 4 EEV outperformed CEV in terms of coverage probabilities For example theregions where the nominal level is exceeded for both estimators have a roughly similar

280 C M Crainiceanu et al

Figure 5 Locations of sampling (dots) with 100 knots (circles) in the Noshiro example

shape and location but the EEV coverage probabilities tend to be closer to their nominallevel

Note that our model incorporating nonparametric variance estimation is more generalthan the models considered in the literature However its performance for function m1(middot middot)with moderate spatial variability and constant variance was better than all methods consid-ered by Lang and Brezger in their simulation study (see their Figure 5 panel b) Our methodhas good performance for the function m2(middot middot) being the second best when compared tomethods consider by Lang and Brezger (see their Figure 5 panel a) For the case whenvariance is not constant our adaptive method that estimates the variance nonparametricallyoutperformed our adaptive method that estimates a constant variance

8 THE NOSHIRO EXAMPLE

Noshiro Japan was the site of a major earthquake Much of the damage from thequake was due to soil movement Professor Thomas OrsquoRourke a geotechnical engineerwas investigating the factors that might help predict soil movement during a future quakeOne factor thought to be of importance was the slope of the land To estimate the slopeRuppert (1997) used a dataset with 799 observations where x was longitude and latitudeand y was elevation at locations in Noshiro The object of primary interest is the gradientat least its magnitude and possibly its direction

We used the thin-plate spline models (62) for the mean of y with Km = 100 knotsselected by the maximum separation criterion implemented in the function coverdesign()

of the R package Fields (Nychka Haaland OrsquoConnell and Ellner 1998) The sampling

Spatially Adaptive Bayesian Penalized Splines 281

Figure 6 Posterior mean of the mean regression function for the Noshiro example The model used was a thinplate spline with K = 100 knots for the mean function a log-thin plate spline with K = 100 knots for the variancefunction and a log-thin plate spline with Klowast = 16 knots for the spatially adaptive shrinkage parameter Theparameter controlling the degree of smoothness of the thin-plate spline basis was m = 2

locations and the knots used for fitting are displayed in Figure 5 The log-thin plate splinedescribed in (63) with the same Kε = 100 knots was used to model the variance of theerror process σ 2

ε (xi ) The log-thin plate spline described in (64) with Kb = 16 subknotswas used to model the variance σ 2

b (κκκk) which controls the amount of shrinkage of the b

parameters The subknots were obtained using the coverdesign() function applied to the100 knots

Figure 6 displays the posterior mean of the mean response function There is a rathersharp peak around (042 04) with a slow decay from the peak in the general south directionand a much faster decay in all other directions The function becomes smoother towardsthe boundary This is exactly the type of function for which adaptive spatial smoothing cansubstantially improve the fit

Figure 7 shows that in large areas of the map σε(middot) is smooth and with small valuesHowever in a neighborhood of the peak of the mean function σε(middot) displays two relativelysharp maxima located NNW and SSE of the peak We also performed a two-stage frequentistanalysis In the first stage we used a low rank thin plate spline for the mean function and theresults from this regression were used to obtain residuals We then fitted another low rankthin plate spline to the absolute values of these residuals While this map was not identicalto the one in Figure 7 it did exhibit the same patterns of heteroscedasticity

The process logσ 2b (κκκk) controlling the shrinkage of the bk is displayed in Figure 8

282 C M Crainiceanu et al

Figure 7 Posterior mean of the standard deviation σε(xi ) of the error process function for the Noshiro exampleThe model used was a thin plate spline with K = 100 knots for the mean function a log-thin plate spline withK = 100 knots for the variance function and a log-thin plate spline with Klowast = 16 knots for the spatially adaptiveshrinkage parameter The parameter controlling the degree of smoothness of the thin-plate spline basis was m = 2

Figure 8 Posterior mean of the shrinkage process minus logσ 2b(κκκk) The model used was a thin plate spline with

K = 100 knots for the mean function a log-thin plate spline with K = 100 knots for the variance function and alog-thin plate spline with Klowast = 16 knots for the spatially adaptive shrinkage parameter The parameter controllingthe degree of smoothness of the thin-plate spline basis was m = 2

Spatially Adaptive Bayesian Penalized Splines 283

Smaller values of this function correspond to less shrinkage of bk towards zero and morelocal behavior of the smoother Figure 8 indicates that in a neighborhood of the peak of themean function the shrinkage is smaller to allow the mean function to change rapidly Awayfrom the peak the shrinkage is larger corresponding to a smoother mean

9 IMPLEMENTATION USING MCMC

We will now provide some details for MCMC simulations of model (21) wherethe variances σ 2

ε (xi) and σ 2b (κk) are modeled by Equations (23) and (24) The imple-

mentation for other models for example for the multivariate smoothing in Section 6is similar Consider independent normal priors for the coefficients of the monomialsβi sim N(0 σ 2

0β) γi sim N(0 σ 20γ ) i = 0 p δi sim N(0 σ 2

0δ) i = 0 q and in-

dependent inverse Gamma priors for the variance components σ 2c sim IGamma(ac bc) and

σ 2d sim IGamma(ad bd) Using these priors many full conditionals of the posterior distribu-

tion are easy to derive while a few have complex multivariate forms Our implementationof the MCMC using multivariate Metropolis-Hastings steps proved to be unstable with poormixing properties A simple and reliable solution was to change the model by adding errorterms to the log-spline models that is

logσ 2ε (xi)

= γ0 + middot middot middot + γpxpi +sumKε

s=1 cs(xi minus κεs )

p+ + ui

logσ 2b (κ

mk ) = δ0 + middot middot middot + δq(κ

mk )q +sumKb

s=1 ds(κmk minus κb

s )q+ + vk

(91)

where ui sim N(0 σ 2u ) and vk sim N(0 σ 2

v ) This idea was also used for σ 2b by Baladan-

dayuthapani Mallick and Carroll (2005) We fixed the values of σ 2u = σ 2

v = 001 as thesevariances appear nonidentifiable and a standard deviation of 01 is small on a log-scale Thisdevice makes computations feasible by replacing sampling from complex full conditionalsby simple univariate MH steps

Define as before by ε b θθθX and define θθθε = (γγγ T cT )T θθθb = (δδδT dT )T CX =(X Z) Cε = (Xε Zε) Cb = (Xb Zb) where X Xε Xb contain the monomials and ZZε Zb contain the truncated polynomials of the spline models for m logσ 2

ε and logσ 2b

respectively Also denote by

0X =[

σ 20βIp+1 0

0 b

]

0ε =[

σ 20γ Ip+1 0

0 σ 2c IKε

]

0b =[

σ 20δIq+1 0

0 σ 2d IKm

]

The full conditionals of the posterior are detailed below

1 [θθθX] sim N(MXCT

Xminus1ε YMX

) where MX =

(CT

Xminus1ε CX + minus1

0X

)minus1

2 [θθθε] sim N(MεCT

ε Yεσ2u Mε

) where Yε = [log(σ 2

ε1) log(σ 2εn)]

T and Mε

= (CTε Cεσ

2u +minus1

0ε )minus1

284 C M Crainiceanu et al

Figure 9 Computation time (in minutes) versus sample size for 10000 simulations from the joint posteriordistribution given the data The mean function is modeled as a penalized spline with Km knots with the log of thevariance of the shrinkage process modeled as another penalized spline with Kb = 4 subknots The log of the errorvariance is modeled as another penalized spline with Kε = Km knots Different symbols correspond to increasingnumber of knots dots (Km = 10) asterisks (Km = 20) circles (Km = 30) xrsquos (Km = 40)

3 [θθθb] sim N(MbCT

b Ybσ2v Mb

)where Yb =

[log(σ 2

1 ) log(σ 2Km

)]T

and Mb =(CT

b Cbσ2v +minus1

0b )minus1

4 [σ 2c ] sim IGamma

(ac + Kε2 bc + ||c||22

)

5 [σ 2d ] sim IGamma

(ad + Kb2 bd + ||d||22

)

6 [σ 2εi] prop σminus3

εi exp

minus(yi minus microi)

2(2σ 2εi ) minus

[log(σ 2

εi ) minus ηi)]2

(2σ 2u )

where microi

and ηi are the ith components of CXθθθX and Cεθθθε respectively

7 [σ 2k ] prop σminus3

k expminusb2

k(2σ2k ) minus [

log(σ 2k ) minus ζk)

]2(σ 2

v ) where ζk is the kth com-

ponent of Cbθθθb

All the above conditionals have an explicit form with the exception of the n + Km

one-dimensional conditionals from 6 and 7 For these distributions we use the Metropolis-Hastings algorithm with a normal proposal distribution centered at the current value andsmall variance

10 IMPLEMENTATION AND COMPUTATIONALPERFORMANCE

Our MCMC simulation algorithm provided reliable results and is reasonably fast To il-lustrate this Figure 9 displays computation time (in minutes) versus sample size for 10000

Spatially Adaptive Bayesian Penalized Splines 285

Figure 10 MCMC histories for eight parameters of interest including the log of standard deviation of the errorprocess and log of standard deviation of the shrinkage process as well as some parameters modeling the differentfunctions The model uses 40 equally spaced knots for the mean and error variance functions and 4 subknotsfor the adaptive shrinkage process Chains correspond to the Doppler function example using quadratic truncatedpolynomials to fit the mean and log variance function and linear truncated polynomial basis for fitting the shrinkageprocess For clarity of presentation only every 20th simulated observation is displayed

simulations from the joint posterior distribution given the data The mean function is mod-eled as a penalized spline with Km knots with the log of the variance of the shrinkageprocess modeled as another penalized spline with Kb = 4 subknots The log of the errorvariance is modeled as another penalized spline with Kε = Km knots Different symbolscorrespond to increasing number of knots dots (Km = 10) asterisks (Km = 20) circles(Km = 30) xrsquos (Km = 40) Computation time increases roughly linearly with the numberof observations with the same intercept 0 and larger slopes for larger number of knotsNote that even with n = 1000 observations and Km = Kε = 40 simulation time is abouttwo minutes

For the models and datasets presented in this article 10000 simulations proved suffi-cient for accurate and reliable inference Some parameters such as the mean and variancefunctions tended to exhibit much better mixing properties than the variance componentsσ 2c and σ 2

d controlling the shrinkage processes This is probably due to the reduced amountof information in the data about these parameters The MCMC convergence and mixingproperties were assessed by visual inspection of the chain histories of many parameters ofinterest Since we started with good starting values of the parameters convergence is fastFigure 10 displays the histories of eight parameters of interest for the model fitted to the

286 C M Crainiceanu et al

Doppler function described in Section 5 For clarity we display only every 100th sampledvalued of the chain but the same patterns would be observed in the full chain history Similargood mixing properties have been noted in all the other examples presented in this article

In the search for a good software platform we implemented our algorithms in MATLABC and FORTRAN Programs can be downloaded from wwwbiostatjhsphedu simccrainicFor our application and implementation there were no serious differences between C andFORTRAN which have performed better than MATLAB by reducing computation timetwo to three times The reported times in Figure 9 are for our implementation in C

Given the complexity of the model we were interested in improving the mixing prop-erties of the Markov Chains The computational trick described in Section 9 was the sinequa non method that allowed us to replace an intractable multivariate MH step with severaltractable univariate MH steps We used normal random walk proposal distributions centeredat the current value and variance tuned to provide acceptance rates between 30 and 50

11 COMMENTS

Our joint models for heterogeneous mean and nonconstant variance are designed to ex-tract the maximum amount of information from complex data and suggest simpler paramet-ric models Bayesian inference based on MCMC offers a natural and powerful frameworkfor model fitting However MCMC in this context is not straightforward and implemen-tation depends essentially on the ability of transforming intractable simulations from highdimensional distributions into tractable one dimensional updates as described in Section9 Our implementation provides accurate fast and reproducible results while software isavailable Actual simulation time increases only linearly with the number of observationsand mixing properties are adequate provided that proposal distribution variances are tunedfor optimal acceptance rates

ACKNOWLEDGMENTS

Carrollrsquos research was supported by a grant from the National Cancer Institute (CA-57030) and by the TexasAampM Center for Environmental and Rural Health via a grant from the National Institute of Environmental HealthSciences (P30-ES09106)

[Received September 2004 Revised October 2006]

REFERENCES

Baladandayuthapani V Mallick B K and Carroll R J (2005) ldquoSpatially Adaptive Bayesian RegressionSplinesrdquo Journal of Computational and Graphical Statistics 14 378ndash394

Besag J Green P Higdon D and Mengersen K (1995) ldquoBayesian Computation and Stochastic Systemsrdquo(with discussions) Statistical Science 10 3ndash66

Carroll R J (2003) ldquoVariances Are Not Always Nuisance Parameters The 2002 R A Fisher Lecturerdquo Biometrics59 211ndash220

Carroll R J and Ruppert D (1988) Transformation and Weighting in Regression New York Chapman andHall

Spatially Adaptive Bayesian Penalized Splines 287

Cleveland W S and Grosse E (1991) ldquoComputational Methods for Local Regressionrdquo Statistics and Computing1 47ndash62

Crainiceanu C and Ruppert D (2004) ldquoLikelihood Ratio Tests in Linear Mixed Models With One VarianceComponentrdquo Journal of the Royal Statistical Society Series B 66 165ndash185

Crainiceanu C M Ruppert D and Wand M P (2005) ldquoBayesian Analysis for Penalized Spline Regressionusing WinBUGSrdquo Journal of Statistical Software 14 Available online at http wwwjstatsoftorg v14 i14

Donoho D L and Johnstone I M (1994) ldquoIdeal Spatial Adaptation by Wavelet Shrinkagerdquo Biometrika 81425ndash455

Eilers P H C and Marx B D (1996) ldquoFlexible Smoothing With B-Splines and Penaltiesrdquo (with discussion)Statistical Science 11 89ndash121

(2002) ldquoGeneralized Linear Additive Smooth Structuresrdquo Journal of Computational and Graphical Statis-tics 11 758ndash783

(2004) ldquoSplines Knots and Penaltiesrdquo unpublished manuscript

Fields Development Team (2006) fields Tools for Spatial Data National Center for Atmospheric ResearchBoulder CO Available online at http wwwcgducaredu Software Fields

Friedman J H (1991) ldquoMultivariate Adaptive Regression Splinesrdquo (with discussion) The Annals of Statistics19 1ndash141

Gelman A (2006) ldquoPrior Distributions for Variance Parameters in Hierarchical Modelsrdquo Bayesian Analysis 1515ndash533

Gray R J (1994) ldquoSpline-Based Tests in Survival Analysisrdquo Biometrics 50 640ndash652

Holst U Hossjer O Bjorklund C Ragnarson P and Edner H (1996) ldquoLocally Weighted Least SquaresKernel Regression and Statistical Evaluation of LIDAR Measurementsrdquo Environmetrics 7 401ndash416

Jullion A and Lambert P (2006) Discussion paper DP0534 UCL

Kammann E E and Wand M P (2003) ldquoGeoadditive Modelsrdquo Applied Statistics 52 1ndash18

Kauermann G (2002) ldquoA Note on Bandwidth Selection for Penalised Spline Smoothingrdquo Technical Report 02-13Department of Statistics University of Glasgow

Krivobokova T Crainiceanu C M and Kauermann G (2007) ldquoFast Adaptive Penalized Splinesrdquo Journal ofComputational and Graphical Statistics to appear

Lang S and Bretzger A (2004) ldquoBayesian P-Splinesrdquo Journal of Compuational and Graphical Statistics 13183ndash212

Lang S Fronk E M and Fahrmeir L (2002) ldquoFunction Estimation with Locally Adaptive Dynamic ModelsrdquoComputational Statistics 17 479ndash499

Marx B D and Eilers PHC (1998) ldquoDirect Generalized Additive Modeling With Penalized LikelihoodrdquoComputational Statistics amp Data Analysis 28 193ndash209

Natarajan R and Kass R E (2000) ldquoReference Bayesian Methods for Generalized Linear Mixed ModelsrdquoJournal of the American Statistical Association 95 227ndash237

Ngo L and Wand M P (2004) ldquoSmoothing With Mixed Model Softwarerdquo Journal of Statistical Software 9Available online at http wwwjstatsoftorg v09 i01

Nychka D Haaland P OrsquoConnell M and Ellner S (1998) ldquoFUNFITS Data Analysis and Statistical Toolsfor Estimating Functionsrdquo in Case Studies in Environmental Statistics Lecture Notes in Statistics vol 132eds D Nychka W W Piegorsch and L H Cox New York Springer Verlag pp 159ndash179 Online athttp wwwcgducaredu stats Software Funfits

Nychka D and Saltzman N (1998) ldquoDesign of Air Quality Monitoring Networksrdquo in Case studies in Environ-mental Statistics Lecture Notes in Statistics vol 132 eds D Nychka W W Piegorsch and L H Cox NewYork Springer Verlag pp 51ndash76

288 C M Crainiceanu et al

Pintore A Speckman P and Holmes C C (2006) ldquoSpatially Adaptive Smoothing Splinesrdquo Biometrika 93113ndash125

Ruppert D (1997) ldquoLocal Polynomial Regression and its Applications in Environmental Statisticsrdquo in Statisticsfor the Environment (vol 3) eds V Barnett and F Turkman Chicester Wiley pp 155ndash173

(2002) ldquoSelecting the Number of Knots for Penalized Splinesrdquo Journal of Computational and GraphicalStatistics 11 735ndash757

Ruppert D and Carroll R J (2000) ldquoSpatially-Adaptive Penalties for Spline Fittingrdquo Australian and NewZealand Journal of Statistics 42 205ndash223

Ruppert D Wand M P Holst and U Hossier O (1997) Technometrics 39 262ndash273

Ruppert D Wand M P and Carroll R J (2003) Semiparametric Regression Cambridge Cambridge UniversityPress

Smith M and Kohn R (1997) ldquoA Bayesian Approach to Nonparametric Bivariate Regressionrdquo Journal of theAmerican Statistical Association 92 1522ndash1535

Wood S (2006) Generalized Additive Models An Introduction with R New York Chapman amp HallCRC

Page 3: Spatially Adaptive Bayesian Penalized Splines - Department of

Spatially Adaptive Bayesian Penalized Splines 267

The penalized spline model can be written asyi = β0 + β1xi + middot middot middot + βpx

pi +sumKm

k=1 bk(xi minus κm

k

)p+ + εi

εi sim N0 σ 2ε (xi) i = 1 n

bk sim N0 σ 2b (κ

mk ) k = 1 Km

(21)

where εi and bk are mutually independent given σ 2ε (middot) σ 2

b (middot) In Sections 21 and 22 σ 2ε (xi)

and σ 2b (κ

mk ) are modeled using log-spline models Denote by X the ntimes (p+1) matrix with

the ith row equal to Xi = (1 xi xpi ) and by Z the ntimesKm matrix with ith row equal to

Zi = (xi minus κm1 )

p+ (xi minus κm

Km)p+ Y = (y1 yn)

T and εεε = (ε1 εn)T Model

(21) can be rewritten in matrix form as

Y = Xβββ + Zb + εεε E

(bεεε

)=(

00

) cov

(bεεε

)=(

b 00 ε

) (22)

where the joint conditional distribution of b and εεε given b and ε is assumed normal Theβββ parameters are treated as fixed effects The covariance matrices b and ε are diago-nal with the vectors σ 2

b (κm1 ) σ 2

b (κmKm

) and σ 2ε (x1) σ

2ε (xn) as main diagonals

respectively For this model E(Y) = Xβββ and cov(Y) = εεε + ZbZT

21 Error Variance Function Estimation

Ignoring heteroscedasticity may lead to incorrect inferences and inefficient estimationespecially when the response standard deviation varies over several orders of magnitudeMoreover understanding how variability changes with the predictor may be of intrinsicinterest (Carroll 2003) Transformations can stabilize the response variance when the con-ditional response variance is a function of the conditional expectation (Carroll and Ruppert1988) but in other cases one needs to estimate it by modeling the variance as a function ofthe predictors

We model the variance function as a loglinear modellogσ 2

ε (xi) = γ0 + middot middot middot + γpxpi +sumKε

s=1 cs(xi minus κεs )

p+

cs sim N(0 σ 2c ) s = 1 Kε

(23)

where the γγγ parameters are fixed effects and κε1 lt middot middot middot lt κε

Kεare knots The normal

distribution of the cs parameters shrinks them towards 0 and ensures stable estimation

22 Smoothly Varying Local Penalties

Following Baladandayuthapani et al (2005) we model σ 2b (middot) as

logσ 2b (x) = δ0 + + δqx

q +sumKb

s=1 ds(x minus κbs )

q+

ds sim N(0 σ 2d ) s = 1 Kb

(24)

where the δrsquos are fixed effects the dsrsquos are mutually independent and κbs Kb

s=1 are knotsThe particular case of a global smoothing parameter corresponds to the case when the splinefunction is a constant that is logσ 2

b (κms ) = δ0

268 C M Crainiceanu et al

Ruppert and Carroll (2000) proposed a ldquolocal penaltyrdquo model similar to (24) Bal-adandayuthapani Mallick and Carroll (2005) developed a Bayesian version of Ruppertand Carrollrsquos (2000) estimator and showed that their estimator is similar to Ruppert andCarrollrsquos in terms of mean square error and outperforms other Bayesian methods LangFronk and Fahrmeir (2002) considered locally adaptive dynamic models and find that theirmethod is roughly comparable to the method of Ruppert and Carroll (2000) in terms ofmean square error (MSE) and coverage probability Krivobokova Crainiceanu and Kauer-mann (2007) used a model similar to (24) for low rank thin plate spline and apply Laplaceapproximations instead of Bayesian MCMC

In Lang and Brezgerrsquos (2004) prior for the bk the nonconstant variance is independentfrom knot to knot since the variances of the bk are assumed to be independent inverse-gammas In contrast our model allows dependence so that if the variance is high at oneknot then it is high at neighboring knots Stated differently the Lang and Brezger model isone of random heteroscedasticity and ours of systematic heteroscedasticity Both types ofpriors are sensible and will have applications but in any specific application it is likely thatone of the two will be better For example Lang and Brezger found that for ldquoDoppler typerdquofunctions their estimator is not quite as good as Ruppert and Carrollrsquos and that the coverageof their credible intervals is not so accurate either It is not hard to understand why Doppler-type functions that oscillate more slowly going from left to right are consistent with a priorvariance for the bk that is monotonically decreasing exactly the type of prior included inour model but not Lang and Brezgerrsquos

23 Choice of Spline Basis Number and Spacing of Knots and Penalties

For concreteness we have made some specific choices about the spline bases numberand knot spacings and form of the penalty In terms of model formulation the choice ofspline basis in the model is not important since an equivalent basis gives the same modeland the basis used in computations need not be the same as the one used to express themodel However spline bases are very different in terms of computational stability Forexample cubic thin plate splines (Ruppert Wand and Carroll 2003 pp 72ndash74) have betternumerical properties than the truncated polynomial basis In more than one dimension thetensor product of truncated polynomials proved unstable and we preferred using low rankradial smoothers (see Section 6)

The penalty we use is somewhat different from that of Eilers and Marx (1996) andalso somewhat different from the penalties used by smoothing splines We believe thatthe penalty parameter is the crucial choice so we have concentrated on spatially adaptivemodeling of the penalty parameter

In this article we use knots to model the mean variance and spatially adaptive smoothingparameter The methodology described here does not depend conceptually on the number ofknots For example one could use a knot at each observation for each of the three functionsAlthough we use quantile knot spacings the best type of knot spacings is controversialEilers and Marx (2004) stated that ldquoEqually spaced knots are always to be preferredrdquo Incontrast Ruppert Wand and Carroll (2003) used quantile-spacing in all of their examples

Spatially Adaptive Bayesian Penalized Splines 269

though they did not make any categorical statement that quantile-spacing is always to bepreferred Since knot spacing is not the main focus of this article and our results were notsensitive to it we do not discuss this issue here

In this article we have used quantile knot spacing for the univariate case and spacefilling design (Nychka and Saltzman 1998) for the multivariate case which is based on themaximal separation principle and generalizes the quantile knot spacing in more than onedimension For the multivariate case we also used equally spaced knots in simulations tomake our methods more comparable to other published methods

3 PRIOR SPECIFICATION

Any smoother depends heavily on the choice of smoothing parameter and for penalizedsplines the smoothing parameter is the ratio of the error variance to the prior variance on themean (Ruppert Wand and Carroll 2003) The smoothness of the fit depends on how thesevariances are estimated For example Crainiceanu and Ruppert (2004) showed that in finitesamples the (RE)ML estimator of the smoothing parameter is biased towards oversmoothingand Kauermann (2002) obtained corresponding asymptotic results for smoothing splines

In Bayesian models the estimates of the variance components are known to be sensi-tive to the prior specification for example see Gelman (2006) To study the effect of thissensitivity upon Bayesian penalized splines consider model (21) with one smoothing pa-rameter and homoscedastic errors so that σ 2

b and σ 2ε are constant In terms of the precision

parameters τb = 1σ 2b and τε = 1σ 2

ε the smoothing parameter is λ = τετb = σ 2b σ

and a small (large) λ corresponds to oversmoothing (undersmoothing)

31 Priors on the Fixed Effects Parameters

It is standard to assume that the fixed effects parameters βi are apriori independentwith prior distributions either [βi] prop 1 or βi prop N(0 σ 2

β ) where σ 2β is very large In our

applications we used σ 2β = 106 which we recommend if x and y have been standardized

or at least have standard deviations with order of magnitude oneFor the fixed effects γ and δ used in the log-spline models (23) and (24) we also used

independent N(0 106) priors When this prior is not consistent with the true value of theparameter a possible strategy is to fit the model using a given set of priors and obtainthe estimators γ σγ δ and σδ We could then use independent priors N(γ 106σ 2

γ ) andN(δ 106σ 2

δ) for γ and δ respectively

32 Priors on the Precision Parameters

As just mentioned the priors for the precisions τb and τε are crucial We now show howcritically the choice of τb may depend upon the scaling of the variables The gamma familyof priors for the precisions is conjugate If [τb] sim Gamma(Ab Bb) and independently of

270 C M Crainiceanu et al

τb [τε] sim Gamma(Aε Bε) where Gamma(AB) has mean AB and variance AB2 then

[τb|Y βββb τε] sim Gamma

(Ab + Km

2 Bb + ||b||2

2

) (31)

and

[τε |Y βββb τε] prop Gamma

(Aε + n

2 Bε + ||Y minus Xβββ minus Zb||2

2

)

Also

E(τb|Y βββb τε) = Ab + Km2

Bb + ||b||22 var(τb|Y βββb τε) = Ab + Km2

(Bb + ||b||22)2

and similarly for τε The prior does not influence the posterior distribution of τε when both Ab and Bb are

small compared to Km2 and ||b||22 respectively Since the number of knots is Km ge 1and in most problems considered Km ge 5 it is safe to choose Ab le 001 When Bb ltlt

||b||22 the posterior distribution is practically unaffected by the prior assumptions WhenBb increases compared to ||b||22 the conditional distribution is increasingly affectedby the prior assumptions E(τb|Y βββb τε) is decreasing in Bb indicating that large Bb

compared to ||b||22 corresponds to undersmoothing Since the posterior variance of τb isalso decreasing in Bb a poor choice of Bb will likely result in underestimating the variabilityof the smoothing parameter λ = τετb causing confidence intervals for the mean functionto be too narrow The condition Bb ltlt ||b||22 shows that the ldquononinformativenessrdquo ofthe gamma prior depends essentially on the scale of the problem

To show the potential severity of these effects we consider the model (21) with a globalsmoothing parameter and homoscedastic errors We used quadratic splines with 30 knotsand the light detection and ranging (LIDAR) data for illustration The LIDAR data wasobtained from atmospheric monitoring of pollutants and consists of the range or distancetraveled before light is reflected back to a source and log ratio the logarithm of the ratioof received light from two laser sources For more details see Ruppert Wand Holst andHossier (1997)

Figure 1 shows the effect of four mean-one Gamma priors for the precision of thetruncated polynomial parameters The variances of these priors are 10 103 106 and 1010respectively Obviously the first two inferences provide severely undersmoothed almostindistinguishable posterior means The third graph is much smoother but still exhibitsroughness especially in the right-hand side of the plot while the fourth graph displaysa pleasing smooth pattern consistent with our frequentist inference Using either priordistribution one obtains that the posterior mean of ||b||22 is of order 10minus6 to 10minus5 Thisexplains why values of Bb larger than 10minus6 proved inappropriate for this problem

The size of ||b||22 depends upon the scaling of the x and y variables and in the caseof the LIDAR data ||b||22 is small because the standard deviation of x is much largerthan the standard deviation of y If y is rescaled to ayy and x to axx then the regressionfunction becomes aym(axx) whose pth derivative is aya

px m

(p)(axx) so that ||b||22 is

Spatially Adaptive Bayesian Penalized Splines 271

Figure 1 LIDAR data with unstandardized covariate effect of four mean-one Gamma priors for the precisionof the truncated polynomial parameters The variances of these priors are 10 103 106 and 1010 respectively

rescaled by the factor a2ya

2px Thus ||b||22 is particularly sensitive to the scaling of x

The size of ||b||22 also depends on Km Because the integral of m(p) over the range ofx will be approximately

sumKm

k=1 bk asymp radicKmσb we can expect that σ 2

b prop (Km)minus1 and thesmoothing parameter should be proportional to Km For the LIDAR data the GCV chosensmoothing parameter is 00095 00205 00440 and 00831 for Km equal to 15 30 60 and120 respectively so as expected the smoothing parameter approximately doubles as Km

doublesPractical experience with LMMs for longitudinal or clustered data should be applied

with caution to penalized splines In a mixed effects model ||b||2 is an estimator of Kmσ 2b

For longitudinal data Kmσ 2b would generally be large because Km is the number of subjects

with constant subject effect variance σ 2b As just discussed for a penalized spline Kmσ 2

b

should be nearly independent of Km and could be quite smallFigure 2 presents the same type of results as Figure 1 for Gamma priors for the precision

parameter τb with the mean held fixed at 10minus6 and variances equal to 10 103 106 and 1010respectively These prior distributions have a much smaller effect on the posterior mean ofthe regression function The fit seems to be undersmooth when the variance is 10 Clearlywhen the variance increases the fit becomes smooth indicating that a value of the variancelarger than 103 will produce a reasonable fit

A similar discussion holds true for τε but now large Bε corresponds to oversmoothingand τε does not depend on the scaling of x In applications it is less likely that Bε iscomparable in size to ||Y minus Xβββ minus Zb||2 because the latter is an estimator of nσ 2

ε If σ 2ε is

an estimator of σ 2ε a good rule of thumb is to use values of Bε smaller than nσ 2

ε 100 This

272 C M Crainiceanu et al

Figure 2 LIDAR data with unstandardized covariate effect of four Gamma priors with mean 10minus6 for theprecision of the truncated polynomial parameters The variances of these priors are 10 103 106 and 1010respectively

rule should work well when σ 2ε does not have an extremely large variance

Alternative to gamma priors are discussed by for example Natarajan and Kass (2000)and Gelman (2006) These have the advantage of requiring less care in the choice of thehyperparameters However we find that with reasonable care the conjugate gamma priorscan be used in practice Nonetheless exploration of other prior families for penalized splineswould be well worthwhile though beyond the scope of this article

4 SIMULTANEOUS CREDIBLE BOUNDS

Simultaneous credible intervals have been discussed before in the literature for exampleBesag Green Higdon and Mengersen (1995) The idea can be used to produce simultaneouscredible bounds for functions estimated smoothly To show this let f (middot) be either m(middot)logσ 2

ε (middot) logσ 2b (middot) or a derivative of order q 1 le q le p of one of these functions It is

straightforward to use MCMC output to construct simultaneous credible bounds on f overan arbitrary finite interval [x1 xN ] Typically x1 and xN would be the smallest and largestobserved values of x

Let x1 lt x2 lt middot middot middot lt xN be a fine grid of points on this interval Let Ef (xi)and SDf (xi) be the posterior mean and standard deviation of f (xi) estimated from aMCMC sample Let Mα be the (1minusα) sample quantile of max1leileN

∣∣[f (xi)minusEf (xi)]SDf (xi)∣∣ computed from the realizations of f in the MCMC sample Then I (xi) =Ef (xi) plusmn MαSDf (xi) i = 1 N are simultaneous credible intervals For N

Spatially Adaptive Bayesian Penalized Splines 273

large the upper and lower limits of these intervals can be connected to form simultaneouscredible bands In the remainder of this article we only report simultaneous credible boundsfor the functions and their derivatives In the examples considered in the following thesebounds tend to be roughly 30ndash50 wider than pointwise credible bounds

5 COMPARISON WITH OTHER UNIVARIATE SMOOTHERS

Baladandayuthapani et al (2005) presented a comprehensive simulation study of theirBayesian spatially adaptive penalized spline model which is obtained as a particular case ofour full model with constant error variance The results of this study indicate that the methodis comparable to or better than several other spatially adaptive Bayesian methods on a varietyof datasets Results also showed that the method produced practically indistinguishableinferences from those obtained using the fast frequentist method of Ruppert and Carroll(2000) Since we are unaware of any other estimation methodology that estimates jointlythe nonparametric adaptive mean and nonparametric variance models we will compareour methodology with that of Ruppert and Carroll (2000) We will compare the frequentistproperties of our Bayesian methodology such as MSE and coverage probabilities calculatedby simulating from existent benchmark models

Consider the regression model

yi = m(xi) + εi

where εi sim N(0 σ 2ε ) are independent errors The first function considered was the Doppler

function described by Donoho and Johnstone (1994)

m(x) = radicx(1 minus x) sin

2π(1 + 005)

x + 005

For the Doppler function we considered n = 2048 equally spaced xrsquos between [0 1] andσ 2ε = 1

Ruppert and Carroll (2000) proposed the following family of models related to theDoppler but allowing for more shape flexibility

m(x j) = radicx(1 minus x) sin

2π(1 + 2(9minus4j)5)

x + 2(9minus4j)5

where j = 3 and j = 6 correspond to low and severe spatial heterogeneity respectivelyFor this function we considered n = 400 equally spaced xrsquos on [0 1] and σ 2

ε (x) = 004 Wewill refer to this as the RCj function where j controls the amount of spatial heterogeneity

Ruppert and Carroll (2000) and Baladandayuthapani et al (2005) also considered thefollowing function

m(x) = expminus400(x minus 06)2

+ 5

3expminus500(x minus 075)2 + 2 expminus500(x minus 09)2

(51)

274 C M Crainiceanu et al

Table 1 Average mean square error (AMSE) for different mean functions based on 100 simulations The columnldquoTrue variancerdquo indicates whether the data-generating process is homoscedastic or heteroscedastic Thecolumns ldquoFitted error variancerdquo refer to the type of assumption used for fitting the data ldquoHomoscedas-ticrdquo uses the Ruppert and Carroll (2000) fitting algorithm while ldquoHeteroscedasticrdquo uses the Bayesianinferential algorithm described in this article The mean and log-variance functions were modeled usingdegree 2 splines with Km = Kε = 40 equally spaced knots The shrinkage process was modeled usingdegree 1 splines with Kb = 4 subknots

Fitted error variance

Function True variance Homoscedastic Heteroscedastic

Doppler Constant 0197 0218RCj=3 Constant 0011 0012RCj=6 Constant 0061 00653 Bumps Constant 0054 00553 Bumps Variable 0031 0026

which is roughly constant between [0 05] and has three sharp peaks at 06 075 and 09For this function we considered n = 1000 equally spaced xrsquos on [0 1] and two differentscenarios for the standard error of the error process (a) σε(x) = 05 for homoscedasticerrors and (b) σε(xi) = 05minus08x+16(xminus05)+ We will refer to this as the ldquothree-bumprdquofunction

We used 100 simulations from these models and fit a spatially adaptive penalized splinethat uses a constant variance For the three-bump function we also fitted a model withnonconstant variance modeled as a log-penalized-spline For the mean and log-variancefunctions we used quadratic penalized splines with Km = Kε = 40 knots For the log-variance corresponding to the shrinkage process we used Kb = 4 knots

We calculated the mean square error for each simulated dataset as

MSE =nsum

i=1

m(xi) minus m(xi)2 n

where m(x) is the posterior mean of m(x) using a given model Table 1 provides a summaryof the average MSE (AMSE) results over 100 simulations comparing the method that allowsfor heteroscedastic errors with the one that does not Remarkably our method that estimatesa nonconstant variance performed well compared to methods that assume a constant variancewhen the variance is constant In contrast when the true variance process is not constantour method performed better

Our method was slightly outperformed by the Ruppert and Caroll (2000) method in theDoppler case This is probably due to the fact that the Ruppert and Carroll method assumesa constant variance for the Doppler function when the variance used for simulations wasσ 2ε = 1 However both methods outperformed the method of Baladandayuthapani et al

(2005) who reported an AMSE of 0024 for the Doppler functionFor the three-bump function with constant variance σε(x) = 05 the two models pro-

duced practically indistinguishable MSEs Note that our AMSE over all xrsquos using 100 simu-

Spatially Adaptive Bayesian Penalized Splines 275

Figure 3 Comparison of pointwise coverage probabilities of the 95 credible intervals in 500 simulationsData was simulated from the model described in Section 5 with heteroscedastic error variance Estimation wasdone using two methods Bayesian adaptive penalized splines with homoscedastic errors (dashed) and Bayesianadaptive penalized splines with heteroscedastic errors (solid) The coverage probabilities have been smoothed toremove Monte Carlo variability

lations was AMSE = 00054 which is smaller than 00061 reported by Baladandayuthapaniet al (2005) The coverage probabilities of the 95 credible intervals were very similarfor the two models for each value of the covariate with a slight advantage in favor of themodel using homoscedastic errors The average coverage probability was 947 for themodel with homoscedastic errors and 935 for the model with heteroscedastic errorsThese coverage probabilities are very similar to the ones reported by Baladandayuthapaniet al (2005) in their Figure 3

For case (b) σε(xi) = 05 minus 08x + 16(x minus 05)+ the heteroscedastic model sub-stantially outperformed the homoscedastic model both in terms of AMSE and coverageprobabilities In this situation it would be misleading to compare only the average coverageprobability for the 95 credible intervals Indeed these averages are very close 941for the heteroscedastic and 930 for the homoscedastic method but they are obtainedfrom very different sets of pointwise coverage probabilities Figure 3 displays the coverageprobabilities for these two methods Note that for the heteroscedastic method the coverageprobability in the [0 05] interval is close to 95 In the same interval the coverage prob-ability for the homoscedastic method starts from around 08 and increases until it crossesthe 95 target around 02 Moreover in the interval [03 05] this coverage probabilityis estimated to be 1 This is due to the fact that on this interval σε(x) decreases linearlyfrom 05 to 01 Since the homoscedastic method assumes a constant variance the credibleintervals will tend to be shorter than nominal in regions of higher variability (x close tozero) thus producing lower coverage probabilities In regions of smaller variability (x closeto 05) the credible intervals will tend to be much wider thus producing extremely large

276 C M Crainiceanu et al

coverage probabilitiesIn the interval [055 065] the homoscedastic method slightly outperforms the het-

eroscedastic method This seems to be the effect of a ldquoluckyrdquo combination of two factorsAs we discussed the size of the credible intervals for the homoscedastic method in a neigh-borhood of 05 is much larger than nominal However at the same point the mean functionchanges from a constant to a rapidly oscillating function and the coverage probabilities droproughly at the same rate In the interval [08 1] one can notice a phenomenon very similarto the one described for the interval [0 05] While the function oscillates more rapidly inthis region the heteroscedastic adaptive method produces credible intervals with coverageprobabilities close to the nominal 95 However the homoscedastic method does not takeinto account the increased variability and produces credible intervals that are too short

For these examples both methods performed roughly similar on simulated data sets thatdid not require variance estimation However when the error variance is not constant themethod using log-penalized-splines to estimate the error variance considerably outperformsthe method that does not

6 LOW RANK MULTIVARIATE SMOOTHING

In this section we generalize the ideas in Section 2 to multivariate smoothing whilepreserving the appealing geometric interpretation of the smoother Consider the followingregression yi = m(xi ) + εi where m(middot) is a smooth function of L covariates We will useradial basis functions which have the advantage of being more numerically stable than thetensor product of truncated polynomials in more than one dimension Suppose that xi isin RL1 le i le n are n vectors of covariates and κk isin RL 1 le k le Km are Km knots Considerthe following distance function

C(r) =

||r||2MminusL for L odd||r||2MminusL log ||r|| for L even

where || middot || denotes the Euclidean norm in RL the integer M controls the smoothnessof C(middot) X the matrix with ith row Xi = [1 xT

i ] ZKm = C(||xi minus κκκk||)1leilen1lekleKm

Km = C(||κκκk minus κκκkprime ||)1lekleK1lekprimeleK and define Z = ZKminus12K where

minus12K is the

principal square root of K With these notations the low rank approximation of thin platespline regression can be obtained as the BLUP in the LMM (Kammann and Wand 2003Ruppert Wand and Carroll 2003)

Y = Xβββ + Zb + εεε E

(bεεε

)=(

00

) cov

(bεεε

)=(

σ 2b IKm 0

0 σ 2ε In

) (61)

This model contains only one variance component σ 2b for controlling the shrinkage of

b which is equivalent to one global smoothing parameter λ = σ 2b σ

2ε and implicitly as-

sumes homoscedastic errors To relax these assumptions we consider a new set of knotsκκκlowast

1 κκκlowastKb

and define Xlowast and Zlowast similarly with the corresponding definition of matrices

Spatially Adaptive Bayesian Penalized Splines 277

X and Z where the x-covariates are replaced by the knots κκκk and the knots are replaced bythe subknots κκκlowast

k Consider the following model for the responseyi = β0 +sumL

j=1 βjxij +sumKm

k=1 bkzik + εiεi sim N0 σ 2

ε (xi ) i = 1 nbk sim N0 σ 2

b (κκκk) k = 1 Km

(62)

where the error variance logσ 2ε (xi ) is modeled as a low rank log-thin plate spline

logσ 2ε (xi ) = γ0 +sumL

j=1 γjxij +sumKε

k=1 ckzikck sim N(0 σ 2

c ) k = 1 Kε(63)

and the random coefficient variance logσ 2b (κκκk) is modeled as another low rank log-thin

plate spline logσ 2

b (κκκk) = δ0 +sumLj=1 δj x

lowastkj +sumKb

j=1 dj zlowastkj

dj sim N(0 σ 2d ) j = 1 Kb

(64)

where xlowastkj and zlowast

kj are the entries of Xlowast and Zlowast matrices respectively and ck and ds areassumed mutually independent

In the case of low rank smoothers the set of knots for the covariates and subknots formodeling the shrinkage process have to be chosen One possibility is to use equally spacedknots and subknots We use this approach in our simulation study in Section 7 to make ourmethod comparable from this point of view with previously published methods Anotherpossibility is to select the knots and subknots using the space filling design (Nychka andSaltzman 1998) which is based on the maximal separation principle This avoids wastingknots and is likely to lead to better approximations in sparse regions of the data Thecoverdesign() function from the R package Fields (Fields Development Team 2006)provides software for space filling knot selection We use this approach in Section 8 in theNoshiro elevation example

7 COMPARISONS WITH OTHER ADAPTIVE SURFACEFITTING METHODS

Lang and Brezger (2004) compared their adaptive surface smoothing method with sev-eral methods used in a simulation study by Smith and Kohn (1997) MARS (Friedman1991) ldquolocfitrdquo (Cleveland and Grosse 1991) ldquotpsrdquo (bivariate cubic thin plate splines with asingle smoothing parameter) tensor product cubic smoothing splines with five smoothingparameters and a parametric linear interaction model We will use the following regressionfunctions also used by Lang and Brezger

bull m1(x1 x2) = x1 sin(4πx2) where x1 and x2 are distributed independently uniformon [0 1]

bull m2(x1 x2) = 15 exp(minus8x21 ) + 35 exp(minus8x2

2 ) where x1 and x2 are distributedindependently normal with mean 05 and variance 01

278 C M Crainiceanu et al

Function m1(middot) contains complex interactions and moderate spatial variability whilem2(middot) has only main effects and is much smoother We used a sample size n = 300σ = 14 range(f ) and 250 simulations from each model

We will compare the performance of our method with that of Lang and Brezgerrsquos whichperformed at least as good or better than other adaptive methods in a variety of examplesTo avoid dependence of results on knot choice we used the same choice as the one describedby Lang and Brezger 12 times 12 equally spaced knots for the mean function However weused thin-plate splines instead of tensor products of B-splines We modeled the smoothingparameter using a low rank log-thin plate spline with an equally spaced 5 times 5 knot grid Weconsidered two estimators one with constant error variance (CEV) and one with estimatederror variance using a log-thin plate spline (EEV) with the same 12 times 12 knot grid as forthe mean function The CEV estimator is the one that can directly be compared with theestimators considered by Lang and Brezger (2004) Because Lang and Brezger presentedonly comparative results using boxplots we will use their results as read from their Figure5 The performance of all estimators was measured by the log of the mean squared errorlog(MSE) where MSE(f ) = 1n

sumni=1f (xi) minus f (xi)2

For the CEV estimator and function f1(middot) corresponding to moderate spatial heterogene-ity we obtained a median log(MSE) of minus367 with an interquartile range [minus380minus353]and a range [minus421minus313] These values are better than the best values reported by Langand Brezgerrsquos methods in their Figure 5(b) Lang and Brezger reported that their method de-noted ldquoP-splinerdquo outperforms all other methods in this case and achieved a median log(MSE)of minus350 Remarkably the EEV estimator performed almost as well as the CEV estimatorin this case and also outperformed the other methods The median log(MSE) for EEV wasof minus359 with an interquartile range [minus374minus343] and a range [minus414minus302] Theseresults are consistent with univariate results reported by Ruppert and Carroll (2000) andBaladandayuthapani Mallick and Carroll (2005) who found that allowing the smoothingparameter to vary smoothly outperforms other adaptive techniques

For function f1(middot) Figure 4 displays the coverage probabilities for the 95 credibleintervals calculated over a 20 times 20 equally spaced grid in [0 1]2 For one grid point wecalculated the frequency with which the 95 pointwise credible interval covers the truevalue of the function at the grid point As expected these coverage probabilities showstrong spatial correlation with lower coverage probabilities along the ridges of the sinusfunction Coverage probability is lowest when x1 is in the [02 05] range The signal-to-noise in this region is about half the signal-to-noise ratio in the region corresponding to x1

close to 1 This explains why the coverage probability is smaller in this region Interestinglythe zeros of the true function appearing at x2 = 025 050 075 are covered with at least95 probability Another interesting feature is the lower coverage probabilities near thenorth south and east boundaries These features of the pointwise coverage probabilitymap are consistent with the frequentist analysis of Krivobokova et al (2007) but are notconsistent with the coverage probabilities reported by Lang and Brezger for their BayesianP-spline method

For our CEV estimator and function f2(middot) corresponding to very low spatial heterogene-ity we obtained a median log(MSE) of minus595 with an interquartile range [minus612minus566]

Spatially Adaptive Bayesian Penalized Splines 279

Figure 4 Coverage probability of the pointwise 95 credible intervals for the mean function for functionf1(x1 x2) = x1 sin(4πx2) with a constant error standard deviation σ = range(f )4 Probabilities are computedon a 20 times 20 equally spaced grid of points in [0 1]2 The model used was a thin plate spline with K = 100 knotsfor the mean function a log-thin plate spline with K = 100 knots for the variance function and a log-thin platespline with Klowast = 16 knots for the spatially adaptive shrinkage parameter The parameter controlling the degreeof smoothness of the thin-plate spline basis was m = 2

and a range [minus651minus541] The results of Lang and Brezger in their Figure 5(a) indicatethat their ldquoP-splinerdquo method achieved a median log(MSE) of minus600 Our method performsroughly similar to the Bayesian P-spline method of Lang and Brezger and to the cubicthin plate spline (tps) being outperformed only by Lang and Brezgerrsquos adaptive P-splinemethod with two main effects ldquoP-spline mainrdquo We could also add two main effects to ourbivariate adaptive smoother to improve MSE for this example but this is not our concernin this article Again the EEV estimator performed very similarly to our CEV estimator

The last simulation study was done using the function f1(middot middot) where x1 x2 are indepen-dent uniformly distributed in [0 1] with an error standard deviation function

σε(x1 x2) = r

32+ 3r

32x2

1

where r = range(f1) We compared our CEV and EEV estimators described at the beginningof this section and as expected the EEV estimator outperformed the CEV both in terms oflog(MSE) and coverage probabilities More precisely for log(MSE) we obtained a medianof minus526 for CEV and minus535 for EEV an interquartile range [minus537minus515] for CEV and[minus548minus524] for EEV and a range [minus465minus572] for CEV and [minus484minus584] for EEVFor both estimators the general patterns for coverage probabilities were the ones displayedin Figure 4 EEV outperformed CEV in terms of coverage probabilities For example theregions where the nominal level is exceeded for both estimators have a roughly similar

280 C M Crainiceanu et al

Figure 5 Locations of sampling (dots) with 100 knots (circles) in the Noshiro example

shape and location but the EEV coverage probabilities tend to be closer to their nominallevel

Note that our model incorporating nonparametric variance estimation is more generalthan the models considered in the literature However its performance for function m1(middot middot)with moderate spatial variability and constant variance was better than all methods consid-ered by Lang and Brezger in their simulation study (see their Figure 5 panel b) Our methodhas good performance for the function m2(middot middot) being the second best when compared tomethods consider by Lang and Brezger (see their Figure 5 panel a) For the case whenvariance is not constant our adaptive method that estimates the variance nonparametricallyoutperformed our adaptive method that estimates a constant variance

8 THE NOSHIRO EXAMPLE

Noshiro Japan was the site of a major earthquake Much of the damage from thequake was due to soil movement Professor Thomas OrsquoRourke a geotechnical engineerwas investigating the factors that might help predict soil movement during a future quakeOne factor thought to be of importance was the slope of the land To estimate the slopeRuppert (1997) used a dataset with 799 observations where x was longitude and latitudeand y was elevation at locations in Noshiro The object of primary interest is the gradientat least its magnitude and possibly its direction

We used the thin-plate spline models (62) for the mean of y with Km = 100 knotsselected by the maximum separation criterion implemented in the function coverdesign()

of the R package Fields (Nychka Haaland OrsquoConnell and Ellner 1998) The sampling

Spatially Adaptive Bayesian Penalized Splines 281

Figure 6 Posterior mean of the mean regression function for the Noshiro example The model used was a thinplate spline with K = 100 knots for the mean function a log-thin plate spline with K = 100 knots for the variancefunction and a log-thin plate spline with Klowast = 16 knots for the spatially adaptive shrinkage parameter Theparameter controlling the degree of smoothness of the thin-plate spline basis was m = 2

locations and the knots used for fitting are displayed in Figure 5 The log-thin plate splinedescribed in (63) with the same Kε = 100 knots was used to model the variance of theerror process σ 2

ε (xi ) The log-thin plate spline described in (64) with Kb = 16 subknotswas used to model the variance σ 2

b (κκκk) which controls the amount of shrinkage of the b

parameters The subknots were obtained using the coverdesign() function applied to the100 knots

Figure 6 displays the posterior mean of the mean response function There is a rathersharp peak around (042 04) with a slow decay from the peak in the general south directionand a much faster decay in all other directions The function becomes smoother towardsthe boundary This is exactly the type of function for which adaptive spatial smoothing cansubstantially improve the fit

Figure 7 shows that in large areas of the map σε(middot) is smooth and with small valuesHowever in a neighborhood of the peak of the mean function σε(middot) displays two relativelysharp maxima located NNW and SSE of the peak We also performed a two-stage frequentistanalysis In the first stage we used a low rank thin plate spline for the mean function and theresults from this regression were used to obtain residuals We then fitted another low rankthin plate spline to the absolute values of these residuals While this map was not identicalto the one in Figure 7 it did exhibit the same patterns of heteroscedasticity

The process logσ 2b (κκκk) controlling the shrinkage of the bk is displayed in Figure 8

282 C M Crainiceanu et al

Figure 7 Posterior mean of the standard deviation σε(xi ) of the error process function for the Noshiro exampleThe model used was a thin plate spline with K = 100 knots for the mean function a log-thin plate spline withK = 100 knots for the variance function and a log-thin plate spline with Klowast = 16 knots for the spatially adaptiveshrinkage parameter The parameter controlling the degree of smoothness of the thin-plate spline basis was m = 2

Figure 8 Posterior mean of the shrinkage process minus logσ 2b(κκκk) The model used was a thin plate spline with

K = 100 knots for the mean function a log-thin plate spline with K = 100 knots for the variance function and alog-thin plate spline with Klowast = 16 knots for the spatially adaptive shrinkage parameter The parameter controllingthe degree of smoothness of the thin-plate spline basis was m = 2

Spatially Adaptive Bayesian Penalized Splines 283

Smaller values of this function correspond to less shrinkage of bk towards zero and morelocal behavior of the smoother Figure 8 indicates that in a neighborhood of the peak of themean function the shrinkage is smaller to allow the mean function to change rapidly Awayfrom the peak the shrinkage is larger corresponding to a smoother mean

9 IMPLEMENTATION USING MCMC

We will now provide some details for MCMC simulations of model (21) wherethe variances σ 2

ε (xi) and σ 2b (κk) are modeled by Equations (23) and (24) The imple-

mentation for other models for example for the multivariate smoothing in Section 6is similar Consider independent normal priors for the coefficients of the monomialsβi sim N(0 σ 2

0β) γi sim N(0 σ 20γ ) i = 0 p δi sim N(0 σ 2

0δ) i = 0 q and in-

dependent inverse Gamma priors for the variance components σ 2c sim IGamma(ac bc) and

σ 2d sim IGamma(ad bd) Using these priors many full conditionals of the posterior distribu-

tion are easy to derive while a few have complex multivariate forms Our implementationof the MCMC using multivariate Metropolis-Hastings steps proved to be unstable with poormixing properties A simple and reliable solution was to change the model by adding errorterms to the log-spline models that is

logσ 2ε (xi)

= γ0 + middot middot middot + γpxpi +sumKε

s=1 cs(xi minus κεs )

p+ + ui

logσ 2b (κ

mk ) = δ0 + middot middot middot + δq(κ

mk )q +sumKb

s=1 ds(κmk minus κb

s )q+ + vk

(91)

where ui sim N(0 σ 2u ) and vk sim N(0 σ 2

v ) This idea was also used for σ 2b by Baladan-

dayuthapani Mallick and Carroll (2005) We fixed the values of σ 2u = σ 2

v = 001 as thesevariances appear nonidentifiable and a standard deviation of 01 is small on a log-scale Thisdevice makes computations feasible by replacing sampling from complex full conditionalsby simple univariate MH steps

Define as before by ε b θθθX and define θθθε = (γγγ T cT )T θθθb = (δδδT dT )T CX =(X Z) Cε = (Xε Zε) Cb = (Xb Zb) where X Xε Xb contain the monomials and ZZε Zb contain the truncated polynomials of the spline models for m logσ 2

ε and logσ 2b

respectively Also denote by

0X =[

σ 20βIp+1 0

0 b

]

0ε =[

σ 20γ Ip+1 0

0 σ 2c IKε

]

0b =[

σ 20δIq+1 0

0 σ 2d IKm

]

The full conditionals of the posterior are detailed below

1 [θθθX] sim N(MXCT

Xminus1ε YMX

) where MX =

(CT

Xminus1ε CX + minus1

0X

)minus1

2 [θθθε] sim N(MεCT

ε Yεσ2u Mε

) where Yε = [log(σ 2

ε1) log(σ 2εn)]

T and Mε

= (CTε Cεσ

2u +minus1

0ε )minus1

284 C M Crainiceanu et al

Figure 9 Computation time (in minutes) versus sample size for 10000 simulations from the joint posteriordistribution given the data The mean function is modeled as a penalized spline with Km knots with the log of thevariance of the shrinkage process modeled as another penalized spline with Kb = 4 subknots The log of the errorvariance is modeled as another penalized spline with Kε = Km knots Different symbols correspond to increasingnumber of knots dots (Km = 10) asterisks (Km = 20) circles (Km = 30) xrsquos (Km = 40)

3 [θθθb] sim N(MbCT

b Ybσ2v Mb

)where Yb =

[log(σ 2

1 ) log(σ 2Km

)]T

and Mb =(CT

b Cbσ2v +minus1

0b )minus1

4 [σ 2c ] sim IGamma

(ac + Kε2 bc + ||c||22

)

5 [σ 2d ] sim IGamma

(ad + Kb2 bd + ||d||22

)

6 [σ 2εi] prop σminus3

εi exp

minus(yi minus microi)

2(2σ 2εi ) minus

[log(σ 2

εi ) minus ηi)]2

(2σ 2u )

where microi

and ηi are the ith components of CXθθθX and Cεθθθε respectively

7 [σ 2k ] prop σminus3

k expminusb2

k(2σ2k ) minus [

log(σ 2k ) minus ζk)

]2(σ 2

v ) where ζk is the kth com-

ponent of Cbθθθb

All the above conditionals have an explicit form with the exception of the n + Km

one-dimensional conditionals from 6 and 7 For these distributions we use the Metropolis-Hastings algorithm with a normal proposal distribution centered at the current value andsmall variance

10 IMPLEMENTATION AND COMPUTATIONALPERFORMANCE

Our MCMC simulation algorithm provided reliable results and is reasonably fast To il-lustrate this Figure 9 displays computation time (in minutes) versus sample size for 10000

Spatially Adaptive Bayesian Penalized Splines 285

Figure 10 MCMC histories for eight parameters of interest including the log of standard deviation of the errorprocess and log of standard deviation of the shrinkage process as well as some parameters modeling the differentfunctions The model uses 40 equally spaced knots for the mean and error variance functions and 4 subknotsfor the adaptive shrinkage process Chains correspond to the Doppler function example using quadratic truncatedpolynomials to fit the mean and log variance function and linear truncated polynomial basis for fitting the shrinkageprocess For clarity of presentation only every 20th simulated observation is displayed

simulations from the joint posterior distribution given the data The mean function is mod-eled as a penalized spline with Km knots with the log of the variance of the shrinkageprocess modeled as another penalized spline with Kb = 4 subknots The log of the errorvariance is modeled as another penalized spline with Kε = Km knots Different symbolscorrespond to increasing number of knots dots (Km = 10) asterisks (Km = 20) circles(Km = 30) xrsquos (Km = 40) Computation time increases roughly linearly with the numberof observations with the same intercept 0 and larger slopes for larger number of knotsNote that even with n = 1000 observations and Km = Kε = 40 simulation time is abouttwo minutes

For the models and datasets presented in this article 10000 simulations proved suffi-cient for accurate and reliable inference Some parameters such as the mean and variancefunctions tended to exhibit much better mixing properties than the variance componentsσ 2c and σ 2

d controlling the shrinkage processes This is probably due to the reduced amountof information in the data about these parameters The MCMC convergence and mixingproperties were assessed by visual inspection of the chain histories of many parameters ofinterest Since we started with good starting values of the parameters convergence is fastFigure 10 displays the histories of eight parameters of interest for the model fitted to the

286 C M Crainiceanu et al

Doppler function described in Section 5 For clarity we display only every 100th sampledvalued of the chain but the same patterns would be observed in the full chain history Similargood mixing properties have been noted in all the other examples presented in this article

In the search for a good software platform we implemented our algorithms in MATLABC and FORTRAN Programs can be downloaded from wwwbiostatjhsphedu simccrainicFor our application and implementation there were no serious differences between C andFORTRAN which have performed better than MATLAB by reducing computation timetwo to three times The reported times in Figure 9 are for our implementation in C

Given the complexity of the model we were interested in improving the mixing prop-erties of the Markov Chains The computational trick described in Section 9 was the sinequa non method that allowed us to replace an intractable multivariate MH step with severaltractable univariate MH steps We used normal random walk proposal distributions centeredat the current value and variance tuned to provide acceptance rates between 30 and 50

11 COMMENTS

Our joint models for heterogeneous mean and nonconstant variance are designed to ex-tract the maximum amount of information from complex data and suggest simpler paramet-ric models Bayesian inference based on MCMC offers a natural and powerful frameworkfor model fitting However MCMC in this context is not straightforward and implemen-tation depends essentially on the ability of transforming intractable simulations from highdimensional distributions into tractable one dimensional updates as described in Section9 Our implementation provides accurate fast and reproducible results while software isavailable Actual simulation time increases only linearly with the number of observationsand mixing properties are adequate provided that proposal distribution variances are tunedfor optimal acceptance rates

ACKNOWLEDGMENTS

Carrollrsquos research was supported by a grant from the National Cancer Institute (CA-57030) and by the TexasAampM Center for Environmental and Rural Health via a grant from the National Institute of Environmental HealthSciences (P30-ES09106)

[Received September 2004 Revised October 2006]

REFERENCES

Baladandayuthapani V Mallick B K and Carroll R J (2005) ldquoSpatially Adaptive Bayesian RegressionSplinesrdquo Journal of Computational and Graphical Statistics 14 378ndash394

Besag J Green P Higdon D and Mengersen K (1995) ldquoBayesian Computation and Stochastic Systemsrdquo(with discussions) Statistical Science 10 3ndash66

Carroll R J (2003) ldquoVariances Are Not Always Nuisance Parameters The 2002 R A Fisher Lecturerdquo Biometrics59 211ndash220

Carroll R J and Ruppert D (1988) Transformation and Weighting in Regression New York Chapman andHall

Spatially Adaptive Bayesian Penalized Splines 287

Cleveland W S and Grosse E (1991) ldquoComputational Methods for Local Regressionrdquo Statistics and Computing1 47ndash62

Crainiceanu C and Ruppert D (2004) ldquoLikelihood Ratio Tests in Linear Mixed Models With One VarianceComponentrdquo Journal of the Royal Statistical Society Series B 66 165ndash185

Crainiceanu C M Ruppert D and Wand M P (2005) ldquoBayesian Analysis for Penalized Spline Regressionusing WinBUGSrdquo Journal of Statistical Software 14 Available online at http wwwjstatsoftorg v14 i14

Donoho D L and Johnstone I M (1994) ldquoIdeal Spatial Adaptation by Wavelet Shrinkagerdquo Biometrika 81425ndash455

Eilers P H C and Marx B D (1996) ldquoFlexible Smoothing With B-Splines and Penaltiesrdquo (with discussion)Statistical Science 11 89ndash121

(2002) ldquoGeneralized Linear Additive Smooth Structuresrdquo Journal of Computational and Graphical Statis-tics 11 758ndash783

(2004) ldquoSplines Knots and Penaltiesrdquo unpublished manuscript

Fields Development Team (2006) fields Tools for Spatial Data National Center for Atmospheric ResearchBoulder CO Available online at http wwwcgducaredu Software Fields

Friedman J H (1991) ldquoMultivariate Adaptive Regression Splinesrdquo (with discussion) The Annals of Statistics19 1ndash141

Gelman A (2006) ldquoPrior Distributions for Variance Parameters in Hierarchical Modelsrdquo Bayesian Analysis 1515ndash533

Gray R J (1994) ldquoSpline-Based Tests in Survival Analysisrdquo Biometrics 50 640ndash652

Holst U Hossjer O Bjorklund C Ragnarson P and Edner H (1996) ldquoLocally Weighted Least SquaresKernel Regression and Statistical Evaluation of LIDAR Measurementsrdquo Environmetrics 7 401ndash416

Jullion A and Lambert P (2006) Discussion paper DP0534 UCL

Kammann E E and Wand M P (2003) ldquoGeoadditive Modelsrdquo Applied Statistics 52 1ndash18

Kauermann G (2002) ldquoA Note on Bandwidth Selection for Penalised Spline Smoothingrdquo Technical Report 02-13Department of Statistics University of Glasgow

Krivobokova T Crainiceanu C M and Kauermann G (2007) ldquoFast Adaptive Penalized Splinesrdquo Journal ofComputational and Graphical Statistics to appear

Lang S and Bretzger A (2004) ldquoBayesian P-Splinesrdquo Journal of Compuational and Graphical Statistics 13183ndash212

Lang S Fronk E M and Fahrmeir L (2002) ldquoFunction Estimation with Locally Adaptive Dynamic ModelsrdquoComputational Statistics 17 479ndash499

Marx B D and Eilers PHC (1998) ldquoDirect Generalized Additive Modeling With Penalized LikelihoodrdquoComputational Statistics amp Data Analysis 28 193ndash209

Natarajan R and Kass R E (2000) ldquoReference Bayesian Methods for Generalized Linear Mixed ModelsrdquoJournal of the American Statistical Association 95 227ndash237

Ngo L and Wand M P (2004) ldquoSmoothing With Mixed Model Softwarerdquo Journal of Statistical Software 9Available online at http wwwjstatsoftorg v09 i01

Nychka D Haaland P OrsquoConnell M and Ellner S (1998) ldquoFUNFITS Data Analysis and Statistical Toolsfor Estimating Functionsrdquo in Case Studies in Environmental Statistics Lecture Notes in Statistics vol 132eds D Nychka W W Piegorsch and L H Cox New York Springer Verlag pp 159ndash179 Online athttp wwwcgducaredu stats Software Funfits

Nychka D and Saltzman N (1998) ldquoDesign of Air Quality Monitoring Networksrdquo in Case studies in Environ-mental Statistics Lecture Notes in Statistics vol 132 eds D Nychka W W Piegorsch and L H Cox NewYork Springer Verlag pp 51ndash76

288 C M Crainiceanu et al

Pintore A Speckman P and Holmes C C (2006) ldquoSpatially Adaptive Smoothing Splinesrdquo Biometrika 93113ndash125

Ruppert D (1997) ldquoLocal Polynomial Regression and its Applications in Environmental Statisticsrdquo in Statisticsfor the Environment (vol 3) eds V Barnett and F Turkman Chicester Wiley pp 155ndash173

(2002) ldquoSelecting the Number of Knots for Penalized Splinesrdquo Journal of Computational and GraphicalStatistics 11 735ndash757

Ruppert D and Carroll R J (2000) ldquoSpatially-Adaptive Penalties for Spline Fittingrdquo Australian and NewZealand Journal of Statistics 42 205ndash223

Ruppert D Wand M P Holst and U Hossier O (1997) Technometrics 39 262ndash273

Ruppert D Wand M P and Carroll R J (2003) Semiparametric Regression Cambridge Cambridge UniversityPress

Smith M and Kohn R (1997) ldquoA Bayesian Approach to Nonparametric Bivariate Regressionrdquo Journal of theAmerican Statistical Association 92 1522ndash1535

Wood S (2006) Generalized Additive Models An Introduction with R New York Chapman amp HallCRC

Page 4: Spatially Adaptive Bayesian Penalized Splines - Department of

268 C M Crainiceanu et al

Ruppert and Carroll (2000) proposed a ldquolocal penaltyrdquo model similar to (24) Bal-adandayuthapani Mallick and Carroll (2005) developed a Bayesian version of Ruppertand Carrollrsquos (2000) estimator and showed that their estimator is similar to Ruppert andCarrollrsquos in terms of mean square error and outperforms other Bayesian methods LangFronk and Fahrmeir (2002) considered locally adaptive dynamic models and find that theirmethod is roughly comparable to the method of Ruppert and Carroll (2000) in terms ofmean square error (MSE) and coverage probability Krivobokova Crainiceanu and Kauer-mann (2007) used a model similar to (24) for low rank thin plate spline and apply Laplaceapproximations instead of Bayesian MCMC

In Lang and Brezgerrsquos (2004) prior for the bk the nonconstant variance is independentfrom knot to knot since the variances of the bk are assumed to be independent inverse-gammas In contrast our model allows dependence so that if the variance is high at oneknot then it is high at neighboring knots Stated differently the Lang and Brezger model isone of random heteroscedasticity and ours of systematic heteroscedasticity Both types ofpriors are sensible and will have applications but in any specific application it is likely thatone of the two will be better For example Lang and Brezger found that for ldquoDoppler typerdquofunctions their estimator is not quite as good as Ruppert and Carrollrsquos and that the coverageof their credible intervals is not so accurate either It is not hard to understand why Doppler-type functions that oscillate more slowly going from left to right are consistent with a priorvariance for the bk that is monotonically decreasing exactly the type of prior included inour model but not Lang and Brezgerrsquos

23 Choice of Spline Basis Number and Spacing of Knots and Penalties

For concreteness we have made some specific choices about the spline bases numberand knot spacings and form of the penalty In terms of model formulation the choice ofspline basis in the model is not important since an equivalent basis gives the same modeland the basis used in computations need not be the same as the one used to express themodel However spline bases are very different in terms of computational stability Forexample cubic thin plate splines (Ruppert Wand and Carroll 2003 pp 72ndash74) have betternumerical properties than the truncated polynomial basis In more than one dimension thetensor product of truncated polynomials proved unstable and we preferred using low rankradial smoothers (see Section 6)

The penalty we use is somewhat different from that of Eilers and Marx (1996) andalso somewhat different from the penalties used by smoothing splines We believe thatthe penalty parameter is the crucial choice so we have concentrated on spatially adaptivemodeling of the penalty parameter

In this article we use knots to model the mean variance and spatially adaptive smoothingparameter The methodology described here does not depend conceptually on the number ofknots For example one could use a knot at each observation for each of the three functionsAlthough we use quantile knot spacings the best type of knot spacings is controversialEilers and Marx (2004) stated that ldquoEqually spaced knots are always to be preferredrdquo Incontrast Ruppert Wand and Carroll (2003) used quantile-spacing in all of their examples

Spatially Adaptive Bayesian Penalized Splines 269

though they did not make any categorical statement that quantile-spacing is always to bepreferred Since knot spacing is not the main focus of this article and our results were notsensitive to it we do not discuss this issue here

In this article we have used quantile knot spacing for the univariate case and spacefilling design (Nychka and Saltzman 1998) for the multivariate case which is based on themaximal separation principle and generalizes the quantile knot spacing in more than onedimension For the multivariate case we also used equally spaced knots in simulations tomake our methods more comparable to other published methods

3 PRIOR SPECIFICATION

Any smoother depends heavily on the choice of smoothing parameter and for penalizedsplines the smoothing parameter is the ratio of the error variance to the prior variance on themean (Ruppert Wand and Carroll 2003) The smoothness of the fit depends on how thesevariances are estimated For example Crainiceanu and Ruppert (2004) showed that in finitesamples the (RE)ML estimator of the smoothing parameter is biased towards oversmoothingand Kauermann (2002) obtained corresponding asymptotic results for smoothing splines

In Bayesian models the estimates of the variance components are known to be sensi-tive to the prior specification for example see Gelman (2006) To study the effect of thissensitivity upon Bayesian penalized splines consider model (21) with one smoothing pa-rameter and homoscedastic errors so that σ 2

b and σ 2ε are constant In terms of the precision

parameters τb = 1σ 2b and τε = 1σ 2

ε the smoothing parameter is λ = τετb = σ 2b σ

and a small (large) λ corresponds to oversmoothing (undersmoothing)

31 Priors on the Fixed Effects Parameters

It is standard to assume that the fixed effects parameters βi are apriori independentwith prior distributions either [βi] prop 1 or βi prop N(0 σ 2

β ) where σ 2β is very large In our

applications we used σ 2β = 106 which we recommend if x and y have been standardized

or at least have standard deviations with order of magnitude oneFor the fixed effects γ and δ used in the log-spline models (23) and (24) we also used

independent N(0 106) priors When this prior is not consistent with the true value of theparameter a possible strategy is to fit the model using a given set of priors and obtainthe estimators γ σγ δ and σδ We could then use independent priors N(γ 106σ 2

γ ) andN(δ 106σ 2

δ) for γ and δ respectively

32 Priors on the Precision Parameters

As just mentioned the priors for the precisions τb and τε are crucial We now show howcritically the choice of τb may depend upon the scaling of the variables The gamma familyof priors for the precisions is conjugate If [τb] sim Gamma(Ab Bb) and independently of

270 C M Crainiceanu et al

τb [τε] sim Gamma(Aε Bε) where Gamma(AB) has mean AB and variance AB2 then

[τb|Y βββb τε] sim Gamma

(Ab + Km

2 Bb + ||b||2

2

) (31)

and

[τε |Y βββb τε] prop Gamma

(Aε + n

2 Bε + ||Y minus Xβββ minus Zb||2

2

)

Also

E(τb|Y βββb τε) = Ab + Km2

Bb + ||b||22 var(τb|Y βββb τε) = Ab + Km2

(Bb + ||b||22)2

and similarly for τε The prior does not influence the posterior distribution of τε when both Ab and Bb are

small compared to Km2 and ||b||22 respectively Since the number of knots is Km ge 1and in most problems considered Km ge 5 it is safe to choose Ab le 001 When Bb ltlt

||b||22 the posterior distribution is practically unaffected by the prior assumptions WhenBb increases compared to ||b||22 the conditional distribution is increasingly affectedby the prior assumptions E(τb|Y βββb τε) is decreasing in Bb indicating that large Bb

compared to ||b||22 corresponds to undersmoothing Since the posterior variance of τb isalso decreasing in Bb a poor choice of Bb will likely result in underestimating the variabilityof the smoothing parameter λ = τετb causing confidence intervals for the mean functionto be too narrow The condition Bb ltlt ||b||22 shows that the ldquononinformativenessrdquo ofthe gamma prior depends essentially on the scale of the problem

To show the potential severity of these effects we consider the model (21) with a globalsmoothing parameter and homoscedastic errors We used quadratic splines with 30 knotsand the light detection and ranging (LIDAR) data for illustration The LIDAR data wasobtained from atmospheric monitoring of pollutants and consists of the range or distancetraveled before light is reflected back to a source and log ratio the logarithm of the ratioof received light from two laser sources For more details see Ruppert Wand Holst andHossier (1997)

Figure 1 shows the effect of four mean-one Gamma priors for the precision of thetruncated polynomial parameters The variances of these priors are 10 103 106 and 1010respectively Obviously the first two inferences provide severely undersmoothed almostindistinguishable posterior means The third graph is much smoother but still exhibitsroughness especially in the right-hand side of the plot while the fourth graph displaysa pleasing smooth pattern consistent with our frequentist inference Using either priordistribution one obtains that the posterior mean of ||b||22 is of order 10minus6 to 10minus5 Thisexplains why values of Bb larger than 10minus6 proved inappropriate for this problem

The size of ||b||22 depends upon the scaling of the x and y variables and in the caseof the LIDAR data ||b||22 is small because the standard deviation of x is much largerthan the standard deviation of y If y is rescaled to ayy and x to axx then the regressionfunction becomes aym(axx) whose pth derivative is aya

px m

(p)(axx) so that ||b||22 is

Spatially Adaptive Bayesian Penalized Splines 271

Figure 1 LIDAR data with unstandardized covariate effect of four mean-one Gamma priors for the precisionof the truncated polynomial parameters The variances of these priors are 10 103 106 and 1010 respectively

rescaled by the factor a2ya

2px Thus ||b||22 is particularly sensitive to the scaling of x

The size of ||b||22 also depends on Km Because the integral of m(p) over the range ofx will be approximately

sumKm

k=1 bk asymp radicKmσb we can expect that σ 2

b prop (Km)minus1 and thesmoothing parameter should be proportional to Km For the LIDAR data the GCV chosensmoothing parameter is 00095 00205 00440 and 00831 for Km equal to 15 30 60 and120 respectively so as expected the smoothing parameter approximately doubles as Km

doublesPractical experience with LMMs for longitudinal or clustered data should be applied

with caution to penalized splines In a mixed effects model ||b||2 is an estimator of Kmσ 2b

For longitudinal data Kmσ 2b would generally be large because Km is the number of subjects

with constant subject effect variance σ 2b As just discussed for a penalized spline Kmσ 2

b

should be nearly independent of Km and could be quite smallFigure 2 presents the same type of results as Figure 1 for Gamma priors for the precision

parameter τb with the mean held fixed at 10minus6 and variances equal to 10 103 106 and 1010respectively These prior distributions have a much smaller effect on the posterior mean ofthe regression function The fit seems to be undersmooth when the variance is 10 Clearlywhen the variance increases the fit becomes smooth indicating that a value of the variancelarger than 103 will produce a reasonable fit

A similar discussion holds true for τε but now large Bε corresponds to oversmoothingand τε does not depend on the scaling of x In applications it is less likely that Bε iscomparable in size to ||Y minus Xβββ minus Zb||2 because the latter is an estimator of nσ 2

ε If σ 2ε is

an estimator of σ 2ε a good rule of thumb is to use values of Bε smaller than nσ 2

ε 100 This

272 C M Crainiceanu et al

Figure 2 LIDAR data with unstandardized covariate effect of four Gamma priors with mean 10minus6 for theprecision of the truncated polynomial parameters The variances of these priors are 10 103 106 and 1010respectively

rule should work well when σ 2ε does not have an extremely large variance

Alternative to gamma priors are discussed by for example Natarajan and Kass (2000)and Gelman (2006) These have the advantage of requiring less care in the choice of thehyperparameters However we find that with reasonable care the conjugate gamma priorscan be used in practice Nonetheless exploration of other prior families for penalized splineswould be well worthwhile though beyond the scope of this article

4 SIMULTANEOUS CREDIBLE BOUNDS

Simultaneous credible intervals have been discussed before in the literature for exampleBesag Green Higdon and Mengersen (1995) The idea can be used to produce simultaneouscredible bounds for functions estimated smoothly To show this let f (middot) be either m(middot)logσ 2

ε (middot) logσ 2b (middot) or a derivative of order q 1 le q le p of one of these functions It is

straightforward to use MCMC output to construct simultaneous credible bounds on f overan arbitrary finite interval [x1 xN ] Typically x1 and xN would be the smallest and largestobserved values of x

Let x1 lt x2 lt middot middot middot lt xN be a fine grid of points on this interval Let Ef (xi)and SDf (xi) be the posterior mean and standard deviation of f (xi) estimated from aMCMC sample Let Mα be the (1minusα) sample quantile of max1leileN

∣∣[f (xi)minusEf (xi)]SDf (xi)∣∣ computed from the realizations of f in the MCMC sample Then I (xi) =Ef (xi) plusmn MαSDf (xi) i = 1 N are simultaneous credible intervals For N

Spatially Adaptive Bayesian Penalized Splines 273

large the upper and lower limits of these intervals can be connected to form simultaneouscredible bands In the remainder of this article we only report simultaneous credible boundsfor the functions and their derivatives In the examples considered in the following thesebounds tend to be roughly 30ndash50 wider than pointwise credible bounds

5 COMPARISON WITH OTHER UNIVARIATE SMOOTHERS

Baladandayuthapani et al (2005) presented a comprehensive simulation study of theirBayesian spatially adaptive penalized spline model which is obtained as a particular case ofour full model with constant error variance The results of this study indicate that the methodis comparable to or better than several other spatially adaptive Bayesian methods on a varietyof datasets Results also showed that the method produced practically indistinguishableinferences from those obtained using the fast frequentist method of Ruppert and Carroll(2000) Since we are unaware of any other estimation methodology that estimates jointlythe nonparametric adaptive mean and nonparametric variance models we will compareour methodology with that of Ruppert and Carroll (2000) We will compare the frequentistproperties of our Bayesian methodology such as MSE and coverage probabilities calculatedby simulating from existent benchmark models

Consider the regression model

yi = m(xi) + εi

where εi sim N(0 σ 2ε ) are independent errors The first function considered was the Doppler

function described by Donoho and Johnstone (1994)

m(x) = radicx(1 minus x) sin

2π(1 + 005)

x + 005

For the Doppler function we considered n = 2048 equally spaced xrsquos between [0 1] andσ 2ε = 1

Ruppert and Carroll (2000) proposed the following family of models related to theDoppler but allowing for more shape flexibility

m(x j) = radicx(1 minus x) sin

2π(1 + 2(9minus4j)5)

x + 2(9minus4j)5

where j = 3 and j = 6 correspond to low and severe spatial heterogeneity respectivelyFor this function we considered n = 400 equally spaced xrsquos on [0 1] and σ 2

ε (x) = 004 Wewill refer to this as the RCj function where j controls the amount of spatial heterogeneity

Ruppert and Carroll (2000) and Baladandayuthapani et al (2005) also considered thefollowing function

m(x) = expminus400(x minus 06)2

+ 5

3expminus500(x minus 075)2 + 2 expminus500(x minus 09)2

(51)

274 C M Crainiceanu et al

Table 1 Average mean square error (AMSE) for different mean functions based on 100 simulations The columnldquoTrue variancerdquo indicates whether the data-generating process is homoscedastic or heteroscedastic Thecolumns ldquoFitted error variancerdquo refer to the type of assumption used for fitting the data ldquoHomoscedas-ticrdquo uses the Ruppert and Carroll (2000) fitting algorithm while ldquoHeteroscedasticrdquo uses the Bayesianinferential algorithm described in this article The mean and log-variance functions were modeled usingdegree 2 splines with Km = Kε = 40 equally spaced knots The shrinkage process was modeled usingdegree 1 splines with Kb = 4 subknots

Fitted error variance

Function True variance Homoscedastic Heteroscedastic

Doppler Constant 0197 0218RCj=3 Constant 0011 0012RCj=6 Constant 0061 00653 Bumps Constant 0054 00553 Bumps Variable 0031 0026

which is roughly constant between [0 05] and has three sharp peaks at 06 075 and 09For this function we considered n = 1000 equally spaced xrsquos on [0 1] and two differentscenarios for the standard error of the error process (a) σε(x) = 05 for homoscedasticerrors and (b) σε(xi) = 05minus08x+16(xminus05)+ We will refer to this as the ldquothree-bumprdquofunction

We used 100 simulations from these models and fit a spatially adaptive penalized splinethat uses a constant variance For the three-bump function we also fitted a model withnonconstant variance modeled as a log-penalized-spline For the mean and log-variancefunctions we used quadratic penalized splines with Km = Kε = 40 knots For the log-variance corresponding to the shrinkage process we used Kb = 4 knots

We calculated the mean square error for each simulated dataset as

MSE =nsum

i=1

m(xi) minus m(xi)2 n

where m(x) is the posterior mean of m(x) using a given model Table 1 provides a summaryof the average MSE (AMSE) results over 100 simulations comparing the method that allowsfor heteroscedastic errors with the one that does not Remarkably our method that estimatesa nonconstant variance performed well compared to methods that assume a constant variancewhen the variance is constant In contrast when the true variance process is not constantour method performed better

Our method was slightly outperformed by the Ruppert and Caroll (2000) method in theDoppler case This is probably due to the fact that the Ruppert and Carroll method assumesa constant variance for the Doppler function when the variance used for simulations wasσ 2ε = 1 However both methods outperformed the method of Baladandayuthapani et al

(2005) who reported an AMSE of 0024 for the Doppler functionFor the three-bump function with constant variance σε(x) = 05 the two models pro-

duced practically indistinguishable MSEs Note that our AMSE over all xrsquos using 100 simu-

Spatially Adaptive Bayesian Penalized Splines 275

Figure 3 Comparison of pointwise coverage probabilities of the 95 credible intervals in 500 simulationsData was simulated from the model described in Section 5 with heteroscedastic error variance Estimation wasdone using two methods Bayesian adaptive penalized splines with homoscedastic errors (dashed) and Bayesianadaptive penalized splines with heteroscedastic errors (solid) The coverage probabilities have been smoothed toremove Monte Carlo variability

lations was AMSE = 00054 which is smaller than 00061 reported by Baladandayuthapaniet al (2005) The coverage probabilities of the 95 credible intervals were very similarfor the two models for each value of the covariate with a slight advantage in favor of themodel using homoscedastic errors The average coverage probability was 947 for themodel with homoscedastic errors and 935 for the model with heteroscedastic errorsThese coverage probabilities are very similar to the ones reported by Baladandayuthapaniet al (2005) in their Figure 3

For case (b) σε(xi) = 05 minus 08x + 16(x minus 05)+ the heteroscedastic model sub-stantially outperformed the homoscedastic model both in terms of AMSE and coverageprobabilities In this situation it would be misleading to compare only the average coverageprobability for the 95 credible intervals Indeed these averages are very close 941for the heteroscedastic and 930 for the homoscedastic method but they are obtainedfrom very different sets of pointwise coverage probabilities Figure 3 displays the coverageprobabilities for these two methods Note that for the heteroscedastic method the coverageprobability in the [0 05] interval is close to 95 In the same interval the coverage prob-ability for the homoscedastic method starts from around 08 and increases until it crossesthe 95 target around 02 Moreover in the interval [03 05] this coverage probabilityis estimated to be 1 This is due to the fact that on this interval σε(x) decreases linearlyfrom 05 to 01 Since the homoscedastic method assumes a constant variance the credibleintervals will tend to be shorter than nominal in regions of higher variability (x close tozero) thus producing lower coverage probabilities In regions of smaller variability (x closeto 05) the credible intervals will tend to be much wider thus producing extremely large

276 C M Crainiceanu et al

coverage probabilitiesIn the interval [055 065] the homoscedastic method slightly outperforms the het-

eroscedastic method This seems to be the effect of a ldquoluckyrdquo combination of two factorsAs we discussed the size of the credible intervals for the homoscedastic method in a neigh-borhood of 05 is much larger than nominal However at the same point the mean functionchanges from a constant to a rapidly oscillating function and the coverage probabilities droproughly at the same rate In the interval [08 1] one can notice a phenomenon very similarto the one described for the interval [0 05] While the function oscillates more rapidly inthis region the heteroscedastic adaptive method produces credible intervals with coverageprobabilities close to the nominal 95 However the homoscedastic method does not takeinto account the increased variability and produces credible intervals that are too short

For these examples both methods performed roughly similar on simulated data sets thatdid not require variance estimation However when the error variance is not constant themethod using log-penalized-splines to estimate the error variance considerably outperformsthe method that does not

6 LOW RANK MULTIVARIATE SMOOTHING

In this section we generalize the ideas in Section 2 to multivariate smoothing whilepreserving the appealing geometric interpretation of the smoother Consider the followingregression yi = m(xi ) + εi where m(middot) is a smooth function of L covariates We will useradial basis functions which have the advantage of being more numerically stable than thetensor product of truncated polynomials in more than one dimension Suppose that xi isin RL1 le i le n are n vectors of covariates and κk isin RL 1 le k le Km are Km knots Considerthe following distance function

C(r) =

||r||2MminusL for L odd||r||2MminusL log ||r|| for L even

where || middot || denotes the Euclidean norm in RL the integer M controls the smoothnessof C(middot) X the matrix with ith row Xi = [1 xT

i ] ZKm = C(||xi minus κκκk||)1leilen1lekleKm

Km = C(||κκκk minus κκκkprime ||)1lekleK1lekprimeleK and define Z = ZKminus12K where

minus12K is the

principal square root of K With these notations the low rank approximation of thin platespline regression can be obtained as the BLUP in the LMM (Kammann and Wand 2003Ruppert Wand and Carroll 2003)

Y = Xβββ + Zb + εεε E

(bεεε

)=(

00

) cov

(bεεε

)=(

σ 2b IKm 0

0 σ 2ε In

) (61)

This model contains only one variance component σ 2b for controlling the shrinkage of

b which is equivalent to one global smoothing parameter λ = σ 2b σ

2ε and implicitly as-

sumes homoscedastic errors To relax these assumptions we consider a new set of knotsκκκlowast

1 κκκlowastKb

and define Xlowast and Zlowast similarly with the corresponding definition of matrices

Spatially Adaptive Bayesian Penalized Splines 277

X and Z where the x-covariates are replaced by the knots κκκk and the knots are replaced bythe subknots κκκlowast

k Consider the following model for the responseyi = β0 +sumL

j=1 βjxij +sumKm

k=1 bkzik + εiεi sim N0 σ 2

ε (xi ) i = 1 nbk sim N0 σ 2

b (κκκk) k = 1 Km

(62)

where the error variance logσ 2ε (xi ) is modeled as a low rank log-thin plate spline

logσ 2ε (xi ) = γ0 +sumL

j=1 γjxij +sumKε

k=1 ckzikck sim N(0 σ 2

c ) k = 1 Kε(63)

and the random coefficient variance logσ 2b (κκκk) is modeled as another low rank log-thin

plate spline logσ 2

b (κκκk) = δ0 +sumLj=1 δj x

lowastkj +sumKb

j=1 dj zlowastkj

dj sim N(0 σ 2d ) j = 1 Kb

(64)

where xlowastkj and zlowast

kj are the entries of Xlowast and Zlowast matrices respectively and ck and ds areassumed mutually independent

In the case of low rank smoothers the set of knots for the covariates and subknots formodeling the shrinkage process have to be chosen One possibility is to use equally spacedknots and subknots We use this approach in our simulation study in Section 7 to make ourmethod comparable from this point of view with previously published methods Anotherpossibility is to select the knots and subknots using the space filling design (Nychka andSaltzman 1998) which is based on the maximal separation principle This avoids wastingknots and is likely to lead to better approximations in sparse regions of the data Thecoverdesign() function from the R package Fields (Fields Development Team 2006)provides software for space filling knot selection We use this approach in Section 8 in theNoshiro elevation example

7 COMPARISONS WITH OTHER ADAPTIVE SURFACEFITTING METHODS

Lang and Brezger (2004) compared their adaptive surface smoothing method with sev-eral methods used in a simulation study by Smith and Kohn (1997) MARS (Friedman1991) ldquolocfitrdquo (Cleveland and Grosse 1991) ldquotpsrdquo (bivariate cubic thin plate splines with asingle smoothing parameter) tensor product cubic smoothing splines with five smoothingparameters and a parametric linear interaction model We will use the following regressionfunctions also used by Lang and Brezger

bull m1(x1 x2) = x1 sin(4πx2) where x1 and x2 are distributed independently uniformon [0 1]

bull m2(x1 x2) = 15 exp(minus8x21 ) + 35 exp(minus8x2

2 ) where x1 and x2 are distributedindependently normal with mean 05 and variance 01

278 C M Crainiceanu et al

Function m1(middot) contains complex interactions and moderate spatial variability whilem2(middot) has only main effects and is much smoother We used a sample size n = 300σ = 14 range(f ) and 250 simulations from each model

We will compare the performance of our method with that of Lang and Brezgerrsquos whichperformed at least as good or better than other adaptive methods in a variety of examplesTo avoid dependence of results on knot choice we used the same choice as the one describedby Lang and Brezger 12 times 12 equally spaced knots for the mean function However weused thin-plate splines instead of tensor products of B-splines We modeled the smoothingparameter using a low rank log-thin plate spline with an equally spaced 5 times 5 knot grid Weconsidered two estimators one with constant error variance (CEV) and one with estimatederror variance using a log-thin plate spline (EEV) with the same 12 times 12 knot grid as forthe mean function The CEV estimator is the one that can directly be compared with theestimators considered by Lang and Brezger (2004) Because Lang and Brezger presentedonly comparative results using boxplots we will use their results as read from their Figure5 The performance of all estimators was measured by the log of the mean squared errorlog(MSE) where MSE(f ) = 1n

sumni=1f (xi) minus f (xi)2

For the CEV estimator and function f1(middot) corresponding to moderate spatial heterogene-ity we obtained a median log(MSE) of minus367 with an interquartile range [minus380minus353]and a range [minus421minus313] These values are better than the best values reported by Langand Brezgerrsquos methods in their Figure 5(b) Lang and Brezger reported that their method de-noted ldquoP-splinerdquo outperforms all other methods in this case and achieved a median log(MSE)of minus350 Remarkably the EEV estimator performed almost as well as the CEV estimatorin this case and also outperformed the other methods The median log(MSE) for EEV wasof minus359 with an interquartile range [minus374minus343] and a range [minus414minus302] Theseresults are consistent with univariate results reported by Ruppert and Carroll (2000) andBaladandayuthapani Mallick and Carroll (2005) who found that allowing the smoothingparameter to vary smoothly outperforms other adaptive techniques

For function f1(middot) Figure 4 displays the coverage probabilities for the 95 credibleintervals calculated over a 20 times 20 equally spaced grid in [0 1]2 For one grid point wecalculated the frequency with which the 95 pointwise credible interval covers the truevalue of the function at the grid point As expected these coverage probabilities showstrong spatial correlation with lower coverage probabilities along the ridges of the sinusfunction Coverage probability is lowest when x1 is in the [02 05] range The signal-to-noise in this region is about half the signal-to-noise ratio in the region corresponding to x1

close to 1 This explains why the coverage probability is smaller in this region Interestinglythe zeros of the true function appearing at x2 = 025 050 075 are covered with at least95 probability Another interesting feature is the lower coverage probabilities near thenorth south and east boundaries These features of the pointwise coverage probabilitymap are consistent with the frequentist analysis of Krivobokova et al (2007) but are notconsistent with the coverage probabilities reported by Lang and Brezger for their BayesianP-spline method

For our CEV estimator and function f2(middot) corresponding to very low spatial heterogene-ity we obtained a median log(MSE) of minus595 with an interquartile range [minus612minus566]

Spatially Adaptive Bayesian Penalized Splines 279

Figure 4 Coverage probability of the pointwise 95 credible intervals for the mean function for functionf1(x1 x2) = x1 sin(4πx2) with a constant error standard deviation σ = range(f )4 Probabilities are computedon a 20 times 20 equally spaced grid of points in [0 1]2 The model used was a thin plate spline with K = 100 knotsfor the mean function a log-thin plate spline with K = 100 knots for the variance function and a log-thin platespline with Klowast = 16 knots for the spatially adaptive shrinkage parameter The parameter controlling the degreeof smoothness of the thin-plate spline basis was m = 2

and a range [minus651minus541] The results of Lang and Brezger in their Figure 5(a) indicatethat their ldquoP-splinerdquo method achieved a median log(MSE) of minus600 Our method performsroughly similar to the Bayesian P-spline method of Lang and Brezger and to the cubicthin plate spline (tps) being outperformed only by Lang and Brezgerrsquos adaptive P-splinemethod with two main effects ldquoP-spline mainrdquo We could also add two main effects to ourbivariate adaptive smoother to improve MSE for this example but this is not our concernin this article Again the EEV estimator performed very similarly to our CEV estimator

The last simulation study was done using the function f1(middot middot) where x1 x2 are indepen-dent uniformly distributed in [0 1] with an error standard deviation function

σε(x1 x2) = r

32+ 3r

32x2

1

where r = range(f1) We compared our CEV and EEV estimators described at the beginningof this section and as expected the EEV estimator outperformed the CEV both in terms oflog(MSE) and coverage probabilities More precisely for log(MSE) we obtained a medianof minus526 for CEV and minus535 for EEV an interquartile range [minus537minus515] for CEV and[minus548minus524] for EEV and a range [minus465minus572] for CEV and [minus484minus584] for EEVFor both estimators the general patterns for coverage probabilities were the ones displayedin Figure 4 EEV outperformed CEV in terms of coverage probabilities For example theregions where the nominal level is exceeded for both estimators have a roughly similar

280 C M Crainiceanu et al

Figure 5 Locations of sampling (dots) with 100 knots (circles) in the Noshiro example

shape and location but the EEV coverage probabilities tend to be closer to their nominallevel

Note that our model incorporating nonparametric variance estimation is more generalthan the models considered in the literature However its performance for function m1(middot middot)with moderate spatial variability and constant variance was better than all methods consid-ered by Lang and Brezger in their simulation study (see their Figure 5 panel b) Our methodhas good performance for the function m2(middot middot) being the second best when compared tomethods consider by Lang and Brezger (see their Figure 5 panel a) For the case whenvariance is not constant our adaptive method that estimates the variance nonparametricallyoutperformed our adaptive method that estimates a constant variance

8 THE NOSHIRO EXAMPLE

Noshiro Japan was the site of a major earthquake Much of the damage from thequake was due to soil movement Professor Thomas OrsquoRourke a geotechnical engineerwas investigating the factors that might help predict soil movement during a future quakeOne factor thought to be of importance was the slope of the land To estimate the slopeRuppert (1997) used a dataset with 799 observations where x was longitude and latitudeand y was elevation at locations in Noshiro The object of primary interest is the gradientat least its magnitude and possibly its direction

We used the thin-plate spline models (62) for the mean of y with Km = 100 knotsselected by the maximum separation criterion implemented in the function coverdesign()

of the R package Fields (Nychka Haaland OrsquoConnell and Ellner 1998) The sampling

Spatially Adaptive Bayesian Penalized Splines 281

Figure 6 Posterior mean of the mean regression function for the Noshiro example The model used was a thinplate spline with K = 100 knots for the mean function a log-thin plate spline with K = 100 knots for the variancefunction and a log-thin plate spline with Klowast = 16 knots for the spatially adaptive shrinkage parameter Theparameter controlling the degree of smoothness of the thin-plate spline basis was m = 2

locations and the knots used for fitting are displayed in Figure 5 The log-thin plate splinedescribed in (63) with the same Kε = 100 knots was used to model the variance of theerror process σ 2

ε (xi ) The log-thin plate spline described in (64) with Kb = 16 subknotswas used to model the variance σ 2

b (κκκk) which controls the amount of shrinkage of the b

parameters The subknots were obtained using the coverdesign() function applied to the100 knots

Figure 6 displays the posterior mean of the mean response function There is a rathersharp peak around (042 04) with a slow decay from the peak in the general south directionand a much faster decay in all other directions The function becomes smoother towardsthe boundary This is exactly the type of function for which adaptive spatial smoothing cansubstantially improve the fit

Figure 7 shows that in large areas of the map σε(middot) is smooth and with small valuesHowever in a neighborhood of the peak of the mean function σε(middot) displays two relativelysharp maxima located NNW and SSE of the peak We also performed a two-stage frequentistanalysis In the first stage we used a low rank thin plate spline for the mean function and theresults from this regression were used to obtain residuals We then fitted another low rankthin plate spline to the absolute values of these residuals While this map was not identicalto the one in Figure 7 it did exhibit the same patterns of heteroscedasticity

The process logσ 2b (κκκk) controlling the shrinkage of the bk is displayed in Figure 8

282 C M Crainiceanu et al

Figure 7 Posterior mean of the standard deviation σε(xi ) of the error process function for the Noshiro exampleThe model used was a thin plate spline with K = 100 knots for the mean function a log-thin plate spline withK = 100 knots for the variance function and a log-thin plate spline with Klowast = 16 knots for the spatially adaptiveshrinkage parameter The parameter controlling the degree of smoothness of the thin-plate spline basis was m = 2

Figure 8 Posterior mean of the shrinkage process minus logσ 2b(κκκk) The model used was a thin plate spline with

K = 100 knots for the mean function a log-thin plate spline with K = 100 knots for the variance function and alog-thin plate spline with Klowast = 16 knots for the spatially adaptive shrinkage parameter The parameter controllingthe degree of smoothness of the thin-plate spline basis was m = 2

Spatially Adaptive Bayesian Penalized Splines 283

Smaller values of this function correspond to less shrinkage of bk towards zero and morelocal behavior of the smoother Figure 8 indicates that in a neighborhood of the peak of themean function the shrinkage is smaller to allow the mean function to change rapidly Awayfrom the peak the shrinkage is larger corresponding to a smoother mean

9 IMPLEMENTATION USING MCMC

We will now provide some details for MCMC simulations of model (21) wherethe variances σ 2

ε (xi) and σ 2b (κk) are modeled by Equations (23) and (24) The imple-

mentation for other models for example for the multivariate smoothing in Section 6is similar Consider independent normal priors for the coefficients of the monomialsβi sim N(0 σ 2

0β) γi sim N(0 σ 20γ ) i = 0 p δi sim N(0 σ 2

0δ) i = 0 q and in-

dependent inverse Gamma priors for the variance components σ 2c sim IGamma(ac bc) and

σ 2d sim IGamma(ad bd) Using these priors many full conditionals of the posterior distribu-

tion are easy to derive while a few have complex multivariate forms Our implementationof the MCMC using multivariate Metropolis-Hastings steps proved to be unstable with poormixing properties A simple and reliable solution was to change the model by adding errorterms to the log-spline models that is

logσ 2ε (xi)

= γ0 + middot middot middot + γpxpi +sumKε

s=1 cs(xi minus κεs )

p+ + ui

logσ 2b (κ

mk ) = δ0 + middot middot middot + δq(κ

mk )q +sumKb

s=1 ds(κmk minus κb

s )q+ + vk

(91)

where ui sim N(0 σ 2u ) and vk sim N(0 σ 2

v ) This idea was also used for σ 2b by Baladan-

dayuthapani Mallick and Carroll (2005) We fixed the values of σ 2u = σ 2

v = 001 as thesevariances appear nonidentifiable and a standard deviation of 01 is small on a log-scale Thisdevice makes computations feasible by replacing sampling from complex full conditionalsby simple univariate MH steps

Define as before by ε b θθθX and define θθθε = (γγγ T cT )T θθθb = (δδδT dT )T CX =(X Z) Cε = (Xε Zε) Cb = (Xb Zb) where X Xε Xb contain the monomials and ZZε Zb contain the truncated polynomials of the spline models for m logσ 2

ε and logσ 2b

respectively Also denote by

0X =[

σ 20βIp+1 0

0 b

]

0ε =[

σ 20γ Ip+1 0

0 σ 2c IKε

]

0b =[

σ 20δIq+1 0

0 σ 2d IKm

]

The full conditionals of the posterior are detailed below

1 [θθθX] sim N(MXCT

Xminus1ε YMX

) where MX =

(CT

Xminus1ε CX + minus1

0X

)minus1

2 [θθθε] sim N(MεCT

ε Yεσ2u Mε

) where Yε = [log(σ 2

ε1) log(σ 2εn)]

T and Mε

= (CTε Cεσ

2u +minus1

0ε )minus1

284 C M Crainiceanu et al

Figure 9 Computation time (in minutes) versus sample size for 10000 simulations from the joint posteriordistribution given the data The mean function is modeled as a penalized spline with Km knots with the log of thevariance of the shrinkage process modeled as another penalized spline with Kb = 4 subknots The log of the errorvariance is modeled as another penalized spline with Kε = Km knots Different symbols correspond to increasingnumber of knots dots (Km = 10) asterisks (Km = 20) circles (Km = 30) xrsquos (Km = 40)

3 [θθθb] sim N(MbCT

b Ybσ2v Mb

)where Yb =

[log(σ 2

1 ) log(σ 2Km

)]T

and Mb =(CT

b Cbσ2v +minus1

0b )minus1

4 [σ 2c ] sim IGamma

(ac + Kε2 bc + ||c||22

)

5 [σ 2d ] sim IGamma

(ad + Kb2 bd + ||d||22

)

6 [σ 2εi] prop σminus3

εi exp

minus(yi minus microi)

2(2σ 2εi ) minus

[log(σ 2

εi ) minus ηi)]2

(2σ 2u )

where microi

and ηi are the ith components of CXθθθX and Cεθθθε respectively

7 [σ 2k ] prop σminus3

k expminusb2

k(2σ2k ) minus [

log(σ 2k ) minus ζk)

]2(σ 2

v ) where ζk is the kth com-

ponent of Cbθθθb

All the above conditionals have an explicit form with the exception of the n + Km

one-dimensional conditionals from 6 and 7 For these distributions we use the Metropolis-Hastings algorithm with a normal proposal distribution centered at the current value andsmall variance

10 IMPLEMENTATION AND COMPUTATIONALPERFORMANCE

Our MCMC simulation algorithm provided reliable results and is reasonably fast To il-lustrate this Figure 9 displays computation time (in minutes) versus sample size for 10000

Spatially Adaptive Bayesian Penalized Splines 285

Figure 10 MCMC histories for eight parameters of interest including the log of standard deviation of the errorprocess and log of standard deviation of the shrinkage process as well as some parameters modeling the differentfunctions The model uses 40 equally spaced knots for the mean and error variance functions and 4 subknotsfor the adaptive shrinkage process Chains correspond to the Doppler function example using quadratic truncatedpolynomials to fit the mean and log variance function and linear truncated polynomial basis for fitting the shrinkageprocess For clarity of presentation only every 20th simulated observation is displayed

simulations from the joint posterior distribution given the data The mean function is mod-eled as a penalized spline with Km knots with the log of the variance of the shrinkageprocess modeled as another penalized spline with Kb = 4 subknots The log of the errorvariance is modeled as another penalized spline with Kε = Km knots Different symbolscorrespond to increasing number of knots dots (Km = 10) asterisks (Km = 20) circles(Km = 30) xrsquos (Km = 40) Computation time increases roughly linearly with the numberof observations with the same intercept 0 and larger slopes for larger number of knotsNote that even with n = 1000 observations and Km = Kε = 40 simulation time is abouttwo minutes

For the models and datasets presented in this article 10000 simulations proved suffi-cient for accurate and reliable inference Some parameters such as the mean and variancefunctions tended to exhibit much better mixing properties than the variance componentsσ 2c and σ 2

d controlling the shrinkage processes This is probably due to the reduced amountof information in the data about these parameters The MCMC convergence and mixingproperties were assessed by visual inspection of the chain histories of many parameters ofinterest Since we started with good starting values of the parameters convergence is fastFigure 10 displays the histories of eight parameters of interest for the model fitted to the

286 C M Crainiceanu et al

Doppler function described in Section 5 For clarity we display only every 100th sampledvalued of the chain but the same patterns would be observed in the full chain history Similargood mixing properties have been noted in all the other examples presented in this article

In the search for a good software platform we implemented our algorithms in MATLABC and FORTRAN Programs can be downloaded from wwwbiostatjhsphedu simccrainicFor our application and implementation there were no serious differences between C andFORTRAN which have performed better than MATLAB by reducing computation timetwo to three times The reported times in Figure 9 are for our implementation in C

Given the complexity of the model we were interested in improving the mixing prop-erties of the Markov Chains The computational trick described in Section 9 was the sinequa non method that allowed us to replace an intractable multivariate MH step with severaltractable univariate MH steps We used normal random walk proposal distributions centeredat the current value and variance tuned to provide acceptance rates between 30 and 50

11 COMMENTS

Our joint models for heterogeneous mean and nonconstant variance are designed to ex-tract the maximum amount of information from complex data and suggest simpler paramet-ric models Bayesian inference based on MCMC offers a natural and powerful frameworkfor model fitting However MCMC in this context is not straightforward and implemen-tation depends essentially on the ability of transforming intractable simulations from highdimensional distributions into tractable one dimensional updates as described in Section9 Our implementation provides accurate fast and reproducible results while software isavailable Actual simulation time increases only linearly with the number of observationsand mixing properties are adequate provided that proposal distribution variances are tunedfor optimal acceptance rates

ACKNOWLEDGMENTS

Carrollrsquos research was supported by a grant from the National Cancer Institute (CA-57030) and by the TexasAampM Center for Environmental and Rural Health via a grant from the National Institute of Environmental HealthSciences (P30-ES09106)

[Received September 2004 Revised October 2006]

REFERENCES

Baladandayuthapani V Mallick B K and Carroll R J (2005) ldquoSpatially Adaptive Bayesian RegressionSplinesrdquo Journal of Computational and Graphical Statistics 14 378ndash394

Besag J Green P Higdon D and Mengersen K (1995) ldquoBayesian Computation and Stochastic Systemsrdquo(with discussions) Statistical Science 10 3ndash66

Carroll R J (2003) ldquoVariances Are Not Always Nuisance Parameters The 2002 R A Fisher Lecturerdquo Biometrics59 211ndash220

Carroll R J and Ruppert D (1988) Transformation and Weighting in Regression New York Chapman andHall

Spatially Adaptive Bayesian Penalized Splines 287

Cleveland W S and Grosse E (1991) ldquoComputational Methods for Local Regressionrdquo Statistics and Computing1 47ndash62

Crainiceanu C and Ruppert D (2004) ldquoLikelihood Ratio Tests in Linear Mixed Models With One VarianceComponentrdquo Journal of the Royal Statistical Society Series B 66 165ndash185

Crainiceanu C M Ruppert D and Wand M P (2005) ldquoBayesian Analysis for Penalized Spline Regressionusing WinBUGSrdquo Journal of Statistical Software 14 Available online at http wwwjstatsoftorg v14 i14

Donoho D L and Johnstone I M (1994) ldquoIdeal Spatial Adaptation by Wavelet Shrinkagerdquo Biometrika 81425ndash455

Eilers P H C and Marx B D (1996) ldquoFlexible Smoothing With B-Splines and Penaltiesrdquo (with discussion)Statistical Science 11 89ndash121

(2002) ldquoGeneralized Linear Additive Smooth Structuresrdquo Journal of Computational and Graphical Statis-tics 11 758ndash783

(2004) ldquoSplines Knots and Penaltiesrdquo unpublished manuscript

Fields Development Team (2006) fields Tools for Spatial Data National Center for Atmospheric ResearchBoulder CO Available online at http wwwcgducaredu Software Fields

Friedman J H (1991) ldquoMultivariate Adaptive Regression Splinesrdquo (with discussion) The Annals of Statistics19 1ndash141

Gelman A (2006) ldquoPrior Distributions for Variance Parameters in Hierarchical Modelsrdquo Bayesian Analysis 1515ndash533

Gray R J (1994) ldquoSpline-Based Tests in Survival Analysisrdquo Biometrics 50 640ndash652

Holst U Hossjer O Bjorklund C Ragnarson P and Edner H (1996) ldquoLocally Weighted Least SquaresKernel Regression and Statistical Evaluation of LIDAR Measurementsrdquo Environmetrics 7 401ndash416

Jullion A and Lambert P (2006) Discussion paper DP0534 UCL

Kammann E E and Wand M P (2003) ldquoGeoadditive Modelsrdquo Applied Statistics 52 1ndash18

Kauermann G (2002) ldquoA Note on Bandwidth Selection for Penalised Spline Smoothingrdquo Technical Report 02-13Department of Statistics University of Glasgow

Krivobokova T Crainiceanu C M and Kauermann G (2007) ldquoFast Adaptive Penalized Splinesrdquo Journal ofComputational and Graphical Statistics to appear

Lang S and Bretzger A (2004) ldquoBayesian P-Splinesrdquo Journal of Compuational and Graphical Statistics 13183ndash212

Lang S Fronk E M and Fahrmeir L (2002) ldquoFunction Estimation with Locally Adaptive Dynamic ModelsrdquoComputational Statistics 17 479ndash499

Marx B D and Eilers PHC (1998) ldquoDirect Generalized Additive Modeling With Penalized LikelihoodrdquoComputational Statistics amp Data Analysis 28 193ndash209

Natarajan R and Kass R E (2000) ldquoReference Bayesian Methods for Generalized Linear Mixed ModelsrdquoJournal of the American Statistical Association 95 227ndash237

Ngo L and Wand M P (2004) ldquoSmoothing With Mixed Model Softwarerdquo Journal of Statistical Software 9Available online at http wwwjstatsoftorg v09 i01

Nychka D Haaland P OrsquoConnell M and Ellner S (1998) ldquoFUNFITS Data Analysis and Statistical Toolsfor Estimating Functionsrdquo in Case Studies in Environmental Statistics Lecture Notes in Statistics vol 132eds D Nychka W W Piegorsch and L H Cox New York Springer Verlag pp 159ndash179 Online athttp wwwcgducaredu stats Software Funfits

Nychka D and Saltzman N (1998) ldquoDesign of Air Quality Monitoring Networksrdquo in Case studies in Environ-mental Statistics Lecture Notes in Statistics vol 132 eds D Nychka W W Piegorsch and L H Cox NewYork Springer Verlag pp 51ndash76

288 C M Crainiceanu et al

Pintore A Speckman P and Holmes C C (2006) ldquoSpatially Adaptive Smoothing Splinesrdquo Biometrika 93113ndash125

Ruppert D (1997) ldquoLocal Polynomial Regression and its Applications in Environmental Statisticsrdquo in Statisticsfor the Environment (vol 3) eds V Barnett and F Turkman Chicester Wiley pp 155ndash173

(2002) ldquoSelecting the Number of Knots for Penalized Splinesrdquo Journal of Computational and GraphicalStatistics 11 735ndash757

Ruppert D and Carroll R J (2000) ldquoSpatially-Adaptive Penalties for Spline Fittingrdquo Australian and NewZealand Journal of Statistics 42 205ndash223

Ruppert D Wand M P Holst and U Hossier O (1997) Technometrics 39 262ndash273

Ruppert D Wand M P and Carroll R J (2003) Semiparametric Regression Cambridge Cambridge UniversityPress

Smith M and Kohn R (1997) ldquoA Bayesian Approach to Nonparametric Bivariate Regressionrdquo Journal of theAmerican Statistical Association 92 1522ndash1535

Wood S (2006) Generalized Additive Models An Introduction with R New York Chapman amp HallCRC

Page 5: Spatially Adaptive Bayesian Penalized Splines - Department of

Spatially Adaptive Bayesian Penalized Splines 269

though they did not make any categorical statement that quantile-spacing is always to bepreferred Since knot spacing is not the main focus of this article and our results were notsensitive to it we do not discuss this issue here

In this article we have used quantile knot spacing for the univariate case and spacefilling design (Nychka and Saltzman 1998) for the multivariate case which is based on themaximal separation principle and generalizes the quantile knot spacing in more than onedimension For the multivariate case we also used equally spaced knots in simulations tomake our methods more comparable to other published methods

3 PRIOR SPECIFICATION

Any smoother depends heavily on the choice of smoothing parameter and for penalizedsplines the smoothing parameter is the ratio of the error variance to the prior variance on themean (Ruppert Wand and Carroll 2003) The smoothness of the fit depends on how thesevariances are estimated For example Crainiceanu and Ruppert (2004) showed that in finitesamples the (RE)ML estimator of the smoothing parameter is biased towards oversmoothingand Kauermann (2002) obtained corresponding asymptotic results for smoothing splines

In Bayesian models the estimates of the variance components are known to be sensi-tive to the prior specification for example see Gelman (2006) To study the effect of thissensitivity upon Bayesian penalized splines consider model (21) with one smoothing pa-rameter and homoscedastic errors so that σ 2

b and σ 2ε are constant In terms of the precision

parameters τb = 1σ 2b and τε = 1σ 2

ε the smoothing parameter is λ = τετb = σ 2b σ

and a small (large) λ corresponds to oversmoothing (undersmoothing)

31 Priors on the Fixed Effects Parameters

It is standard to assume that the fixed effects parameters βi are apriori independentwith prior distributions either [βi] prop 1 or βi prop N(0 σ 2

β ) where σ 2β is very large In our

applications we used σ 2β = 106 which we recommend if x and y have been standardized

or at least have standard deviations with order of magnitude oneFor the fixed effects γ and δ used in the log-spline models (23) and (24) we also used

independent N(0 106) priors When this prior is not consistent with the true value of theparameter a possible strategy is to fit the model using a given set of priors and obtainthe estimators γ σγ δ and σδ We could then use independent priors N(γ 106σ 2

γ ) andN(δ 106σ 2

δ) for γ and δ respectively

32 Priors on the Precision Parameters

As just mentioned the priors for the precisions τb and τε are crucial We now show howcritically the choice of τb may depend upon the scaling of the variables The gamma familyof priors for the precisions is conjugate If [τb] sim Gamma(Ab Bb) and independently of

270 C M Crainiceanu et al

τb [τε] sim Gamma(Aε Bε) where Gamma(AB) has mean AB and variance AB2 then

[τb|Y βββb τε] sim Gamma

(Ab + Km

2 Bb + ||b||2

2

) (31)

and

[τε |Y βββb τε] prop Gamma

(Aε + n

2 Bε + ||Y minus Xβββ minus Zb||2

2

)

Also

E(τb|Y βββb τε) = Ab + Km2

Bb + ||b||22 var(τb|Y βββb τε) = Ab + Km2

(Bb + ||b||22)2

and similarly for τε The prior does not influence the posterior distribution of τε when both Ab and Bb are

small compared to Km2 and ||b||22 respectively Since the number of knots is Km ge 1and in most problems considered Km ge 5 it is safe to choose Ab le 001 When Bb ltlt

||b||22 the posterior distribution is practically unaffected by the prior assumptions WhenBb increases compared to ||b||22 the conditional distribution is increasingly affectedby the prior assumptions E(τb|Y βββb τε) is decreasing in Bb indicating that large Bb

compared to ||b||22 corresponds to undersmoothing Since the posterior variance of τb isalso decreasing in Bb a poor choice of Bb will likely result in underestimating the variabilityof the smoothing parameter λ = τετb causing confidence intervals for the mean functionto be too narrow The condition Bb ltlt ||b||22 shows that the ldquononinformativenessrdquo ofthe gamma prior depends essentially on the scale of the problem

To show the potential severity of these effects we consider the model (21) with a globalsmoothing parameter and homoscedastic errors We used quadratic splines with 30 knotsand the light detection and ranging (LIDAR) data for illustration The LIDAR data wasobtained from atmospheric monitoring of pollutants and consists of the range or distancetraveled before light is reflected back to a source and log ratio the logarithm of the ratioof received light from two laser sources For more details see Ruppert Wand Holst andHossier (1997)

Figure 1 shows the effect of four mean-one Gamma priors for the precision of thetruncated polynomial parameters The variances of these priors are 10 103 106 and 1010respectively Obviously the first two inferences provide severely undersmoothed almostindistinguishable posterior means The third graph is much smoother but still exhibitsroughness especially in the right-hand side of the plot while the fourth graph displaysa pleasing smooth pattern consistent with our frequentist inference Using either priordistribution one obtains that the posterior mean of ||b||22 is of order 10minus6 to 10minus5 Thisexplains why values of Bb larger than 10minus6 proved inappropriate for this problem

The size of ||b||22 depends upon the scaling of the x and y variables and in the caseof the LIDAR data ||b||22 is small because the standard deviation of x is much largerthan the standard deviation of y If y is rescaled to ayy and x to axx then the regressionfunction becomes aym(axx) whose pth derivative is aya

px m

(p)(axx) so that ||b||22 is

Spatially Adaptive Bayesian Penalized Splines 271

Figure 1 LIDAR data with unstandardized covariate effect of four mean-one Gamma priors for the precisionof the truncated polynomial parameters The variances of these priors are 10 103 106 and 1010 respectively

rescaled by the factor a2ya

2px Thus ||b||22 is particularly sensitive to the scaling of x

The size of ||b||22 also depends on Km Because the integral of m(p) over the range ofx will be approximately

sumKm

k=1 bk asymp radicKmσb we can expect that σ 2

b prop (Km)minus1 and thesmoothing parameter should be proportional to Km For the LIDAR data the GCV chosensmoothing parameter is 00095 00205 00440 and 00831 for Km equal to 15 30 60 and120 respectively so as expected the smoothing parameter approximately doubles as Km

doublesPractical experience with LMMs for longitudinal or clustered data should be applied

with caution to penalized splines In a mixed effects model ||b||2 is an estimator of Kmσ 2b

For longitudinal data Kmσ 2b would generally be large because Km is the number of subjects

with constant subject effect variance σ 2b As just discussed for a penalized spline Kmσ 2

b

should be nearly independent of Km and could be quite smallFigure 2 presents the same type of results as Figure 1 for Gamma priors for the precision

parameter τb with the mean held fixed at 10minus6 and variances equal to 10 103 106 and 1010respectively These prior distributions have a much smaller effect on the posterior mean ofthe regression function The fit seems to be undersmooth when the variance is 10 Clearlywhen the variance increases the fit becomes smooth indicating that a value of the variancelarger than 103 will produce a reasonable fit

A similar discussion holds true for τε but now large Bε corresponds to oversmoothingand τε does not depend on the scaling of x In applications it is less likely that Bε iscomparable in size to ||Y minus Xβββ minus Zb||2 because the latter is an estimator of nσ 2

ε If σ 2ε is

an estimator of σ 2ε a good rule of thumb is to use values of Bε smaller than nσ 2

ε 100 This

272 C M Crainiceanu et al

Figure 2 LIDAR data with unstandardized covariate effect of four Gamma priors with mean 10minus6 for theprecision of the truncated polynomial parameters The variances of these priors are 10 103 106 and 1010respectively

rule should work well when σ 2ε does not have an extremely large variance

Alternative to gamma priors are discussed by for example Natarajan and Kass (2000)and Gelman (2006) These have the advantage of requiring less care in the choice of thehyperparameters However we find that with reasonable care the conjugate gamma priorscan be used in practice Nonetheless exploration of other prior families for penalized splineswould be well worthwhile though beyond the scope of this article

4 SIMULTANEOUS CREDIBLE BOUNDS

Simultaneous credible intervals have been discussed before in the literature for exampleBesag Green Higdon and Mengersen (1995) The idea can be used to produce simultaneouscredible bounds for functions estimated smoothly To show this let f (middot) be either m(middot)logσ 2

ε (middot) logσ 2b (middot) or a derivative of order q 1 le q le p of one of these functions It is

straightforward to use MCMC output to construct simultaneous credible bounds on f overan arbitrary finite interval [x1 xN ] Typically x1 and xN would be the smallest and largestobserved values of x

Let x1 lt x2 lt middot middot middot lt xN be a fine grid of points on this interval Let Ef (xi)and SDf (xi) be the posterior mean and standard deviation of f (xi) estimated from aMCMC sample Let Mα be the (1minusα) sample quantile of max1leileN

∣∣[f (xi)minusEf (xi)]SDf (xi)∣∣ computed from the realizations of f in the MCMC sample Then I (xi) =Ef (xi) plusmn MαSDf (xi) i = 1 N are simultaneous credible intervals For N

Spatially Adaptive Bayesian Penalized Splines 273

large the upper and lower limits of these intervals can be connected to form simultaneouscredible bands In the remainder of this article we only report simultaneous credible boundsfor the functions and their derivatives In the examples considered in the following thesebounds tend to be roughly 30ndash50 wider than pointwise credible bounds

5 COMPARISON WITH OTHER UNIVARIATE SMOOTHERS

Baladandayuthapani et al (2005) presented a comprehensive simulation study of theirBayesian spatially adaptive penalized spline model which is obtained as a particular case ofour full model with constant error variance The results of this study indicate that the methodis comparable to or better than several other spatially adaptive Bayesian methods on a varietyof datasets Results also showed that the method produced practically indistinguishableinferences from those obtained using the fast frequentist method of Ruppert and Carroll(2000) Since we are unaware of any other estimation methodology that estimates jointlythe nonparametric adaptive mean and nonparametric variance models we will compareour methodology with that of Ruppert and Carroll (2000) We will compare the frequentistproperties of our Bayesian methodology such as MSE and coverage probabilities calculatedby simulating from existent benchmark models

Consider the regression model

yi = m(xi) + εi

where εi sim N(0 σ 2ε ) are independent errors The first function considered was the Doppler

function described by Donoho and Johnstone (1994)

m(x) = radicx(1 minus x) sin

2π(1 + 005)

x + 005

For the Doppler function we considered n = 2048 equally spaced xrsquos between [0 1] andσ 2ε = 1

Ruppert and Carroll (2000) proposed the following family of models related to theDoppler but allowing for more shape flexibility

m(x j) = radicx(1 minus x) sin

2π(1 + 2(9minus4j)5)

x + 2(9minus4j)5

where j = 3 and j = 6 correspond to low and severe spatial heterogeneity respectivelyFor this function we considered n = 400 equally spaced xrsquos on [0 1] and σ 2

ε (x) = 004 Wewill refer to this as the RCj function where j controls the amount of spatial heterogeneity

Ruppert and Carroll (2000) and Baladandayuthapani et al (2005) also considered thefollowing function

m(x) = expminus400(x minus 06)2

+ 5

3expminus500(x minus 075)2 + 2 expminus500(x minus 09)2

(51)

274 C M Crainiceanu et al

Table 1 Average mean square error (AMSE) for different mean functions based on 100 simulations The columnldquoTrue variancerdquo indicates whether the data-generating process is homoscedastic or heteroscedastic Thecolumns ldquoFitted error variancerdquo refer to the type of assumption used for fitting the data ldquoHomoscedas-ticrdquo uses the Ruppert and Carroll (2000) fitting algorithm while ldquoHeteroscedasticrdquo uses the Bayesianinferential algorithm described in this article The mean and log-variance functions were modeled usingdegree 2 splines with Km = Kε = 40 equally spaced knots The shrinkage process was modeled usingdegree 1 splines with Kb = 4 subknots

Fitted error variance

Function True variance Homoscedastic Heteroscedastic

Doppler Constant 0197 0218RCj=3 Constant 0011 0012RCj=6 Constant 0061 00653 Bumps Constant 0054 00553 Bumps Variable 0031 0026

which is roughly constant between [0 05] and has three sharp peaks at 06 075 and 09For this function we considered n = 1000 equally spaced xrsquos on [0 1] and two differentscenarios for the standard error of the error process (a) σε(x) = 05 for homoscedasticerrors and (b) σε(xi) = 05minus08x+16(xminus05)+ We will refer to this as the ldquothree-bumprdquofunction

We used 100 simulations from these models and fit a spatially adaptive penalized splinethat uses a constant variance For the three-bump function we also fitted a model withnonconstant variance modeled as a log-penalized-spline For the mean and log-variancefunctions we used quadratic penalized splines with Km = Kε = 40 knots For the log-variance corresponding to the shrinkage process we used Kb = 4 knots

We calculated the mean square error for each simulated dataset as

MSE =nsum

i=1

m(xi) minus m(xi)2 n

where m(x) is the posterior mean of m(x) using a given model Table 1 provides a summaryof the average MSE (AMSE) results over 100 simulations comparing the method that allowsfor heteroscedastic errors with the one that does not Remarkably our method that estimatesa nonconstant variance performed well compared to methods that assume a constant variancewhen the variance is constant In contrast when the true variance process is not constantour method performed better

Our method was slightly outperformed by the Ruppert and Caroll (2000) method in theDoppler case This is probably due to the fact that the Ruppert and Carroll method assumesa constant variance for the Doppler function when the variance used for simulations wasσ 2ε = 1 However both methods outperformed the method of Baladandayuthapani et al

(2005) who reported an AMSE of 0024 for the Doppler functionFor the three-bump function with constant variance σε(x) = 05 the two models pro-

duced practically indistinguishable MSEs Note that our AMSE over all xrsquos using 100 simu-

Spatially Adaptive Bayesian Penalized Splines 275

Figure 3 Comparison of pointwise coverage probabilities of the 95 credible intervals in 500 simulationsData was simulated from the model described in Section 5 with heteroscedastic error variance Estimation wasdone using two methods Bayesian adaptive penalized splines with homoscedastic errors (dashed) and Bayesianadaptive penalized splines with heteroscedastic errors (solid) The coverage probabilities have been smoothed toremove Monte Carlo variability

lations was AMSE = 00054 which is smaller than 00061 reported by Baladandayuthapaniet al (2005) The coverage probabilities of the 95 credible intervals were very similarfor the two models for each value of the covariate with a slight advantage in favor of themodel using homoscedastic errors The average coverage probability was 947 for themodel with homoscedastic errors and 935 for the model with heteroscedastic errorsThese coverage probabilities are very similar to the ones reported by Baladandayuthapaniet al (2005) in their Figure 3

For case (b) σε(xi) = 05 minus 08x + 16(x minus 05)+ the heteroscedastic model sub-stantially outperformed the homoscedastic model both in terms of AMSE and coverageprobabilities In this situation it would be misleading to compare only the average coverageprobability for the 95 credible intervals Indeed these averages are very close 941for the heteroscedastic and 930 for the homoscedastic method but they are obtainedfrom very different sets of pointwise coverage probabilities Figure 3 displays the coverageprobabilities for these two methods Note that for the heteroscedastic method the coverageprobability in the [0 05] interval is close to 95 In the same interval the coverage prob-ability for the homoscedastic method starts from around 08 and increases until it crossesthe 95 target around 02 Moreover in the interval [03 05] this coverage probabilityis estimated to be 1 This is due to the fact that on this interval σε(x) decreases linearlyfrom 05 to 01 Since the homoscedastic method assumes a constant variance the credibleintervals will tend to be shorter than nominal in regions of higher variability (x close tozero) thus producing lower coverage probabilities In regions of smaller variability (x closeto 05) the credible intervals will tend to be much wider thus producing extremely large

276 C M Crainiceanu et al

coverage probabilitiesIn the interval [055 065] the homoscedastic method slightly outperforms the het-

eroscedastic method This seems to be the effect of a ldquoluckyrdquo combination of two factorsAs we discussed the size of the credible intervals for the homoscedastic method in a neigh-borhood of 05 is much larger than nominal However at the same point the mean functionchanges from a constant to a rapidly oscillating function and the coverage probabilities droproughly at the same rate In the interval [08 1] one can notice a phenomenon very similarto the one described for the interval [0 05] While the function oscillates more rapidly inthis region the heteroscedastic adaptive method produces credible intervals with coverageprobabilities close to the nominal 95 However the homoscedastic method does not takeinto account the increased variability and produces credible intervals that are too short

For these examples both methods performed roughly similar on simulated data sets thatdid not require variance estimation However when the error variance is not constant themethod using log-penalized-splines to estimate the error variance considerably outperformsthe method that does not

6 LOW RANK MULTIVARIATE SMOOTHING

In this section we generalize the ideas in Section 2 to multivariate smoothing whilepreserving the appealing geometric interpretation of the smoother Consider the followingregression yi = m(xi ) + εi where m(middot) is a smooth function of L covariates We will useradial basis functions which have the advantage of being more numerically stable than thetensor product of truncated polynomials in more than one dimension Suppose that xi isin RL1 le i le n are n vectors of covariates and κk isin RL 1 le k le Km are Km knots Considerthe following distance function

C(r) =

||r||2MminusL for L odd||r||2MminusL log ||r|| for L even

where || middot || denotes the Euclidean norm in RL the integer M controls the smoothnessof C(middot) X the matrix with ith row Xi = [1 xT

i ] ZKm = C(||xi minus κκκk||)1leilen1lekleKm

Km = C(||κκκk minus κκκkprime ||)1lekleK1lekprimeleK and define Z = ZKminus12K where

minus12K is the

principal square root of K With these notations the low rank approximation of thin platespline regression can be obtained as the BLUP in the LMM (Kammann and Wand 2003Ruppert Wand and Carroll 2003)

Y = Xβββ + Zb + εεε E

(bεεε

)=(

00

) cov

(bεεε

)=(

σ 2b IKm 0

0 σ 2ε In

) (61)

This model contains only one variance component σ 2b for controlling the shrinkage of

b which is equivalent to one global smoothing parameter λ = σ 2b σ

2ε and implicitly as-

sumes homoscedastic errors To relax these assumptions we consider a new set of knotsκκκlowast

1 κκκlowastKb

and define Xlowast and Zlowast similarly with the corresponding definition of matrices

Spatially Adaptive Bayesian Penalized Splines 277

X and Z where the x-covariates are replaced by the knots κκκk and the knots are replaced bythe subknots κκκlowast

k Consider the following model for the responseyi = β0 +sumL

j=1 βjxij +sumKm

k=1 bkzik + εiεi sim N0 σ 2

ε (xi ) i = 1 nbk sim N0 σ 2

b (κκκk) k = 1 Km

(62)

where the error variance logσ 2ε (xi ) is modeled as a low rank log-thin plate spline

logσ 2ε (xi ) = γ0 +sumL

j=1 γjxij +sumKε

k=1 ckzikck sim N(0 σ 2

c ) k = 1 Kε(63)

and the random coefficient variance logσ 2b (κκκk) is modeled as another low rank log-thin

plate spline logσ 2

b (κκκk) = δ0 +sumLj=1 δj x

lowastkj +sumKb

j=1 dj zlowastkj

dj sim N(0 σ 2d ) j = 1 Kb

(64)

where xlowastkj and zlowast

kj are the entries of Xlowast and Zlowast matrices respectively and ck and ds areassumed mutually independent

In the case of low rank smoothers the set of knots for the covariates and subknots formodeling the shrinkage process have to be chosen One possibility is to use equally spacedknots and subknots We use this approach in our simulation study in Section 7 to make ourmethod comparable from this point of view with previously published methods Anotherpossibility is to select the knots and subknots using the space filling design (Nychka andSaltzman 1998) which is based on the maximal separation principle This avoids wastingknots and is likely to lead to better approximations in sparse regions of the data Thecoverdesign() function from the R package Fields (Fields Development Team 2006)provides software for space filling knot selection We use this approach in Section 8 in theNoshiro elevation example

7 COMPARISONS WITH OTHER ADAPTIVE SURFACEFITTING METHODS

Lang and Brezger (2004) compared their adaptive surface smoothing method with sev-eral methods used in a simulation study by Smith and Kohn (1997) MARS (Friedman1991) ldquolocfitrdquo (Cleveland and Grosse 1991) ldquotpsrdquo (bivariate cubic thin plate splines with asingle smoothing parameter) tensor product cubic smoothing splines with five smoothingparameters and a parametric linear interaction model We will use the following regressionfunctions also used by Lang and Brezger

bull m1(x1 x2) = x1 sin(4πx2) where x1 and x2 are distributed independently uniformon [0 1]

bull m2(x1 x2) = 15 exp(minus8x21 ) + 35 exp(minus8x2

2 ) where x1 and x2 are distributedindependently normal with mean 05 and variance 01

278 C M Crainiceanu et al

Function m1(middot) contains complex interactions and moderate spatial variability whilem2(middot) has only main effects and is much smoother We used a sample size n = 300σ = 14 range(f ) and 250 simulations from each model

We will compare the performance of our method with that of Lang and Brezgerrsquos whichperformed at least as good or better than other adaptive methods in a variety of examplesTo avoid dependence of results on knot choice we used the same choice as the one describedby Lang and Brezger 12 times 12 equally spaced knots for the mean function However weused thin-plate splines instead of tensor products of B-splines We modeled the smoothingparameter using a low rank log-thin plate spline with an equally spaced 5 times 5 knot grid Weconsidered two estimators one with constant error variance (CEV) and one with estimatederror variance using a log-thin plate spline (EEV) with the same 12 times 12 knot grid as forthe mean function The CEV estimator is the one that can directly be compared with theestimators considered by Lang and Brezger (2004) Because Lang and Brezger presentedonly comparative results using boxplots we will use their results as read from their Figure5 The performance of all estimators was measured by the log of the mean squared errorlog(MSE) where MSE(f ) = 1n

sumni=1f (xi) minus f (xi)2

For the CEV estimator and function f1(middot) corresponding to moderate spatial heterogene-ity we obtained a median log(MSE) of minus367 with an interquartile range [minus380minus353]and a range [minus421minus313] These values are better than the best values reported by Langand Brezgerrsquos methods in their Figure 5(b) Lang and Brezger reported that their method de-noted ldquoP-splinerdquo outperforms all other methods in this case and achieved a median log(MSE)of minus350 Remarkably the EEV estimator performed almost as well as the CEV estimatorin this case and also outperformed the other methods The median log(MSE) for EEV wasof minus359 with an interquartile range [minus374minus343] and a range [minus414minus302] Theseresults are consistent with univariate results reported by Ruppert and Carroll (2000) andBaladandayuthapani Mallick and Carroll (2005) who found that allowing the smoothingparameter to vary smoothly outperforms other adaptive techniques

For function f1(middot) Figure 4 displays the coverage probabilities for the 95 credibleintervals calculated over a 20 times 20 equally spaced grid in [0 1]2 For one grid point wecalculated the frequency with which the 95 pointwise credible interval covers the truevalue of the function at the grid point As expected these coverage probabilities showstrong spatial correlation with lower coverage probabilities along the ridges of the sinusfunction Coverage probability is lowest when x1 is in the [02 05] range The signal-to-noise in this region is about half the signal-to-noise ratio in the region corresponding to x1

close to 1 This explains why the coverage probability is smaller in this region Interestinglythe zeros of the true function appearing at x2 = 025 050 075 are covered with at least95 probability Another interesting feature is the lower coverage probabilities near thenorth south and east boundaries These features of the pointwise coverage probabilitymap are consistent with the frequentist analysis of Krivobokova et al (2007) but are notconsistent with the coverage probabilities reported by Lang and Brezger for their BayesianP-spline method

For our CEV estimator and function f2(middot) corresponding to very low spatial heterogene-ity we obtained a median log(MSE) of minus595 with an interquartile range [minus612minus566]

Spatially Adaptive Bayesian Penalized Splines 279

Figure 4 Coverage probability of the pointwise 95 credible intervals for the mean function for functionf1(x1 x2) = x1 sin(4πx2) with a constant error standard deviation σ = range(f )4 Probabilities are computedon a 20 times 20 equally spaced grid of points in [0 1]2 The model used was a thin plate spline with K = 100 knotsfor the mean function a log-thin plate spline with K = 100 knots for the variance function and a log-thin platespline with Klowast = 16 knots for the spatially adaptive shrinkage parameter The parameter controlling the degreeof smoothness of the thin-plate spline basis was m = 2

and a range [minus651minus541] The results of Lang and Brezger in their Figure 5(a) indicatethat their ldquoP-splinerdquo method achieved a median log(MSE) of minus600 Our method performsroughly similar to the Bayesian P-spline method of Lang and Brezger and to the cubicthin plate spline (tps) being outperformed only by Lang and Brezgerrsquos adaptive P-splinemethod with two main effects ldquoP-spline mainrdquo We could also add two main effects to ourbivariate adaptive smoother to improve MSE for this example but this is not our concernin this article Again the EEV estimator performed very similarly to our CEV estimator

The last simulation study was done using the function f1(middot middot) where x1 x2 are indepen-dent uniformly distributed in [0 1] with an error standard deviation function

σε(x1 x2) = r

32+ 3r

32x2

1

where r = range(f1) We compared our CEV and EEV estimators described at the beginningof this section and as expected the EEV estimator outperformed the CEV both in terms oflog(MSE) and coverage probabilities More precisely for log(MSE) we obtained a medianof minus526 for CEV and minus535 for EEV an interquartile range [minus537minus515] for CEV and[minus548minus524] for EEV and a range [minus465minus572] for CEV and [minus484minus584] for EEVFor both estimators the general patterns for coverage probabilities were the ones displayedin Figure 4 EEV outperformed CEV in terms of coverage probabilities For example theregions where the nominal level is exceeded for both estimators have a roughly similar

280 C M Crainiceanu et al

Figure 5 Locations of sampling (dots) with 100 knots (circles) in the Noshiro example

shape and location but the EEV coverage probabilities tend to be closer to their nominallevel

Note that our model incorporating nonparametric variance estimation is more generalthan the models considered in the literature However its performance for function m1(middot middot)with moderate spatial variability and constant variance was better than all methods consid-ered by Lang and Brezger in their simulation study (see their Figure 5 panel b) Our methodhas good performance for the function m2(middot middot) being the second best when compared tomethods consider by Lang and Brezger (see their Figure 5 panel a) For the case whenvariance is not constant our adaptive method that estimates the variance nonparametricallyoutperformed our adaptive method that estimates a constant variance

8 THE NOSHIRO EXAMPLE

Noshiro Japan was the site of a major earthquake Much of the damage from thequake was due to soil movement Professor Thomas OrsquoRourke a geotechnical engineerwas investigating the factors that might help predict soil movement during a future quakeOne factor thought to be of importance was the slope of the land To estimate the slopeRuppert (1997) used a dataset with 799 observations where x was longitude and latitudeand y was elevation at locations in Noshiro The object of primary interest is the gradientat least its magnitude and possibly its direction

We used the thin-plate spline models (62) for the mean of y with Km = 100 knotsselected by the maximum separation criterion implemented in the function coverdesign()

of the R package Fields (Nychka Haaland OrsquoConnell and Ellner 1998) The sampling

Spatially Adaptive Bayesian Penalized Splines 281

Figure 6 Posterior mean of the mean regression function for the Noshiro example The model used was a thinplate spline with K = 100 knots for the mean function a log-thin plate spline with K = 100 knots for the variancefunction and a log-thin plate spline with Klowast = 16 knots for the spatially adaptive shrinkage parameter Theparameter controlling the degree of smoothness of the thin-plate spline basis was m = 2

locations and the knots used for fitting are displayed in Figure 5 The log-thin plate splinedescribed in (63) with the same Kε = 100 knots was used to model the variance of theerror process σ 2

ε (xi ) The log-thin plate spline described in (64) with Kb = 16 subknotswas used to model the variance σ 2

b (κκκk) which controls the amount of shrinkage of the b

parameters The subknots were obtained using the coverdesign() function applied to the100 knots

Figure 6 displays the posterior mean of the mean response function There is a rathersharp peak around (042 04) with a slow decay from the peak in the general south directionand a much faster decay in all other directions The function becomes smoother towardsthe boundary This is exactly the type of function for which adaptive spatial smoothing cansubstantially improve the fit

Figure 7 shows that in large areas of the map σε(middot) is smooth and with small valuesHowever in a neighborhood of the peak of the mean function σε(middot) displays two relativelysharp maxima located NNW and SSE of the peak We also performed a two-stage frequentistanalysis In the first stage we used a low rank thin plate spline for the mean function and theresults from this regression were used to obtain residuals We then fitted another low rankthin plate spline to the absolute values of these residuals While this map was not identicalto the one in Figure 7 it did exhibit the same patterns of heteroscedasticity

The process logσ 2b (κκκk) controlling the shrinkage of the bk is displayed in Figure 8

282 C M Crainiceanu et al

Figure 7 Posterior mean of the standard deviation σε(xi ) of the error process function for the Noshiro exampleThe model used was a thin plate spline with K = 100 knots for the mean function a log-thin plate spline withK = 100 knots for the variance function and a log-thin plate spline with Klowast = 16 knots for the spatially adaptiveshrinkage parameter The parameter controlling the degree of smoothness of the thin-plate spline basis was m = 2

Figure 8 Posterior mean of the shrinkage process minus logσ 2b(κκκk) The model used was a thin plate spline with

K = 100 knots for the mean function a log-thin plate spline with K = 100 knots for the variance function and alog-thin plate spline with Klowast = 16 knots for the spatially adaptive shrinkage parameter The parameter controllingthe degree of smoothness of the thin-plate spline basis was m = 2

Spatially Adaptive Bayesian Penalized Splines 283

Smaller values of this function correspond to less shrinkage of bk towards zero and morelocal behavior of the smoother Figure 8 indicates that in a neighborhood of the peak of themean function the shrinkage is smaller to allow the mean function to change rapidly Awayfrom the peak the shrinkage is larger corresponding to a smoother mean

9 IMPLEMENTATION USING MCMC

We will now provide some details for MCMC simulations of model (21) wherethe variances σ 2

ε (xi) and σ 2b (κk) are modeled by Equations (23) and (24) The imple-

mentation for other models for example for the multivariate smoothing in Section 6is similar Consider independent normal priors for the coefficients of the monomialsβi sim N(0 σ 2

0β) γi sim N(0 σ 20γ ) i = 0 p δi sim N(0 σ 2

0δ) i = 0 q and in-

dependent inverse Gamma priors for the variance components σ 2c sim IGamma(ac bc) and

σ 2d sim IGamma(ad bd) Using these priors many full conditionals of the posterior distribu-

tion are easy to derive while a few have complex multivariate forms Our implementationof the MCMC using multivariate Metropolis-Hastings steps proved to be unstable with poormixing properties A simple and reliable solution was to change the model by adding errorterms to the log-spline models that is

logσ 2ε (xi)

= γ0 + middot middot middot + γpxpi +sumKε

s=1 cs(xi minus κεs )

p+ + ui

logσ 2b (κ

mk ) = δ0 + middot middot middot + δq(κ

mk )q +sumKb

s=1 ds(κmk minus κb

s )q+ + vk

(91)

where ui sim N(0 σ 2u ) and vk sim N(0 σ 2

v ) This idea was also used for σ 2b by Baladan-

dayuthapani Mallick and Carroll (2005) We fixed the values of σ 2u = σ 2

v = 001 as thesevariances appear nonidentifiable and a standard deviation of 01 is small on a log-scale Thisdevice makes computations feasible by replacing sampling from complex full conditionalsby simple univariate MH steps

Define as before by ε b θθθX and define θθθε = (γγγ T cT )T θθθb = (δδδT dT )T CX =(X Z) Cε = (Xε Zε) Cb = (Xb Zb) where X Xε Xb contain the monomials and ZZε Zb contain the truncated polynomials of the spline models for m logσ 2

ε and logσ 2b

respectively Also denote by

0X =[

σ 20βIp+1 0

0 b

]

0ε =[

σ 20γ Ip+1 0

0 σ 2c IKε

]

0b =[

σ 20δIq+1 0

0 σ 2d IKm

]

The full conditionals of the posterior are detailed below

1 [θθθX] sim N(MXCT

Xminus1ε YMX

) where MX =

(CT

Xminus1ε CX + minus1

0X

)minus1

2 [θθθε] sim N(MεCT

ε Yεσ2u Mε

) where Yε = [log(σ 2

ε1) log(σ 2εn)]

T and Mε

= (CTε Cεσ

2u +minus1

0ε )minus1

284 C M Crainiceanu et al

Figure 9 Computation time (in minutes) versus sample size for 10000 simulations from the joint posteriordistribution given the data The mean function is modeled as a penalized spline with Km knots with the log of thevariance of the shrinkage process modeled as another penalized spline with Kb = 4 subknots The log of the errorvariance is modeled as another penalized spline with Kε = Km knots Different symbols correspond to increasingnumber of knots dots (Km = 10) asterisks (Km = 20) circles (Km = 30) xrsquos (Km = 40)

3 [θθθb] sim N(MbCT

b Ybσ2v Mb

)where Yb =

[log(σ 2

1 ) log(σ 2Km

)]T

and Mb =(CT

b Cbσ2v +minus1

0b )minus1

4 [σ 2c ] sim IGamma

(ac + Kε2 bc + ||c||22

)

5 [σ 2d ] sim IGamma

(ad + Kb2 bd + ||d||22

)

6 [σ 2εi] prop σminus3

εi exp

minus(yi minus microi)

2(2σ 2εi ) minus

[log(σ 2

εi ) minus ηi)]2

(2σ 2u )

where microi

and ηi are the ith components of CXθθθX and Cεθθθε respectively

7 [σ 2k ] prop σminus3

k expminusb2

k(2σ2k ) minus [

log(σ 2k ) minus ζk)

]2(σ 2

v ) where ζk is the kth com-

ponent of Cbθθθb

All the above conditionals have an explicit form with the exception of the n + Km

one-dimensional conditionals from 6 and 7 For these distributions we use the Metropolis-Hastings algorithm with a normal proposal distribution centered at the current value andsmall variance

10 IMPLEMENTATION AND COMPUTATIONALPERFORMANCE

Our MCMC simulation algorithm provided reliable results and is reasonably fast To il-lustrate this Figure 9 displays computation time (in minutes) versus sample size for 10000

Spatially Adaptive Bayesian Penalized Splines 285

Figure 10 MCMC histories for eight parameters of interest including the log of standard deviation of the errorprocess and log of standard deviation of the shrinkage process as well as some parameters modeling the differentfunctions The model uses 40 equally spaced knots for the mean and error variance functions and 4 subknotsfor the adaptive shrinkage process Chains correspond to the Doppler function example using quadratic truncatedpolynomials to fit the mean and log variance function and linear truncated polynomial basis for fitting the shrinkageprocess For clarity of presentation only every 20th simulated observation is displayed

simulations from the joint posterior distribution given the data The mean function is mod-eled as a penalized spline with Km knots with the log of the variance of the shrinkageprocess modeled as another penalized spline with Kb = 4 subknots The log of the errorvariance is modeled as another penalized spline with Kε = Km knots Different symbolscorrespond to increasing number of knots dots (Km = 10) asterisks (Km = 20) circles(Km = 30) xrsquos (Km = 40) Computation time increases roughly linearly with the numberof observations with the same intercept 0 and larger slopes for larger number of knotsNote that even with n = 1000 observations and Km = Kε = 40 simulation time is abouttwo minutes

For the models and datasets presented in this article 10000 simulations proved suffi-cient for accurate and reliable inference Some parameters such as the mean and variancefunctions tended to exhibit much better mixing properties than the variance componentsσ 2c and σ 2

d controlling the shrinkage processes This is probably due to the reduced amountof information in the data about these parameters The MCMC convergence and mixingproperties were assessed by visual inspection of the chain histories of many parameters ofinterest Since we started with good starting values of the parameters convergence is fastFigure 10 displays the histories of eight parameters of interest for the model fitted to the

286 C M Crainiceanu et al

Doppler function described in Section 5 For clarity we display only every 100th sampledvalued of the chain but the same patterns would be observed in the full chain history Similargood mixing properties have been noted in all the other examples presented in this article

In the search for a good software platform we implemented our algorithms in MATLABC and FORTRAN Programs can be downloaded from wwwbiostatjhsphedu simccrainicFor our application and implementation there were no serious differences between C andFORTRAN which have performed better than MATLAB by reducing computation timetwo to three times The reported times in Figure 9 are for our implementation in C

Given the complexity of the model we were interested in improving the mixing prop-erties of the Markov Chains The computational trick described in Section 9 was the sinequa non method that allowed us to replace an intractable multivariate MH step with severaltractable univariate MH steps We used normal random walk proposal distributions centeredat the current value and variance tuned to provide acceptance rates between 30 and 50

11 COMMENTS

Our joint models for heterogeneous mean and nonconstant variance are designed to ex-tract the maximum amount of information from complex data and suggest simpler paramet-ric models Bayesian inference based on MCMC offers a natural and powerful frameworkfor model fitting However MCMC in this context is not straightforward and implemen-tation depends essentially on the ability of transforming intractable simulations from highdimensional distributions into tractable one dimensional updates as described in Section9 Our implementation provides accurate fast and reproducible results while software isavailable Actual simulation time increases only linearly with the number of observationsand mixing properties are adequate provided that proposal distribution variances are tunedfor optimal acceptance rates

ACKNOWLEDGMENTS

Carrollrsquos research was supported by a grant from the National Cancer Institute (CA-57030) and by the TexasAampM Center for Environmental and Rural Health via a grant from the National Institute of Environmental HealthSciences (P30-ES09106)

[Received September 2004 Revised October 2006]

REFERENCES

Baladandayuthapani V Mallick B K and Carroll R J (2005) ldquoSpatially Adaptive Bayesian RegressionSplinesrdquo Journal of Computational and Graphical Statistics 14 378ndash394

Besag J Green P Higdon D and Mengersen K (1995) ldquoBayesian Computation and Stochastic Systemsrdquo(with discussions) Statistical Science 10 3ndash66

Carroll R J (2003) ldquoVariances Are Not Always Nuisance Parameters The 2002 R A Fisher Lecturerdquo Biometrics59 211ndash220

Carroll R J and Ruppert D (1988) Transformation and Weighting in Regression New York Chapman andHall

Spatially Adaptive Bayesian Penalized Splines 287

Cleveland W S and Grosse E (1991) ldquoComputational Methods for Local Regressionrdquo Statistics and Computing1 47ndash62

Crainiceanu C and Ruppert D (2004) ldquoLikelihood Ratio Tests in Linear Mixed Models With One VarianceComponentrdquo Journal of the Royal Statistical Society Series B 66 165ndash185

Crainiceanu C M Ruppert D and Wand M P (2005) ldquoBayesian Analysis for Penalized Spline Regressionusing WinBUGSrdquo Journal of Statistical Software 14 Available online at http wwwjstatsoftorg v14 i14

Donoho D L and Johnstone I M (1994) ldquoIdeal Spatial Adaptation by Wavelet Shrinkagerdquo Biometrika 81425ndash455

Eilers P H C and Marx B D (1996) ldquoFlexible Smoothing With B-Splines and Penaltiesrdquo (with discussion)Statistical Science 11 89ndash121

(2002) ldquoGeneralized Linear Additive Smooth Structuresrdquo Journal of Computational and Graphical Statis-tics 11 758ndash783

(2004) ldquoSplines Knots and Penaltiesrdquo unpublished manuscript

Fields Development Team (2006) fields Tools for Spatial Data National Center for Atmospheric ResearchBoulder CO Available online at http wwwcgducaredu Software Fields

Friedman J H (1991) ldquoMultivariate Adaptive Regression Splinesrdquo (with discussion) The Annals of Statistics19 1ndash141

Gelman A (2006) ldquoPrior Distributions for Variance Parameters in Hierarchical Modelsrdquo Bayesian Analysis 1515ndash533

Gray R J (1994) ldquoSpline-Based Tests in Survival Analysisrdquo Biometrics 50 640ndash652

Holst U Hossjer O Bjorklund C Ragnarson P and Edner H (1996) ldquoLocally Weighted Least SquaresKernel Regression and Statistical Evaluation of LIDAR Measurementsrdquo Environmetrics 7 401ndash416

Jullion A and Lambert P (2006) Discussion paper DP0534 UCL

Kammann E E and Wand M P (2003) ldquoGeoadditive Modelsrdquo Applied Statistics 52 1ndash18

Kauermann G (2002) ldquoA Note on Bandwidth Selection for Penalised Spline Smoothingrdquo Technical Report 02-13Department of Statistics University of Glasgow

Krivobokova T Crainiceanu C M and Kauermann G (2007) ldquoFast Adaptive Penalized Splinesrdquo Journal ofComputational and Graphical Statistics to appear

Lang S and Bretzger A (2004) ldquoBayesian P-Splinesrdquo Journal of Compuational and Graphical Statistics 13183ndash212

Lang S Fronk E M and Fahrmeir L (2002) ldquoFunction Estimation with Locally Adaptive Dynamic ModelsrdquoComputational Statistics 17 479ndash499

Marx B D and Eilers PHC (1998) ldquoDirect Generalized Additive Modeling With Penalized LikelihoodrdquoComputational Statistics amp Data Analysis 28 193ndash209

Natarajan R and Kass R E (2000) ldquoReference Bayesian Methods for Generalized Linear Mixed ModelsrdquoJournal of the American Statistical Association 95 227ndash237

Ngo L and Wand M P (2004) ldquoSmoothing With Mixed Model Softwarerdquo Journal of Statistical Software 9Available online at http wwwjstatsoftorg v09 i01

Nychka D Haaland P OrsquoConnell M and Ellner S (1998) ldquoFUNFITS Data Analysis and Statistical Toolsfor Estimating Functionsrdquo in Case Studies in Environmental Statistics Lecture Notes in Statistics vol 132eds D Nychka W W Piegorsch and L H Cox New York Springer Verlag pp 159ndash179 Online athttp wwwcgducaredu stats Software Funfits

Nychka D and Saltzman N (1998) ldquoDesign of Air Quality Monitoring Networksrdquo in Case studies in Environ-mental Statistics Lecture Notes in Statistics vol 132 eds D Nychka W W Piegorsch and L H Cox NewYork Springer Verlag pp 51ndash76

288 C M Crainiceanu et al

Pintore A Speckman P and Holmes C C (2006) ldquoSpatially Adaptive Smoothing Splinesrdquo Biometrika 93113ndash125

Ruppert D (1997) ldquoLocal Polynomial Regression and its Applications in Environmental Statisticsrdquo in Statisticsfor the Environment (vol 3) eds V Barnett and F Turkman Chicester Wiley pp 155ndash173

(2002) ldquoSelecting the Number of Knots for Penalized Splinesrdquo Journal of Computational and GraphicalStatistics 11 735ndash757

Ruppert D and Carroll R J (2000) ldquoSpatially-Adaptive Penalties for Spline Fittingrdquo Australian and NewZealand Journal of Statistics 42 205ndash223

Ruppert D Wand M P Holst and U Hossier O (1997) Technometrics 39 262ndash273

Ruppert D Wand M P and Carroll R J (2003) Semiparametric Regression Cambridge Cambridge UniversityPress

Smith M and Kohn R (1997) ldquoA Bayesian Approach to Nonparametric Bivariate Regressionrdquo Journal of theAmerican Statistical Association 92 1522ndash1535

Wood S (2006) Generalized Additive Models An Introduction with R New York Chapman amp HallCRC

Page 6: Spatially Adaptive Bayesian Penalized Splines - Department of

270 C M Crainiceanu et al

τb [τε] sim Gamma(Aε Bε) where Gamma(AB) has mean AB and variance AB2 then

[τb|Y βββb τε] sim Gamma

(Ab + Km

2 Bb + ||b||2

2

) (31)

and

[τε |Y βββb τε] prop Gamma

(Aε + n

2 Bε + ||Y minus Xβββ minus Zb||2

2

)

Also

E(τb|Y βββb τε) = Ab + Km2

Bb + ||b||22 var(τb|Y βββb τε) = Ab + Km2

(Bb + ||b||22)2

and similarly for τε The prior does not influence the posterior distribution of τε when both Ab and Bb are

small compared to Km2 and ||b||22 respectively Since the number of knots is Km ge 1and in most problems considered Km ge 5 it is safe to choose Ab le 001 When Bb ltlt

||b||22 the posterior distribution is practically unaffected by the prior assumptions WhenBb increases compared to ||b||22 the conditional distribution is increasingly affectedby the prior assumptions E(τb|Y βββb τε) is decreasing in Bb indicating that large Bb

compared to ||b||22 corresponds to undersmoothing Since the posterior variance of τb isalso decreasing in Bb a poor choice of Bb will likely result in underestimating the variabilityof the smoothing parameter λ = τετb causing confidence intervals for the mean functionto be too narrow The condition Bb ltlt ||b||22 shows that the ldquononinformativenessrdquo ofthe gamma prior depends essentially on the scale of the problem

To show the potential severity of these effects we consider the model (21) with a globalsmoothing parameter and homoscedastic errors We used quadratic splines with 30 knotsand the light detection and ranging (LIDAR) data for illustration The LIDAR data wasobtained from atmospheric monitoring of pollutants and consists of the range or distancetraveled before light is reflected back to a source and log ratio the logarithm of the ratioof received light from two laser sources For more details see Ruppert Wand Holst andHossier (1997)

Figure 1 shows the effect of four mean-one Gamma priors for the precision of thetruncated polynomial parameters The variances of these priors are 10 103 106 and 1010respectively Obviously the first two inferences provide severely undersmoothed almostindistinguishable posterior means The third graph is much smoother but still exhibitsroughness especially in the right-hand side of the plot while the fourth graph displaysa pleasing smooth pattern consistent with our frequentist inference Using either priordistribution one obtains that the posterior mean of ||b||22 is of order 10minus6 to 10minus5 Thisexplains why values of Bb larger than 10minus6 proved inappropriate for this problem

The size of ||b||22 depends upon the scaling of the x and y variables and in the caseof the LIDAR data ||b||22 is small because the standard deviation of x is much largerthan the standard deviation of y If y is rescaled to ayy and x to axx then the regressionfunction becomes aym(axx) whose pth derivative is aya

px m

(p)(axx) so that ||b||22 is

Spatially Adaptive Bayesian Penalized Splines 271

Figure 1 LIDAR data with unstandardized covariate effect of four mean-one Gamma priors for the precisionof the truncated polynomial parameters The variances of these priors are 10 103 106 and 1010 respectively

rescaled by the factor a2ya

2px Thus ||b||22 is particularly sensitive to the scaling of x

The size of ||b||22 also depends on Km Because the integral of m(p) over the range ofx will be approximately

sumKm

k=1 bk asymp radicKmσb we can expect that σ 2

b prop (Km)minus1 and thesmoothing parameter should be proportional to Km For the LIDAR data the GCV chosensmoothing parameter is 00095 00205 00440 and 00831 for Km equal to 15 30 60 and120 respectively so as expected the smoothing parameter approximately doubles as Km

doublesPractical experience with LMMs for longitudinal or clustered data should be applied

with caution to penalized splines In a mixed effects model ||b||2 is an estimator of Kmσ 2b

For longitudinal data Kmσ 2b would generally be large because Km is the number of subjects

with constant subject effect variance σ 2b As just discussed for a penalized spline Kmσ 2

b

should be nearly independent of Km and could be quite smallFigure 2 presents the same type of results as Figure 1 for Gamma priors for the precision

parameter τb with the mean held fixed at 10minus6 and variances equal to 10 103 106 and 1010respectively These prior distributions have a much smaller effect on the posterior mean ofthe regression function The fit seems to be undersmooth when the variance is 10 Clearlywhen the variance increases the fit becomes smooth indicating that a value of the variancelarger than 103 will produce a reasonable fit

A similar discussion holds true for τε but now large Bε corresponds to oversmoothingand τε does not depend on the scaling of x In applications it is less likely that Bε iscomparable in size to ||Y minus Xβββ minus Zb||2 because the latter is an estimator of nσ 2

ε If σ 2ε is

an estimator of σ 2ε a good rule of thumb is to use values of Bε smaller than nσ 2

ε 100 This

272 C M Crainiceanu et al

Figure 2 LIDAR data with unstandardized covariate effect of four Gamma priors with mean 10minus6 for theprecision of the truncated polynomial parameters The variances of these priors are 10 103 106 and 1010respectively

rule should work well when σ 2ε does not have an extremely large variance

Alternative to gamma priors are discussed by for example Natarajan and Kass (2000)and Gelman (2006) These have the advantage of requiring less care in the choice of thehyperparameters However we find that with reasonable care the conjugate gamma priorscan be used in practice Nonetheless exploration of other prior families for penalized splineswould be well worthwhile though beyond the scope of this article

4 SIMULTANEOUS CREDIBLE BOUNDS

Simultaneous credible intervals have been discussed before in the literature for exampleBesag Green Higdon and Mengersen (1995) The idea can be used to produce simultaneouscredible bounds for functions estimated smoothly To show this let f (middot) be either m(middot)logσ 2

ε (middot) logσ 2b (middot) or a derivative of order q 1 le q le p of one of these functions It is

straightforward to use MCMC output to construct simultaneous credible bounds on f overan arbitrary finite interval [x1 xN ] Typically x1 and xN would be the smallest and largestobserved values of x

Let x1 lt x2 lt middot middot middot lt xN be a fine grid of points on this interval Let Ef (xi)and SDf (xi) be the posterior mean and standard deviation of f (xi) estimated from aMCMC sample Let Mα be the (1minusα) sample quantile of max1leileN

∣∣[f (xi)minusEf (xi)]SDf (xi)∣∣ computed from the realizations of f in the MCMC sample Then I (xi) =Ef (xi) plusmn MαSDf (xi) i = 1 N are simultaneous credible intervals For N

Spatially Adaptive Bayesian Penalized Splines 273

large the upper and lower limits of these intervals can be connected to form simultaneouscredible bands In the remainder of this article we only report simultaneous credible boundsfor the functions and their derivatives In the examples considered in the following thesebounds tend to be roughly 30ndash50 wider than pointwise credible bounds

5 COMPARISON WITH OTHER UNIVARIATE SMOOTHERS

Baladandayuthapani et al (2005) presented a comprehensive simulation study of theirBayesian spatially adaptive penalized spline model which is obtained as a particular case ofour full model with constant error variance The results of this study indicate that the methodis comparable to or better than several other spatially adaptive Bayesian methods on a varietyof datasets Results also showed that the method produced practically indistinguishableinferences from those obtained using the fast frequentist method of Ruppert and Carroll(2000) Since we are unaware of any other estimation methodology that estimates jointlythe nonparametric adaptive mean and nonparametric variance models we will compareour methodology with that of Ruppert and Carroll (2000) We will compare the frequentistproperties of our Bayesian methodology such as MSE and coverage probabilities calculatedby simulating from existent benchmark models

Consider the regression model

yi = m(xi) + εi

where εi sim N(0 σ 2ε ) are independent errors The first function considered was the Doppler

function described by Donoho and Johnstone (1994)

m(x) = radicx(1 minus x) sin

2π(1 + 005)

x + 005

For the Doppler function we considered n = 2048 equally spaced xrsquos between [0 1] andσ 2ε = 1

Ruppert and Carroll (2000) proposed the following family of models related to theDoppler but allowing for more shape flexibility

m(x j) = radicx(1 minus x) sin

2π(1 + 2(9minus4j)5)

x + 2(9minus4j)5

where j = 3 and j = 6 correspond to low and severe spatial heterogeneity respectivelyFor this function we considered n = 400 equally spaced xrsquos on [0 1] and σ 2

ε (x) = 004 Wewill refer to this as the RCj function where j controls the amount of spatial heterogeneity

Ruppert and Carroll (2000) and Baladandayuthapani et al (2005) also considered thefollowing function

m(x) = expminus400(x minus 06)2

+ 5

3expminus500(x minus 075)2 + 2 expminus500(x minus 09)2

(51)

274 C M Crainiceanu et al

Table 1 Average mean square error (AMSE) for different mean functions based on 100 simulations The columnldquoTrue variancerdquo indicates whether the data-generating process is homoscedastic or heteroscedastic Thecolumns ldquoFitted error variancerdquo refer to the type of assumption used for fitting the data ldquoHomoscedas-ticrdquo uses the Ruppert and Carroll (2000) fitting algorithm while ldquoHeteroscedasticrdquo uses the Bayesianinferential algorithm described in this article The mean and log-variance functions were modeled usingdegree 2 splines with Km = Kε = 40 equally spaced knots The shrinkage process was modeled usingdegree 1 splines with Kb = 4 subknots

Fitted error variance

Function True variance Homoscedastic Heteroscedastic

Doppler Constant 0197 0218RCj=3 Constant 0011 0012RCj=6 Constant 0061 00653 Bumps Constant 0054 00553 Bumps Variable 0031 0026

which is roughly constant between [0 05] and has three sharp peaks at 06 075 and 09For this function we considered n = 1000 equally spaced xrsquos on [0 1] and two differentscenarios for the standard error of the error process (a) σε(x) = 05 for homoscedasticerrors and (b) σε(xi) = 05minus08x+16(xminus05)+ We will refer to this as the ldquothree-bumprdquofunction

We used 100 simulations from these models and fit a spatially adaptive penalized splinethat uses a constant variance For the three-bump function we also fitted a model withnonconstant variance modeled as a log-penalized-spline For the mean and log-variancefunctions we used quadratic penalized splines with Km = Kε = 40 knots For the log-variance corresponding to the shrinkage process we used Kb = 4 knots

We calculated the mean square error for each simulated dataset as

MSE =nsum

i=1

m(xi) minus m(xi)2 n

where m(x) is the posterior mean of m(x) using a given model Table 1 provides a summaryof the average MSE (AMSE) results over 100 simulations comparing the method that allowsfor heteroscedastic errors with the one that does not Remarkably our method that estimatesa nonconstant variance performed well compared to methods that assume a constant variancewhen the variance is constant In contrast when the true variance process is not constantour method performed better

Our method was slightly outperformed by the Ruppert and Caroll (2000) method in theDoppler case This is probably due to the fact that the Ruppert and Carroll method assumesa constant variance for the Doppler function when the variance used for simulations wasσ 2ε = 1 However both methods outperformed the method of Baladandayuthapani et al

(2005) who reported an AMSE of 0024 for the Doppler functionFor the three-bump function with constant variance σε(x) = 05 the two models pro-

duced practically indistinguishable MSEs Note that our AMSE over all xrsquos using 100 simu-

Spatially Adaptive Bayesian Penalized Splines 275

Figure 3 Comparison of pointwise coverage probabilities of the 95 credible intervals in 500 simulationsData was simulated from the model described in Section 5 with heteroscedastic error variance Estimation wasdone using two methods Bayesian adaptive penalized splines with homoscedastic errors (dashed) and Bayesianadaptive penalized splines with heteroscedastic errors (solid) The coverage probabilities have been smoothed toremove Monte Carlo variability

lations was AMSE = 00054 which is smaller than 00061 reported by Baladandayuthapaniet al (2005) The coverage probabilities of the 95 credible intervals were very similarfor the two models for each value of the covariate with a slight advantage in favor of themodel using homoscedastic errors The average coverage probability was 947 for themodel with homoscedastic errors and 935 for the model with heteroscedastic errorsThese coverage probabilities are very similar to the ones reported by Baladandayuthapaniet al (2005) in their Figure 3

For case (b) σε(xi) = 05 minus 08x + 16(x minus 05)+ the heteroscedastic model sub-stantially outperformed the homoscedastic model both in terms of AMSE and coverageprobabilities In this situation it would be misleading to compare only the average coverageprobability for the 95 credible intervals Indeed these averages are very close 941for the heteroscedastic and 930 for the homoscedastic method but they are obtainedfrom very different sets of pointwise coverage probabilities Figure 3 displays the coverageprobabilities for these two methods Note that for the heteroscedastic method the coverageprobability in the [0 05] interval is close to 95 In the same interval the coverage prob-ability for the homoscedastic method starts from around 08 and increases until it crossesthe 95 target around 02 Moreover in the interval [03 05] this coverage probabilityis estimated to be 1 This is due to the fact that on this interval σε(x) decreases linearlyfrom 05 to 01 Since the homoscedastic method assumes a constant variance the credibleintervals will tend to be shorter than nominal in regions of higher variability (x close tozero) thus producing lower coverage probabilities In regions of smaller variability (x closeto 05) the credible intervals will tend to be much wider thus producing extremely large

276 C M Crainiceanu et al

coverage probabilitiesIn the interval [055 065] the homoscedastic method slightly outperforms the het-

eroscedastic method This seems to be the effect of a ldquoluckyrdquo combination of two factorsAs we discussed the size of the credible intervals for the homoscedastic method in a neigh-borhood of 05 is much larger than nominal However at the same point the mean functionchanges from a constant to a rapidly oscillating function and the coverage probabilities droproughly at the same rate In the interval [08 1] one can notice a phenomenon very similarto the one described for the interval [0 05] While the function oscillates more rapidly inthis region the heteroscedastic adaptive method produces credible intervals with coverageprobabilities close to the nominal 95 However the homoscedastic method does not takeinto account the increased variability and produces credible intervals that are too short

For these examples both methods performed roughly similar on simulated data sets thatdid not require variance estimation However when the error variance is not constant themethod using log-penalized-splines to estimate the error variance considerably outperformsthe method that does not

6 LOW RANK MULTIVARIATE SMOOTHING

In this section we generalize the ideas in Section 2 to multivariate smoothing whilepreserving the appealing geometric interpretation of the smoother Consider the followingregression yi = m(xi ) + εi where m(middot) is a smooth function of L covariates We will useradial basis functions which have the advantage of being more numerically stable than thetensor product of truncated polynomials in more than one dimension Suppose that xi isin RL1 le i le n are n vectors of covariates and κk isin RL 1 le k le Km are Km knots Considerthe following distance function

C(r) =

||r||2MminusL for L odd||r||2MminusL log ||r|| for L even

where || middot || denotes the Euclidean norm in RL the integer M controls the smoothnessof C(middot) X the matrix with ith row Xi = [1 xT

i ] ZKm = C(||xi minus κκκk||)1leilen1lekleKm

Km = C(||κκκk minus κκκkprime ||)1lekleK1lekprimeleK and define Z = ZKminus12K where

minus12K is the

principal square root of K With these notations the low rank approximation of thin platespline regression can be obtained as the BLUP in the LMM (Kammann and Wand 2003Ruppert Wand and Carroll 2003)

Y = Xβββ + Zb + εεε E

(bεεε

)=(

00

) cov

(bεεε

)=(

σ 2b IKm 0

0 σ 2ε In

) (61)

This model contains only one variance component σ 2b for controlling the shrinkage of

b which is equivalent to one global smoothing parameter λ = σ 2b σ

2ε and implicitly as-

sumes homoscedastic errors To relax these assumptions we consider a new set of knotsκκκlowast

1 κκκlowastKb

and define Xlowast and Zlowast similarly with the corresponding definition of matrices

Spatially Adaptive Bayesian Penalized Splines 277

X and Z where the x-covariates are replaced by the knots κκκk and the knots are replaced bythe subknots κκκlowast

k Consider the following model for the responseyi = β0 +sumL

j=1 βjxij +sumKm

k=1 bkzik + εiεi sim N0 σ 2

ε (xi ) i = 1 nbk sim N0 σ 2

b (κκκk) k = 1 Km

(62)

where the error variance logσ 2ε (xi ) is modeled as a low rank log-thin plate spline

logσ 2ε (xi ) = γ0 +sumL

j=1 γjxij +sumKε

k=1 ckzikck sim N(0 σ 2

c ) k = 1 Kε(63)

and the random coefficient variance logσ 2b (κκκk) is modeled as another low rank log-thin

plate spline logσ 2

b (κκκk) = δ0 +sumLj=1 δj x

lowastkj +sumKb

j=1 dj zlowastkj

dj sim N(0 σ 2d ) j = 1 Kb

(64)

where xlowastkj and zlowast

kj are the entries of Xlowast and Zlowast matrices respectively and ck and ds areassumed mutually independent

In the case of low rank smoothers the set of knots for the covariates and subknots formodeling the shrinkage process have to be chosen One possibility is to use equally spacedknots and subknots We use this approach in our simulation study in Section 7 to make ourmethod comparable from this point of view with previously published methods Anotherpossibility is to select the knots and subknots using the space filling design (Nychka andSaltzman 1998) which is based on the maximal separation principle This avoids wastingknots and is likely to lead to better approximations in sparse regions of the data Thecoverdesign() function from the R package Fields (Fields Development Team 2006)provides software for space filling knot selection We use this approach in Section 8 in theNoshiro elevation example

7 COMPARISONS WITH OTHER ADAPTIVE SURFACEFITTING METHODS

Lang and Brezger (2004) compared their adaptive surface smoothing method with sev-eral methods used in a simulation study by Smith and Kohn (1997) MARS (Friedman1991) ldquolocfitrdquo (Cleveland and Grosse 1991) ldquotpsrdquo (bivariate cubic thin plate splines with asingle smoothing parameter) tensor product cubic smoothing splines with five smoothingparameters and a parametric linear interaction model We will use the following regressionfunctions also used by Lang and Brezger

bull m1(x1 x2) = x1 sin(4πx2) where x1 and x2 are distributed independently uniformon [0 1]

bull m2(x1 x2) = 15 exp(minus8x21 ) + 35 exp(minus8x2

2 ) where x1 and x2 are distributedindependently normal with mean 05 and variance 01

278 C M Crainiceanu et al

Function m1(middot) contains complex interactions and moderate spatial variability whilem2(middot) has only main effects and is much smoother We used a sample size n = 300σ = 14 range(f ) and 250 simulations from each model

We will compare the performance of our method with that of Lang and Brezgerrsquos whichperformed at least as good or better than other adaptive methods in a variety of examplesTo avoid dependence of results on knot choice we used the same choice as the one describedby Lang and Brezger 12 times 12 equally spaced knots for the mean function However weused thin-plate splines instead of tensor products of B-splines We modeled the smoothingparameter using a low rank log-thin plate spline with an equally spaced 5 times 5 knot grid Weconsidered two estimators one with constant error variance (CEV) and one with estimatederror variance using a log-thin plate spline (EEV) with the same 12 times 12 knot grid as forthe mean function The CEV estimator is the one that can directly be compared with theestimators considered by Lang and Brezger (2004) Because Lang and Brezger presentedonly comparative results using boxplots we will use their results as read from their Figure5 The performance of all estimators was measured by the log of the mean squared errorlog(MSE) where MSE(f ) = 1n

sumni=1f (xi) minus f (xi)2

For the CEV estimator and function f1(middot) corresponding to moderate spatial heterogene-ity we obtained a median log(MSE) of minus367 with an interquartile range [minus380minus353]and a range [minus421minus313] These values are better than the best values reported by Langand Brezgerrsquos methods in their Figure 5(b) Lang and Brezger reported that their method de-noted ldquoP-splinerdquo outperforms all other methods in this case and achieved a median log(MSE)of minus350 Remarkably the EEV estimator performed almost as well as the CEV estimatorin this case and also outperformed the other methods The median log(MSE) for EEV wasof minus359 with an interquartile range [minus374minus343] and a range [minus414minus302] Theseresults are consistent with univariate results reported by Ruppert and Carroll (2000) andBaladandayuthapani Mallick and Carroll (2005) who found that allowing the smoothingparameter to vary smoothly outperforms other adaptive techniques

For function f1(middot) Figure 4 displays the coverage probabilities for the 95 credibleintervals calculated over a 20 times 20 equally spaced grid in [0 1]2 For one grid point wecalculated the frequency with which the 95 pointwise credible interval covers the truevalue of the function at the grid point As expected these coverage probabilities showstrong spatial correlation with lower coverage probabilities along the ridges of the sinusfunction Coverage probability is lowest when x1 is in the [02 05] range The signal-to-noise in this region is about half the signal-to-noise ratio in the region corresponding to x1

close to 1 This explains why the coverage probability is smaller in this region Interestinglythe zeros of the true function appearing at x2 = 025 050 075 are covered with at least95 probability Another interesting feature is the lower coverage probabilities near thenorth south and east boundaries These features of the pointwise coverage probabilitymap are consistent with the frequentist analysis of Krivobokova et al (2007) but are notconsistent with the coverage probabilities reported by Lang and Brezger for their BayesianP-spline method

For our CEV estimator and function f2(middot) corresponding to very low spatial heterogene-ity we obtained a median log(MSE) of minus595 with an interquartile range [minus612minus566]

Spatially Adaptive Bayesian Penalized Splines 279

Figure 4 Coverage probability of the pointwise 95 credible intervals for the mean function for functionf1(x1 x2) = x1 sin(4πx2) with a constant error standard deviation σ = range(f )4 Probabilities are computedon a 20 times 20 equally spaced grid of points in [0 1]2 The model used was a thin plate spline with K = 100 knotsfor the mean function a log-thin plate spline with K = 100 knots for the variance function and a log-thin platespline with Klowast = 16 knots for the spatially adaptive shrinkage parameter The parameter controlling the degreeof smoothness of the thin-plate spline basis was m = 2

and a range [minus651minus541] The results of Lang and Brezger in their Figure 5(a) indicatethat their ldquoP-splinerdquo method achieved a median log(MSE) of minus600 Our method performsroughly similar to the Bayesian P-spline method of Lang and Brezger and to the cubicthin plate spline (tps) being outperformed only by Lang and Brezgerrsquos adaptive P-splinemethod with two main effects ldquoP-spline mainrdquo We could also add two main effects to ourbivariate adaptive smoother to improve MSE for this example but this is not our concernin this article Again the EEV estimator performed very similarly to our CEV estimator

The last simulation study was done using the function f1(middot middot) where x1 x2 are indepen-dent uniformly distributed in [0 1] with an error standard deviation function

σε(x1 x2) = r

32+ 3r

32x2

1

where r = range(f1) We compared our CEV and EEV estimators described at the beginningof this section and as expected the EEV estimator outperformed the CEV both in terms oflog(MSE) and coverage probabilities More precisely for log(MSE) we obtained a medianof minus526 for CEV and minus535 for EEV an interquartile range [minus537minus515] for CEV and[minus548minus524] for EEV and a range [minus465minus572] for CEV and [minus484minus584] for EEVFor both estimators the general patterns for coverage probabilities were the ones displayedin Figure 4 EEV outperformed CEV in terms of coverage probabilities For example theregions where the nominal level is exceeded for both estimators have a roughly similar

280 C M Crainiceanu et al

Figure 5 Locations of sampling (dots) with 100 knots (circles) in the Noshiro example

shape and location but the EEV coverage probabilities tend to be closer to their nominallevel

Note that our model incorporating nonparametric variance estimation is more generalthan the models considered in the literature However its performance for function m1(middot middot)with moderate spatial variability and constant variance was better than all methods consid-ered by Lang and Brezger in their simulation study (see their Figure 5 panel b) Our methodhas good performance for the function m2(middot middot) being the second best when compared tomethods consider by Lang and Brezger (see their Figure 5 panel a) For the case whenvariance is not constant our adaptive method that estimates the variance nonparametricallyoutperformed our adaptive method that estimates a constant variance

8 THE NOSHIRO EXAMPLE

Noshiro Japan was the site of a major earthquake Much of the damage from thequake was due to soil movement Professor Thomas OrsquoRourke a geotechnical engineerwas investigating the factors that might help predict soil movement during a future quakeOne factor thought to be of importance was the slope of the land To estimate the slopeRuppert (1997) used a dataset with 799 observations where x was longitude and latitudeand y was elevation at locations in Noshiro The object of primary interest is the gradientat least its magnitude and possibly its direction

We used the thin-plate spline models (62) for the mean of y with Km = 100 knotsselected by the maximum separation criterion implemented in the function coverdesign()

of the R package Fields (Nychka Haaland OrsquoConnell and Ellner 1998) The sampling

Spatially Adaptive Bayesian Penalized Splines 281

Figure 6 Posterior mean of the mean regression function for the Noshiro example The model used was a thinplate spline with K = 100 knots for the mean function a log-thin plate spline with K = 100 knots for the variancefunction and a log-thin plate spline with Klowast = 16 knots for the spatially adaptive shrinkage parameter Theparameter controlling the degree of smoothness of the thin-plate spline basis was m = 2

locations and the knots used for fitting are displayed in Figure 5 The log-thin plate splinedescribed in (63) with the same Kε = 100 knots was used to model the variance of theerror process σ 2

ε (xi ) The log-thin plate spline described in (64) with Kb = 16 subknotswas used to model the variance σ 2

b (κκκk) which controls the amount of shrinkage of the b

parameters The subknots were obtained using the coverdesign() function applied to the100 knots

Figure 6 displays the posterior mean of the mean response function There is a rathersharp peak around (042 04) with a slow decay from the peak in the general south directionand a much faster decay in all other directions The function becomes smoother towardsthe boundary This is exactly the type of function for which adaptive spatial smoothing cansubstantially improve the fit

Figure 7 shows that in large areas of the map σε(middot) is smooth and with small valuesHowever in a neighborhood of the peak of the mean function σε(middot) displays two relativelysharp maxima located NNW and SSE of the peak We also performed a two-stage frequentistanalysis In the first stage we used a low rank thin plate spline for the mean function and theresults from this regression were used to obtain residuals We then fitted another low rankthin plate spline to the absolute values of these residuals While this map was not identicalto the one in Figure 7 it did exhibit the same patterns of heteroscedasticity

The process logσ 2b (κκκk) controlling the shrinkage of the bk is displayed in Figure 8

282 C M Crainiceanu et al

Figure 7 Posterior mean of the standard deviation σε(xi ) of the error process function for the Noshiro exampleThe model used was a thin plate spline with K = 100 knots for the mean function a log-thin plate spline withK = 100 knots for the variance function and a log-thin plate spline with Klowast = 16 knots for the spatially adaptiveshrinkage parameter The parameter controlling the degree of smoothness of the thin-plate spline basis was m = 2

Figure 8 Posterior mean of the shrinkage process minus logσ 2b(κκκk) The model used was a thin plate spline with

K = 100 knots for the mean function a log-thin plate spline with K = 100 knots for the variance function and alog-thin plate spline with Klowast = 16 knots for the spatially adaptive shrinkage parameter The parameter controllingthe degree of smoothness of the thin-plate spline basis was m = 2

Spatially Adaptive Bayesian Penalized Splines 283

Smaller values of this function correspond to less shrinkage of bk towards zero and morelocal behavior of the smoother Figure 8 indicates that in a neighborhood of the peak of themean function the shrinkage is smaller to allow the mean function to change rapidly Awayfrom the peak the shrinkage is larger corresponding to a smoother mean

9 IMPLEMENTATION USING MCMC

We will now provide some details for MCMC simulations of model (21) wherethe variances σ 2

ε (xi) and σ 2b (κk) are modeled by Equations (23) and (24) The imple-

mentation for other models for example for the multivariate smoothing in Section 6is similar Consider independent normal priors for the coefficients of the monomialsβi sim N(0 σ 2

0β) γi sim N(0 σ 20γ ) i = 0 p δi sim N(0 σ 2

0δ) i = 0 q and in-

dependent inverse Gamma priors for the variance components σ 2c sim IGamma(ac bc) and

σ 2d sim IGamma(ad bd) Using these priors many full conditionals of the posterior distribu-

tion are easy to derive while a few have complex multivariate forms Our implementationof the MCMC using multivariate Metropolis-Hastings steps proved to be unstable with poormixing properties A simple and reliable solution was to change the model by adding errorterms to the log-spline models that is

logσ 2ε (xi)

= γ0 + middot middot middot + γpxpi +sumKε

s=1 cs(xi minus κεs )

p+ + ui

logσ 2b (κ

mk ) = δ0 + middot middot middot + δq(κ

mk )q +sumKb

s=1 ds(κmk minus κb

s )q+ + vk

(91)

where ui sim N(0 σ 2u ) and vk sim N(0 σ 2

v ) This idea was also used for σ 2b by Baladan-

dayuthapani Mallick and Carroll (2005) We fixed the values of σ 2u = σ 2

v = 001 as thesevariances appear nonidentifiable and a standard deviation of 01 is small on a log-scale Thisdevice makes computations feasible by replacing sampling from complex full conditionalsby simple univariate MH steps

Define as before by ε b θθθX and define θθθε = (γγγ T cT )T θθθb = (δδδT dT )T CX =(X Z) Cε = (Xε Zε) Cb = (Xb Zb) where X Xε Xb contain the monomials and ZZε Zb contain the truncated polynomials of the spline models for m logσ 2

ε and logσ 2b

respectively Also denote by

0X =[

σ 20βIp+1 0

0 b

]

0ε =[

σ 20γ Ip+1 0

0 σ 2c IKε

]

0b =[

σ 20δIq+1 0

0 σ 2d IKm

]

The full conditionals of the posterior are detailed below

1 [θθθX] sim N(MXCT

Xminus1ε YMX

) where MX =

(CT

Xminus1ε CX + minus1

0X

)minus1

2 [θθθε] sim N(MεCT

ε Yεσ2u Mε

) where Yε = [log(σ 2

ε1) log(σ 2εn)]

T and Mε

= (CTε Cεσ

2u +minus1

0ε )minus1

284 C M Crainiceanu et al

Figure 9 Computation time (in minutes) versus sample size for 10000 simulations from the joint posteriordistribution given the data The mean function is modeled as a penalized spline with Km knots with the log of thevariance of the shrinkage process modeled as another penalized spline with Kb = 4 subknots The log of the errorvariance is modeled as another penalized spline with Kε = Km knots Different symbols correspond to increasingnumber of knots dots (Km = 10) asterisks (Km = 20) circles (Km = 30) xrsquos (Km = 40)

3 [θθθb] sim N(MbCT

b Ybσ2v Mb

)where Yb =

[log(σ 2

1 ) log(σ 2Km

)]T

and Mb =(CT

b Cbσ2v +minus1

0b )minus1

4 [σ 2c ] sim IGamma

(ac + Kε2 bc + ||c||22

)

5 [σ 2d ] sim IGamma

(ad + Kb2 bd + ||d||22

)

6 [σ 2εi] prop σminus3

εi exp

minus(yi minus microi)

2(2σ 2εi ) minus

[log(σ 2

εi ) minus ηi)]2

(2σ 2u )

where microi

and ηi are the ith components of CXθθθX and Cεθθθε respectively

7 [σ 2k ] prop σminus3

k expminusb2

k(2σ2k ) minus [

log(σ 2k ) minus ζk)

]2(σ 2

v ) where ζk is the kth com-

ponent of Cbθθθb

All the above conditionals have an explicit form with the exception of the n + Km

one-dimensional conditionals from 6 and 7 For these distributions we use the Metropolis-Hastings algorithm with a normal proposal distribution centered at the current value andsmall variance

10 IMPLEMENTATION AND COMPUTATIONALPERFORMANCE

Our MCMC simulation algorithm provided reliable results and is reasonably fast To il-lustrate this Figure 9 displays computation time (in minutes) versus sample size for 10000

Spatially Adaptive Bayesian Penalized Splines 285

Figure 10 MCMC histories for eight parameters of interest including the log of standard deviation of the errorprocess and log of standard deviation of the shrinkage process as well as some parameters modeling the differentfunctions The model uses 40 equally spaced knots for the mean and error variance functions and 4 subknotsfor the adaptive shrinkage process Chains correspond to the Doppler function example using quadratic truncatedpolynomials to fit the mean and log variance function and linear truncated polynomial basis for fitting the shrinkageprocess For clarity of presentation only every 20th simulated observation is displayed

simulations from the joint posterior distribution given the data The mean function is mod-eled as a penalized spline with Km knots with the log of the variance of the shrinkageprocess modeled as another penalized spline with Kb = 4 subknots The log of the errorvariance is modeled as another penalized spline with Kε = Km knots Different symbolscorrespond to increasing number of knots dots (Km = 10) asterisks (Km = 20) circles(Km = 30) xrsquos (Km = 40) Computation time increases roughly linearly with the numberof observations with the same intercept 0 and larger slopes for larger number of knotsNote that even with n = 1000 observations and Km = Kε = 40 simulation time is abouttwo minutes

For the models and datasets presented in this article 10000 simulations proved suffi-cient for accurate and reliable inference Some parameters such as the mean and variancefunctions tended to exhibit much better mixing properties than the variance componentsσ 2c and σ 2

d controlling the shrinkage processes This is probably due to the reduced amountof information in the data about these parameters The MCMC convergence and mixingproperties were assessed by visual inspection of the chain histories of many parameters ofinterest Since we started with good starting values of the parameters convergence is fastFigure 10 displays the histories of eight parameters of interest for the model fitted to the

286 C M Crainiceanu et al

Doppler function described in Section 5 For clarity we display only every 100th sampledvalued of the chain but the same patterns would be observed in the full chain history Similargood mixing properties have been noted in all the other examples presented in this article

In the search for a good software platform we implemented our algorithms in MATLABC and FORTRAN Programs can be downloaded from wwwbiostatjhsphedu simccrainicFor our application and implementation there were no serious differences between C andFORTRAN which have performed better than MATLAB by reducing computation timetwo to three times The reported times in Figure 9 are for our implementation in C

Given the complexity of the model we were interested in improving the mixing prop-erties of the Markov Chains The computational trick described in Section 9 was the sinequa non method that allowed us to replace an intractable multivariate MH step with severaltractable univariate MH steps We used normal random walk proposal distributions centeredat the current value and variance tuned to provide acceptance rates between 30 and 50

11 COMMENTS

Our joint models for heterogeneous mean and nonconstant variance are designed to ex-tract the maximum amount of information from complex data and suggest simpler paramet-ric models Bayesian inference based on MCMC offers a natural and powerful frameworkfor model fitting However MCMC in this context is not straightforward and implemen-tation depends essentially on the ability of transforming intractable simulations from highdimensional distributions into tractable one dimensional updates as described in Section9 Our implementation provides accurate fast and reproducible results while software isavailable Actual simulation time increases only linearly with the number of observationsand mixing properties are adequate provided that proposal distribution variances are tunedfor optimal acceptance rates

ACKNOWLEDGMENTS

Carrollrsquos research was supported by a grant from the National Cancer Institute (CA-57030) and by the TexasAampM Center for Environmental and Rural Health via a grant from the National Institute of Environmental HealthSciences (P30-ES09106)

[Received September 2004 Revised October 2006]

REFERENCES

Baladandayuthapani V Mallick B K and Carroll R J (2005) ldquoSpatially Adaptive Bayesian RegressionSplinesrdquo Journal of Computational and Graphical Statistics 14 378ndash394

Besag J Green P Higdon D and Mengersen K (1995) ldquoBayesian Computation and Stochastic Systemsrdquo(with discussions) Statistical Science 10 3ndash66

Carroll R J (2003) ldquoVariances Are Not Always Nuisance Parameters The 2002 R A Fisher Lecturerdquo Biometrics59 211ndash220

Carroll R J and Ruppert D (1988) Transformation and Weighting in Regression New York Chapman andHall

Spatially Adaptive Bayesian Penalized Splines 287

Cleveland W S and Grosse E (1991) ldquoComputational Methods for Local Regressionrdquo Statistics and Computing1 47ndash62

Crainiceanu C and Ruppert D (2004) ldquoLikelihood Ratio Tests in Linear Mixed Models With One VarianceComponentrdquo Journal of the Royal Statistical Society Series B 66 165ndash185

Crainiceanu C M Ruppert D and Wand M P (2005) ldquoBayesian Analysis for Penalized Spline Regressionusing WinBUGSrdquo Journal of Statistical Software 14 Available online at http wwwjstatsoftorg v14 i14

Donoho D L and Johnstone I M (1994) ldquoIdeal Spatial Adaptation by Wavelet Shrinkagerdquo Biometrika 81425ndash455

Eilers P H C and Marx B D (1996) ldquoFlexible Smoothing With B-Splines and Penaltiesrdquo (with discussion)Statistical Science 11 89ndash121

(2002) ldquoGeneralized Linear Additive Smooth Structuresrdquo Journal of Computational and Graphical Statis-tics 11 758ndash783

(2004) ldquoSplines Knots and Penaltiesrdquo unpublished manuscript

Fields Development Team (2006) fields Tools for Spatial Data National Center for Atmospheric ResearchBoulder CO Available online at http wwwcgducaredu Software Fields

Friedman J H (1991) ldquoMultivariate Adaptive Regression Splinesrdquo (with discussion) The Annals of Statistics19 1ndash141

Gelman A (2006) ldquoPrior Distributions for Variance Parameters in Hierarchical Modelsrdquo Bayesian Analysis 1515ndash533

Gray R J (1994) ldquoSpline-Based Tests in Survival Analysisrdquo Biometrics 50 640ndash652

Holst U Hossjer O Bjorklund C Ragnarson P and Edner H (1996) ldquoLocally Weighted Least SquaresKernel Regression and Statistical Evaluation of LIDAR Measurementsrdquo Environmetrics 7 401ndash416

Jullion A and Lambert P (2006) Discussion paper DP0534 UCL

Kammann E E and Wand M P (2003) ldquoGeoadditive Modelsrdquo Applied Statistics 52 1ndash18

Kauermann G (2002) ldquoA Note on Bandwidth Selection for Penalised Spline Smoothingrdquo Technical Report 02-13Department of Statistics University of Glasgow

Krivobokova T Crainiceanu C M and Kauermann G (2007) ldquoFast Adaptive Penalized Splinesrdquo Journal ofComputational and Graphical Statistics to appear

Lang S and Bretzger A (2004) ldquoBayesian P-Splinesrdquo Journal of Compuational and Graphical Statistics 13183ndash212

Lang S Fronk E M and Fahrmeir L (2002) ldquoFunction Estimation with Locally Adaptive Dynamic ModelsrdquoComputational Statistics 17 479ndash499

Marx B D and Eilers PHC (1998) ldquoDirect Generalized Additive Modeling With Penalized LikelihoodrdquoComputational Statistics amp Data Analysis 28 193ndash209

Natarajan R and Kass R E (2000) ldquoReference Bayesian Methods for Generalized Linear Mixed ModelsrdquoJournal of the American Statistical Association 95 227ndash237

Ngo L and Wand M P (2004) ldquoSmoothing With Mixed Model Softwarerdquo Journal of Statistical Software 9Available online at http wwwjstatsoftorg v09 i01

Nychka D Haaland P OrsquoConnell M and Ellner S (1998) ldquoFUNFITS Data Analysis and Statistical Toolsfor Estimating Functionsrdquo in Case Studies in Environmental Statistics Lecture Notes in Statistics vol 132eds D Nychka W W Piegorsch and L H Cox New York Springer Verlag pp 159ndash179 Online athttp wwwcgducaredu stats Software Funfits

Nychka D and Saltzman N (1998) ldquoDesign of Air Quality Monitoring Networksrdquo in Case studies in Environ-mental Statistics Lecture Notes in Statistics vol 132 eds D Nychka W W Piegorsch and L H Cox NewYork Springer Verlag pp 51ndash76

288 C M Crainiceanu et al

Pintore A Speckman P and Holmes C C (2006) ldquoSpatially Adaptive Smoothing Splinesrdquo Biometrika 93113ndash125

Ruppert D (1997) ldquoLocal Polynomial Regression and its Applications in Environmental Statisticsrdquo in Statisticsfor the Environment (vol 3) eds V Barnett and F Turkman Chicester Wiley pp 155ndash173

(2002) ldquoSelecting the Number of Knots for Penalized Splinesrdquo Journal of Computational and GraphicalStatistics 11 735ndash757

Ruppert D and Carroll R J (2000) ldquoSpatially-Adaptive Penalties for Spline Fittingrdquo Australian and NewZealand Journal of Statistics 42 205ndash223

Ruppert D Wand M P Holst and U Hossier O (1997) Technometrics 39 262ndash273

Ruppert D Wand M P and Carroll R J (2003) Semiparametric Regression Cambridge Cambridge UniversityPress

Smith M and Kohn R (1997) ldquoA Bayesian Approach to Nonparametric Bivariate Regressionrdquo Journal of theAmerican Statistical Association 92 1522ndash1535

Wood S (2006) Generalized Additive Models An Introduction with R New York Chapman amp HallCRC

Page 7: Spatially Adaptive Bayesian Penalized Splines - Department of

Spatially Adaptive Bayesian Penalized Splines 271

Figure 1 LIDAR data with unstandardized covariate effect of four mean-one Gamma priors for the precisionof the truncated polynomial parameters The variances of these priors are 10 103 106 and 1010 respectively

rescaled by the factor a2ya

2px Thus ||b||22 is particularly sensitive to the scaling of x

The size of ||b||22 also depends on Km Because the integral of m(p) over the range ofx will be approximately

sumKm

k=1 bk asymp radicKmσb we can expect that σ 2

b prop (Km)minus1 and thesmoothing parameter should be proportional to Km For the LIDAR data the GCV chosensmoothing parameter is 00095 00205 00440 and 00831 for Km equal to 15 30 60 and120 respectively so as expected the smoothing parameter approximately doubles as Km

doublesPractical experience with LMMs for longitudinal or clustered data should be applied

with caution to penalized splines In a mixed effects model ||b||2 is an estimator of Kmσ 2b

For longitudinal data Kmσ 2b would generally be large because Km is the number of subjects

with constant subject effect variance σ 2b As just discussed for a penalized spline Kmσ 2

b

should be nearly independent of Km and could be quite smallFigure 2 presents the same type of results as Figure 1 for Gamma priors for the precision

parameter τb with the mean held fixed at 10minus6 and variances equal to 10 103 106 and 1010respectively These prior distributions have a much smaller effect on the posterior mean ofthe regression function The fit seems to be undersmooth when the variance is 10 Clearlywhen the variance increases the fit becomes smooth indicating that a value of the variancelarger than 103 will produce a reasonable fit

A similar discussion holds true for τε but now large Bε corresponds to oversmoothingand τε does not depend on the scaling of x In applications it is less likely that Bε iscomparable in size to ||Y minus Xβββ minus Zb||2 because the latter is an estimator of nσ 2

ε If σ 2ε is

an estimator of σ 2ε a good rule of thumb is to use values of Bε smaller than nσ 2

ε 100 This

272 C M Crainiceanu et al

Figure 2 LIDAR data with unstandardized covariate effect of four Gamma priors with mean 10minus6 for theprecision of the truncated polynomial parameters The variances of these priors are 10 103 106 and 1010respectively

rule should work well when σ 2ε does not have an extremely large variance

Alternative to gamma priors are discussed by for example Natarajan and Kass (2000)and Gelman (2006) These have the advantage of requiring less care in the choice of thehyperparameters However we find that with reasonable care the conjugate gamma priorscan be used in practice Nonetheless exploration of other prior families for penalized splineswould be well worthwhile though beyond the scope of this article

4 SIMULTANEOUS CREDIBLE BOUNDS

Simultaneous credible intervals have been discussed before in the literature for exampleBesag Green Higdon and Mengersen (1995) The idea can be used to produce simultaneouscredible bounds for functions estimated smoothly To show this let f (middot) be either m(middot)logσ 2

ε (middot) logσ 2b (middot) or a derivative of order q 1 le q le p of one of these functions It is

straightforward to use MCMC output to construct simultaneous credible bounds on f overan arbitrary finite interval [x1 xN ] Typically x1 and xN would be the smallest and largestobserved values of x

Let x1 lt x2 lt middot middot middot lt xN be a fine grid of points on this interval Let Ef (xi)and SDf (xi) be the posterior mean and standard deviation of f (xi) estimated from aMCMC sample Let Mα be the (1minusα) sample quantile of max1leileN

∣∣[f (xi)minusEf (xi)]SDf (xi)∣∣ computed from the realizations of f in the MCMC sample Then I (xi) =Ef (xi) plusmn MαSDf (xi) i = 1 N are simultaneous credible intervals For N

Spatially Adaptive Bayesian Penalized Splines 273

large the upper and lower limits of these intervals can be connected to form simultaneouscredible bands In the remainder of this article we only report simultaneous credible boundsfor the functions and their derivatives In the examples considered in the following thesebounds tend to be roughly 30ndash50 wider than pointwise credible bounds

5 COMPARISON WITH OTHER UNIVARIATE SMOOTHERS

Baladandayuthapani et al (2005) presented a comprehensive simulation study of theirBayesian spatially adaptive penalized spline model which is obtained as a particular case ofour full model with constant error variance The results of this study indicate that the methodis comparable to or better than several other spatially adaptive Bayesian methods on a varietyof datasets Results also showed that the method produced practically indistinguishableinferences from those obtained using the fast frequentist method of Ruppert and Carroll(2000) Since we are unaware of any other estimation methodology that estimates jointlythe nonparametric adaptive mean and nonparametric variance models we will compareour methodology with that of Ruppert and Carroll (2000) We will compare the frequentistproperties of our Bayesian methodology such as MSE and coverage probabilities calculatedby simulating from existent benchmark models

Consider the regression model

yi = m(xi) + εi

where εi sim N(0 σ 2ε ) are independent errors The first function considered was the Doppler

function described by Donoho and Johnstone (1994)

m(x) = radicx(1 minus x) sin

2π(1 + 005)

x + 005

For the Doppler function we considered n = 2048 equally spaced xrsquos between [0 1] andσ 2ε = 1

Ruppert and Carroll (2000) proposed the following family of models related to theDoppler but allowing for more shape flexibility

m(x j) = radicx(1 minus x) sin

2π(1 + 2(9minus4j)5)

x + 2(9minus4j)5

where j = 3 and j = 6 correspond to low and severe spatial heterogeneity respectivelyFor this function we considered n = 400 equally spaced xrsquos on [0 1] and σ 2

ε (x) = 004 Wewill refer to this as the RCj function where j controls the amount of spatial heterogeneity

Ruppert and Carroll (2000) and Baladandayuthapani et al (2005) also considered thefollowing function

m(x) = expminus400(x minus 06)2

+ 5

3expminus500(x minus 075)2 + 2 expminus500(x minus 09)2

(51)

274 C M Crainiceanu et al

Table 1 Average mean square error (AMSE) for different mean functions based on 100 simulations The columnldquoTrue variancerdquo indicates whether the data-generating process is homoscedastic or heteroscedastic Thecolumns ldquoFitted error variancerdquo refer to the type of assumption used for fitting the data ldquoHomoscedas-ticrdquo uses the Ruppert and Carroll (2000) fitting algorithm while ldquoHeteroscedasticrdquo uses the Bayesianinferential algorithm described in this article The mean and log-variance functions were modeled usingdegree 2 splines with Km = Kε = 40 equally spaced knots The shrinkage process was modeled usingdegree 1 splines with Kb = 4 subknots

Fitted error variance

Function True variance Homoscedastic Heteroscedastic

Doppler Constant 0197 0218RCj=3 Constant 0011 0012RCj=6 Constant 0061 00653 Bumps Constant 0054 00553 Bumps Variable 0031 0026

which is roughly constant between [0 05] and has three sharp peaks at 06 075 and 09For this function we considered n = 1000 equally spaced xrsquos on [0 1] and two differentscenarios for the standard error of the error process (a) σε(x) = 05 for homoscedasticerrors and (b) σε(xi) = 05minus08x+16(xminus05)+ We will refer to this as the ldquothree-bumprdquofunction

We used 100 simulations from these models and fit a spatially adaptive penalized splinethat uses a constant variance For the three-bump function we also fitted a model withnonconstant variance modeled as a log-penalized-spline For the mean and log-variancefunctions we used quadratic penalized splines with Km = Kε = 40 knots For the log-variance corresponding to the shrinkage process we used Kb = 4 knots

We calculated the mean square error for each simulated dataset as

MSE =nsum

i=1

m(xi) minus m(xi)2 n

where m(x) is the posterior mean of m(x) using a given model Table 1 provides a summaryof the average MSE (AMSE) results over 100 simulations comparing the method that allowsfor heteroscedastic errors with the one that does not Remarkably our method that estimatesa nonconstant variance performed well compared to methods that assume a constant variancewhen the variance is constant In contrast when the true variance process is not constantour method performed better

Our method was slightly outperformed by the Ruppert and Caroll (2000) method in theDoppler case This is probably due to the fact that the Ruppert and Carroll method assumesa constant variance for the Doppler function when the variance used for simulations wasσ 2ε = 1 However both methods outperformed the method of Baladandayuthapani et al

(2005) who reported an AMSE of 0024 for the Doppler functionFor the three-bump function with constant variance σε(x) = 05 the two models pro-

duced practically indistinguishable MSEs Note that our AMSE over all xrsquos using 100 simu-

Spatially Adaptive Bayesian Penalized Splines 275

Figure 3 Comparison of pointwise coverage probabilities of the 95 credible intervals in 500 simulationsData was simulated from the model described in Section 5 with heteroscedastic error variance Estimation wasdone using two methods Bayesian adaptive penalized splines with homoscedastic errors (dashed) and Bayesianadaptive penalized splines with heteroscedastic errors (solid) The coverage probabilities have been smoothed toremove Monte Carlo variability

lations was AMSE = 00054 which is smaller than 00061 reported by Baladandayuthapaniet al (2005) The coverage probabilities of the 95 credible intervals were very similarfor the two models for each value of the covariate with a slight advantage in favor of themodel using homoscedastic errors The average coverage probability was 947 for themodel with homoscedastic errors and 935 for the model with heteroscedastic errorsThese coverage probabilities are very similar to the ones reported by Baladandayuthapaniet al (2005) in their Figure 3

For case (b) σε(xi) = 05 minus 08x + 16(x minus 05)+ the heteroscedastic model sub-stantially outperformed the homoscedastic model both in terms of AMSE and coverageprobabilities In this situation it would be misleading to compare only the average coverageprobability for the 95 credible intervals Indeed these averages are very close 941for the heteroscedastic and 930 for the homoscedastic method but they are obtainedfrom very different sets of pointwise coverage probabilities Figure 3 displays the coverageprobabilities for these two methods Note that for the heteroscedastic method the coverageprobability in the [0 05] interval is close to 95 In the same interval the coverage prob-ability for the homoscedastic method starts from around 08 and increases until it crossesthe 95 target around 02 Moreover in the interval [03 05] this coverage probabilityis estimated to be 1 This is due to the fact that on this interval σε(x) decreases linearlyfrom 05 to 01 Since the homoscedastic method assumes a constant variance the credibleintervals will tend to be shorter than nominal in regions of higher variability (x close tozero) thus producing lower coverage probabilities In regions of smaller variability (x closeto 05) the credible intervals will tend to be much wider thus producing extremely large

276 C M Crainiceanu et al

coverage probabilitiesIn the interval [055 065] the homoscedastic method slightly outperforms the het-

eroscedastic method This seems to be the effect of a ldquoluckyrdquo combination of two factorsAs we discussed the size of the credible intervals for the homoscedastic method in a neigh-borhood of 05 is much larger than nominal However at the same point the mean functionchanges from a constant to a rapidly oscillating function and the coverage probabilities droproughly at the same rate In the interval [08 1] one can notice a phenomenon very similarto the one described for the interval [0 05] While the function oscillates more rapidly inthis region the heteroscedastic adaptive method produces credible intervals with coverageprobabilities close to the nominal 95 However the homoscedastic method does not takeinto account the increased variability and produces credible intervals that are too short

For these examples both methods performed roughly similar on simulated data sets thatdid not require variance estimation However when the error variance is not constant themethod using log-penalized-splines to estimate the error variance considerably outperformsthe method that does not

6 LOW RANK MULTIVARIATE SMOOTHING

In this section we generalize the ideas in Section 2 to multivariate smoothing whilepreserving the appealing geometric interpretation of the smoother Consider the followingregression yi = m(xi ) + εi where m(middot) is a smooth function of L covariates We will useradial basis functions which have the advantage of being more numerically stable than thetensor product of truncated polynomials in more than one dimension Suppose that xi isin RL1 le i le n are n vectors of covariates and κk isin RL 1 le k le Km are Km knots Considerthe following distance function

C(r) =

||r||2MminusL for L odd||r||2MminusL log ||r|| for L even

where || middot || denotes the Euclidean norm in RL the integer M controls the smoothnessof C(middot) X the matrix with ith row Xi = [1 xT

i ] ZKm = C(||xi minus κκκk||)1leilen1lekleKm

Km = C(||κκκk minus κκκkprime ||)1lekleK1lekprimeleK and define Z = ZKminus12K where

minus12K is the

principal square root of K With these notations the low rank approximation of thin platespline regression can be obtained as the BLUP in the LMM (Kammann and Wand 2003Ruppert Wand and Carroll 2003)

Y = Xβββ + Zb + εεε E

(bεεε

)=(

00

) cov

(bεεε

)=(

σ 2b IKm 0

0 σ 2ε In

) (61)

This model contains only one variance component σ 2b for controlling the shrinkage of

b which is equivalent to one global smoothing parameter λ = σ 2b σ

2ε and implicitly as-

sumes homoscedastic errors To relax these assumptions we consider a new set of knotsκκκlowast

1 κκκlowastKb

and define Xlowast and Zlowast similarly with the corresponding definition of matrices

Spatially Adaptive Bayesian Penalized Splines 277

X and Z where the x-covariates are replaced by the knots κκκk and the knots are replaced bythe subknots κκκlowast

k Consider the following model for the responseyi = β0 +sumL

j=1 βjxij +sumKm

k=1 bkzik + εiεi sim N0 σ 2

ε (xi ) i = 1 nbk sim N0 σ 2

b (κκκk) k = 1 Km

(62)

where the error variance logσ 2ε (xi ) is modeled as a low rank log-thin plate spline

logσ 2ε (xi ) = γ0 +sumL

j=1 γjxij +sumKε

k=1 ckzikck sim N(0 σ 2

c ) k = 1 Kε(63)

and the random coefficient variance logσ 2b (κκκk) is modeled as another low rank log-thin

plate spline logσ 2

b (κκκk) = δ0 +sumLj=1 δj x

lowastkj +sumKb

j=1 dj zlowastkj

dj sim N(0 σ 2d ) j = 1 Kb

(64)

where xlowastkj and zlowast

kj are the entries of Xlowast and Zlowast matrices respectively and ck and ds areassumed mutually independent

In the case of low rank smoothers the set of knots for the covariates and subknots formodeling the shrinkage process have to be chosen One possibility is to use equally spacedknots and subknots We use this approach in our simulation study in Section 7 to make ourmethod comparable from this point of view with previously published methods Anotherpossibility is to select the knots and subknots using the space filling design (Nychka andSaltzman 1998) which is based on the maximal separation principle This avoids wastingknots and is likely to lead to better approximations in sparse regions of the data Thecoverdesign() function from the R package Fields (Fields Development Team 2006)provides software for space filling knot selection We use this approach in Section 8 in theNoshiro elevation example

7 COMPARISONS WITH OTHER ADAPTIVE SURFACEFITTING METHODS

Lang and Brezger (2004) compared their adaptive surface smoothing method with sev-eral methods used in a simulation study by Smith and Kohn (1997) MARS (Friedman1991) ldquolocfitrdquo (Cleveland and Grosse 1991) ldquotpsrdquo (bivariate cubic thin plate splines with asingle smoothing parameter) tensor product cubic smoothing splines with five smoothingparameters and a parametric linear interaction model We will use the following regressionfunctions also used by Lang and Brezger

bull m1(x1 x2) = x1 sin(4πx2) where x1 and x2 are distributed independently uniformon [0 1]

bull m2(x1 x2) = 15 exp(minus8x21 ) + 35 exp(minus8x2

2 ) where x1 and x2 are distributedindependently normal with mean 05 and variance 01

278 C M Crainiceanu et al

Function m1(middot) contains complex interactions and moderate spatial variability whilem2(middot) has only main effects and is much smoother We used a sample size n = 300σ = 14 range(f ) and 250 simulations from each model

We will compare the performance of our method with that of Lang and Brezgerrsquos whichperformed at least as good or better than other adaptive methods in a variety of examplesTo avoid dependence of results on knot choice we used the same choice as the one describedby Lang and Brezger 12 times 12 equally spaced knots for the mean function However weused thin-plate splines instead of tensor products of B-splines We modeled the smoothingparameter using a low rank log-thin plate spline with an equally spaced 5 times 5 knot grid Weconsidered two estimators one with constant error variance (CEV) and one with estimatederror variance using a log-thin plate spline (EEV) with the same 12 times 12 knot grid as forthe mean function The CEV estimator is the one that can directly be compared with theestimators considered by Lang and Brezger (2004) Because Lang and Brezger presentedonly comparative results using boxplots we will use their results as read from their Figure5 The performance of all estimators was measured by the log of the mean squared errorlog(MSE) where MSE(f ) = 1n

sumni=1f (xi) minus f (xi)2

For the CEV estimator and function f1(middot) corresponding to moderate spatial heterogene-ity we obtained a median log(MSE) of minus367 with an interquartile range [minus380minus353]and a range [minus421minus313] These values are better than the best values reported by Langand Brezgerrsquos methods in their Figure 5(b) Lang and Brezger reported that their method de-noted ldquoP-splinerdquo outperforms all other methods in this case and achieved a median log(MSE)of minus350 Remarkably the EEV estimator performed almost as well as the CEV estimatorin this case and also outperformed the other methods The median log(MSE) for EEV wasof minus359 with an interquartile range [minus374minus343] and a range [minus414minus302] Theseresults are consistent with univariate results reported by Ruppert and Carroll (2000) andBaladandayuthapani Mallick and Carroll (2005) who found that allowing the smoothingparameter to vary smoothly outperforms other adaptive techniques

For function f1(middot) Figure 4 displays the coverage probabilities for the 95 credibleintervals calculated over a 20 times 20 equally spaced grid in [0 1]2 For one grid point wecalculated the frequency with which the 95 pointwise credible interval covers the truevalue of the function at the grid point As expected these coverage probabilities showstrong spatial correlation with lower coverage probabilities along the ridges of the sinusfunction Coverage probability is lowest when x1 is in the [02 05] range The signal-to-noise in this region is about half the signal-to-noise ratio in the region corresponding to x1

close to 1 This explains why the coverage probability is smaller in this region Interestinglythe zeros of the true function appearing at x2 = 025 050 075 are covered with at least95 probability Another interesting feature is the lower coverage probabilities near thenorth south and east boundaries These features of the pointwise coverage probabilitymap are consistent with the frequentist analysis of Krivobokova et al (2007) but are notconsistent with the coverage probabilities reported by Lang and Brezger for their BayesianP-spline method

For our CEV estimator and function f2(middot) corresponding to very low spatial heterogene-ity we obtained a median log(MSE) of minus595 with an interquartile range [minus612minus566]

Spatially Adaptive Bayesian Penalized Splines 279

Figure 4 Coverage probability of the pointwise 95 credible intervals for the mean function for functionf1(x1 x2) = x1 sin(4πx2) with a constant error standard deviation σ = range(f )4 Probabilities are computedon a 20 times 20 equally spaced grid of points in [0 1]2 The model used was a thin plate spline with K = 100 knotsfor the mean function a log-thin plate spline with K = 100 knots for the variance function and a log-thin platespline with Klowast = 16 knots for the spatially adaptive shrinkage parameter The parameter controlling the degreeof smoothness of the thin-plate spline basis was m = 2

and a range [minus651minus541] The results of Lang and Brezger in their Figure 5(a) indicatethat their ldquoP-splinerdquo method achieved a median log(MSE) of minus600 Our method performsroughly similar to the Bayesian P-spline method of Lang and Brezger and to the cubicthin plate spline (tps) being outperformed only by Lang and Brezgerrsquos adaptive P-splinemethod with two main effects ldquoP-spline mainrdquo We could also add two main effects to ourbivariate adaptive smoother to improve MSE for this example but this is not our concernin this article Again the EEV estimator performed very similarly to our CEV estimator

The last simulation study was done using the function f1(middot middot) where x1 x2 are indepen-dent uniformly distributed in [0 1] with an error standard deviation function

σε(x1 x2) = r

32+ 3r

32x2

1

where r = range(f1) We compared our CEV and EEV estimators described at the beginningof this section and as expected the EEV estimator outperformed the CEV both in terms oflog(MSE) and coverage probabilities More precisely for log(MSE) we obtained a medianof minus526 for CEV and minus535 for EEV an interquartile range [minus537minus515] for CEV and[minus548minus524] for EEV and a range [minus465minus572] for CEV and [minus484minus584] for EEVFor both estimators the general patterns for coverage probabilities were the ones displayedin Figure 4 EEV outperformed CEV in terms of coverage probabilities For example theregions where the nominal level is exceeded for both estimators have a roughly similar

280 C M Crainiceanu et al

Figure 5 Locations of sampling (dots) with 100 knots (circles) in the Noshiro example

shape and location but the EEV coverage probabilities tend to be closer to their nominallevel

Note that our model incorporating nonparametric variance estimation is more generalthan the models considered in the literature However its performance for function m1(middot middot)with moderate spatial variability and constant variance was better than all methods consid-ered by Lang and Brezger in their simulation study (see their Figure 5 panel b) Our methodhas good performance for the function m2(middot middot) being the second best when compared tomethods consider by Lang and Brezger (see their Figure 5 panel a) For the case whenvariance is not constant our adaptive method that estimates the variance nonparametricallyoutperformed our adaptive method that estimates a constant variance

8 THE NOSHIRO EXAMPLE

Noshiro Japan was the site of a major earthquake Much of the damage from thequake was due to soil movement Professor Thomas OrsquoRourke a geotechnical engineerwas investigating the factors that might help predict soil movement during a future quakeOne factor thought to be of importance was the slope of the land To estimate the slopeRuppert (1997) used a dataset with 799 observations where x was longitude and latitudeand y was elevation at locations in Noshiro The object of primary interest is the gradientat least its magnitude and possibly its direction

We used the thin-plate spline models (62) for the mean of y with Km = 100 knotsselected by the maximum separation criterion implemented in the function coverdesign()

of the R package Fields (Nychka Haaland OrsquoConnell and Ellner 1998) The sampling

Spatially Adaptive Bayesian Penalized Splines 281

Figure 6 Posterior mean of the mean regression function for the Noshiro example The model used was a thinplate spline with K = 100 knots for the mean function a log-thin plate spline with K = 100 knots for the variancefunction and a log-thin plate spline with Klowast = 16 knots for the spatially adaptive shrinkage parameter Theparameter controlling the degree of smoothness of the thin-plate spline basis was m = 2

locations and the knots used for fitting are displayed in Figure 5 The log-thin plate splinedescribed in (63) with the same Kε = 100 knots was used to model the variance of theerror process σ 2

ε (xi ) The log-thin plate spline described in (64) with Kb = 16 subknotswas used to model the variance σ 2

b (κκκk) which controls the amount of shrinkage of the b

parameters The subknots were obtained using the coverdesign() function applied to the100 knots

Figure 6 displays the posterior mean of the mean response function There is a rathersharp peak around (042 04) with a slow decay from the peak in the general south directionand a much faster decay in all other directions The function becomes smoother towardsthe boundary This is exactly the type of function for which adaptive spatial smoothing cansubstantially improve the fit

Figure 7 shows that in large areas of the map σε(middot) is smooth and with small valuesHowever in a neighborhood of the peak of the mean function σε(middot) displays two relativelysharp maxima located NNW and SSE of the peak We also performed a two-stage frequentistanalysis In the first stage we used a low rank thin plate spline for the mean function and theresults from this regression were used to obtain residuals We then fitted another low rankthin plate spline to the absolute values of these residuals While this map was not identicalto the one in Figure 7 it did exhibit the same patterns of heteroscedasticity

The process logσ 2b (κκκk) controlling the shrinkage of the bk is displayed in Figure 8

282 C M Crainiceanu et al

Figure 7 Posterior mean of the standard deviation σε(xi ) of the error process function for the Noshiro exampleThe model used was a thin plate spline with K = 100 knots for the mean function a log-thin plate spline withK = 100 knots for the variance function and a log-thin plate spline with Klowast = 16 knots for the spatially adaptiveshrinkage parameter The parameter controlling the degree of smoothness of the thin-plate spline basis was m = 2

Figure 8 Posterior mean of the shrinkage process minus logσ 2b(κκκk) The model used was a thin plate spline with

K = 100 knots for the mean function a log-thin plate spline with K = 100 knots for the variance function and alog-thin plate spline with Klowast = 16 knots for the spatially adaptive shrinkage parameter The parameter controllingthe degree of smoothness of the thin-plate spline basis was m = 2

Spatially Adaptive Bayesian Penalized Splines 283

Smaller values of this function correspond to less shrinkage of bk towards zero and morelocal behavior of the smoother Figure 8 indicates that in a neighborhood of the peak of themean function the shrinkage is smaller to allow the mean function to change rapidly Awayfrom the peak the shrinkage is larger corresponding to a smoother mean

9 IMPLEMENTATION USING MCMC

We will now provide some details for MCMC simulations of model (21) wherethe variances σ 2

ε (xi) and σ 2b (κk) are modeled by Equations (23) and (24) The imple-

mentation for other models for example for the multivariate smoothing in Section 6is similar Consider independent normal priors for the coefficients of the monomialsβi sim N(0 σ 2

0β) γi sim N(0 σ 20γ ) i = 0 p δi sim N(0 σ 2

0δ) i = 0 q and in-

dependent inverse Gamma priors for the variance components σ 2c sim IGamma(ac bc) and

σ 2d sim IGamma(ad bd) Using these priors many full conditionals of the posterior distribu-

tion are easy to derive while a few have complex multivariate forms Our implementationof the MCMC using multivariate Metropolis-Hastings steps proved to be unstable with poormixing properties A simple and reliable solution was to change the model by adding errorterms to the log-spline models that is

logσ 2ε (xi)

= γ0 + middot middot middot + γpxpi +sumKε

s=1 cs(xi minus κεs )

p+ + ui

logσ 2b (κ

mk ) = δ0 + middot middot middot + δq(κ

mk )q +sumKb

s=1 ds(κmk minus κb

s )q+ + vk

(91)

where ui sim N(0 σ 2u ) and vk sim N(0 σ 2

v ) This idea was also used for σ 2b by Baladan-

dayuthapani Mallick and Carroll (2005) We fixed the values of σ 2u = σ 2

v = 001 as thesevariances appear nonidentifiable and a standard deviation of 01 is small on a log-scale Thisdevice makes computations feasible by replacing sampling from complex full conditionalsby simple univariate MH steps

Define as before by ε b θθθX and define θθθε = (γγγ T cT )T θθθb = (δδδT dT )T CX =(X Z) Cε = (Xε Zε) Cb = (Xb Zb) where X Xε Xb contain the monomials and ZZε Zb contain the truncated polynomials of the spline models for m logσ 2

ε and logσ 2b

respectively Also denote by

0X =[

σ 20βIp+1 0

0 b

]

0ε =[

σ 20γ Ip+1 0

0 σ 2c IKε

]

0b =[

σ 20δIq+1 0

0 σ 2d IKm

]

The full conditionals of the posterior are detailed below

1 [θθθX] sim N(MXCT

Xminus1ε YMX

) where MX =

(CT

Xminus1ε CX + minus1

0X

)minus1

2 [θθθε] sim N(MεCT

ε Yεσ2u Mε

) where Yε = [log(σ 2

ε1) log(σ 2εn)]

T and Mε

= (CTε Cεσ

2u +minus1

0ε )minus1

284 C M Crainiceanu et al

Figure 9 Computation time (in minutes) versus sample size for 10000 simulations from the joint posteriordistribution given the data The mean function is modeled as a penalized spline with Km knots with the log of thevariance of the shrinkage process modeled as another penalized spline with Kb = 4 subknots The log of the errorvariance is modeled as another penalized spline with Kε = Km knots Different symbols correspond to increasingnumber of knots dots (Km = 10) asterisks (Km = 20) circles (Km = 30) xrsquos (Km = 40)

3 [θθθb] sim N(MbCT

b Ybσ2v Mb

)where Yb =

[log(σ 2

1 ) log(σ 2Km

)]T

and Mb =(CT

b Cbσ2v +minus1

0b )minus1

4 [σ 2c ] sim IGamma

(ac + Kε2 bc + ||c||22

)

5 [σ 2d ] sim IGamma

(ad + Kb2 bd + ||d||22

)

6 [σ 2εi] prop σminus3

εi exp

minus(yi minus microi)

2(2σ 2εi ) minus

[log(σ 2

εi ) minus ηi)]2

(2σ 2u )

where microi

and ηi are the ith components of CXθθθX and Cεθθθε respectively

7 [σ 2k ] prop σminus3

k expminusb2

k(2σ2k ) minus [

log(σ 2k ) minus ζk)

]2(σ 2

v ) where ζk is the kth com-

ponent of Cbθθθb

All the above conditionals have an explicit form with the exception of the n + Km

one-dimensional conditionals from 6 and 7 For these distributions we use the Metropolis-Hastings algorithm with a normal proposal distribution centered at the current value andsmall variance

10 IMPLEMENTATION AND COMPUTATIONALPERFORMANCE

Our MCMC simulation algorithm provided reliable results and is reasonably fast To il-lustrate this Figure 9 displays computation time (in minutes) versus sample size for 10000

Spatially Adaptive Bayesian Penalized Splines 285

Figure 10 MCMC histories for eight parameters of interest including the log of standard deviation of the errorprocess and log of standard deviation of the shrinkage process as well as some parameters modeling the differentfunctions The model uses 40 equally spaced knots for the mean and error variance functions and 4 subknotsfor the adaptive shrinkage process Chains correspond to the Doppler function example using quadratic truncatedpolynomials to fit the mean and log variance function and linear truncated polynomial basis for fitting the shrinkageprocess For clarity of presentation only every 20th simulated observation is displayed

simulations from the joint posterior distribution given the data The mean function is mod-eled as a penalized spline with Km knots with the log of the variance of the shrinkageprocess modeled as another penalized spline with Kb = 4 subknots The log of the errorvariance is modeled as another penalized spline with Kε = Km knots Different symbolscorrespond to increasing number of knots dots (Km = 10) asterisks (Km = 20) circles(Km = 30) xrsquos (Km = 40) Computation time increases roughly linearly with the numberof observations with the same intercept 0 and larger slopes for larger number of knotsNote that even with n = 1000 observations and Km = Kε = 40 simulation time is abouttwo minutes

For the models and datasets presented in this article 10000 simulations proved suffi-cient for accurate and reliable inference Some parameters such as the mean and variancefunctions tended to exhibit much better mixing properties than the variance componentsσ 2c and σ 2

d controlling the shrinkage processes This is probably due to the reduced amountof information in the data about these parameters The MCMC convergence and mixingproperties were assessed by visual inspection of the chain histories of many parameters ofinterest Since we started with good starting values of the parameters convergence is fastFigure 10 displays the histories of eight parameters of interest for the model fitted to the

286 C M Crainiceanu et al

Doppler function described in Section 5 For clarity we display only every 100th sampledvalued of the chain but the same patterns would be observed in the full chain history Similargood mixing properties have been noted in all the other examples presented in this article

In the search for a good software platform we implemented our algorithms in MATLABC and FORTRAN Programs can be downloaded from wwwbiostatjhsphedu simccrainicFor our application and implementation there were no serious differences between C andFORTRAN which have performed better than MATLAB by reducing computation timetwo to three times The reported times in Figure 9 are for our implementation in C

Given the complexity of the model we were interested in improving the mixing prop-erties of the Markov Chains The computational trick described in Section 9 was the sinequa non method that allowed us to replace an intractable multivariate MH step with severaltractable univariate MH steps We used normal random walk proposal distributions centeredat the current value and variance tuned to provide acceptance rates between 30 and 50

11 COMMENTS

Our joint models for heterogeneous mean and nonconstant variance are designed to ex-tract the maximum amount of information from complex data and suggest simpler paramet-ric models Bayesian inference based on MCMC offers a natural and powerful frameworkfor model fitting However MCMC in this context is not straightforward and implemen-tation depends essentially on the ability of transforming intractable simulations from highdimensional distributions into tractable one dimensional updates as described in Section9 Our implementation provides accurate fast and reproducible results while software isavailable Actual simulation time increases only linearly with the number of observationsand mixing properties are adequate provided that proposal distribution variances are tunedfor optimal acceptance rates

ACKNOWLEDGMENTS

Carrollrsquos research was supported by a grant from the National Cancer Institute (CA-57030) and by the TexasAampM Center for Environmental and Rural Health via a grant from the National Institute of Environmental HealthSciences (P30-ES09106)

[Received September 2004 Revised October 2006]

REFERENCES

Baladandayuthapani V Mallick B K and Carroll R J (2005) ldquoSpatially Adaptive Bayesian RegressionSplinesrdquo Journal of Computational and Graphical Statistics 14 378ndash394

Besag J Green P Higdon D and Mengersen K (1995) ldquoBayesian Computation and Stochastic Systemsrdquo(with discussions) Statistical Science 10 3ndash66

Carroll R J (2003) ldquoVariances Are Not Always Nuisance Parameters The 2002 R A Fisher Lecturerdquo Biometrics59 211ndash220

Carroll R J and Ruppert D (1988) Transformation and Weighting in Regression New York Chapman andHall

Spatially Adaptive Bayesian Penalized Splines 287

Cleveland W S and Grosse E (1991) ldquoComputational Methods for Local Regressionrdquo Statistics and Computing1 47ndash62

Crainiceanu C and Ruppert D (2004) ldquoLikelihood Ratio Tests in Linear Mixed Models With One VarianceComponentrdquo Journal of the Royal Statistical Society Series B 66 165ndash185

Crainiceanu C M Ruppert D and Wand M P (2005) ldquoBayesian Analysis for Penalized Spline Regressionusing WinBUGSrdquo Journal of Statistical Software 14 Available online at http wwwjstatsoftorg v14 i14

Donoho D L and Johnstone I M (1994) ldquoIdeal Spatial Adaptation by Wavelet Shrinkagerdquo Biometrika 81425ndash455

Eilers P H C and Marx B D (1996) ldquoFlexible Smoothing With B-Splines and Penaltiesrdquo (with discussion)Statistical Science 11 89ndash121

(2002) ldquoGeneralized Linear Additive Smooth Structuresrdquo Journal of Computational and Graphical Statis-tics 11 758ndash783

(2004) ldquoSplines Knots and Penaltiesrdquo unpublished manuscript

Fields Development Team (2006) fields Tools for Spatial Data National Center for Atmospheric ResearchBoulder CO Available online at http wwwcgducaredu Software Fields

Friedman J H (1991) ldquoMultivariate Adaptive Regression Splinesrdquo (with discussion) The Annals of Statistics19 1ndash141

Gelman A (2006) ldquoPrior Distributions for Variance Parameters in Hierarchical Modelsrdquo Bayesian Analysis 1515ndash533

Gray R J (1994) ldquoSpline-Based Tests in Survival Analysisrdquo Biometrics 50 640ndash652

Holst U Hossjer O Bjorklund C Ragnarson P and Edner H (1996) ldquoLocally Weighted Least SquaresKernel Regression and Statistical Evaluation of LIDAR Measurementsrdquo Environmetrics 7 401ndash416

Jullion A and Lambert P (2006) Discussion paper DP0534 UCL

Kammann E E and Wand M P (2003) ldquoGeoadditive Modelsrdquo Applied Statistics 52 1ndash18

Kauermann G (2002) ldquoA Note on Bandwidth Selection for Penalised Spline Smoothingrdquo Technical Report 02-13Department of Statistics University of Glasgow

Krivobokova T Crainiceanu C M and Kauermann G (2007) ldquoFast Adaptive Penalized Splinesrdquo Journal ofComputational and Graphical Statistics to appear

Lang S and Bretzger A (2004) ldquoBayesian P-Splinesrdquo Journal of Compuational and Graphical Statistics 13183ndash212

Lang S Fronk E M and Fahrmeir L (2002) ldquoFunction Estimation with Locally Adaptive Dynamic ModelsrdquoComputational Statistics 17 479ndash499

Marx B D and Eilers PHC (1998) ldquoDirect Generalized Additive Modeling With Penalized LikelihoodrdquoComputational Statistics amp Data Analysis 28 193ndash209

Natarajan R and Kass R E (2000) ldquoReference Bayesian Methods for Generalized Linear Mixed ModelsrdquoJournal of the American Statistical Association 95 227ndash237

Ngo L and Wand M P (2004) ldquoSmoothing With Mixed Model Softwarerdquo Journal of Statistical Software 9Available online at http wwwjstatsoftorg v09 i01

Nychka D Haaland P OrsquoConnell M and Ellner S (1998) ldquoFUNFITS Data Analysis and Statistical Toolsfor Estimating Functionsrdquo in Case Studies in Environmental Statistics Lecture Notes in Statistics vol 132eds D Nychka W W Piegorsch and L H Cox New York Springer Verlag pp 159ndash179 Online athttp wwwcgducaredu stats Software Funfits

Nychka D and Saltzman N (1998) ldquoDesign of Air Quality Monitoring Networksrdquo in Case studies in Environ-mental Statistics Lecture Notes in Statistics vol 132 eds D Nychka W W Piegorsch and L H Cox NewYork Springer Verlag pp 51ndash76

288 C M Crainiceanu et al

Pintore A Speckman P and Holmes C C (2006) ldquoSpatially Adaptive Smoothing Splinesrdquo Biometrika 93113ndash125

Ruppert D (1997) ldquoLocal Polynomial Regression and its Applications in Environmental Statisticsrdquo in Statisticsfor the Environment (vol 3) eds V Barnett and F Turkman Chicester Wiley pp 155ndash173

(2002) ldquoSelecting the Number of Knots for Penalized Splinesrdquo Journal of Computational and GraphicalStatistics 11 735ndash757

Ruppert D and Carroll R J (2000) ldquoSpatially-Adaptive Penalties for Spline Fittingrdquo Australian and NewZealand Journal of Statistics 42 205ndash223

Ruppert D Wand M P Holst and U Hossier O (1997) Technometrics 39 262ndash273

Ruppert D Wand M P and Carroll R J (2003) Semiparametric Regression Cambridge Cambridge UniversityPress

Smith M and Kohn R (1997) ldquoA Bayesian Approach to Nonparametric Bivariate Regressionrdquo Journal of theAmerican Statistical Association 92 1522ndash1535

Wood S (2006) Generalized Additive Models An Introduction with R New York Chapman amp HallCRC

Page 8: Spatially Adaptive Bayesian Penalized Splines - Department of

272 C M Crainiceanu et al

Figure 2 LIDAR data with unstandardized covariate effect of four Gamma priors with mean 10minus6 for theprecision of the truncated polynomial parameters The variances of these priors are 10 103 106 and 1010respectively

rule should work well when σ 2ε does not have an extremely large variance

Alternative to gamma priors are discussed by for example Natarajan and Kass (2000)and Gelman (2006) These have the advantage of requiring less care in the choice of thehyperparameters However we find that with reasonable care the conjugate gamma priorscan be used in practice Nonetheless exploration of other prior families for penalized splineswould be well worthwhile though beyond the scope of this article

4 SIMULTANEOUS CREDIBLE BOUNDS

Simultaneous credible intervals have been discussed before in the literature for exampleBesag Green Higdon and Mengersen (1995) The idea can be used to produce simultaneouscredible bounds for functions estimated smoothly To show this let f (middot) be either m(middot)logσ 2

ε (middot) logσ 2b (middot) or a derivative of order q 1 le q le p of one of these functions It is

straightforward to use MCMC output to construct simultaneous credible bounds on f overan arbitrary finite interval [x1 xN ] Typically x1 and xN would be the smallest and largestobserved values of x

Let x1 lt x2 lt middot middot middot lt xN be a fine grid of points on this interval Let Ef (xi)and SDf (xi) be the posterior mean and standard deviation of f (xi) estimated from aMCMC sample Let Mα be the (1minusα) sample quantile of max1leileN

∣∣[f (xi)minusEf (xi)]SDf (xi)∣∣ computed from the realizations of f in the MCMC sample Then I (xi) =Ef (xi) plusmn MαSDf (xi) i = 1 N are simultaneous credible intervals For N

Spatially Adaptive Bayesian Penalized Splines 273

large the upper and lower limits of these intervals can be connected to form simultaneouscredible bands In the remainder of this article we only report simultaneous credible boundsfor the functions and their derivatives In the examples considered in the following thesebounds tend to be roughly 30ndash50 wider than pointwise credible bounds

5 COMPARISON WITH OTHER UNIVARIATE SMOOTHERS

Baladandayuthapani et al (2005) presented a comprehensive simulation study of theirBayesian spatially adaptive penalized spline model which is obtained as a particular case ofour full model with constant error variance The results of this study indicate that the methodis comparable to or better than several other spatially adaptive Bayesian methods on a varietyof datasets Results also showed that the method produced practically indistinguishableinferences from those obtained using the fast frequentist method of Ruppert and Carroll(2000) Since we are unaware of any other estimation methodology that estimates jointlythe nonparametric adaptive mean and nonparametric variance models we will compareour methodology with that of Ruppert and Carroll (2000) We will compare the frequentistproperties of our Bayesian methodology such as MSE and coverage probabilities calculatedby simulating from existent benchmark models

Consider the regression model

yi = m(xi) + εi

where εi sim N(0 σ 2ε ) are independent errors The first function considered was the Doppler

function described by Donoho and Johnstone (1994)

m(x) = radicx(1 minus x) sin

2π(1 + 005)

x + 005

For the Doppler function we considered n = 2048 equally spaced xrsquos between [0 1] andσ 2ε = 1

Ruppert and Carroll (2000) proposed the following family of models related to theDoppler but allowing for more shape flexibility

m(x j) = radicx(1 minus x) sin

2π(1 + 2(9minus4j)5)

x + 2(9minus4j)5

where j = 3 and j = 6 correspond to low and severe spatial heterogeneity respectivelyFor this function we considered n = 400 equally spaced xrsquos on [0 1] and σ 2

ε (x) = 004 Wewill refer to this as the RCj function where j controls the amount of spatial heterogeneity

Ruppert and Carroll (2000) and Baladandayuthapani et al (2005) also considered thefollowing function

m(x) = expminus400(x minus 06)2

+ 5

3expminus500(x minus 075)2 + 2 expminus500(x minus 09)2

(51)

274 C M Crainiceanu et al

Table 1 Average mean square error (AMSE) for different mean functions based on 100 simulations The columnldquoTrue variancerdquo indicates whether the data-generating process is homoscedastic or heteroscedastic Thecolumns ldquoFitted error variancerdquo refer to the type of assumption used for fitting the data ldquoHomoscedas-ticrdquo uses the Ruppert and Carroll (2000) fitting algorithm while ldquoHeteroscedasticrdquo uses the Bayesianinferential algorithm described in this article The mean and log-variance functions were modeled usingdegree 2 splines with Km = Kε = 40 equally spaced knots The shrinkage process was modeled usingdegree 1 splines with Kb = 4 subknots

Fitted error variance

Function True variance Homoscedastic Heteroscedastic

Doppler Constant 0197 0218RCj=3 Constant 0011 0012RCj=6 Constant 0061 00653 Bumps Constant 0054 00553 Bumps Variable 0031 0026

which is roughly constant between [0 05] and has three sharp peaks at 06 075 and 09For this function we considered n = 1000 equally spaced xrsquos on [0 1] and two differentscenarios for the standard error of the error process (a) σε(x) = 05 for homoscedasticerrors and (b) σε(xi) = 05minus08x+16(xminus05)+ We will refer to this as the ldquothree-bumprdquofunction

We used 100 simulations from these models and fit a spatially adaptive penalized splinethat uses a constant variance For the three-bump function we also fitted a model withnonconstant variance modeled as a log-penalized-spline For the mean and log-variancefunctions we used quadratic penalized splines with Km = Kε = 40 knots For the log-variance corresponding to the shrinkage process we used Kb = 4 knots

We calculated the mean square error for each simulated dataset as

MSE =nsum

i=1

m(xi) minus m(xi)2 n

where m(x) is the posterior mean of m(x) using a given model Table 1 provides a summaryof the average MSE (AMSE) results over 100 simulations comparing the method that allowsfor heteroscedastic errors with the one that does not Remarkably our method that estimatesa nonconstant variance performed well compared to methods that assume a constant variancewhen the variance is constant In contrast when the true variance process is not constantour method performed better

Our method was slightly outperformed by the Ruppert and Caroll (2000) method in theDoppler case This is probably due to the fact that the Ruppert and Carroll method assumesa constant variance for the Doppler function when the variance used for simulations wasσ 2ε = 1 However both methods outperformed the method of Baladandayuthapani et al

(2005) who reported an AMSE of 0024 for the Doppler functionFor the three-bump function with constant variance σε(x) = 05 the two models pro-

duced practically indistinguishable MSEs Note that our AMSE over all xrsquos using 100 simu-

Spatially Adaptive Bayesian Penalized Splines 275

Figure 3 Comparison of pointwise coverage probabilities of the 95 credible intervals in 500 simulationsData was simulated from the model described in Section 5 with heteroscedastic error variance Estimation wasdone using two methods Bayesian adaptive penalized splines with homoscedastic errors (dashed) and Bayesianadaptive penalized splines with heteroscedastic errors (solid) The coverage probabilities have been smoothed toremove Monte Carlo variability

lations was AMSE = 00054 which is smaller than 00061 reported by Baladandayuthapaniet al (2005) The coverage probabilities of the 95 credible intervals were very similarfor the two models for each value of the covariate with a slight advantage in favor of themodel using homoscedastic errors The average coverage probability was 947 for themodel with homoscedastic errors and 935 for the model with heteroscedastic errorsThese coverage probabilities are very similar to the ones reported by Baladandayuthapaniet al (2005) in their Figure 3

For case (b) σε(xi) = 05 minus 08x + 16(x minus 05)+ the heteroscedastic model sub-stantially outperformed the homoscedastic model both in terms of AMSE and coverageprobabilities In this situation it would be misleading to compare only the average coverageprobability for the 95 credible intervals Indeed these averages are very close 941for the heteroscedastic and 930 for the homoscedastic method but they are obtainedfrom very different sets of pointwise coverage probabilities Figure 3 displays the coverageprobabilities for these two methods Note that for the heteroscedastic method the coverageprobability in the [0 05] interval is close to 95 In the same interval the coverage prob-ability for the homoscedastic method starts from around 08 and increases until it crossesthe 95 target around 02 Moreover in the interval [03 05] this coverage probabilityis estimated to be 1 This is due to the fact that on this interval σε(x) decreases linearlyfrom 05 to 01 Since the homoscedastic method assumes a constant variance the credibleintervals will tend to be shorter than nominal in regions of higher variability (x close tozero) thus producing lower coverage probabilities In regions of smaller variability (x closeto 05) the credible intervals will tend to be much wider thus producing extremely large

276 C M Crainiceanu et al

coverage probabilitiesIn the interval [055 065] the homoscedastic method slightly outperforms the het-

eroscedastic method This seems to be the effect of a ldquoluckyrdquo combination of two factorsAs we discussed the size of the credible intervals for the homoscedastic method in a neigh-borhood of 05 is much larger than nominal However at the same point the mean functionchanges from a constant to a rapidly oscillating function and the coverage probabilities droproughly at the same rate In the interval [08 1] one can notice a phenomenon very similarto the one described for the interval [0 05] While the function oscillates more rapidly inthis region the heteroscedastic adaptive method produces credible intervals with coverageprobabilities close to the nominal 95 However the homoscedastic method does not takeinto account the increased variability and produces credible intervals that are too short

For these examples both methods performed roughly similar on simulated data sets thatdid not require variance estimation However when the error variance is not constant themethod using log-penalized-splines to estimate the error variance considerably outperformsthe method that does not

6 LOW RANK MULTIVARIATE SMOOTHING

In this section we generalize the ideas in Section 2 to multivariate smoothing whilepreserving the appealing geometric interpretation of the smoother Consider the followingregression yi = m(xi ) + εi where m(middot) is a smooth function of L covariates We will useradial basis functions which have the advantage of being more numerically stable than thetensor product of truncated polynomials in more than one dimension Suppose that xi isin RL1 le i le n are n vectors of covariates and κk isin RL 1 le k le Km are Km knots Considerthe following distance function

C(r) =

||r||2MminusL for L odd||r||2MminusL log ||r|| for L even

where || middot || denotes the Euclidean norm in RL the integer M controls the smoothnessof C(middot) X the matrix with ith row Xi = [1 xT

i ] ZKm = C(||xi minus κκκk||)1leilen1lekleKm

Km = C(||κκκk minus κκκkprime ||)1lekleK1lekprimeleK and define Z = ZKminus12K where

minus12K is the

principal square root of K With these notations the low rank approximation of thin platespline regression can be obtained as the BLUP in the LMM (Kammann and Wand 2003Ruppert Wand and Carroll 2003)

Y = Xβββ + Zb + εεε E

(bεεε

)=(

00

) cov

(bεεε

)=(

σ 2b IKm 0

0 σ 2ε In

) (61)

This model contains only one variance component σ 2b for controlling the shrinkage of

b which is equivalent to one global smoothing parameter λ = σ 2b σ

2ε and implicitly as-

sumes homoscedastic errors To relax these assumptions we consider a new set of knotsκκκlowast

1 κκκlowastKb

and define Xlowast and Zlowast similarly with the corresponding definition of matrices

Spatially Adaptive Bayesian Penalized Splines 277

X and Z where the x-covariates are replaced by the knots κκκk and the knots are replaced bythe subknots κκκlowast

k Consider the following model for the responseyi = β0 +sumL

j=1 βjxij +sumKm

k=1 bkzik + εiεi sim N0 σ 2

ε (xi ) i = 1 nbk sim N0 σ 2

b (κκκk) k = 1 Km

(62)

where the error variance logσ 2ε (xi ) is modeled as a low rank log-thin plate spline

logσ 2ε (xi ) = γ0 +sumL

j=1 γjxij +sumKε

k=1 ckzikck sim N(0 σ 2

c ) k = 1 Kε(63)

and the random coefficient variance logσ 2b (κκκk) is modeled as another low rank log-thin

plate spline logσ 2

b (κκκk) = δ0 +sumLj=1 δj x

lowastkj +sumKb

j=1 dj zlowastkj

dj sim N(0 σ 2d ) j = 1 Kb

(64)

where xlowastkj and zlowast

kj are the entries of Xlowast and Zlowast matrices respectively and ck and ds areassumed mutually independent

In the case of low rank smoothers the set of knots for the covariates and subknots formodeling the shrinkage process have to be chosen One possibility is to use equally spacedknots and subknots We use this approach in our simulation study in Section 7 to make ourmethod comparable from this point of view with previously published methods Anotherpossibility is to select the knots and subknots using the space filling design (Nychka andSaltzman 1998) which is based on the maximal separation principle This avoids wastingknots and is likely to lead to better approximations in sparse regions of the data Thecoverdesign() function from the R package Fields (Fields Development Team 2006)provides software for space filling knot selection We use this approach in Section 8 in theNoshiro elevation example

7 COMPARISONS WITH OTHER ADAPTIVE SURFACEFITTING METHODS

Lang and Brezger (2004) compared their adaptive surface smoothing method with sev-eral methods used in a simulation study by Smith and Kohn (1997) MARS (Friedman1991) ldquolocfitrdquo (Cleveland and Grosse 1991) ldquotpsrdquo (bivariate cubic thin plate splines with asingle smoothing parameter) tensor product cubic smoothing splines with five smoothingparameters and a parametric linear interaction model We will use the following regressionfunctions also used by Lang and Brezger

bull m1(x1 x2) = x1 sin(4πx2) where x1 and x2 are distributed independently uniformon [0 1]

bull m2(x1 x2) = 15 exp(minus8x21 ) + 35 exp(minus8x2

2 ) where x1 and x2 are distributedindependently normal with mean 05 and variance 01

278 C M Crainiceanu et al

Function m1(middot) contains complex interactions and moderate spatial variability whilem2(middot) has only main effects and is much smoother We used a sample size n = 300σ = 14 range(f ) and 250 simulations from each model

We will compare the performance of our method with that of Lang and Brezgerrsquos whichperformed at least as good or better than other adaptive methods in a variety of examplesTo avoid dependence of results on knot choice we used the same choice as the one describedby Lang and Brezger 12 times 12 equally spaced knots for the mean function However weused thin-plate splines instead of tensor products of B-splines We modeled the smoothingparameter using a low rank log-thin plate spline with an equally spaced 5 times 5 knot grid Weconsidered two estimators one with constant error variance (CEV) and one with estimatederror variance using a log-thin plate spline (EEV) with the same 12 times 12 knot grid as forthe mean function The CEV estimator is the one that can directly be compared with theestimators considered by Lang and Brezger (2004) Because Lang and Brezger presentedonly comparative results using boxplots we will use their results as read from their Figure5 The performance of all estimators was measured by the log of the mean squared errorlog(MSE) where MSE(f ) = 1n

sumni=1f (xi) minus f (xi)2

For the CEV estimator and function f1(middot) corresponding to moderate spatial heterogene-ity we obtained a median log(MSE) of minus367 with an interquartile range [minus380minus353]and a range [minus421minus313] These values are better than the best values reported by Langand Brezgerrsquos methods in their Figure 5(b) Lang and Brezger reported that their method de-noted ldquoP-splinerdquo outperforms all other methods in this case and achieved a median log(MSE)of minus350 Remarkably the EEV estimator performed almost as well as the CEV estimatorin this case and also outperformed the other methods The median log(MSE) for EEV wasof minus359 with an interquartile range [minus374minus343] and a range [minus414minus302] Theseresults are consistent with univariate results reported by Ruppert and Carroll (2000) andBaladandayuthapani Mallick and Carroll (2005) who found that allowing the smoothingparameter to vary smoothly outperforms other adaptive techniques

For function f1(middot) Figure 4 displays the coverage probabilities for the 95 credibleintervals calculated over a 20 times 20 equally spaced grid in [0 1]2 For one grid point wecalculated the frequency with which the 95 pointwise credible interval covers the truevalue of the function at the grid point As expected these coverage probabilities showstrong spatial correlation with lower coverage probabilities along the ridges of the sinusfunction Coverage probability is lowest when x1 is in the [02 05] range The signal-to-noise in this region is about half the signal-to-noise ratio in the region corresponding to x1

close to 1 This explains why the coverage probability is smaller in this region Interestinglythe zeros of the true function appearing at x2 = 025 050 075 are covered with at least95 probability Another interesting feature is the lower coverage probabilities near thenorth south and east boundaries These features of the pointwise coverage probabilitymap are consistent with the frequentist analysis of Krivobokova et al (2007) but are notconsistent with the coverage probabilities reported by Lang and Brezger for their BayesianP-spline method

For our CEV estimator and function f2(middot) corresponding to very low spatial heterogene-ity we obtained a median log(MSE) of minus595 with an interquartile range [minus612minus566]

Spatially Adaptive Bayesian Penalized Splines 279

Figure 4 Coverage probability of the pointwise 95 credible intervals for the mean function for functionf1(x1 x2) = x1 sin(4πx2) with a constant error standard deviation σ = range(f )4 Probabilities are computedon a 20 times 20 equally spaced grid of points in [0 1]2 The model used was a thin plate spline with K = 100 knotsfor the mean function a log-thin plate spline with K = 100 knots for the variance function and a log-thin platespline with Klowast = 16 knots for the spatially adaptive shrinkage parameter The parameter controlling the degreeof smoothness of the thin-plate spline basis was m = 2

and a range [minus651minus541] The results of Lang and Brezger in their Figure 5(a) indicatethat their ldquoP-splinerdquo method achieved a median log(MSE) of minus600 Our method performsroughly similar to the Bayesian P-spline method of Lang and Brezger and to the cubicthin plate spline (tps) being outperformed only by Lang and Brezgerrsquos adaptive P-splinemethod with two main effects ldquoP-spline mainrdquo We could also add two main effects to ourbivariate adaptive smoother to improve MSE for this example but this is not our concernin this article Again the EEV estimator performed very similarly to our CEV estimator

The last simulation study was done using the function f1(middot middot) where x1 x2 are indepen-dent uniformly distributed in [0 1] with an error standard deviation function

σε(x1 x2) = r

32+ 3r

32x2

1

where r = range(f1) We compared our CEV and EEV estimators described at the beginningof this section and as expected the EEV estimator outperformed the CEV both in terms oflog(MSE) and coverage probabilities More precisely for log(MSE) we obtained a medianof minus526 for CEV and minus535 for EEV an interquartile range [minus537minus515] for CEV and[minus548minus524] for EEV and a range [minus465minus572] for CEV and [minus484minus584] for EEVFor both estimators the general patterns for coverage probabilities were the ones displayedin Figure 4 EEV outperformed CEV in terms of coverage probabilities For example theregions where the nominal level is exceeded for both estimators have a roughly similar

280 C M Crainiceanu et al

Figure 5 Locations of sampling (dots) with 100 knots (circles) in the Noshiro example

shape and location but the EEV coverage probabilities tend to be closer to their nominallevel

Note that our model incorporating nonparametric variance estimation is more generalthan the models considered in the literature However its performance for function m1(middot middot)with moderate spatial variability and constant variance was better than all methods consid-ered by Lang and Brezger in their simulation study (see their Figure 5 panel b) Our methodhas good performance for the function m2(middot middot) being the second best when compared tomethods consider by Lang and Brezger (see their Figure 5 panel a) For the case whenvariance is not constant our adaptive method that estimates the variance nonparametricallyoutperformed our adaptive method that estimates a constant variance

8 THE NOSHIRO EXAMPLE

Noshiro Japan was the site of a major earthquake Much of the damage from thequake was due to soil movement Professor Thomas OrsquoRourke a geotechnical engineerwas investigating the factors that might help predict soil movement during a future quakeOne factor thought to be of importance was the slope of the land To estimate the slopeRuppert (1997) used a dataset with 799 observations where x was longitude and latitudeand y was elevation at locations in Noshiro The object of primary interest is the gradientat least its magnitude and possibly its direction

We used the thin-plate spline models (62) for the mean of y with Km = 100 knotsselected by the maximum separation criterion implemented in the function coverdesign()

of the R package Fields (Nychka Haaland OrsquoConnell and Ellner 1998) The sampling

Spatially Adaptive Bayesian Penalized Splines 281

Figure 6 Posterior mean of the mean regression function for the Noshiro example The model used was a thinplate spline with K = 100 knots for the mean function a log-thin plate spline with K = 100 knots for the variancefunction and a log-thin plate spline with Klowast = 16 knots for the spatially adaptive shrinkage parameter Theparameter controlling the degree of smoothness of the thin-plate spline basis was m = 2

locations and the knots used for fitting are displayed in Figure 5 The log-thin plate splinedescribed in (63) with the same Kε = 100 knots was used to model the variance of theerror process σ 2

ε (xi ) The log-thin plate spline described in (64) with Kb = 16 subknotswas used to model the variance σ 2

b (κκκk) which controls the amount of shrinkage of the b

parameters The subknots were obtained using the coverdesign() function applied to the100 knots

Figure 6 displays the posterior mean of the mean response function There is a rathersharp peak around (042 04) with a slow decay from the peak in the general south directionand a much faster decay in all other directions The function becomes smoother towardsthe boundary This is exactly the type of function for which adaptive spatial smoothing cansubstantially improve the fit

Figure 7 shows that in large areas of the map σε(middot) is smooth and with small valuesHowever in a neighborhood of the peak of the mean function σε(middot) displays two relativelysharp maxima located NNW and SSE of the peak We also performed a two-stage frequentistanalysis In the first stage we used a low rank thin plate spline for the mean function and theresults from this regression were used to obtain residuals We then fitted another low rankthin plate spline to the absolute values of these residuals While this map was not identicalto the one in Figure 7 it did exhibit the same patterns of heteroscedasticity

The process logσ 2b (κκκk) controlling the shrinkage of the bk is displayed in Figure 8

282 C M Crainiceanu et al

Figure 7 Posterior mean of the standard deviation σε(xi ) of the error process function for the Noshiro exampleThe model used was a thin plate spline with K = 100 knots for the mean function a log-thin plate spline withK = 100 knots for the variance function and a log-thin plate spline with Klowast = 16 knots for the spatially adaptiveshrinkage parameter The parameter controlling the degree of smoothness of the thin-plate spline basis was m = 2

Figure 8 Posterior mean of the shrinkage process minus logσ 2b(κκκk) The model used was a thin plate spline with

K = 100 knots for the mean function a log-thin plate spline with K = 100 knots for the variance function and alog-thin plate spline with Klowast = 16 knots for the spatially adaptive shrinkage parameter The parameter controllingthe degree of smoothness of the thin-plate spline basis was m = 2

Spatially Adaptive Bayesian Penalized Splines 283

Smaller values of this function correspond to less shrinkage of bk towards zero and morelocal behavior of the smoother Figure 8 indicates that in a neighborhood of the peak of themean function the shrinkage is smaller to allow the mean function to change rapidly Awayfrom the peak the shrinkage is larger corresponding to a smoother mean

9 IMPLEMENTATION USING MCMC

We will now provide some details for MCMC simulations of model (21) wherethe variances σ 2

ε (xi) and σ 2b (κk) are modeled by Equations (23) and (24) The imple-

mentation for other models for example for the multivariate smoothing in Section 6is similar Consider independent normal priors for the coefficients of the monomialsβi sim N(0 σ 2

0β) γi sim N(0 σ 20γ ) i = 0 p δi sim N(0 σ 2

0δ) i = 0 q and in-

dependent inverse Gamma priors for the variance components σ 2c sim IGamma(ac bc) and

σ 2d sim IGamma(ad bd) Using these priors many full conditionals of the posterior distribu-

tion are easy to derive while a few have complex multivariate forms Our implementationof the MCMC using multivariate Metropolis-Hastings steps proved to be unstable with poormixing properties A simple and reliable solution was to change the model by adding errorterms to the log-spline models that is

logσ 2ε (xi)

= γ0 + middot middot middot + γpxpi +sumKε

s=1 cs(xi minus κεs )

p+ + ui

logσ 2b (κ

mk ) = δ0 + middot middot middot + δq(κ

mk )q +sumKb

s=1 ds(κmk minus κb

s )q+ + vk

(91)

where ui sim N(0 σ 2u ) and vk sim N(0 σ 2

v ) This idea was also used for σ 2b by Baladan-

dayuthapani Mallick and Carroll (2005) We fixed the values of σ 2u = σ 2

v = 001 as thesevariances appear nonidentifiable and a standard deviation of 01 is small on a log-scale Thisdevice makes computations feasible by replacing sampling from complex full conditionalsby simple univariate MH steps

Define as before by ε b θθθX and define θθθε = (γγγ T cT )T θθθb = (δδδT dT )T CX =(X Z) Cε = (Xε Zε) Cb = (Xb Zb) where X Xε Xb contain the monomials and ZZε Zb contain the truncated polynomials of the spline models for m logσ 2

ε and logσ 2b

respectively Also denote by

0X =[

σ 20βIp+1 0

0 b

]

0ε =[

σ 20γ Ip+1 0

0 σ 2c IKε

]

0b =[

σ 20δIq+1 0

0 σ 2d IKm

]

The full conditionals of the posterior are detailed below

1 [θθθX] sim N(MXCT

Xminus1ε YMX

) where MX =

(CT

Xminus1ε CX + minus1

0X

)minus1

2 [θθθε] sim N(MεCT

ε Yεσ2u Mε

) where Yε = [log(σ 2

ε1) log(σ 2εn)]

T and Mε

= (CTε Cεσ

2u +minus1

0ε )minus1

284 C M Crainiceanu et al

Figure 9 Computation time (in minutes) versus sample size for 10000 simulations from the joint posteriordistribution given the data The mean function is modeled as a penalized spline with Km knots with the log of thevariance of the shrinkage process modeled as another penalized spline with Kb = 4 subknots The log of the errorvariance is modeled as another penalized spline with Kε = Km knots Different symbols correspond to increasingnumber of knots dots (Km = 10) asterisks (Km = 20) circles (Km = 30) xrsquos (Km = 40)

3 [θθθb] sim N(MbCT

b Ybσ2v Mb

)where Yb =

[log(σ 2

1 ) log(σ 2Km

)]T

and Mb =(CT

b Cbσ2v +minus1

0b )minus1

4 [σ 2c ] sim IGamma

(ac + Kε2 bc + ||c||22

)

5 [σ 2d ] sim IGamma

(ad + Kb2 bd + ||d||22

)

6 [σ 2εi] prop σminus3

εi exp

minus(yi minus microi)

2(2σ 2εi ) minus

[log(σ 2

εi ) minus ηi)]2

(2σ 2u )

where microi

and ηi are the ith components of CXθθθX and Cεθθθε respectively

7 [σ 2k ] prop σminus3

k expminusb2

k(2σ2k ) minus [

log(σ 2k ) minus ζk)

]2(σ 2

v ) where ζk is the kth com-

ponent of Cbθθθb

All the above conditionals have an explicit form with the exception of the n + Km

one-dimensional conditionals from 6 and 7 For these distributions we use the Metropolis-Hastings algorithm with a normal proposal distribution centered at the current value andsmall variance

10 IMPLEMENTATION AND COMPUTATIONALPERFORMANCE

Our MCMC simulation algorithm provided reliable results and is reasonably fast To il-lustrate this Figure 9 displays computation time (in minutes) versus sample size for 10000

Spatially Adaptive Bayesian Penalized Splines 285

Figure 10 MCMC histories for eight parameters of interest including the log of standard deviation of the errorprocess and log of standard deviation of the shrinkage process as well as some parameters modeling the differentfunctions The model uses 40 equally spaced knots for the mean and error variance functions and 4 subknotsfor the adaptive shrinkage process Chains correspond to the Doppler function example using quadratic truncatedpolynomials to fit the mean and log variance function and linear truncated polynomial basis for fitting the shrinkageprocess For clarity of presentation only every 20th simulated observation is displayed

simulations from the joint posterior distribution given the data The mean function is mod-eled as a penalized spline with Km knots with the log of the variance of the shrinkageprocess modeled as another penalized spline with Kb = 4 subknots The log of the errorvariance is modeled as another penalized spline with Kε = Km knots Different symbolscorrespond to increasing number of knots dots (Km = 10) asterisks (Km = 20) circles(Km = 30) xrsquos (Km = 40) Computation time increases roughly linearly with the numberof observations with the same intercept 0 and larger slopes for larger number of knotsNote that even with n = 1000 observations and Km = Kε = 40 simulation time is abouttwo minutes

For the models and datasets presented in this article 10000 simulations proved suffi-cient for accurate and reliable inference Some parameters such as the mean and variancefunctions tended to exhibit much better mixing properties than the variance componentsσ 2c and σ 2

d controlling the shrinkage processes This is probably due to the reduced amountof information in the data about these parameters The MCMC convergence and mixingproperties were assessed by visual inspection of the chain histories of many parameters ofinterest Since we started with good starting values of the parameters convergence is fastFigure 10 displays the histories of eight parameters of interest for the model fitted to the

286 C M Crainiceanu et al

Doppler function described in Section 5 For clarity we display only every 100th sampledvalued of the chain but the same patterns would be observed in the full chain history Similargood mixing properties have been noted in all the other examples presented in this article

In the search for a good software platform we implemented our algorithms in MATLABC and FORTRAN Programs can be downloaded from wwwbiostatjhsphedu simccrainicFor our application and implementation there were no serious differences between C andFORTRAN which have performed better than MATLAB by reducing computation timetwo to three times The reported times in Figure 9 are for our implementation in C

Given the complexity of the model we were interested in improving the mixing prop-erties of the Markov Chains The computational trick described in Section 9 was the sinequa non method that allowed us to replace an intractable multivariate MH step with severaltractable univariate MH steps We used normal random walk proposal distributions centeredat the current value and variance tuned to provide acceptance rates between 30 and 50

11 COMMENTS

Our joint models for heterogeneous mean and nonconstant variance are designed to ex-tract the maximum amount of information from complex data and suggest simpler paramet-ric models Bayesian inference based on MCMC offers a natural and powerful frameworkfor model fitting However MCMC in this context is not straightforward and implemen-tation depends essentially on the ability of transforming intractable simulations from highdimensional distributions into tractable one dimensional updates as described in Section9 Our implementation provides accurate fast and reproducible results while software isavailable Actual simulation time increases only linearly with the number of observationsand mixing properties are adequate provided that proposal distribution variances are tunedfor optimal acceptance rates

ACKNOWLEDGMENTS

Carrollrsquos research was supported by a grant from the National Cancer Institute (CA-57030) and by the TexasAampM Center for Environmental and Rural Health via a grant from the National Institute of Environmental HealthSciences (P30-ES09106)

[Received September 2004 Revised October 2006]

REFERENCES

Baladandayuthapani V Mallick B K and Carroll R J (2005) ldquoSpatially Adaptive Bayesian RegressionSplinesrdquo Journal of Computational and Graphical Statistics 14 378ndash394

Besag J Green P Higdon D and Mengersen K (1995) ldquoBayesian Computation and Stochastic Systemsrdquo(with discussions) Statistical Science 10 3ndash66

Carroll R J (2003) ldquoVariances Are Not Always Nuisance Parameters The 2002 R A Fisher Lecturerdquo Biometrics59 211ndash220

Carroll R J and Ruppert D (1988) Transformation and Weighting in Regression New York Chapman andHall

Spatially Adaptive Bayesian Penalized Splines 287

Cleveland W S and Grosse E (1991) ldquoComputational Methods for Local Regressionrdquo Statistics and Computing1 47ndash62

Crainiceanu C and Ruppert D (2004) ldquoLikelihood Ratio Tests in Linear Mixed Models With One VarianceComponentrdquo Journal of the Royal Statistical Society Series B 66 165ndash185

Crainiceanu C M Ruppert D and Wand M P (2005) ldquoBayesian Analysis for Penalized Spline Regressionusing WinBUGSrdquo Journal of Statistical Software 14 Available online at http wwwjstatsoftorg v14 i14

Donoho D L and Johnstone I M (1994) ldquoIdeal Spatial Adaptation by Wavelet Shrinkagerdquo Biometrika 81425ndash455

Eilers P H C and Marx B D (1996) ldquoFlexible Smoothing With B-Splines and Penaltiesrdquo (with discussion)Statistical Science 11 89ndash121

(2002) ldquoGeneralized Linear Additive Smooth Structuresrdquo Journal of Computational and Graphical Statis-tics 11 758ndash783

(2004) ldquoSplines Knots and Penaltiesrdquo unpublished manuscript

Fields Development Team (2006) fields Tools for Spatial Data National Center for Atmospheric ResearchBoulder CO Available online at http wwwcgducaredu Software Fields

Friedman J H (1991) ldquoMultivariate Adaptive Regression Splinesrdquo (with discussion) The Annals of Statistics19 1ndash141

Gelman A (2006) ldquoPrior Distributions for Variance Parameters in Hierarchical Modelsrdquo Bayesian Analysis 1515ndash533

Gray R J (1994) ldquoSpline-Based Tests in Survival Analysisrdquo Biometrics 50 640ndash652

Holst U Hossjer O Bjorklund C Ragnarson P and Edner H (1996) ldquoLocally Weighted Least SquaresKernel Regression and Statistical Evaluation of LIDAR Measurementsrdquo Environmetrics 7 401ndash416

Jullion A and Lambert P (2006) Discussion paper DP0534 UCL

Kammann E E and Wand M P (2003) ldquoGeoadditive Modelsrdquo Applied Statistics 52 1ndash18

Kauermann G (2002) ldquoA Note on Bandwidth Selection for Penalised Spline Smoothingrdquo Technical Report 02-13Department of Statistics University of Glasgow

Krivobokova T Crainiceanu C M and Kauermann G (2007) ldquoFast Adaptive Penalized Splinesrdquo Journal ofComputational and Graphical Statistics to appear

Lang S and Bretzger A (2004) ldquoBayesian P-Splinesrdquo Journal of Compuational and Graphical Statistics 13183ndash212

Lang S Fronk E M and Fahrmeir L (2002) ldquoFunction Estimation with Locally Adaptive Dynamic ModelsrdquoComputational Statistics 17 479ndash499

Marx B D and Eilers PHC (1998) ldquoDirect Generalized Additive Modeling With Penalized LikelihoodrdquoComputational Statistics amp Data Analysis 28 193ndash209

Natarajan R and Kass R E (2000) ldquoReference Bayesian Methods for Generalized Linear Mixed ModelsrdquoJournal of the American Statistical Association 95 227ndash237

Ngo L and Wand M P (2004) ldquoSmoothing With Mixed Model Softwarerdquo Journal of Statistical Software 9Available online at http wwwjstatsoftorg v09 i01

Nychka D Haaland P OrsquoConnell M and Ellner S (1998) ldquoFUNFITS Data Analysis and Statistical Toolsfor Estimating Functionsrdquo in Case Studies in Environmental Statistics Lecture Notes in Statistics vol 132eds D Nychka W W Piegorsch and L H Cox New York Springer Verlag pp 159ndash179 Online athttp wwwcgducaredu stats Software Funfits

Nychka D and Saltzman N (1998) ldquoDesign of Air Quality Monitoring Networksrdquo in Case studies in Environ-mental Statistics Lecture Notes in Statistics vol 132 eds D Nychka W W Piegorsch and L H Cox NewYork Springer Verlag pp 51ndash76

288 C M Crainiceanu et al

Pintore A Speckman P and Holmes C C (2006) ldquoSpatially Adaptive Smoothing Splinesrdquo Biometrika 93113ndash125

Ruppert D (1997) ldquoLocal Polynomial Regression and its Applications in Environmental Statisticsrdquo in Statisticsfor the Environment (vol 3) eds V Barnett and F Turkman Chicester Wiley pp 155ndash173

(2002) ldquoSelecting the Number of Knots for Penalized Splinesrdquo Journal of Computational and GraphicalStatistics 11 735ndash757

Ruppert D and Carroll R J (2000) ldquoSpatially-Adaptive Penalties for Spline Fittingrdquo Australian and NewZealand Journal of Statistics 42 205ndash223

Ruppert D Wand M P Holst and U Hossier O (1997) Technometrics 39 262ndash273

Ruppert D Wand M P and Carroll R J (2003) Semiparametric Regression Cambridge Cambridge UniversityPress

Smith M and Kohn R (1997) ldquoA Bayesian Approach to Nonparametric Bivariate Regressionrdquo Journal of theAmerican Statistical Association 92 1522ndash1535

Wood S (2006) Generalized Additive Models An Introduction with R New York Chapman amp HallCRC

Page 9: Spatially Adaptive Bayesian Penalized Splines - Department of

Spatially Adaptive Bayesian Penalized Splines 273

large the upper and lower limits of these intervals can be connected to form simultaneouscredible bands In the remainder of this article we only report simultaneous credible boundsfor the functions and their derivatives In the examples considered in the following thesebounds tend to be roughly 30ndash50 wider than pointwise credible bounds

5 COMPARISON WITH OTHER UNIVARIATE SMOOTHERS

Baladandayuthapani et al (2005) presented a comprehensive simulation study of theirBayesian spatially adaptive penalized spline model which is obtained as a particular case ofour full model with constant error variance The results of this study indicate that the methodis comparable to or better than several other spatially adaptive Bayesian methods on a varietyof datasets Results also showed that the method produced practically indistinguishableinferences from those obtained using the fast frequentist method of Ruppert and Carroll(2000) Since we are unaware of any other estimation methodology that estimates jointlythe nonparametric adaptive mean and nonparametric variance models we will compareour methodology with that of Ruppert and Carroll (2000) We will compare the frequentistproperties of our Bayesian methodology such as MSE and coverage probabilities calculatedby simulating from existent benchmark models

Consider the regression model

yi = m(xi) + εi

where εi sim N(0 σ 2ε ) are independent errors The first function considered was the Doppler

function described by Donoho and Johnstone (1994)

m(x) = radicx(1 minus x) sin

2π(1 + 005)

x + 005

For the Doppler function we considered n = 2048 equally spaced xrsquos between [0 1] andσ 2ε = 1

Ruppert and Carroll (2000) proposed the following family of models related to theDoppler but allowing for more shape flexibility

m(x j) = radicx(1 minus x) sin

2π(1 + 2(9minus4j)5)

x + 2(9minus4j)5

where j = 3 and j = 6 correspond to low and severe spatial heterogeneity respectivelyFor this function we considered n = 400 equally spaced xrsquos on [0 1] and σ 2

ε (x) = 004 Wewill refer to this as the RCj function where j controls the amount of spatial heterogeneity

Ruppert and Carroll (2000) and Baladandayuthapani et al (2005) also considered thefollowing function

m(x) = expminus400(x minus 06)2

+ 5

3expminus500(x minus 075)2 + 2 expminus500(x minus 09)2

(51)

274 C M Crainiceanu et al

Table 1 Average mean square error (AMSE) for different mean functions based on 100 simulations The columnldquoTrue variancerdquo indicates whether the data-generating process is homoscedastic or heteroscedastic Thecolumns ldquoFitted error variancerdquo refer to the type of assumption used for fitting the data ldquoHomoscedas-ticrdquo uses the Ruppert and Carroll (2000) fitting algorithm while ldquoHeteroscedasticrdquo uses the Bayesianinferential algorithm described in this article The mean and log-variance functions were modeled usingdegree 2 splines with Km = Kε = 40 equally spaced knots The shrinkage process was modeled usingdegree 1 splines with Kb = 4 subknots

Fitted error variance

Function True variance Homoscedastic Heteroscedastic

Doppler Constant 0197 0218RCj=3 Constant 0011 0012RCj=6 Constant 0061 00653 Bumps Constant 0054 00553 Bumps Variable 0031 0026

which is roughly constant between [0 05] and has three sharp peaks at 06 075 and 09For this function we considered n = 1000 equally spaced xrsquos on [0 1] and two differentscenarios for the standard error of the error process (a) σε(x) = 05 for homoscedasticerrors and (b) σε(xi) = 05minus08x+16(xminus05)+ We will refer to this as the ldquothree-bumprdquofunction

We used 100 simulations from these models and fit a spatially adaptive penalized splinethat uses a constant variance For the three-bump function we also fitted a model withnonconstant variance modeled as a log-penalized-spline For the mean and log-variancefunctions we used quadratic penalized splines with Km = Kε = 40 knots For the log-variance corresponding to the shrinkage process we used Kb = 4 knots

We calculated the mean square error for each simulated dataset as

MSE =nsum

i=1

m(xi) minus m(xi)2 n

where m(x) is the posterior mean of m(x) using a given model Table 1 provides a summaryof the average MSE (AMSE) results over 100 simulations comparing the method that allowsfor heteroscedastic errors with the one that does not Remarkably our method that estimatesa nonconstant variance performed well compared to methods that assume a constant variancewhen the variance is constant In contrast when the true variance process is not constantour method performed better

Our method was slightly outperformed by the Ruppert and Caroll (2000) method in theDoppler case This is probably due to the fact that the Ruppert and Carroll method assumesa constant variance for the Doppler function when the variance used for simulations wasσ 2ε = 1 However both methods outperformed the method of Baladandayuthapani et al

(2005) who reported an AMSE of 0024 for the Doppler functionFor the three-bump function with constant variance σε(x) = 05 the two models pro-

duced practically indistinguishable MSEs Note that our AMSE over all xrsquos using 100 simu-

Spatially Adaptive Bayesian Penalized Splines 275

Figure 3 Comparison of pointwise coverage probabilities of the 95 credible intervals in 500 simulationsData was simulated from the model described in Section 5 with heteroscedastic error variance Estimation wasdone using two methods Bayesian adaptive penalized splines with homoscedastic errors (dashed) and Bayesianadaptive penalized splines with heteroscedastic errors (solid) The coverage probabilities have been smoothed toremove Monte Carlo variability

lations was AMSE = 00054 which is smaller than 00061 reported by Baladandayuthapaniet al (2005) The coverage probabilities of the 95 credible intervals were very similarfor the two models for each value of the covariate with a slight advantage in favor of themodel using homoscedastic errors The average coverage probability was 947 for themodel with homoscedastic errors and 935 for the model with heteroscedastic errorsThese coverage probabilities are very similar to the ones reported by Baladandayuthapaniet al (2005) in their Figure 3

For case (b) σε(xi) = 05 minus 08x + 16(x minus 05)+ the heteroscedastic model sub-stantially outperformed the homoscedastic model both in terms of AMSE and coverageprobabilities In this situation it would be misleading to compare only the average coverageprobability for the 95 credible intervals Indeed these averages are very close 941for the heteroscedastic and 930 for the homoscedastic method but they are obtainedfrom very different sets of pointwise coverage probabilities Figure 3 displays the coverageprobabilities for these two methods Note that for the heteroscedastic method the coverageprobability in the [0 05] interval is close to 95 In the same interval the coverage prob-ability for the homoscedastic method starts from around 08 and increases until it crossesthe 95 target around 02 Moreover in the interval [03 05] this coverage probabilityis estimated to be 1 This is due to the fact that on this interval σε(x) decreases linearlyfrom 05 to 01 Since the homoscedastic method assumes a constant variance the credibleintervals will tend to be shorter than nominal in regions of higher variability (x close tozero) thus producing lower coverage probabilities In regions of smaller variability (x closeto 05) the credible intervals will tend to be much wider thus producing extremely large

276 C M Crainiceanu et al

coverage probabilitiesIn the interval [055 065] the homoscedastic method slightly outperforms the het-

eroscedastic method This seems to be the effect of a ldquoluckyrdquo combination of two factorsAs we discussed the size of the credible intervals for the homoscedastic method in a neigh-borhood of 05 is much larger than nominal However at the same point the mean functionchanges from a constant to a rapidly oscillating function and the coverage probabilities droproughly at the same rate In the interval [08 1] one can notice a phenomenon very similarto the one described for the interval [0 05] While the function oscillates more rapidly inthis region the heteroscedastic adaptive method produces credible intervals with coverageprobabilities close to the nominal 95 However the homoscedastic method does not takeinto account the increased variability and produces credible intervals that are too short

For these examples both methods performed roughly similar on simulated data sets thatdid not require variance estimation However when the error variance is not constant themethod using log-penalized-splines to estimate the error variance considerably outperformsthe method that does not

6 LOW RANK MULTIVARIATE SMOOTHING

In this section we generalize the ideas in Section 2 to multivariate smoothing whilepreserving the appealing geometric interpretation of the smoother Consider the followingregression yi = m(xi ) + εi where m(middot) is a smooth function of L covariates We will useradial basis functions which have the advantage of being more numerically stable than thetensor product of truncated polynomials in more than one dimension Suppose that xi isin RL1 le i le n are n vectors of covariates and κk isin RL 1 le k le Km are Km knots Considerthe following distance function

C(r) =

||r||2MminusL for L odd||r||2MminusL log ||r|| for L even

where || middot || denotes the Euclidean norm in RL the integer M controls the smoothnessof C(middot) X the matrix with ith row Xi = [1 xT

i ] ZKm = C(||xi minus κκκk||)1leilen1lekleKm

Km = C(||κκκk minus κκκkprime ||)1lekleK1lekprimeleK and define Z = ZKminus12K where

minus12K is the

principal square root of K With these notations the low rank approximation of thin platespline regression can be obtained as the BLUP in the LMM (Kammann and Wand 2003Ruppert Wand and Carroll 2003)

Y = Xβββ + Zb + εεε E

(bεεε

)=(

00

) cov

(bεεε

)=(

σ 2b IKm 0

0 σ 2ε In

) (61)

This model contains only one variance component σ 2b for controlling the shrinkage of

b which is equivalent to one global smoothing parameter λ = σ 2b σ

2ε and implicitly as-

sumes homoscedastic errors To relax these assumptions we consider a new set of knotsκκκlowast

1 κκκlowastKb

and define Xlowast and Zlowast similarly with the corresponding definition of matrices

Spatially Adaptive Bayesian Penalized Splines 277

X and Z where the x-covariates are replaced by the knots κκκk and the knots are replaced bythe subknots κκκlowast

k Consider the following model for the responseyi = β0 +sumL

j=1 βjxij +sumKm

k=1 bkzik + εiεi sim N0 σ 2

ε (xi ) i = 1 nbk sim N0 σ 2

b (κκκk) k = 1 Km

(62)

where the error variance logσ 2ε (xi ) is modeled as a low rank log-thin plate spline

logσ 2ε (xi ) = γ0 +sumL

j=1 γjxij +sumKε

k=1 ckzikck sim N(0 σ 2

c ) k = 1 Kε(63)

and the random coefficient variance logσ 2b (κκκk) is modeled as another low rank log-thin

plate spline logσ 2

b (κκκk) = δ0 +sumLj=1 δj x

lowastkj +sumKb

j=1 dj zlowastkj

dj sim N(0 σ 2d ) j = 1 Kb

(64)

where xlowastkj and zlowast

kj are the entries of Xlowast and Zlowast matrices respectively and ck and ds areassumed mutually independent

In the case of low rank smoothers the set of knots for the covariates and subknots formodeling the shrinkage process have to be chosen One possibility is to use equally spacedknots and subknots We use this approach in our simulation study in Section 7 to make ourmethod comparable from this point of view with previously published methods Anotherpossibility is to select the knots and subknots using the space filling design (Nychka andSaltzman 1998) which is based on the maximal separation principle This avoids wastingknots and is likely to lead to better approximations in sparse regions of the data Thecoverdesign() function from the R package Fields (Fields Development Team 2006)provides software for space filling knot selection We use this approach in Section 8 in theNoshiro elevation example

7 COMPARISONS WITH OTHER ADAPTIVE SURFACEFITTING METHODS

Lang and Brezger (2004) compared their adaptive surface smoothing method with sev-eral methods used in a simulation study by Smith and Kohn (1997) MARS (Friedman1991) ldquolocfitrdquo (Cleveland and Grosse 1991) ldquotpsrdquo (bivariate cubic thin plate splines with asingle smoothing parameter) tensor product cubic smoothing splines with five smoothingparameters and a parametric linear interaction model We will use the following regressionfunctions also used by Lang and Brezger

bull m1(x1 x2) = x1 sin(4πx2) where x1 and x2 are distributed independently uniformon [0 1]

bull m2(x1 x2) = 15 exp(minus8x21 ) + 35 exp(minus8x2

2 ) where x1 and x2 are distributedindependently normal with mean 05 and variance 01

278 C M Crainiceanu et al

Function m1(middot) contains complex interactions and moderate spatial variability whilem2(middot) has only main effects and is much smoother We used a sample size n = 300σ = 14 range(f ) and 250 simulations from each model

We will compare the performance of our method with that of Lang and Brezgerrsquos whichperformed at least as good or better than other adaptive methods in a variety of examplesTo avoid dependence of results on knot choice we used the same choice as the one describedby Lang and Brezger 12 times 12 equally spaced knots for the mean function However weused thin-plate splines instead of tensor products of B-splines We modeled the smoothingparameter using a low rank log-thin plate spline with an equally spaced 5 times 5 knot grid Weconsidered two estimators one with constant error variance (CEV) and one with estimatederror variance using a log-thin plate spline (EEV) with the same 12 times 12 knot grid as forthe mean function The CEV estimator is the one that can directly be compared with theestimators considered by Lang and Brezger (2004) Because Lang and Brezger presentedonly comparative results using boxplots we will use their results as read from their Figure5 The performance of all estimators was measured by the log of the mean squared errorlog(MSE) where MSE(f ) = 1n

sumni=1f (xi) minus f (xi)2

For the CEV estimator and function f1(middot) corresponding to moderate spatial heterogene-ity we obtained a median log(MSE) of minus367 with an interquartile range [minus380minus353]and a range [minus421minus313] These values are better than the best values reported by Langand Brezgerrsquos methods in their Figure 5(b) Lang and Brezger reported that their method de-noted ldquoP-splinerdquo outperforms all other methods in this case and achieved a median log(MSE)of minus350 Remarkably the EEV estimator performed almost as well as the CEV estimatorin this case and also outperformed the other methods The median log(MSE) for EEV wasof minus359 with an interquartile range [minus374minus343] and a range [minus414minus302] Theseresults are consistent with univariate results reported by Ruppert and Carroll (2000) andBaladandayuthapani Mallick and Carroll (2005) who found that allowing the smoothingparameter to vary smoothly outperforms other adaptive techniques

For function f1(middot) Figure 4 displays the coverage probabilities for the 95 credibleintervals calculated over a 20 times 20 equally spaced grid in [0 1]2 For one grid point wecalculated the frequency with which the 95 pointwise credible interval covers the truevalue of the function at the grid point As expected these coverage probabilities showstrong spatial correlation with lower coverage probabilities along the ridges of the sinusfunction Coverage probability is lowest when x1 is in the [02 05] range The signal-to-noise in this region is about half the signal-to-noise ratio in the region corresponding to x1

close to 1 This explains why the coverage probability is smaller in this region Interestinglythe zeros of the true function appearing at x2 = 025 050 075 are covered with at least95 probability Another interesting feature is the lower coverage probabilities near thenorth south and east boundaries These features of the pointwise coverage probabilitymap are consistent with the frequentist analysis of Krivobokova et al (2007) but are notconsistent with the coverage probabilities reported by Lang and Brezger for their BayesianP-spline method

For our CEV estimator and function f2(middot) corresponding to very low spatial heterogene-ity we obtained a median log(MSE) of minus595 with an interquartile range [minus612minus566]

Spatially Adaptive Bayesian Penalized Splines 279

Figure 4 Coverage probability of the pointwise 95 credible intervals for the mean function for functionf1(x1 x2) = x1 sin(4πx2) with a constant error standard deviation σ = range(f )4 Probabilities are computedon a 20 times 20 equally spaced grid of points in [0 1]2 The model used was a thin plate spline with K = 100 knotsfor the mean function a log-thin plate spline with K = 100 knots for the variance function and a log-thin platespline with Klowast = 16 knots for the spatially adaptive shrinkage parameter The parameter controlling the degreeof smoothness of the thin-plate spline basis was m = 2

and a range [minus651minus541] The results of Lang and Brezger in their Figure 5(a) indicatethat their ldquoP-splinerdquo method achieved a median log(MSE) of minus600 Our method performsroughly similar to the Bayesian P-spline method of Lang and Brezger and to the cubicthin plate spline (tps) being outperformed only by Lang and Brezgerrsquos adaptive P-splinemethod with two main effects ldquoP-spline mainrdquo We could also add two main effects to ourbivariate adaptive smoother to improve MSE for this example but this is not our concernin this article Again the EEV estimator performed very similarly to our CEV estimator

The last simulation study was done using the function f1(middot middot) where x1 x2 are indepen-dent uniformly distributed in [0 1] with an error standard deviation function

σε(x1 x2) = r

32+ 3r

32x2

1

where r = range(f1) We compared our CEV and EEV estimators described at the beginningof this section and as expected the EEV estimator outperformed the CEV both in terms oflog(MSE) and coverage probabilities More precisely for log(MSE) we obtained a medianof minus526 for CEV and minus535 for EEV an interquartile range [minus537minus515] for CEV and[minus548minus524] for EEV and a range [minus465minus572] for CEV and [minus484minus584] for EEVFor both estimators the general patterns for coverage probabilities were the ones displayedin Figure 4 EEV outperformed CEV in terms of coverage probabilities For example theregions where the nominal level is exceeded for both estimators have a roughly similar

280 C M Crainiceanu et al

Figure 5 Locations of sampling (dots) with 100 knots (circles) in the Noshiro example

shape and location but the EEV coverage probabilities tend to be closer to their nominallevel

Note that our model incorporating nonparametric variance estimation is more generalthan the models considered in the literature However its performance for function m1(middot middot)with moderate spatial variability and constant variance was better than all methods consid-ered by Lang and Brezger in their simulation study (see their Figure 5 panel b) Our methodhas good performance for the function m2(middot middot) being the second best when compared tomethods consider by Lang and Brezger (see their Figure 5 panel a) For the case whenvariance is not constant our adaptive method that estimates the variance nonparametricallyoutperformed our adaptive method that estimates a constant variance

8 THE NOSHIRO EXAMPLE

Noshiro Japan was the site of a major earthquake Much of the damage from thequake was due to soil movement Professor Thomas OrsquoRourke a geotechnical engineerwas investigating the factors that might help predict soil movement during a future quakeOne factor thought to be of importance was the slope of the land To estimate the slopeRuppert (1997) used a dataset with 799 observations where x was longitude and latitudeand y was elevation at locations in Noshiro The object of primary interest is the gradientat least its magnitude and possibly its direction

We used the thin-plate spline models (62) for the mean of y with Km = 100 knotsselected by the maximum separation criterion implemented in the function coverdesign()

of the R package Fields (Nychka Haaland OrsquoConnell and Ellner 1998) The sampling

Spatially Adaptive Bayesian Penalized Splines 281

Figure 6 Posterior mean of the mean regression function for the Noshiro example The model used was a thinplate spline with K = 100 knots for the mean function a log-thin plate spline with K = 100 knots for the variancefunction and a log-thin plate spline with Klowast = 16 knots for the spatially adaptive shrinkage parameter Theparameter controlling the degree of smoothness of the thin-plate spline basis was m = 2

locations and the knots used for fitting are displayed in Figure 5 The log-thin plate splinedescribed in (63) with the same Kε = 100 knots was used to model the variance of theerror process σ 2

ε (xi ) The log-thin plate spline described in (64) with Kb = 16 subknotswas used to model the variance σ 2

b (κκκk) which controls the amount of shrinkage of the b

parameters The subknots were obtained using the coverdesign() function applied to the100 knots

Figure 6 displays the posterior mean of the mean response function There is a rathersharp peak around (042 04) with a slow decay from the peak in the general south directionand a much faster decay in all other directions The function becomes smoother towardsthe boundary This is exactly the type of function for which adaptive spatial smoothing cansubstantially improve the fit

Figure 7 shows that in large areas of the map σε(middot) is smooth and with small valuesHowever in a neighborhood of the peak of the mean function σε(middot) displays two relativelysharp maxima located NNW and SSE of the peak We also performed a two-stage frequentistanalysis In the first stage we used a low rank thin plate spline for the mean function and theresults from this regression were used to obtain residuals We then fitted another low rankthin plate spline to the absolute values of these residuals While this map was not identicalto the one in Figure 7 it did exhibit the same patterns of heteroscedasticity

The process logσ 2b (κκκk) controlling the shrinkage of the bk is displayed in Figure 8

282 C M Crainiceanu et al

Figure 7 Posterior mean of the standard deviation σε(xi ) of the error process function for the Noshiro exampleThe model used was a thin plate spline with K = 100 knots for the mean function a log-thin plate spline withK = 100 knots for the variance function and a log-thin plate spline with Klowast = 16 knots for the spatially adaptiveshrinkage parameter The parameter controlling the degree of smoothness of the thin-plate spline basis was m = 2

Figure 8 Posterior mean of the shrinkage process minus logσ 2b(κκκk) The model used was a thin plate spline with

K = 100 knots for the mean function a log-thin plate spline with K = 100 knots for the variance function and alog-thin plate spline with Klowast = 16 knots for the spatially adaptive shrinkage parameter The parameter controllingthe degree of smoothness of the thin-plate spline basis was m = 2

Spatially Adaptive Bayesian Penalized Splines 283

Smaller values of this function correspond to less shrinkage of bk towards zero and morelocal behavior of the smoother Figure 8 indicates that in a neighborhood of the peak of themean function the shrinkage is smaller to allow the mean function to change rapidly Awayfrom the peak the shrinkage is larger corresponding to a smoother mean

9 IMPLEMENTATION USING MCMC

We will now provide some details for MCMC simulations of model (21) wherethe variances σ 2

ε (xi) and σ 2b (κk) are modeled by Equations (23) and (24) The imple-

mentation for other models for example for the multivariate smoothing in Section 6is similar Consider independent normal priors for the coefficients of the monomialsβi sim N(0 σ 2

0β) γi sim N(0 σ 20γ ) i = 0 p δi sim N(0 σ 2

0δ) i = 0 q and in-

dependent inverse Gamma priors for the variance components σ 2c sim IGamma(ac bc) and

σ 2d sim IGamma(ad bd) Using these priors many full conditionals of the posterior distribu-

tion are easy to derive while a few have complex multivariate forms Our implementationof the MCMC using multivariate Metropolis-Hastings steps proved to be unstable with poormixing properties A simple and reliable solution was to change the model by adding errorterms to the log-spline models that is

logσ 2ε (xi)

= γ0 + middot middot middot + γpxpi +sumKε

s=1 cs(xi minus κεs )

p+ + ui

logσ 2b (κ

mk ) = δ0 + middot middot middot + δq(κ

mk )q +sumKb

s=1 ds(κmk minus κb

s )q+ + vk

(91)

where ui sim N(0 σ 2u ) and vk sim N(0 σ 2

v ) This idea was also used for σ 2b by Baladan-

dayuthapani Mallick and Carroll (2005) We fixed the values of σ 2u = σ 2

v = 001 as thesevariances appear nonidentifiable and a standard deviation of 01 is small on a log-scale Thisdevice makes computations feasible by replacing sampling from complex full conditionalsby simple univariate MH steps

Define as before by ε b θθθX and define θθθε = (γγγ T cT )T θθθb = (δδδT dT )T CX =(X Z) Cε = (Xε Zε) Cb = (Xb Zb) where X Xε Xb contain the monomials and ZZε Zb contain the truncated polynomials of the spline models for m logσ 2

ε and logσ 2b

respectively Also denote by

0X =[

σ 20βIp+1 0

0 b

]

0ε =[

σ 20γ Ip+1 0

0 σ 2c IKε

]

0b =[

σ 20δIq+1 0

0 σ 2d IKm

]

The full conditionals of the posterior are detailed below

1 [θθθX] sim N(MXCT

Xminus1ε YMX

) where MX =

(CT

Xminus1ε CX + minus1

0X

)minus1

2 [θθθε] sim N(MεCT

ε Yεσ2u Mε

) where Yε = [log(σ 2

ε1) log(σ 2εn)]

T and Mε

= (CTε Cεσ

2u +minus1

0ε )minus1

284 C M Crainiceanu et al

Figure 9 Computation time (in minutes) versus sample size for 10000 simulations from the joint posteriordistribution given the data The mean function is modeled as a penalized spline with Km knots with the log of thevariance of the shrinkage process modeled as another penalized spline with Kb = 4 subknots The log of the errorvariance is modeled as another penalized spline with Kε = Km knots Different symbols correspond to increasingnumber of knots dots (Km = 10) asterisks (Km = 20) circles (Km = 30) xrsquos (Km = 40)

3 [θθθb] sim N(MbCT

b Ybσ2v Mb

)where Yb =

[log(σ 2

1 ) log(σ 2Km

)]T

and Mb =(CT

b Cbσ2v +minus1

0b )minus1

4 [σ 2c ] sim IGamma

(ac + Kε2 bc + ||c||22

)

5 [σ 2d ] sim IGamma

(ad + Kb2 bd + ||d||22

)

6 [σ 2εi] prop σminus3

εi exp

minus(yi minus microi)

2(2σ 2εi ) minus

[log(σ 2

εi ) minus ηi)]2

(2σ 2u )

where microi

and ηi are the ith components of CXθθθX and Cεθθθε respectively

7 [σ 2k ] prop σminus3

k expminusb2

k(2σ2k ) minus [

log(σ 2k ) minus ζk)

]2(σ 2

v ) where ζk is the kth com-

ponent of Cbθθθb

All the above conditionals have an explicit form with the exception of the n + Km

one-dimensional conditionals from 6 and 7 For these distributions we use the Metropolis-Hastings algorithm with a normal proposal distribution centered at the current value andsmall variance

10 IMPLEMENTATION AND COMPUTATIONALPERFORMANCE

Our MCMC simulation algorithm provided reliable results and is reasonably fast To il-lustrate this Figure 9 displays computation time (in minutes) versus sample size for 10000

Spatially Adaptive Bayesian Penalized Splines 285

Figure 10 MCMC histories for eight parameters of interest including the log of standard deviation of the errorprocess and log of standard deviation of the shrinkage process as well as some parameters modeling the differentfunctions The model uses 40 equally spaced knots for the mean and error variance functions and 4 subknotsfor the adaptive shrinkage process Chains correspond to the Doppler function example using quadratic truncatedpolynomials to fit the mean and log variance function and linear truncated polynomial basis for fitting the shrinkageprocess For clarity of presentation only every 20th simulated observation is displayed

simulations from the joint posterior distribution given the data The mean function is mod-eled as a penalized spline with Km knots with the log of the variance of the shrinkageprocess modeled as another penalized spline with Kb = 4 subknots The log of the errorvariance is modeled as another penalized spline with Kε = Km knots Different symbolscorrespond to increasing number of knots dots (Km = 10) asterisks (Km = 20) circles(Km = 30) xrsquos (Km = 40) Computation time increases roughly linearly with the numberof observations with the same intercept 0 and larger slopes for larger number of knotsNote that even with n = 1000 observations and Km = Kε = 40 simulation time is abouttwo minutes

For the models and datasets presented in this article 10000 simulations proved suffi-cient for accurate and reliable inference Some parameters such as the mean and variancefunctions tended to exhibit much better mixing properties than the variance componentsσ 2c and σ 2

d controlling the shrinkage processes This is probably due to the reduced amountof information in the data about these parameters The MCMC convergence and mixingproperties were assessed by visual inspection of the chain histories of many parameters ofinterest Since we started with good starting values of the parameters convergence is fastFigure 10 displays the histories of eight parameters of interest for the model fitted to the

286 C M Crainiceanu et al

Doppler function described in Section 5 For clarity we display only every 100th sampledvalued of the chain but the same patterns would be observed in the full chain history Similargood mixing properties have been noted in all the other examples presented in this article

In the search for a good software platform we implemented our algorithms in MATLABC and FORTRAN Programs can be downloaded from wwwbiostatjhsphedu simccrainicFor our application and implementation there were no serious differences between C andFORTRAN which have performed better than MATLAB by reducing computation timetwo to three times The reported times in Figure 9 are for our implementation in C

Given the complexity of the model we were interested in improving the mixing prop-erties of the Markov Chains The computational trick described in Section 9 was the sinequa non method that allowed us to replace an intractable multivariate MH step with severaltractable univariate MH steps We used normal random walk proposal distributions centeredat the current value and variance tuned to provide acceptance rates between 30 and 50

11 COMMENTS

Our joint models for heterogeneous mean and nonconstant variance are designed to ex-tract the maximum amount of information from complex data and suggest simpler paramet-ric models Bayesian inference based on MCMC offers a natural and powerful frameworkfor model fitting However MCMC in this context is not straightforward and implemen-tation depends essentially on the ability of transforming intractable simulations from highdimensional distributions into tractable one dimensional updates as described in Section9 Our implementation provides accurate fast and reproducible results while software isavailable Actual simulation time increases only linearly with the number of observationsand mixing properties are adequate provided that proposal distribution variances are tunedfor optimal acceptance rates

ACKNOWLEDGMENTS

Carrollrsquos research was supported by a grant from the National Cancer Institute (CA-57030) and by the TexasAampM Center for Environmental and Rural Health via a grant from the National Institute of Environmental HealthSciences (P30-ES09106)

[Received September 2004 Revised October 2006]

REFERENCES

Baladandayuthapani V Mallick B K and Carroll R J (2005) ldquoSpatially Adaptive Bayesian RegressionSplinesrdquo Journal of Computational and Graphical Statistics 14 378ndash394

Besag J Green P Higdon D and Mengersen K (1995) ldquoBayesian Computation and Stochastic Systemsrdquo(with discussions) Statistical Science 10 3ndash66

Carroll R J (2003) ldquoVariances Are Not Always Nuisance Parameters The 2002 R A Fisher Lecturerdquo Biometrics59 211ndash220

Carroll R J and Ruppert D (1988) Transformation and Weighting in Regression New York Chapman andHall

Spatially Adaptive Bayesian Penalized Splines 287

Cleveland W S and Grosse E (1991) ldquoComputational Methods for Local Regressionrdquo Statistics and Computing1 47ndash62

Crainiceanu C and Ruppert D (2004) ldquoLikelihood Ratio Tests in Linear Mixed Models With One VarianceComponentrdquo Journal of the Royal Statistical Society Series B 66 165ndash185

Crainiceanu C M Ruppert D and Wand M P (2005) ldquoBayesian Analysis for Penalized Spline Regressionusing WinBUGSrdquo Journal of Statistical Software 14 Available online at http wwwjstatsoftorg v14 i14

Donoho D L and Johnstone I M (1994) ldquoIdeal Spatial Adaptation by Wavelet Shrinkagerdquo Biometrika 81425ndash455

Eilers P H C and Marx B D (1996) ldquoFlexible Smoothing With B-Splines and Penaltiesrdquo (with discussion)Statistical Science 11 89ndash121

(2002) ldquoGeneralized Linear Additive Smooth Structuresrdquo Journal of Computational and Graphical Statis-tics 11 758ndash783

(2004) ldquoSplines Knots and Penaltiesrdquo unpublished manuscript

Fields Development Team (2006) fields Tools for Spatial Data National Center for Atmospheric ResearchBoulder CO Available online at http wwwcgducaredu Software Fields

Friedman J H (1991) ldquoMultivariate Adaptive Regression Splinesrdquo (with discussion) The Annals of Statistics19 1ndash141

Gelman A (2006) ldquoPrior Distributions for Variance Parameters in Hierarchical Modelsrdquo Bayesian Analysis 1515ndash533

Gray R J (1994) ldquoSpline-Based Tests in Survival Analysisrdquo Biometrics 50 640ndash652

Holst U Hossjer O Bjorklund C Ragnarson P and Edner H (1996) ldquoLocally Weighted Least SquaresKernel Regression and Statistical Evaluation of LIDAR Measurementsrdquo Environmetrics 7 401ndash416

Jullion A and Lambert P (2006) Discussion paper DP0534 UCL

Kammann E E and Wand M P (2003) ldquoGeoadditive Modelsrdquo Applied Statistics 52 1ndash18

Kauermann G (2002) ldquoA Note on Bandwidth Selection for Penalised Spline Smoothingrdquo Technical Report 02-13Department of Statistics University of Glasgow

Krivobokova T Crainiceanu C M and Kauermann G (2007) ldquoFast Adaptive Penalized Splinesrdquo Journal ofComputational and Graphical Statistics to appear

Lang S and Bretzger A (2004) ldquoBayesian P-Splinesrdquo Journal of Compuational and Graphical Statistics 13183ndash212

Lang S Fronk E M and Fahrmeir L (2002) ldquoFunction Estimation with Locally Adaptive Dynamic ModelsrdquoComputational Statistics 17 479ndash499

Marx B D and Eilers PHC (1998) ldquoDirect Generalized Additive Modeling With Penalized LikelihoodrdquoComputational Statistics amp Data Analysis 28 193ndash209

Natarajan R and Kass R E (2000) ldquoReference Bayesian Methods for Generalized Linear Mixed ModelsrdquoJournal of the American Statistical Association 95 227ndash237

Ngo L and Wand M P (2004) ldquoSmoothing With Mixed Model Softwarerdquo Journal of Statistical Software 9Available online at http wwwjstatsoftorg v09 i01

Nychka D Haaland P OrsquoConnell M and Ellner S (1998) ldquoFUNFITS Data Analysis and Statistical Toolsfor Estimating Functionsrdquo in Case Studies in Environmental Statistics Lecture Notes in Statistics vol 132eds D Nychka W W Piegorsch and L H Cox New York Springer Verlag pp 159ndash179 Online athttp wwwcgducaredu stats Software Funfits

Nychka D and Saltzman N (1998) ldquoDesign of Air Quality Monitoring Networksrdquo in Case studies in Environ-mental Statistics Lecture Notes in Statistics vol 132 eds D Nychka W W Piegorsch and L H Cox NewYork Springer Verlag pp 51ndash76

288 C M Crainiceanu et al

Pintore A Speckman P and Holmes C C (2006) ldquoSpatially Adaptive Smoothing Splinesrdquo Biometrika 93113ndash125

Ruppert D (1997) ldquoLocal Polynomial Regression and its Applications in Environmental Statisticsrdquo in Statisticsfor the Environment (vol 3) eds V Barnett and F Turkman Chicester Wiley pp 155ndash173

(2002) ldquoSelecting the Number of Knots for Penalized Splinesrdquo Journal of Computational and GraphicalStatistics 11 735ndash757

Ruppert D and Carroll R J (2000) ldquoSpatially-Adaptive Penalties for Spline Fittingrdquo Australian and NewZealand Journal of Statistics 42 205ndash223

Ruppert D Wand M P Holst and U Hossier O (1997) Technometrics 39 262ndash273

Ruppert D Wand M P and Carroll R J (2003) Semiparametric Regression Cambridge Cambridge UniversityPress

Smith M and Kohn R (1997) ldquoA Bayesian Approach to Nonparametric Bivariate Regressionrdquo Journal of theAmerican Statistical Association 92 1522ndash1535

Wood S (2006) Generalized Additive Models An Introduction with R New York Chapman amp HallCRC

Page 10: Spatially Adaptive Bayesian Penalized Splines - Department of

274 C M Crainiceanu et al

Table 1 Average mean square error (AMSE) for different mean functions based on 100 simulations The columnldquoTrue variancerdquo indicates whether the data-generating process is homoscedastic or heteroscedastic Thecolumns ldquoFitted error variancerdquo refer to the type of assumption used for fitting the data ldquoHomoscedas-ticrdquo uses the Ruppert and Carroll (2000) fitting algorithm while ldquoHeteroscedasticrdquo uses the Bayesianinferential algorithm described in this article The mean and log-variance functions were modeled usingdegree 2 splines with Km = Kε = 40 equally spaced knots The shrinkage process was modeled usingdegree 1 splines with Kb = 4 subknots

Fitted error variance

Function True variance Homoscedastic Heteroscedastic

Doppler Constant 0197 0218RCj=3 Constant 0011 0012RCj=6 Constant 0061 00653 Bumps Constant 0054 00553 Bumps Variable 0031 0026

which is roughly constant between [0 05] and has three sharp peaks at 06 075 and 09For this function we considered n = 1000 equally spaced xrsquos on [0 1] and two differentscenarios for the standard error of the error process (a) σε(x) = 05 for homoscedasticerrors and (b) σε(xi) = 05minus08x+16(xminus05)+ We will refer to this as the ldquothree-bumprdquofunction

We used 100 simulations from these models and fit a spatially adaptive penalized splinethat uses a constant variance For the three-bump function we also fitted a model withnonconstant variance modeled as a log-penalized-spline For the mean and log-variancefunctions we used quadratic penalized splines with Km = Kε = 40 knots For the log-variance corresponding to the shrinkage process we used Kb = 4 knots

We calculated the mean square error for each simulated dataset as

MSE =nsum

i=1

m(xi) minus m(xi)2 n

where m(x) is the posterior mean of m(x) using a given model Table 1 provides a summaryof the average MSE (AMSE) results over 100 simulations comparing the method that allowsfor heteroscedastic errors with the one that does not Remarkably our method that estimatesa nonconstant variance performed well compared to methods that assume a constant variancewhen the variance is constant In contrast when the true variance process is not constantour method performed better

Our method was slightly outperformed by the Ruppert and Caroll (2000) method in theDoppler case This is probably due to the fact that the Ruppert and Carroll method assumesa constant variance for the Doppler function when the variance used for simulations wasσ 2ε = 1 However both methods outperformed the method of Baladandayuthapani et al

(2005) who reported an AMSE of 0024 for the Doppler functionFor the three-bump function with constant variance σε(x) = 05 the two models pro-

duced practically indistinguishable MSEs Note that our AMSE over all xrsquos using 100 simu-

Spatially Adaptive Bayesian Penalized Splines 275

Figure 3 Comparison of pointwise coverage probabilities of the 95 credible intervals in 500 simulationsData was simulated from the model described in Section 5 with heteroscedastic error variance Estimation wasdone using two methods Bayesian adaptive penalized splines with homoscedastic errors (dashed) and Bayesianadaptive penalized splines with heteroscedastic errors (solid) The coverage probabilities have been smoothed toremove Monte Carlo variability

lations was AMSE = 00054 which is smaller than 00061 reported by Baladandayuthapaniet al (2005) The coverage probabilities of the 95 credible intervals were very similarfor the two models for each value of the covariate with a slight advantage in favor of themodel using homoscedastic errors The average coverage probability was 947 for themodel with homoscedastic errors and 935 for the model with heteroscedastic errorsThese coverage probabilities are very similar to the ones reported by Baladandayuthapaniet al (2005) in their Figure 3

For case (b) σε(xi) = 05 minus 08x + 16(x minus 05)+ the heteroscedastic model sub-stantially outperformed the homoscedastic model both in terms of AMSE and coverageprobabilities In this situation it would be misleading to compare only the average coverageprobability for the 95 credible intervals Indeed these averages are very close 941for the heteroscedastic and 930 for the homoscedastic method but they are obtainedfrom very different sets of pointwise coverage probabilities Figure 3 displays the coverageprobabilities for these two methods Note that for the heteroscedastic method the coverageprobability in the [0 05] interval is close to 95 In the same interval the coverage prob-ability for the homoscedastic method starts from around 08 and increases until it crossesthe 95 target around 02 Moreover in the interval [03 05] this coverage probabilityis estimated to be 1 This is due to the fact that on this interval σε(x) decreases linearlyfrom 05 to 01 Since the homoscedastic method assumes a constant variance the credibleintervals will tend to be shorter than nominal in regions of higher variability (x close tozero) thus producing lower coverage probabilities In regions of smaller variability (x closeto 05) the credible intervals will tend to be much wider thus producing extremely large

276 C M Crainiceanu et al

coverage probabilitiesIn the interval [055 065] the homoscedastic method slightly outperforms the het-

eroscedastic method This seems to be the effect of a ldquoluckyrdquo combination of two factorsAs we discussed the size of the credible intervals for the homoscedastic method in a neigh-borhood of 05 is much larger than nominal However at the same point the mean functionchanges from a constant to a rapidly oscillating function and the coverage probabilities droproughly at the same rate In the interval [08 1] one can notice a phenomenon very similarto the one described for the interval [0 05] While the function oscillates more rapidly inthis region the heteroscedastic adaptive method produces credible intervals with coverageprobabilities close to the nominal 95 However the homoscedastic method does not takeinto account the increased variability and produces credible intervals that are too short

For these examples both methods performed roughly similar on simulated data sets thatdid not require variance estimation However when the error variance is not constant themethod using log-penalized-splines to estimate the error variance considerably outperformsthe method that does not

6 LOW RANK MULTIVARIATE SMOOTHING

In this section we generalize the ideas in Section 2 to multivariate smoothing whilepreserving the appealing geometric interpretation of the smoother Consider the followingregression yi = m(xi ) + εi where m(middot) is a smooth function of L covariates We will useradial basis functions which have the advantage of being more numerically stable than thetensor product of truncated polynomials in more than one dimension Suppose that xi isin RL1 le i le n are n vectors of covariates and κk isin RL 1 le k le Km are Km knots Considerthe following distance function

C(r) =

||r||2MminusL for L odd||r||2MminusL log ||r|| for L even

where || middot || denotes the Euclidean norm in RL the integer M controls the smoothnessof C(middot) X the matrix with ith row Xi = [1 xT

i ] ZKm = C(||xi minus κκκk||)1leilen1lekleKm

Km = C(||κκκk minus κκκkprime ||)1lekleK1lekprimeleK and define Z = ZKminus12K where

minus12K is the

principal square root of K With these notations the low rank approximation of thin platespline regression can be obtained as the BLUP in the LMM (Kammann and Wand 2003Ruppert Wand and Carroll 2003)

Y = Xβββ + Zb + εεε E

(bεεε

)=(

00

) cov

(bεεε

)=(

σ 2b IKm 0

0 σ 2ε In

) (61)

This model contains only one variance component σ 2b for controlling the shrinkage of

b which is equivalent to one global smoothing parameter λ = σ 2b σ

2ε and implicitly as-

sumes homoscedastic errors To relax these assumptions we consider a new set of knotsκκκlowast

1 κκκlowastKb

and define Xlowast and Zlowast similarly with the corresponding definition of matrices

Spatially Adaptive Bayesian Penalized Splines 277

X and Z where the x-covariates are replaced by the knots κκκk and the knots are replaced bythe subknots κκκlowast

k Consider the following model for the responseyi = β0 +sumL

j=1 βjxij +sumKm

k=1 bkzik + εiεi sim N0 σ 2

ε (xi ) i = 1 nbk sim N0 σ 2

b (κκκk) k = 1 Km

(62)

where the error variance logσ 2ε (xi ) is modeled as a low rank log-thin plate spline

logσ 2ε (xi ) = γ0 +sumL

j=1 γjxij +sumKε

k=1 ckzikck sim N(0 σ 2

c ) k = 1 Kε(63)

and the random coefficient variance logσ 2b (κκκk) is modeled as another low rank log-thin

plate spline logσ 2

b (κκκk) = δ0 +sumLj=1 δj x

lowastkj +sumKb

j=1 dj zlowastkj

dj sim N(0 σ 2d ) j = 1 Kb

(64)

where xlowastkj and zlowast

kj are the entries of Xlowast and Zlowast matrices respectively and ck and ds areassumed mutually independent

In the case of low rank smoothers the set of knots for the covariates and subknots formodeling the shrinkage process have to be chosen One possibility is to use equally spacedknots and subknots We use this approach in our simulation study in Section 7 to make ourmethod comparable from this point of view with previously published methods Anotherpossibility is to select the knots and subknots using the space filling design (Nychka andSaltzman 1998) which is based on the maximal separation principle This avoids wastingknots and is likely to lead to better approximations in sparse regions of the data Thecoverdesign() function from the R package Fields (Fields Development Team 2006)provides software for space filling knot selection We use this approach in Section 8 in theNoshiro elevation example

7 COMPARISONS WITH OTHER ADAPTIVE SURFACEFITTING METHODS

Lang and Brezger (2004) compared their adaptive surface smoothing method with sev-eral methods used in a simulation study by Smith and Kohn (1997) MARS (Friedman1991) ldquolocfitrdquo (Cleveland and Grosse 1991) ldquotpsrdquo (bivariate cubic thin plate splines with asingle smoothing parameter) tensor product cubic smoothing splines with five smoothingparameters and a parametric linear interaction model We will use the following regressionfunctions also used by Lang and Brezger

bull m1(x1 x2) = x1 sin(4πx2) where x1 and x2 are distributed independently uniformon [0 1]

bull m2(x1 x2) = 15 exp(minus8x21 ) + 35 exp(minus8x2

2 ) where x1 and x2 are distributedindependently normal with mean 05 and variance 01

278 C M Crainiceanu et al

Function m1(middot) contains complex interactions and moderate spatial variability whilem2(middot) has only main effects and is much smoother We used a sample size n = 300σ = 14 range(f ) and 250 simulations from each model

We will compare the performance of our method with that of Lang and Brezgerrsquos whichperformed at least as good or better than other adaptive methods in a variety of examplesTo avoid dependence of results on knot choice we used the same choice as the one describedby Lang and Brezger 12 times 12 equally spaced knots for the mean function However weused thin-plate splines instead of tensor products of B-splines We modeled the smoothingparameter using a low rank log-thin plate spline with an equally spaced 5 times 5 knot grid Weconsidered two estimators one with constant error variance (CEV) and one with estimatederror variance using a log-thin plate spline (EEV) with the same 12 times 12 knot grid as forthe mean function The CEV estimator is the one that can directly be compared with theestimators considered by Lang and Brezger (2004) Because Lang and Brezger presentedonly comparative results using boxplots we will use their results as read from their Figure5 The performance of all estimators was measured by the log of the mean squared errorlog(MSE) where MSE(f ) = 1n

sumni=1f (xi) minus f (xi)2

For the CEV estimator and function f1(middot) corresponding to moderate spatial heterogene-ity we obtained a median log(MSE) of minus367 with an interquartile range [minus380minus353]and a range [minus421minus313] These values are better than the best values reported by Langand Brezgerrsquos methods in their Figure 5(b) Lang and Brezger reported that their method de-noted ldquoP-splinerdquo outperforms all other methods in this case and achieved a median log(MSE)of minus350 Remarkably the EEV estimator performed almost as well as the CEV estimatorin this case and also outperformed the other methods The median log(MSE) for EEV wasof minus359 with an interquartile range [minus374minus343] and a range [minus414minus302] Theseresults are consistent with univariate results reported by Ruppert and Carroll (2000) andBaladandayuthapani Mallick and Carroll (2005) who found that allowing the smoothingparameter to vary smoothly outperforms other adaptive techniques

For function f1(middot) Figure 4 displays the coverage probabilities for the 95 credibleintervals calculated over a 20 times 20 equally spaced grid in [0 1]2 For one grid point wecalculated the frequency with which the 95 pointwise credible interval covers the truevalue of the function at the grid point As expected these coverage probabilities showstrong spatial correlation with lower coverage probabilities along the ridges of the sinusfunction Coverage probability is lowest when x1 is in the [02 05] range The signal-to-noise in this region is about half the signal-to-noise ratio in the region corresponding to x1

close to 1 This explains why the coverage probability is smaller in this region Interestinglythe zeros of the true function appearing at x2 = 025 050 075 are covered with at least95 probability Another interesting feature is the lower coverage probabilities near thenorth south and east boundaries These features of the pointwise coverage probabilitymap are consistent with the frequentist analysis of Krivobokova et al (2007) but are notconsistent with the coverage probabilities reported by Lang and Brezger for their BayesianP-spline method

For our CEV estimator and function f2(middot) corresponding to very low spatial heterogene-ity we obtained a median log(MSE) of minus595 with an interquartile range [minus612minus566]

Spatially Adaptive Bayesian Penalized Splines 279

Figure 4 Coverage probability of the pointwise 95 credible intervals for the mean function for functionf1(x1 x2) = x1 sin(4πx2) with a constant error standard deviation σ = range(f )4 Probabilities are computedon a 20 times 20 equally spaced grid of points in [0 1]2 The model used was a thin plate spline with K = 100 knotsfor the mean function a log-thin plate spline with K = 100 knots for the variance function and a log-thin platespline with Klowast = 16 knots for the spatially adaptive shrinkage parameter The parameter controlling the degreeof smoothness of the thin-plate spline basis was m = 2

and a range [minus651minus541] The results of Lang and Brezger in their Figure 5(a) indicatethat their ldquoP-splinerdquo method achieved a median log(MSE) of minus600 Our method performsroughly similar to the Bayesian P-spline method of Lang and Brezger and to the cubicthin plate spline (tps) being outperformed only by Lang and Brezgerrsquos adaptive P-splinemethod with two main effects ldquoP-spline mainrdquo We could also add two main effects to ourbivariate adaptive smoother to improve MSE for this example but this is not our concernin this article Again the EEV estimator performed very similarly to our CEV estimator

The last simulation study was done using the function f1(middot middot) where x1 x2 are indepen-dent uniformly distributed in [0 1] with an error standard deviation function

σε(x1 x2) = r

32+ 3r

32x2

1

where r = range(f1) We compared our CEV and EEV estimators described at the beginningof this section and as expected the EEV estimator outperformed the CEV both in terms oflog(MSE) and coverage probabilities More precisely for log(MSE) we obtained a medianof minus526 for CEV and minus535 for EEV an interquartile range [minus537minus515] for CEV and[minus548minus524] for EEV and a range [minus465minus572] for CEV and [minus484minus584] for EEVFor both estimators the general patterns for coverage probabilities were the ones displayedin Figure 4 EEV outperformed CEV in terms of coverage probabilities For example theregions where the nominal level is exceeded for both estimators have a roughly similar

280 C M Crainiceanu et al

Figure 5 Locations of sampling (dots) with 100 knots (circles) in the Noshiro example

shape and location but the EEV coverage probabilities tend to be closer to their nominallevel

Note that our model incorporating nonparametric variance estimation is more generalthan the models considered in the literature However its performance for function m1(middot middot)with moderate spatial variability and constant variance was better than all methods consid-ered by Lang and Brezger in their simulation study (see their Figure 5 panel b) Our methodhas good performance for the function m2(middot middot) being the second best when compared tomethods consider by Lang and Brezger (see their Figure 5 panel a) For the case whenvariance is not constant our adaptive method that estimates the variance nonparametricallyoutperformed our adaptive method that estimates a constant variance

8 THE NOSHIRO EXAMPLE

Noshiro Japan was the site of a major earthquake Much of the damage from thequake was due to soil movement Professor Thomas OrsquoRourke a geotechnical engineerwas investigating the factors that might help predict soil movement during a future quakeOne factor thought to be of importance was the slope of the land To estimate the slopeRuppert (1997) used a dataset with 799 observations where x was longitude and latitudeand y was elevation at locations in Noshiro The object of primary interest is the gradientat least its magnitude and possibly its direction

We used the thin-plate spline models (62) for the mean of y with Km = 100 knotsselected by the maximum separation criterion implemented in the function coverdesign()

of the R package Fields (Nychka Haaland OrsquoConnell and Ellner 1998) The sampling

Spatially Adaptive Bayesian Penalized Splines 281

Figure 6 Posterior mean of the mean regression function for the Noshiro example The model used was a thinplate spline with K = 100 knots for the mean function a log-thin plate spline with K = 100 knots for the variancefunction and a log-thin plate spline with Klowast = 16 knots for the spatially adaptive shrinkage parameter Theparameter controlling the degree of smoothness of the thin-plate spline basis was m = 2

locations and the knots used for fitting are displayed in Figure 5 The log-thin plate splinedescribed in (63) with the same Kε = 100 knots was used to model the variance of theerror process σ 2

ε (xi ) The log-thin plate spline described in (64) with Kb = 16 subknotswas used to model the variance σ 2

b (κκκk) which controls the amount of shrinkage of the b

parameters The subknots were obtained using the coverdesign() function applied to the100 knots

Figure 6 displays the posterior mean of the mean response function There is a rathersharp peak around (042 04) with a slow decay from the peak in the general south directionand a much faster decay in all other directions The function becomes smoother towardsthe boundary This is exactly the type of function for which adaptive spatial smoothing cansubstantially improve the fit

Figure 7 shows that in large areas of the map σε(middot) is smooth and with small valuesHowever in a neighborhood of the peak of the mean function σε(middot) displays two relativelysharp maxima located NNW and SSE of the peak We also performed a two-stage frequentistanalysis In the first stage we used a low rank thin plate spline for the mean function and theresults from this regression were used to obtain residuals We then fitted another low rankthin plate spline to the absolute values of these residuals While this map was not identicalto the one in Figure 7 it did exhibit the same patterns of heteroscedasticity

The process logσ 2b (κκκk) controlling the shrinkage of the bk is displayed in Figure 8

282 C M Crainiceanu et al

Figure 7 Posterior mean of the standard deviation σε(xi ) of the error process function for the Noshiro exampleThe model used was a thin plate spline with K = 100 knots for the mean function a log-thin plate spline withK = 100 knots for the variance function and a log-thin plate spline with Klowast = 16 knots for the spatially adaptiveshrinkage parameter The parameter controlling the degree of smoothness of the thin-plate spline basis was m = 2

Figure 8 Posterior mean of the shrinkage process minus logσ 2b(κκκk) The model used was a thin plate spline with

K = 100 knots for the mean function a log-thin plate spline with K = 100 knots for the variance function and alog-thin plate spline with Klowast = 16 knots for the spatially adaptive shrinkage parameter The parameter controllingthe degree of smoothness of the thin-plate spline basis was m = 2

Spatially Adaptive Bayesian Penalized Splines 283

Smaller values of this function correspond to less shrinkage of bk towards zero and morelocal behavior of the smoother Figure 8 indicates that in a neighborhood of the peak of themean function the shrinkage is smaller to allow the mean function to change rapidly Awayfrom the peak the shrinkage is larger corresponding to a smoother mean

9 IMPLEMENTATION USING MCMC

We will now provide some details for MCMC simulations of model (21) wherethe variances σ 2

ε (xi) and σ 2b (κk) are modeled by Equations (23) and (24) The imple-

mentation for other models for example for the multivariate smoothing in Section 6is similar Consider independent normal priors for the coefficients of the monomialsβi sim N(0 σ 2

0β) γi sim N(0 σ 20γ ) i = 0 p δi sim N(0 σ 2

0δ) i = 0 q and in-

dependent inverse Gamma priors for the variance components σ 2c sim IGamma(ac bc) and

σ 2d sim IGamma(ad bd) Using these priors many full conditionals of the posterior distribu-

tion are easy to derive while a few have complex multivariate forms Our implementationof the MCMC using multivariate Metropolis-Hastings steps proved to be unstable with poormixing properties A simple and reliable solution was to change the model by adding errorterms to the log-spline models that is

logσ 2ε (xi)

= γ0 + middot middot middot + γpxpi +sumKε

s=1 cs(xi minus κεs )

p+ + ui

logσ 2b (κ

mk ) = δ0 + middot middot middot + δq(κ

mk )q +sumKb

s=1 ds(κmk minus κb

s )q+ + vk

(91)

where ui sim N(0 σ 2u ) and vk sim N(0 σ 2

v ) This idea was also used for σ 2b by Baladan-

dayuthapani Mallick and Carroll (2005) We fixed the values of σ 2u = σ 2

v = 001 as thesevariances appear nonidentifiable and a standard deviation of 01 is small on a log-scale Thisdevice makes computations feasible by replacing sampling from complex full conditionalsby simple univariate MH steps

Define as before by ε b θθθX and define θθθε = (γγγ T cT )T θθθb = (δδδT dT )T CX =(X Z) Cε = (Xε Zε) Cb = (Xb Zb) where X Xε Xb contain the monomials and ZZε Zb contain the truncated polynomials of the spline models for m logσ 2

ε and logσ 2b

respectively Also denote by

0X =[

σ 20βIp+1 0

0 b

]

0ε =[

σ 20γ Ip+1 0

0 σ 2c IKε

]

0b =[

σ 20δIq+1 0

0 σ 2d IKm

]

The full conditionals of the posterior are detailed below

1 [θθθX] sim N(MXCT

Xminus1ε YMX

) where MX =

(CT

Xminus1ε CX + minus1

0X

)minus1

2 [θθθε] sim N(MεCT

ε Yεσ2u Mε

) where Yε = [log(σ 2

ε1) log(σ 2εn)]

T and Mε

= (CTε Cεσ

2u +minus1

0ε )minus1

284 C M Crainiceanu et al

Figure 9 Computation time (in minutes) versus sample size for 10000 simulations from the joint posteriordistribution given the data The mean function is modeled as a penalized spline with Km knots with the log of thevariance of the shrinkage process modeled as another penalized spline with Kb = 4 subknots The log of the errorvariance is modeled as another penalized spline with Kε = Km knots Different symbols correspond to increasingnumber of knots dots (Km = 10) asterisks (Km = 20) circles (Km = 30) xrsquos (Km = 40)

3 [θθθb] sim N(MbCT

b Ybσ2v Mb

)where Yb =

[log(σ 2

1 ) log(σ 2Km

)]T

and Mb =(CT

b Cbσ2v +minus1

0b )minus1

4 [σ 2c ] sim IGamma

(ac + Kε2 bc + ||c||22

)

5 [σ 2d ] sim IGamma

(ad + Kb2 bd + ||d||22

)

6 [σ 2εi] prop σminus3

εi exp

minus(yi minus microi)

2(2σ 2εi ) minus

[log(σ 2

εi ) minus ηi)]2

(2σ 2u )

where microi

and ηi are the ith components of CXθθθX and Cεθθθε respectively

7 [σ 2k ] prop σminus3

k expminusb2

k(2σ2k ) minus [

log(σ 2k ) minus ζk)

]2(σ 2

v ) where ζk is the kth com-

ponent of Cbθθθb

All the above conditionals have an explicit form with the exception of the n + Km

one-dimensional conditionals from 6 and 7 For these distributions we use the Metropolis-Hastings algorithm with a normal proposal distribution centered at the current value andsmall variance

10 IMPLEMENTATION AND COMPUTATIONALPERFORMANCE

Our MCMC simulation algorithm provided reliable results and is reasonably fast To il-lustrate this Figure 9 displays computation time (in minutes) versus sample size for 10000

Spatially Adaptive Bayesian Penalized Splines 285

Figure 10 MCMC histories for eight parameters of interest including the log of standard deviation of the errorprocess and log of standard deviation of the shrinkage process as well as some parameters modeling the differentfunctions The model uses 40 equally spaced knots for the mean and error variance functions and 4 subknotsfor the adaptive shrinkage process Chains correspond to the Doppler function example using quadratic truncatedpolynomials to fit the mean and log variance function and linear truncated polynomial basis for fitting the shrinkageprocess For clarity of presentation only every 20th simulated observation is displayed

simulations from the joint posterior distribution given the data The mean function is mod-eled as a penalized spline with Km knots with the log of the variance of the shrinkageprocess modeled as another penalized spline with Kb = 4 subknots The log of the errorvariance is modeled as another penalized spline with Kε = Km knots Different symbolscorrespond to increasing number of knots dots (Km = 10) asterisks (Km = 20) circles(Km = 30) xrsquos (Km = 40) Computation time increases roughly linearly with the numberof observations with the same intercept 0 and larger slopes for larger number of knotsNote that even with n = 1000 observations and Km = Kε = 40 simulation time is abouttwo minutes

For the models and datasets presented in this article 10000 simulations proved suffi-cient for accurate and reliable inference Some parameters such as the mean and variancefunctions tended to exhibit much better mixing properties than the variance componentsσ 2c and σ 2

d controlling the shrinkage processes This is probably due to the reduced amountof information in the data about these parameters The MCMC convergence and mixingproperties were assessed by visual inspection of the chain histories of many parameters ofinterest Since we started with good starting values of the parameters convergence is fastFigure 10 displays the histories of eight parameters of interest for the model fitted to the

286 C M Crainiceanu et al

Doppler function described in Section 5 For clarity we display only every 100th sampledvalued of the chain but the same patterns would be observed in the full chain history Similargood mixing properties have been noted in all the other examples presented in this article

In the search for a good software platform we implemented our algorithms in MATLABC and FORTRAN Programs can be downloaded from wwwbiostatjhsphedu simccrainicFor our application and implementation there were no serious differences between C andFORTRAN which have performed better than MATLAB by reducing computation timetwo to three times The reported times in Figure 9 are for our implementation in C

Given the complexity of the model we were interested in improving the mixing prop-erties of the Markov Chains The computational trick described in Section 9 was the sinequa non method that allowed us to replace an intractable multivariate MH step with severaltractable univariate MH steps We used normal random walk proposal distributions centeredat the current value and variance tuned to provide acceptance rates between 30 and 50

11 COMMENTS

Our joint models for heterogeneous mean and nonconstant variance are designed to ex-tract the maximum amount of information from complex data and suggest simpler paramet-ric models Bayesian inference based on MCMC offers a natural and powerful frameworkfor model fitting However MCMC in this context is not straightforward and implemen-tation depends essentially on the ability of transforming intractable simulations from highdimensional distributions into tractable one dimensional updates as described in Section9 Our implementation provides accurate fast and reproducible results while software isavailable Actual simulation time increases only linearly with the number of observationsand mixing properties are adequate provided that proposal distribution variances are tunedfor optimal acceptance rates

ACKNOWLEDGMENTS

Carrollrsquos research was supported by a grant from the National Cancer Institute (CA-57030) and by the TexasAampM Center for Environmental and Rural Health via a grant from the National Institute of Environmental HealthSciences (P30-ES09106)

[Received September 2004 Revised October 2006]

REFERENCES

Baladandayuthapani V Mallick B K and Carroll R J (2005) ldquoSpatially Adaptive Bayesian RegressionSplinesrdquo Journal of Computational and Graphical Statistics 14 378ndash394

Besag J Green P Higdon D and Mengersen K (1995) ldquoBayesian Computation and Stochastic Systemsrdquo(with discussions) Statistical Science 10 3ndash66

Carroll R J (2003) ldquoVariances Are Not Always Nuisance Parameters The 2002 R A Fisher Lecturerdquo Biometrics59 211ndash220

Carroll R J and Ruppert D (1988) Transformation and Weighting in Regression New York Chapman andHall

Spatially Adaptive Bayesian Penalized Splines 287

Cleveland W S and Grosse E (1991) ldquoComputational Methods for Local Regressionrdquo Statistics and Computing1 47ndash62

Crainiceanu C and Ruppert D (2004) ldquoLikelihood Ratio Tests in Linear Mixed Models With One VarianceComponentrdquo Journal of the Royal Statistical Society Series B 66 165ndash185

Crainiceanu C M Ruppert D and Wand M P (2005) ldquoBayesian Analysis for Penalized Spline Regressionusing WinBUGSrdquo Journal of Statistical Software 14 Available online at http wwwjstatsoftorg v14 i14

Donoho D L and Johnstone I M (1994) ldquoIdeal Spatial Adaptation by Wavelet Shrinkagerdquo Biometrika 81425ndash455

Eilers P H C and Marx B D (1996) ldquoFlexible Smoothing With B-Splines and Penaltiesrdquo (with discussion)Statistical Science 11 89ndash121

(2002) ldquoGeneralized Linear Additive Smooth Structuresrdquo Journal of Computational and Graphical Statis-tics 11 758ndash783

(2004) ldquoSplines Knots and Penaltiesrdquo unpublished manuscript

Fields Development Team (2006) fields Tools for Spatial Data National Center for Atmospheric ResearchBoulder CO Available online at http wwwcgducaredu Software Fields

Friedman J H (1991) ldquoMultivariate Adaptive Regression Splinesrdquo (with discussion) The Annals of Statistics19 1ndash141

Gelman A (2006) ldquoPrior Distributions for Variance Parameters in Hierarchical Modelsrdquo Bayesian Analysis 1515ndash533

Gray R J (1994) ldquoSpline-Based Tests in Survival Analysisrdquo Biometrics 50 640ndash652

Holst U Hossjer O Bjorklund C Ragnarson P and Edner H (1996) ldquoLocally Weighted Least SquaresKernel Regression and Statistical Evaluation of LIDAR Measurementsrdquo Environmetrics 7 401ndash416

Jullion A and Lambert P (2006) Discussion paper DP0534 UCL

Kammann E E and Wand M P (2003) ldquoGeoadditive Modelsrdquo Applied Statistics 52 1ndash18

Kauermann G (2002) ldquoA Note on Bandwidth Selection for Penalised Spline Smoothingrdquo Technical Report 02-13Department of Statistics University of Glasgow

Krivobokova T Crainiceanu C M and Kauermann G (2007) ldquoFast Adaptive Penalized Splinesrdquo Journal ofComputational and Graphical Statistics to appear

Lang S and Bretzger A (2004) ldquoBayesian P-Splinesrdquo Journal of Compuational and Graphical Statistics 13183ndash212

Lang S Fronk E M and Fahrmeir L (2002) ldquoFunction Estimation with Locally Adaptive Dynamic ModelsrdquoComputational Statistics 17 479ndash499

Marx B D and Eilers PHC (1998) ldquoDirect Generalized Additive Modeling With Penalized LikelihoodrdquoComputational Statistics amp Data Analysis 28 193ndash209

Natarajan R and Kass R E (2000) ldquoReference Bayesian Methods for Generalized Linear Mixed ModelsrdquoJournal of the American Statistical Association 95 227ndash237

Ngo L and Wand M P (2004) ldquoSmoothing With Mixed Model Softwarerdquo Journal of Statistical Software 9Available online at http wwwjstatsoftorg v09 i01

Nychka D Haaland P OrsquoConnell M and Ellner S (1998) ldquoFUNFITS Data Analysis and Statistical Toolsfor Estimating Functionsrdquo in Case Studies in Environmental Statistics Lecture Notes in Statistics vol 132eds D Nychka W W Piegorsch and L H Cox New York Springer Verlag pp 159ndash179 Online athttp wwwcgducaredu stats Software Funfits

Nychka D and Saltzman N (1998) ldquoDesign of Air Quality Monitoring Networksrdquo in Case studies in Environ-mental Statistics Lecture Notes in Statistics vol 132 eds D Nychka W W Piegorsch and L H Cox NewYork Springer Verlag pp 51ndash76

288 C M Crainiceanu et al

Pintore A Speckman P and Holmes C C (2006) ldquoSpatially Adaptive Smoothing Splinesrdquo Biometrika 93113ndash125

Ruppert D (1997) ldquoLocal Polynomial Regression and its Applications in Environmental Statisticsrdquo in Statisticsfor the Environment (vol 3) eds V Barnett and F Turkman Chicester Wiley pp 155ndash173

(2002) ldquoSelecting the Number of Knots for Penalized Splinesrdquo Journal of Computational and GraphicalStatistics 11 735ndash757

Ruppert D and Carroll R J (2000) ldquoSpatially-Adaptive Penalties for Spline Fittingrdquo Australian and NewZealand Journal of Statistics 42 205ndash223

Ruppert D Wand M P Holst and U Hossier O (1997) Technometrics 39 262ndash273

Ruppert D Wand M P and Carroll R J (2003) Semiparametric Regression Cambridge Cambridge UniversityPress

Smith M and Kohn R (1997) ldquoA Bayesian Approach to Nonparametric Bivariate Regressionrdquo Journal of theAmerican Statistical Association 92 1522ndash1535

Wood S (2006) Generalized Additive Models An Introduction with R New York Chapman amp HallCRC

Page 11: Spatially Adaptive Bayesian Penalized Splines - Department of

Spatially Adaptive Bayesian Penalized Splines 275

Figure 3 Comparison of pointwise coverage probabilities of the 95 credible intervals in 500 simulationsData was simulated from the model described in Section 5 with heteroscedastic error variance Estimation wasdone using two methods Bayesian adaptive penalized splines with homoscedastic errors (dashed) and Bayesianadaptive penalized splines with heteroscedastic errors (solid) The coverage probabilities have been smoothed toremove Monte Carlo variability

lations was AMSE = 00054 which is smaller than 00061 reported by Baladandayuthapaniet al (2005) The coverage probabilities of the 95 credible intervals were very similarfor the two models for each value of the covariate with a slight advantage in favor of themodel using homoscedastic errors The average coverage probability was 947 for themodel with homoscedastic errors and 935 for the model with heteroscedastic errorsThese coverage probabilities are very similar to the ones reported by Baladandayuthapaniet al (2005) in their Figure 3

For case (b) σε(xi) = 05 minus 08x + 16(x minus 05)+ the heteroscedastic model sub-stantially outperformed the homoscedastic model both in terms of AMSE and coverageprobabilities In this situation it would be misleading to compare only the average coverageprobability for the 95 credible intervals Indeed these averages are very close 941for the heteroscedastic and 930 for the homoscedastic method but they are obtainedfrom very different sets of pointwise coverage probabilities Figure 3 displays the coverageprobabilities for these two methods Note that for the heteroscedastic method the coverageprobability in the [0 05] interval is close to 95 In the same interval the coverage prob-ability for the homoscedastic method starts from around 08 and increases until it crossesthe 95 target around 02 Moreover in the interval [03 05] this coverage probabilityis estimated to be 1 This is due to the fact that on this interval σε(x) decreases linearlyfrom 05 to 01 Since the homoscedastic method assumes a constant variance the credibleintervals will tend to be shorter than nominal in regions of higher variability (x close tozero) thus producing lower coverage probabilities In regions of smaller variability (x closeto 05) the credible intervals will tend to be much wider thus producing extremely large

276 C M Crainiceanu et al

coverage probabilitiesIn the interval [055 065] the homoscedastic method slightly outperforms the het-

eroscedastic method This seems to be the effect of a ldquoluckyrdquo combination of two factorsAs we discussed the size of the credible intervals for the homoscedastic method in a neigh-borhood of 05 is much larger than nominal However at the same point the mean functionchanges from a constant to a rapidly oscillating function and the coverage probabilities droproughly at the same rate In the interval [08 1] one can notice a phenomenon very similarto the one described for the interval [0 05] While the function oscillates more rapidly inthis region the heteroscedastic adaptive method produces credible intervals with coverageprobabilities close to the nominal 95 However the homoscedastic method does not takeinto account the increased variability and produces credible intervals that are too short

For these examples both methods performed roughly similar on simulated data sets thatdid not require variance estimation However when the error variance is not constant themethod using log-penalized-splines to estimate the error variance considerably outperformsthe method that does not

6 LOW RANK MULTIVARIATE SMOOTHING

In this section we generalize the ideas in Section 2 to multivariate smoothing whilepreserving the appealing geometric interpretation of the smoother Consider the followingregression yi = m(xi ) + εi where m(middot) is a smooth function of L covariates We will useradial basis functions which have the advantage of being more numerically stable than thetensor product of truncated polynomials in more than one dimension Suppose that xi isin RL1 le i le n are n vectors of covariates and κk isin RL 1 le k le Km are Km knots Considerthe following distance function

C(r) =

||r||2MminusL for L odd||r||2MminusL log ||r|| for L even

where || middot || denotes the Euclidean norm in RL the integer M controls the smoothnessof C(middot) X the matrix with ith row Xi = [1 xT

i ] ZKm = C(||xi minus κκκk||)1leilen1lekleKm

Km = C(||κκκk minus κκκkprime ||)1lekleK1lekprimeleK and define Z = ZKminus12K where

minus12K is the

principal square root of K With these notations the low rank approximation of thin platespline regression can be obtained as the BLUP in the LMM (Kammann and Wand 2003Ruppert Wand and Carroll 2003)

Y = Xβββ + Zb + εεε E

(bεεε

)=(

00

) cov

(bεεε

)=(

σ 2b IKm 0

0 σ 2ε In

) (61)

This model contains only one variance component σ 2b for controlling the shrinkage of

b which is equivalent to one global smoothing parameter λ = σ 2b σ

2ε and implicitly as-

sumes homoscedastic errors To relax these assumptions we consider a new set of knotsκκκlowast

1 κκκlowastKb

and define Xlowast and Zlowast similarly with the corresponding definition of matrices

Spatially Adaptive Bayesian Penalized Splines 277

X and Z where the x-covariates are replaced by the knots κκκk and the knots are replaced bythe subknots κκκlowast

k Consider the following model for the responseyi = β0 +sumL

j=1 βjxij +sumKm

k=1 bkzik + εiεi sim N0 σ 2

ε (xi ) i = 1 nbk sim N0 σ 2

b (κκκk) k = 1 Km

(62)

where the error variance logσ 2ε (xi ) is modeled as a low rank log-thin plate spline

logσ 2ε (xi ) = γ0 +sumL

j=1 γjxij +sumKε

k=1 ckzikck sim N(0 σ 2

c ) k = 1 Kε(63)

and the random coefficient variance logσ 2b (κκκk) is modeled as another low rank log-thin

plate spline logσ 2

b (κκκk) = δ0 +sumLj=1 δj x

lowastkj +sumKb

j=1 dj zlowastkj

dj sim N(0 σ 2d ) j = 1 Kb

(64)

where xlowastkj and zlowast

kj are the entries of Xlowast and Zlowast matrices respectively and ck and ds areassumed mutually independent

In the case of low rank smoothers the set of knots for the covariates and subknots formodeling the shrinkage process have to be chosen One possibility is to use equally spacedknots and subknots We use this approach in our simulation study in Section 7 to make ourmethod comparable from this point of view with previously published methods Anotherpossibility is to select the knots and subknots using the space filling design (Nychka andSaltzman 1998) which is based on the maximal separation principle This avoids wastingknots and is likely to lead to better approximations in sparse regions of the data Thecoverdesign() function from the R package Fields (Fields Development Team 2006)provides software for space filling knot selection We use this approach in Section 8 in theNoshiro elevation example

7 COMPARISONS WITH OTHER ADAPTIVE SURFACEFITTING METHODS

Lang and Brezger (2004) compared their adaptive surface smoothing method with sev-eral methods used in a simulation study by Smith and Kohn (1997) MARS (Friedman1991) ldquolocfitrdquo (Cleveland and Grosse 1991) ldquotpsrdquo (bivariate cubic thin plate splines with asingle smoothing parameter) tensor product cubic smoothing splines with five smoothingparameters and a parametric linear interaction model We will use the following regressionfunctions also used by Lang and Brezger

bull m1(x1 x2) = x1 sin(4πx2) where x1 and x2 are distributed independently uniformon [0 1]

bull m2(x1 x2) = 15 exp(minus8x21 ) + 35 exp(minus8x2

2 ) where x1 and x2 are distributedindependently normal with mean 05 and variance 01

278 C M Crainiceanu et al

Function m1(middot) contains complex interactions and moderate spatial variability whilem2(middot) has only main effects and is much smoother We used a sample size n = 300σ = 14 range(f ) and 250 simulations from each model

We will compare the performance of our method with that of Lang and Brezgerrsquos whichperformed at least as good or better than other adaptive methods in a variety of examplesTo avoid dependence of results on knot choice we used the same choice as the one describedby Lang and Brezger 12 times 12 equally spaced knots for the mean function However weused thin-plate splines instead of tensor products of B-splines We modeled the smoothingparameter using a low rank log-thin plate spline with an equally spaced 5 times 5 knot grid Weconsidered two estimators one with constant error variance (CEV) and one with estimatederror variance using a log-thin plate spline (EEV) with the same 12 times 12 knot grid as forthe mean function The CEV estimator is the one that can directly be compared with theestimators considered by Lang and Brezger (2004) Because Lang and Brezger presentedonly comparative results using boxplots we will use their results as read from their Figure5 The performance of all estimators was measured by the log of the mean squared errorlog(MSE) where MSE(f ) = 1n

sumni=1f (xi) minus f (xi)2

For the CEV estimator and function f1(middot) corresponding to moderate spatial heterogene-ity we obtained a median log(MSE) of minus367 with an interquartile range [minus380minus353]and a range [minus421minus313] These values are better than the best values reported by Langand Brezgerrsquos methods in their Figure 5(b) Lang and Brezger reported that their method de-noted ldquoP-splinerdquo outperforms all other methods in this case and achieved a median log(MSE)of minus350 Remarkably the EEV estimator performed almost as well as the CEV estimatorin this case and also outperformed the other methods The median log(MSE) for EEV wasof minus359 with an interquartile range [minus374minus343] and a range [minus414minus302] Theseresults are consistent with univariate results reported by Ruppert and Carroll (2000) andBaladandayuthapani Mallick and Carroll (2005) who found that allowing the smoothingparameter to vary smoothly outperforms other adaptive techniques

For function f1(middot) Figure 4 displays the coverage probabilities for the 95 credibleintervals calculated over a 20 times 20 equally spaced grid in [0 1]2 For one grid point wecalculated the frequency with which the 95 pointwise credible interval covers the truevalue of the function at the grid point As expected these coverage probabilities showstrong spatial correlation with lower coverage probabilities along the ridges of the sinusfunction Coverage probability is lowest when x1 is in the [02 05] range The signal-to-noise in this region is about half the signal-to-noise ratio in the region corresponding to x1

close to 1 This explains why the coverage probability is smaller in this region Interestinglythe zeros of the true function appearing at x2 = 025 050 075 are covered with at least95 probability Another interesting feature is the lower coverage probabilities near thenorth south and east boundaries These features of the pointwise coverage probabilitymap are consistent with the frequentist analysis of Krivobokova et al (2007) but are notconsistent with the coverage probabilities reported by Lang and Brezger for their BayesianP-spline method

For our CEV estimator and function f2(middot) corresponding to very low spatial heterogene-ity we obtained a median log(MSE) of minus595 with an interquartile range [minus612minus566]

Spatially Adaptive Bayesian Penalized Splines 279

Figure 4 Coverage probability of the pointwise 95 credible intervals for the mean function for functionf1(x1 x2) = x1 sin(4πx2) with a constant error standard deviation σ = range(f )4 Probabilities are computedon a 20 times 20 equally spaced grid of points in [0 1]2 The model used was a thin plate spline with K = 100 knotsfor the mean function a log-thin plate spline with K = 100 knots for the variance function and a log-thin platespline with Klowast = 16 knots for the spatially adaptive shrinkage parameter The parameter controlling the degreeof smoothness of the thin-plate spline basis was m = 2

and a range [minus651minus541] The results of Lang and Brezger in their Figure 5(a) indicatethat their ldquoP-splinerdquo method achieved a median log(MSE) of minus600 Our method performsroughly similar to the Bayesian P-spline method of Lang and Brezger and to the cubicthin plate spline (tps) being outperformed only by Lang and Brezgerrsquos adaptive P-splinemethod with two main effects ldquoP-spline mainrdquo We could also add two main effects to ourbivariate adaptive smoother to improve MSE for this example but this is not our concernin this article Again the EEV estimator performed very similarly to our CEV estimator

The last simulation study was done using the function f1(middot middot) where x1 x2 are indepen-dent uniformly distributed in [0 1] with an error standard deviation function

σε(x1 x2) = r

32+ 3r

32x2

1

where r = range(f1) We compared our CEV and EEV estimators described at the beginningof this section and as expected the EEV estimator outperformed the CEV both in terms oflog(MSE) and coverage probabilities More precisely for log(MSE) we obtained a medianof minus526 for CEV and minus535 for EEV an interquartile range [minus537minus515] for CEV and[minus548minus524] for EEV and a range [minus465minus572] for CEV and [minus484minus584] for EEVFor both estimators the general patterns for coverage probabilities were the ones displayedin Figure 4 EEV outperformed CEV in terms of coverage probabilities For example theregions where the nominal level is exceeded for both estimators have a roughly similar

280 C M Crainiceanu et al

Figure 5 Locations of sampling (dots) with 100 knots (circles) in the Noshiro example

shape and location but the EEV coverage probabilities tend to be closer to their nominallevel

Note that our model incorporating nonparametric variance estimation is more generalthan the models considered in the literature However its performance for function m1(middot middot)with moderate spatial variability and constant variance was better than all methods consid-ered by Lang and Brezger in their simulation study (see their Figure 5 panel b) Our methodhas good performance for the function m2(middot middot) being the second best when compared tomethods consider by Lang and Brezger (see their Figure 5 panel a) For the case whenvariance is not constant our adaptive method that estimates the variance nonparametricallyoutperformed our adaptive method that estimates a constant variance

8 THE NOSHIRO EXAMPLE

Noshiro Japan was the site of a major earthquake Much of the damage from thequake was due to soil movement Professor Thomas OrsquoRourke a geotechnical engineerwas investigating the factors that might help predict soil movement during a future quakeOne factor thought to be of importance was the slope of the land To estimate the slopeRuppert (1997) used a dataset with 799 observations where x was longitude and latitudeand y was elevation at locations in Noshiro The object of primary interest is the gradientat least its magnitude and possibly its direction

We used the thin-plate spline models (62) for the mean of y with Km = 100 knotsselected by the maximum separation criterion implemented in the function coverdesign()

of the R package Fields (Nychka Haaland OrsquoConnell and Ellner 1998) The sampling

Spatially Adaptive Bayesian Penalized Splines 281

Figure 6 Posterior mean of the mean regression function for the Noshiro example The model used was a thinplate spline with K = 100 knots for the mean function a log-thin plate spline with K = 100 knots for the variancefunction and a log-thin plate spline with Klowast = 16 knots for the spatially adaptive shrinkage parameter Theparameter controlling the degree of smoothness of the thin-plate spline basis was m = 2

locations and the knots used for fitting are displayed in Figure 5 The log-thin plate splinedescribed in (63) with the same Kε = 100 knots was used to model the variance of theerror process σ 2

ε (xi ) The log-thin plate spline described in (64) with Kb = 16 subknotswas used to model the variance σ 2

b (κκκk) which controls the amount of shrinkage of the b

parameters The subknots were obtained using the coverdesign() function applied to the100 knots

Figure 6 displays the posterior mean of the mean response function There is a rathersharp peak around (042 04) with a slow decay from the peak in the general south directionand a much faster decay in all other directions The function becomes smoother towardsthe boundary This is exactly the type of function for which adaptive spatial smoothing cansubstantially improve the fit

Figure 7 shows that in large areas of the map σε(middot) is smooth and with small valuesHowever in a neighborhood of the peak of the mean function σε(middot) displays two relativelysharp maxima located NNW and SSE of the peak We also performed a two-stage frequentistanalysis In the first stage we used a low rank thin plate spline for the mean function and theresults from this regression were used to obtain residuals We then fitted another low rankthin plate spline to the absolute values of these residuals While this map was not identicalto the one in Figure 7 it did exhibit the same patterns of heteroscedasticity

The process logσ 2b (κκκk) controlling the shrinkage of the bk is displayed in Figure 8

282 C M Crainiceanu et al

Figure 7 Posterior mean of the standard deviation σε(xi ) of the error process function for the Noshiro exampleThe model used was a thin plate spline with K = 100 knots for the mean function a log-thin plate spline withK = 100 knots for the variance function and a log-thin plate spline with Klowast = 16 knots for the spatially adaptiveshrinkage parameter The parameter controlling the degree of smoothness of the thin-plate spline basis was m = 2

Figure 8 Posterior mean of the shrinkage process minus logσ 2b(κκκk) The model used was a thin plate spline with

K = 100 knots for the mean function a log-thin plate spline with K = 100 knots for the variance function and alog-thin plate spline with Klowast = 16 knots for the spatially adaptive shrinkage parameter The parameter controllingthe degree of smoothness of the thin-plate spline basis was m = 2

Spatially Adaptive Bayesian Penalized Splines 283

Smaller values of this function correspond to less shrinkage of bk towards zero and morelocal behavior of the smoother Figure 8 indicates that in a neighborhood of the peak of themean function the shrinkage is smaller to allow the mean function to change rapidly Awayfrom the peak the shrinkage is larger corresponding to a smoother mean

9 IMPLEMENTATION USING MCMC

We will now provide some details for MCMC simulations of model (21) wherethe variances σ 2

ε (xi) and σ 2b (κk) are modeled by Equations (23) and (24) The imple-

mentation for other models for example for the multivariate smoothing in Section 6is similar Consider independent normal priors for the coefficients of the monomialsβi sim N(0 σ 2

0β) γi sim N(0 σ 20γ ) i = 0 p δi sim N(0 σ 2

0δ) i = 0 q and in-

dependent inverse Gamma priors for the variance components σ 2c sim IGamma(ac bc) and

σ 2d sim IGamma(ad bd) Using these priors many full conditionals of the posterior distribu-

tion are easy to derive while a few have complex multivariate forms Our implementationof the MCMC using multivariate Metropolis-Hastings steps proved to be unstable with poormixing properties A simple and reliable solution was to change the model by adding errorterms to the log-spline models that is

logσ 2ε (xi)

= γ0 + middot middot middot + γpxpi +sumKε

s=1 cs(xi minus κεs )

p+ + ui

logσ 2b (κ

mk ) = δ0 + middot middot middot + δq(κ

mk )q +sumKb

s=1 ds(κmk minus κb

s )q+ + vk

(91)

where ui sim N(0 σ 2u ) and vk sim N(0 σ 2

v ) This idea was also used for σ 2b by Baladan-

dayuthapani Mallick and Carroll (2005) We fixed the values of σ 2u = σ 2

v = 001 as thesevariances appear nonidentifiable and a standard deviation of 01 is small on a log-scale Thisdevice makes computations feasible by replacing sampling from complex full conditionalsby simple univariate MH steps

Define as before by ε b θθθX and define θθθε = (γγγ T cT )T θθθb = (δδδT dT )T CX =(X Z) Cε = (Xε Zε) Cb = (Xb Zb) where X Xε Xb contain the monomials and ZZε Zb contain the truncated polynomials of the spline models for m logσ 2

ε and logσ 2b

respectively Also denote by

0X =[

σ 20βIp+1 0

0 b

]

0ε =[

σ 20γ Ip+1 0

0 σ 2c IKε

]

0b =[

σ 20δIq+1 0

0 σ 2d IKm

]

The full conditionals of the posterior are detailed below

1 [θθθX] sim N(MXCT

Xminus1ε YMX

) where MX =

(CT

Xminus1ε CX + minus1

0X

)minus1

2 [θθθε] sim N(MεCT

ε Yεσ2u Mε

) where Yε = [log(σ 2

ε1) log(σ 2εn)]

T and Mε

= (CTε Cεσ

2u +minus1

0ε )minus1

284 C M Crainiceanu et al

Figure 9 Computation time (in minutes) versus sample size for 10000 simulations from the joint posteriordistribution given the data The mean function is modeled as a penalized spline with Km knots with the log of thevariance of the shrinkage process modeled as another penalized spline with Kb = 4 subknots The log of the errorvariance is modeled as another penalized spline with Kε = Km knots Different symbols correspond to increasingnumber of knots dots (Km = 10) asterisks (Km = 20) circles (Km = 30) xrsquos (Km = 40)

3 [θθθb] sim N(MbCT

b Ybσ2v Mb

)where Yb =

[log(σ 2

1 ) log(σ 2Km

)]T

and Mb =(CT

b Cbσ2v +minus1

0b )minus1

4 [σ 2c ] sim IGamma

(ac + Kε2 bc + ||c||22

)

5 [σ 2d ] sim IGamma

(ad + Kb2 bd + ||d||22

)

6 [σ 2εi] prop σminus3

εi exp

minus(yi minus microi)

2(2σ 2εi ) minus

[log(σ 2

εi ) minus ηi)]2

(2σ 2u )

where microi

and ηi are the ith components of CXθθθX and Cεθθθε respectively

7 [σ 2k ] prop σminus3

k expminusb2

k(2σ2k ) minus [

log(σ 2k ) minus ζk)

]2(σ 2

v ) where ζk is the kth com-

ponent of Cbθθθb

All the above conditionals have an explicit form with the exception of the n + Km

one-dimensional conditionals from 6 and 7 For these distributions we use the Metropolis-Hastings algorithm with a normal proposal distribution centered at the current value andsmall variance

10 IMPLEMENTATION AND COMPUTATIONALPERFORMANCE

Our MCMC simulation algorithm provided reliable results and is reasonably fast To il-lustrate this Figure 9 displays computation time (in minutes) versus sample size for 10000

Spatially Adaptive Bayesian Penalized Splines 285

Figure 10 MCMC histories for eight parameters of interest including the log of standard deviation of the errorprocess and log of standard deviation of the shrinkage process as well as some parameters modeling the differentfunctions The model uses 40 equally spaced knots for the mean and error variance functions and 4 subknotsfor the adaptive shrinkage process Chains correspond to the Doppler function example using quadratic truncatedpolynomials to fit the mean and log variance function and linear truncated polynomial basis for fitting the shrinkageprocess For clarity of presentation only every 20th simulated observation is displayed

simulations from the joint posterior distribution given the data The mean function is mod-eled as a penalized spline with Km knots with the log of the variance of the shrinkageprocess modeled as another penalized spline with Kb = 4 subknots The log of the errorvariance is modeled as another penalized spline with Kε = Km knots Different symbolscorrespond to increasing number of knots dots (Km = 10) asterisks (Km = 20) circles(Km = 30) xrsquos (Km = 40) Computation time increases roughly linearly with the numberof observations with the same intercept 0 and larger slopes for larger number of knotsNote that even with n = 1000 observations and Km = Kε = 40 simulation time is abouttwo minutes

For the models and datasets presented in this article 10000 simulations proved suffi-cient for accurate and reliable inference Some parameters such as the mean and variancefunctions tended to exhibit much better mixing properties than the variance componentsσ 2c and σ 2

d controlling the shrinkage processes This is probably due to the reduced amountof information in the data about these parameters The MCMC convergence and mixingproperties were assessed by visual inspection of the chain histories of many parameters ofinterest Since we started with good starting values of the parameters convergence is fastFigure 10 displays the histories of eight parameters of interest for the model fitted to the

286 C M Crainiceanu et al

Doppler function described in Section 5 For clarity we display only every 100th sampledvalued of the chain but the same patterns would be observed in the full chain history Similargood mixing properties have been noted in all the other examples presented in this article

In the search for a good software platform we implemented our algorithms in MATLABC and FORTRAN Programs can be downloaded from wwwbiostatjhsphedu simccrainicFor our application and implementation there were no serious differences between C andFORTRAN which have performed better than MATLAB by reducing computation timetwo to three times The reported times in Figure 9 are for our implementation in C

Given the complexity of the model we were interested in improving the mixing prop-erties of the Markov Chains The computational trick described in Section 9 was the sinequa non method that allowed us to replace an intractable multivariate MH step with severaltractable univariate MH steps We used normal random walk proposal distributions centeredat the current value and variance tuned to provide acceptance rates between 30 and 50

11 COMMENTS

Our joint models for heterogeneous mean and nonconstant variance are designed to ex-tract the maximum amount of information from complex data and suggest simpler paramet-ric models Bayesian inference based on MCMC offers a natural and powerful frameworkfor model fitting However MCMC in this context is not straightforward and implemen-tation depends essentially on the ability of transforming intractable simulations from highdimensional distributions into tractable one dimensional updates as described in Section9 Our implementation provides accurate fast and reproducible results while software isavailable Actual simulation time increases only linearly with the number of observationsand mixing properties are adequate provided that proposal distribution variances are tunedfor optimal acceptance rates

ACKNOWLEDGMENTS

Carrollrsquos research was supported by a grant from the National Cancer Institute (CA-57030) and by the TexasAampM Center for Environmental and Rural Health via a grant from the National Institute of Environmental HealthSciences (P30-ES09106)

[Received September 2004 Revised October 2006]

REFERENCES

Baladandayuthapani V Mallick B K and Carroll R J (2005) ldquoSpatially Adaptive Bayesian RegressionSplinesrdquo Journal of Computational and Graphical Statistics 14 378ndash394

Besag J Green P Higdon D and Mengersen K (1995) ldquoBayesian Computation and Stochastic Systemsrdquo(with discussions) Statistical Science 10 3ndash66

Carroll R J (2003) ldquoVariances Are Not Always Nuisance Parameters The 2002 R A Fisher Lecturerdquo Biometrics59 211ndash220

Carroll R J and Ruppert D (1988) Transformation and Weighting in Regression New York Chapman andHall

Spatially Adaptive Bayesian Penalized Splines 287

Cleveland W S and Grosse E (1991) ldquoComputational Methods for Local Regressionrdquo Statistics and Computing1 47ndash62

Crainiceanu C and Ruppert D (2004) ldquoLikelihood Ratio Tests in Linear Mixed Models With One VarianceComponentrdquo Journal of the Royal Statistical Society Series B 66 165ndash185

Crainiceanu C M Ruppert D and Wand M P (2005) ldquoBayesian Analysis for Penalized Spline Regressionusing WinBUGSrdquo Journal of Statistical Software 14 Available online at http wwwjstatsoftorg v14 i14

Donoho D L and Johnstone I M (1994) ldquoIdeal Spatial Adaptation by Wavelet Shrinkagerdquo Biometrika 81425ndash455

Eilers P H C and Marx B D (1996) ldquoFlexible Smoothing With B-Splines and Penaltiesrdquo (with discussion)Statistical Science 11 89ndash121

(2002) ldquoGeneralized Linear Additive Smooth Structuresrdquo Journal of Computational and Graphical Statis-tics 11 758ndash783

(2004) ldquoSplines Knots and Penaltiesrdquo unpublished manuscript

Fields Development Team (2006) fields Tools for Spatial Data National Center for Atmospheric ResearchBoulder CO Available online at http wwwcgducaredu Software Fields

Friedman J H (1991) ldquoMultivariate Adaptive Regression Splinesrdquo (with discussion) The Annals of Statistics19 1ndash141

Gelman A (2006) ldquoPrior Distributions for Variance Parameters in Hierarchical Modelsrdquo Bayesian Analysis 1515ndash533

Gray R J (1994) ldquoSpline-Based Tests in Survival Analysisrdquo Biometrics 50 640ndash652

Holst U Hossjer O Bjorklund C Ragnarson P and Edner H (1996) ldquoLocally Weighted Least SquaresKernel Regression and Statistical Evaluation of LIDAR Measurementsrdquo Environmetrics 7 401ndash416

Jullion A and Lambert P (2006) Discussion paper DP0534 UCL

Kammann E E and Wand M P (2003) ldquoGeoadditive Modelsrdquo Applied Statistics 52 1ndash18

Kauermann G (2002) ldquoA Note on Bandwidth Selection for Penalised Spline Smoothingrdquo Technical Report 02-13Department of Statistics University of Glasgow

Krivobokova T Crainiceanu C M and Kauermann G (2007) ldquoFast Adaptive Penalized Splinesrdquo Journal ofComputational and Graphical Statistics to appear

Lang S and Bretzger A (2004) ldquoBayesian P-Splinesrdquo Journal of Compuational and Graphical Statistics 13183ndash212

Lang S Fronk E M and Fahrmeir L (2002) ldquoFunction Estimation with Locally Adaptive Dynamic ModelsrdquoComputational Statistics 17 479ndash499

Marx B D and Eilers PHC (1998) ldquoDirect Generalized Additive Modeling With Penalized LikelihoodrdquoComputational Statistics amp Data Analysis 28 193ndash209

Natarajan R and Kass R E (2000) ldquoReference Bayesian Methods for Generalized Linear Mixed ModelsrdquoJournal of the American Statistical Association 95 227ndash237

Ngo L and Wand M P (2004) ldquoSmoothing With Mixed Model Softwarerdquo Journal of Statistical Software 9Available online at http wwwjstatsoftorg v09 i01

Nychka D Haaland P OrsquoConnell M and Ellner S (1998) ldquoFUNFITS Data Analysis and Statistical Toolsfor Estimating Functionsrdquo in Case Studies in Environmental Statistics Lecture Notes in Statistics vol 132eds D Nychka W W Piegorsch and L H Cox New York Springer Verlag pp 159ndash179 Online athttp wwwcgducaredu stats Software Funfits

Nychka D and Saltzman N (1998) ldquoDesign of Air Quality Monitoring Networksrdquo in Case studies in Environ-mental Statistics Lecture Notes in Statistics vol 132 eds D Nychka W W Piegorsch and L H Cox NewYork Springer Verlag pp 51ndash76

288 C M Crainiceanu et al

Pintore A Speckman P and Holmes C C (2006) ldquoSpatially Adaptive Smoothing Splinesrdquo Biometrika 93113ndash125

Ruppert D (1997) ldquoLocal Polynomial Regression and its Applications in Environmental Statisticsrdquo in Statisticsfor the Environment (vol 3) eds V Barnett and F Turkman Chicester Wiley pp 155ndash173

(2002) ldquoSelecting the Number of Knots for Penalized Splinesrdquo Journal of Computational and GraphicalStatistics 11 735ndash757

Ruppert D and Carroll R J (2000) ldquoSpatially-Adaptive Penalties for Spline Fittingrdquo Australian and NewZealand Journal of Statistics 42 205ndash223

Ruppert D Wand M P Holst and U Hossier O (1997) Technometrics 39 262ndash273

Ruppert D Wand M P and Carroll R J (2003) Semiparametric Regression Cambridge Cambridge UniversityPress

Smith M and Kohn R (1997) ldquoA Bayesian Approach to Nonparametric Bivariate Regressionrdquo Journal of theAmerican Statistical Association 92 1522ndash1535

Wood S (2006) Generalized Additive Models An Introduction with R New York Chapman amp HallCRC

Page 12: Spatially Adaptive Bayesian Penalized Splines - Department of

276 C M Crainiceanu et al

coverage probabilitiesIn the interval [055 065] the homoscedastic method slightly outperforms the het-

eroscedastic method This seems to be the effect of a ldquoluckyrdquo combination of two factorsAs we discussed the size of the credible intervals for the homoscedastic method in a neigh-borhood of 05 is much larger than nominal However at the same point the mean functionchanges from a constant to a rapidly oscillating function and the coverage probabilities droproughly at the same rate In the interval [08 1] one can notice a phenomenon very similarto the one described for the interval [0 05] While the function oscillates more rapidly inthis region the heteroscedastic adaptive method produces credible intervals with coverageprobabilities close to the nominal 95 However the homoscedastic method does not takeinto account the increased variability and produces credible intervals that are too short

For these examples both methods performed roughly similar on simulated data sets thatdid not require variance estimation However when the error variance is not constant themethod using log-penalized-splines to estimate the error variance considerably outperformsthe method that does not

6 LOW RANK MULTIVARIATE SMOOTHING

In this section we generalize the ideas in Section 2 to multivariate smoothing whilepreserving the appealing geometric interpretation of the smoother Consider the followingregression yi = m(xi ) + εi where m(middot) is a smooth function of L covariates We will useradial basis functions which have the advantage of being more numerically stable than thetensor product of truncated polynomials in more than one dimension Suppose that xi isin RL1 le i le n are n vectors of covariates and κk isin RL 1 le k le Km are Km knots Considerthe following distance function

C(r) =

||r||2MminusL for L odd||r||2MminusL log ||r|| for L even

where || middot || denotes the Euclidean norm in RL the integer M controls the smoothnessof C(middot) X the matrix with ith row Xi = [1 xT

i ] ZKm = C(||xi minus κκκk||)1leilen1lekleKm

Km = C(||κκκk minus κκκkprime ||)1lekleK1lekprimeleK and define Z = ZKminus12K where

minus12K is the

principal square root of K With these notations the low rank approximation of thin platespline regression can be obtained as the BLUP in the LMM (Kammann and Wand 2003Ruppert Wand and Carroll 2003)

Y = Xβββ + Zb + εεε E

(bεεε

)=(

00

) cov

(bεεε

)=(

σ 2b IKm 0

0 σ 2ε In

) (61)

This model contains only one variance component σ 2b for controlling the shrinkage of

b which is equivalent to one global smoothing parameter λ = σ 2b σ

2ε and implicitly as-

sumes homoscedastic errors To relax these assumptions we consider a new set of knotsκκκlowast

1 κκκlowastKb

and define Xlowast and Zlowast similarly with the corresponding definition of matrices

Spatially Adaptive Bayesian Penalized Splines 277

X and Z where the x-covariates are replaced by the knots κκκk and the knots are replaced bythe subknots κκκlowast

k Consider the following model for the responseyi = β0 +sumL

j=1 βjxij +sumKm

k=1 bkzik + εiεi sim N0 σ 2

ε (xi ) i = 1 nbk sim N0 σ 2

b (κκκk) k = 1 Km

(62)

where the error variance logσ 2ε (xi ) is modeled as a low rank log-thin plate spline

logσ 2ε (xi ) = γ0 +sumL

j=1 γjxij +sumKε

k=1 ckzikck sim N(0 σ 2

c ) k = 1 Kε(63)

and the random coefficient variance logσ 2b (κκκk) is modeled as another low rank log-thin

plate spline logσ 2

b (κκκk) = δ0 +sumLj=1 δj x

lowastkj +sumKb

j=1 dj zlowastkj

dj sim N(0 σ 2d ) j = 1 Kb

(64)

where xlowastkj and zlowast

kj are the entries of Xlowast and Zlowast matrices respectively and ck and ds areassumed mutually independent

In the case of low rank smoothers the set of knots for the covariates and subknots formodeling the shrinkage process have to be chosen One possibility is to use equally spacedknots and subknots We use this approach in our simulation study in Section 7 to make ourmethod comparable from this point of view with previously published methods Anotherpossibility is to select the knots and subknots using the space filling design (Nychka andSaltzman 1998) which is based on the maximal separation principle This avoids wastingknots and is likely to lead to better approximations in sparse regions of the data Thecoverdesign() function from the R package Fields (Fields Development Team 2006)provides software for space filling knot selection We use this approach in Section 8 in theNoshiro elevation example

7 COMPARISONS WITH OTHER ADAPTIVE SURFACEFITTING METHODS

Lang and Brezger (2004) compared their adaptive surface smoothing method with sev-eral methods used in a simulation study by Smith and Kohn (1997) MARS (Friedman1991) ldquolocfitrdquo (Cleveland and Grosse 1991) ldquotpsrdquo (bivariate cubic thin plate splines with asingle smoothing parameter) tensor product cubic smoothing splines with five smoothingparameters and a parametric linear interaction model We will use the following regressionfunctions also used by Lang and Brezger

bull m1(x1 x2) = x1 sin(4πx2) where x1 and x2 are distributed independently uniformon [0 1]

bull m2(x1 x2) = 15 exp(minus8x21 ) + 35 exp(minus8x2

2 ) where x1 and x2 are distributedindependently normal with mean 05 and variance 01

278 C M Crainiceanu et al

Function m1(middot) contains complex interactions and moderate spatial variability whilem2(middot) has only main effects and is much smoother We used a sample size n = 300σ = 14 range(f ) and 250 simulations from each model

We will compare the performance of our method with that of Lang and Brezgerrsquos whichperformed at least as good or better than other adaptive methods in a variety of examplesTo avoid dependence of results on knot choice we used the same choice as the one describedby Lang and Brezger 12 times 12 equally spaced knots for the mean function However weused thin-plate splines instead of tensor products of B-splines We modeled the smoothingparameter using a low rank log-thin plate spline with an equally spaced 5 times 5 knot grid Weconsidered two estimators one with constant error variance (CEV) and one with estimatederror variance using a log-thin plate spline (EEV) with the same 12 times 12 knot grid as forthe mean function The CEV estimator is the one that can directly be compared with theestimators considered by Lang and Brezger (2004) Because Lang and Brezger presentedonly comparative results using boxplots we will use their results as read from their Figure5 The performance of all estimators was measured by the log of the mean squared errorlog(MSE) where MSE(f ) = 1n

sumni=1f (xi) minus f (xi)2

For the CEV estimator and function f1(middot) corresponding to moderate spatial heterogene-ity we obtained a median log(MSE) of minus367 with an interquartile range [minus380minus353]and a range [minus421minus313] These values are better than the best values reported by Langand Brezgerrsquos methods in their Figure 5(b) Lang and Brezger reported that their method de-noted ldquoP-splinerdquo outperforms all other methods in this case and achieved a median log(MSE)of minus350 Remarkably the EEV estimator performed almost as well as the CEV estimatorin this case and also outperformed the other methods The median log(MSE) for EEV wasof minus359 with an interquartile range [minus374minus343] and a range [minus414minus302] Theseresults are consistent with univariate results reported by Ruppert and Carroll (2000) andBaladandayuthapani Mallick and Carroll (2005) who found that allowing the smoothingparameter to vary smoothly outperforms other adaptive techniques

For function f1(middot) Figure 4 displays the coverage probabilities for the 95 credibleintervals calculated over a 20 times 20 equally spaced grid in [0 1]2 For one grid point wecalculated the frequency with which the 95 pointwise credible interval covers the truevalue of the function at the grid point As expected these coverage probabilities showstrong spatial correlation with lower coverage probabilities along the ridges of the sinusfunction Coverage probability is lowest when x1 is in the [02 05] range The signal-to-noise in this region is about half the signal-to-noise ratio in the region corresponding to x1

close to 1 This explains why the coverage probability is smaller in this region Interestinglythe zeros of the true function appearing at x2 = 025 050 075 are covered with at least95 probability Another interesting feature is the lower coverage probabilities near thenorth south and east boundaries These features of the pointwise coverage probabilitymap are consistent with the frequentist analysis of Krivobokova et al (2007) but are notconsistent with the coverage probabilities reported by Lang and Brezger for their BayesianP-spline method

For our CEV estimator and function f2(middot) corresponding to very low spatial heterogene-ity we obtained a median log(MSE) of minus595 with an interquartile range [minus612minus566]

Spatially Adaptive Bayesian Penalized Splines 279

Figure 4 Coverage probability of the pointwise 95 credible intervals for the mean function for functionf1(x1 x2) = x1 sin(4πx2) with a constant error standard deviation σ = range(f )4 Probabilities are computedon a 20 times 20 equally spaced grid of points in [0 1]2 The model used was a thin plate spline with K = 100 knotsfor the mean function a log-thin plate spline with K = 100 knots for the variance function and a log-thin platespline with Klowast = 16 knots for the spatially adaptive shrinkage parameter The parameter controlling the degreeof smoothness of the thin-plate spline basis was m = 2

and a range [minus651minus541] The results of Lang and Brezger in their Figure 5(a) indicatethat their ldquoP-splinerdquo method achieved a median log(MSE) of minus600 Our method performsroughly similar to the Bayesian P-spline method of Lang and Brezger and to the cubicthin plate spline (tps) being outperformed only by Lang and Brezgerrsquos adaptive P-splinemethod with two main effects ldquoP-spline mainrdquo We could also add two main effects to ourbivariate adaptive smoother to improve MSE for this example but this is not our concernin this article Again the EEV estimator performed very similarly to our CEV estimator

The last simulation study was done using the function f1(middot middot) where x1 x2 are indepen-dent uniformly distributed in [0 1] with an error standard deviation function

σε(x1 x2) = r

32+ 3r

32x2

1

where r = range(f1) We compared our CEV and EEV estimators described at the beginningof this section and as expected the EEV estimator outperformed the CEV both in terms oflog(MSE) and coverage probabilities More precisely for log(MSE) we obtained a medianof minus526 for CEV and minus535 for EEV an interquartile range [minus537minus515] for CEV and[minus548minus524] for EEV and a range [minus465minus572] for CEV and [minus484minus584] for EEVFor both estimators the general patterns for coverage probabilities were the ones displayedin Figure 4 EEV outperformed CEV in terms of coverage probabilities For example theregions where the nominal level is exceeded for both estimators have a roughly similar

280 C M Crainiceanu et al

Figure 5 Locations of sampling (dots) with 100 knots (circles) in the Noshiro example

shape and location but the EEV coverage probabilities tend to be closer to their nominallevel

Note that our model incorporating nonparametric variance estimation is more generalthan the models considered in the literature However its performance for function m1(middot middot)with moderate spatial variability and constant variance was better than all methods consid-ered by Lang and Brezger in their simulation study (see their Figure 5 panel b) Our methodhas good performance for the function m2(middot middot) being the second best when compared tomethods consider by Lang and Brezger (see their Figure 5 panel a) For the case whenvariance is not constant our adaptive method that estimates the variance nonparametricallyoutperformed our adaptive method that estimates a constant variance

8 THE NOSHIRO EXAMPLE

Noshiro Japan was the site of a major earthquake Much of the damage from thequake was due to soil movement Professor Thomas OrsquoRourke a geotechnical engineerwas investigating the factors that might help predict soil movement during a future quakeOne factor thought to be of importance was the slope of the land To estimate the slopeRuppert (1997) used a dataset with 799 observations where x was longitude and latitudeand y was elevation at locations in Noshiro The object of primary interest is the gradientat least its magnitude and possibly its direction

We used the thin-plate spline models (62) for the mean of y with Km = 100 knotsselected by the maximum separation criterion implemented in the function coverdesign()

of the R package Fields (Nychka Haaland OrsquoConnell and Ellner 1998) The sampling

Spatially Adaptive Bayesian Penalized Splines 281

Figure 6 Posterior mean of the mean regression function for the Noshiro example The model used was a thinplate spline with K = 100 knots for the mean function a log-thin plate spline with K = 100 knots for the variancefunction and a log-thin plate spline with Klowast = 16 knots for the spatially adaptive shrinkage parameter Theparameter controlling the degree of smoothness of the thin-plate spline basis was m = 2

locations and the knots used for fitting are displayed in Figure 5 The log-thin plate splinedescribed in (63) with the same Kε = 100 knots was used to model the variance of theerror process σ 2

ε (xi ) The log-thin plate spline described in (64) with Kb = 16 subknotswas used to model the variance σ 2

b (κκκk) which controls the amount of shrinkage of the b

parameters The subknots were obtained using the coverdesign() function applied to the100 knots

Figure 6 displays the posterior mean of the mean response function There is a rathersharp peak around (042 04) with a slow decay from the peak in the general south directionand a much faster decay in all other directions The function becomes smoother towardsthe boundary This is exactly the type of function for which adaptive spatial smoothing cansubstantially improve the fit

Figure 7 shows that in large areas of the map σε(middot) is smooth and with small valuesHowever in a neighborhood of the peak of the mean function σε(middot) displays two relativelysharp maxima located NNW and SSE of the peak We also performed a two-stage frequentistanalysis In the first stage we used a low rank thin plate spline for the mean function and theresults from this regression were used to obtain residuals We then fitted another low rankthin plate spline to the absolute values of these residuals While this map was not identicalto the one in Figure 7 it did exhibit the same patterns of heteroscedasticity

The process logσ 2b (κκκk) controlling the shrinkage of the bk is displayed in Figure 8

282 C M Crainiceanu et al

Figure 7 Posterior mean of the standard deviation σε(xi ) of the error process function for the Noshiro exampleThe model used was a thin plate spline with K = 100 knots for the mean function a log-thin plate spline withK = 100 knots for the variance function and a log-thin plate spline with Klowast = 16 knots for the spatially adaptiveshrinkage parameter The parameter controlling the degree of smoothness of the thin-plate spline basis was m = 2

Figure 8 Posterior mean of the shrinkage process minus logσ 2b(κκκk) The model used was a thin plate spline with

K = 100 knots for the mean function a log-thin plate spline with K = 100 knots for the variance function and alog-thin plate spline with Klowast = 16 knots for the spatially adaptive shrinkage parameter The parameter controllingthe degree of smoothness of the thin-plate spline basis was m = 2

Spatially Adaptive Bayesian Penalized Splines 283

Smaller values of this function correspond to less shrinkage of bk towards zero and morelocal behavior of the smoother Figure 8 indicates that in a neighborhood of the peak of themean function the shrinkage is smaller to allow the mean function to change rapidly Awayfrom the peak the shrinkage is larger corresponding to a smoother mean

9 IMPLEMENTATION USING MCMC

We will now provide some details for MCMC simulations of model (21) wherethe variances σ 2

ε (xi) and σ 2b (κk) are modeled by Equations (23) and (24) The imple-

mentation for other models for example for the multivariate smoothing in Section 6is similar Consider independent normal priors for the coefficients of the monomialsβi sim N(0 σ 2

0β) γi sim N(0 σ 20γ ) i = 0 p δi sim N(0 σ 2

0δ) i = 0 q and in-

dependent inverse Gamma priors for the variance components σ 2c sim IGamma(ac bc) and

σ 2d sim IGamma(ad bd) Using these priors many full conditionals of the posterior distribu-

tion are easy to derive while a few have complex multivariate forms Our implementationof the MCMC using multivariate Metropolis-Hastings steps proved to be unstable with poormixing properties A simple and reliable solution was to change the model by adding errorterms to the log-spline models that is

logσ 2ε (xi)

= γ0 + middot middot middot + γpxpi +sumKε

s=1 cs(xi minus κεs )

p+ + ui

logσ 2b (κ

mk ) = δ0 + middot middot middot + δq(κ

mk )q +sumKb

s=1 ds(κmk minus κb

s )q+ + vk

(91)

where ui sim N(0 σ 2u ) and vk sim N(0 σ 2

v ) This idea was also used for σ 2b by Baladan-

dayuthapani Mallick and Carroll (2005) We fixed the values of σ 2u = σ 2

v = 001 as thesevariances appear nonidentifiable and a standard deviation of 01 is small on a log-scale Thisdevice makes computations feasible by replacing sampling from complex full conditionalsby simple univariate MH steps

Define as before by ε b θθθX and define θθθε = (γγγ T cT )T θθθb = (δδδT dT )T CX =(X Z) Cε = (Xε Zε) Cb = (Xb Zb) where X Xε Xb contain the monomials and ZZε Zb contain the truncated polynomials of the spline models for m logσ 2

ε and logσ 2b

respectively Also denote by

0X =[

σ 20βIp+1 0

0 b

]

0ε =[

σ 20γ Ip+1 0

0 σ 2c IKε

]

0b =[

σ 20δIq+1 0

0 σ 2d IKm

]

The full conditionals of the posterior are detailed below

1 [θθθX] sim N(MXCT

Xminus1ε YMX

) where MX =

(CT

Xminus1ε CX + minus1

0X

)minus1

2 [θθθε] sim N(MεCT

ε Yεσ2u Mε

) where Yε = [log(σ 2

ε1) log(σ 2εn)]

T and Mε

= (CTε Cεσ

2u +minus1

0ε )minus1

284 C M Crainiceanu et al

Figure 9 Computation time (in minutes) versus sample size for 10000 simulations from the joint posteriordistribution given the data The mean function is modeled as a penalized spline with Km knots with the log of thevariance of the shrinkage process modeled as another penalized spline with Kb = 4 subknots The log of the errorvariance is modeled as another penalized spline with Kε = Km knots Different symbols correspond to increasingnumber of knots dots (Km = 10) asterisks (Km = 20) circles (Km = 30) xrsquos (Km = 40)

3 [θθθb] sim N(MbCT

b Ybσ2v Mb

)where Yb =

[log(σ 2

1 ) log(σ 2Km

)]T

and Mb =(CT

b Cbσ2v +minus1

0b )minus1

4 [σ 2c ] sim IGamma

(ac + Kε2 bc + ||c||22

)

5 [σ 2d ] sim IGamma

(ad + Kb2 bd + ||d||22

)

6 [σ 2εi] prop σminus3

εi exp

minus(yi minus microi)

2(2σ 2εi ) minus

[log(σ 2

εi ) minus ηi)]2

(2σ 2u )

where microi

and ηi are the ith components of CXθθθX and Cεθθθε respectively

7 [σ 2k ] prop σminus3

k expminusb2

k(2σ2k ) minus [

log(σ 2k ) minus ζk)

]2(σ 2

v ) where ζk is the kth com-

ponent of Cbθθθb

All the above conditionals have an explicit form with the exception of the n + Km

one-dimensional conditionals from 6 and 7 For these distributions we use the Metropolis-Hastings algorithm with a normal proposal distribution centered at the current value andsmall variance

10 IMPLEMENTATION AND COMPUTATIONALPERFORMANCE

Our MCMC simulation algorithm provided reliable results and is reasonably fast To il-lustrate this Figure 9 displays computation time (in minutes) versus sample size for 10000

Spatially Adaptive Bayesian Penalized Splines 285

Figure 10 MCMC histories for eight parameters of interest including the log of standard deviation of the errorprocess and log of standard deviation of the shrinkage process as well as some parameters modeling the differentfunctions The model uses 40 equally spaced knots for the mean and error variance functions and 4 subknotsfor the adaptive shrinkage process Chains correspond to the Doppler function example using quadratic truncatedpolynomials to fit the mean and log variance function and linear truncated polynomial basis for fitting the shrinkageprocess For clarity of presentation only every 20th simulated observation is displayed

simulations from the joint posterior distribution given the data The mean function is mod-eled as a penalized spline with Km knots with the log of the variance of the shrinkageprocess modeled as another penalized spline with Kb = 4 subknots The log of the errorvariance is modeled as another penalized spline with Kε = Km knots Different symbolscorrespond to increasing number of knots dots (Km = 10) asterisks (Km = 20) circles(Km = 30) xrsquos (Km = 40) Computation time increases roughly linearly with the numberof observations with the same intercept 0 and larger slopes for larger number of knotsNote that even with n = 1000 observations and Km = Kε = 40 simulation time is abouttwo minutes

For the models and datasets presented in this article 10000 simulations proved suffi-cient for accurate and reliable inference Some parameters such as the mean and variancefunctions tended to exhibit much better mixing properties than the variance componentsσ 2c and σ 2

d controlling the shrinkage processes This is probably due to the reduced amountof information in the data about these parameters The MCMC convergence and mixingproperties were assessed by visual inspection of the chain histories of many parameters ofinterest Since we started with good starting values of the parameters convergence is fastFigure 10 displays the histories of eight parameters of interest for the model fitted to the

286 C M Crainiceanu et al

Doppler function described in Section 5 For clarity we display only every 100th sampledvalued of the chain but the same patterns would be observed in the full chain history Similargood mixing properties have been noted in all the other examples presented in this article

In the search for a good software platform we implemented our algorithms in MATLABC and FORTRAN Programs can be downloaded from wwwbiostatjhsphedu simccrainicFor our application and implementation there were no serious differences between C andFORTRAN which have performed better than MATLAB by reducing computation timetwo to three times The reported times in Figure 9 are for our implementation in C

Given the complexity of the model we were interested in improving the mixing prop-erties of the Markov Chains The computational trick described in Section 9 was the sinequa non method that allowed us to replace an intractable multivariate MH step with severaltractable univariate MH steps We used normal random walk proposal distributions centeredat the current value and variance tuned to provide acceptance rates between 30 and 50

11 COMMENTS

Our joint models for heterogeneous mean and nonconstant variance are designed to ex-tract the maximum amount of information from complex data and suggest simpler paramet-ric models Bayesian inference based on MCMC offers a natural and powerful frameworkfor model fitting However MCMC in this context is not straightforward and implemen-tation depends essentially on the ability of transforming intractable simulations from highdimensional distributions into tractable one dimensional updates as described in Section9 Our implementation provides accurate fast and reproducible results while software isavailable Actual simulation time increases only linearly with the number of observationsand mixing properties are adequate provided that proposal distribution variances are tunedfor optimal acceptance rates

ACKNOWLEDGMENTS

Carrollrsquos research was supported by a grant from the National Cancer Institute (CA-57030) and by the TexasAampM Center for Environmental and Rural Health via a grant from the National Institute of Environmental HealthSciences (P30-ES09106)

[Received September 2004 Revised October 2006]

REFERENCES

Baladandayuthapani V Mallick B K and Carroll R J (2005) ldquoSpatially Adaptive Bayesian RegressionSplinesrdquo Journal of Computational and Graphical Statistics 14 378ndash394

Besag J Green P Higdon D and Mengersen K (1995) ldquoBayesian Computation and Stochastic Systemsrdquo(with discussions) Statistical Science 10 3ndash66

Carroll R J (2003) ldquoVariances Are Not Always Nuisance Parameters The 2002 R A Fisher Lecturerdquo Biometrics59 211ndash220

Carroll R J and Ruppert D (1988) Transformation and Weighting in Regression New York Chapman andHall

Spatially Adaptive Bayesian Penalized Splines 287

Cleveland W S and Grosse E (1991) ldquoComputational Methods for Local Regressionrdquo Statistics and Computing1 47ndash62

Crainiceanu C and Ruppert D (2004) ldquoLikelihood Ratio Tests in Linear Mixed Models With One VarianceComponentrdquo Journal of the Royal Statistical Society Series B 66 165ndash185

Crainiceanu C M Ruppert D and Wand M P (2005) ldquoBayesian Analysis for Penalized Spline Regressionusing WinBUGSrdquo Journal of Statistical Software 14 Available online at http wwwjstatsoftorg v14 i14

Donoho D L and Johnstone I M (1994) ldquoIdeal Spatial Adaptation by Wavelet Shrinkagerdquo Biometrika 81425ndash455

Eilers P H C and Marx B D (1996) ldquoFlexible Smoothing With B-Splines and Penaltiesrdquo (with discussion)Statistical Science 11 89ndash121

(2002) ldquoGeneralized Linear Additive Smooth Structuresrdquo Journal of Computational and Graphical Statis-tics 11 758ndash783

(2004) ldquoSplines Knots and Penaltiesrdquo unpublished manuscript

Fields Development Team (2006) fields Tools for Spatial Data National Center for Atmospheric ResearchBoulder CO Available online at http wwwcgducaredu Software Fields

Friedman J H (1991) ldquoMultivariate Adaptive Regression Splinesrdquo (with discussion) The Annals of Statistics19 1ndash141

Gelman A (2006) ldquoPrior Distributions for Variance Parameters in Hierarchical Modelsrdquo Bayesian Analysis 1515ndash533

Gray R J (1994) ldquoSpline-Based Tests in Survival Analysisrdquo Biometrics 50 640ndash652

Holst U Hossjer O Bjorklund C Ragnarson P and Edner H (1996) ldquoLocally Weighted Least SquaresKernel Regression and Statistical Evaluation of LIDAR Measurementsrdquo Environmetrics 7 401ndash416

Jullion A and Lambert P (2006) Discussion paper DP0534 UCL

Kammann E E and Wand M P (2003) ldquoGeoadditive Modelsrdquo Applied Statistics 52 1ndash18

Kauermann G (2002) ldquoA Note on Bandwidth Selection for Penalised Spline Smoothingrdquo Technical Report 02-13Department of Statistics University of Glasgow

Krivobokova T Crainiceanu C M and Kauermann G (2007) ldquoFast Adaptive Penalized Splinesrdquo Journal ofComputational and Graphical Statistics to appear

Lang S and Bretzger A (2004) ldquoBayesian P-Splinesrdquo Journal of Compuational and Graphical Statistics 13183ndash212

Lang S Fronk E M and Fahrmeir L (2002) ldquoFunction Estimation with Locally Adaptive Dynamic ModelsrdquoComputational Statistics 17 479ndash499

Marx B D and Eilers PHC (1998) ldquoDirect Generalized Additive Modeling With Penalized LikelihoodrdquoComputational Statistics amp Data Analysis 28 193ndash209

Natarajan R and Kass R E (2000) ldquoReference Bayesian Methods for Generalized Linear Mixed ModelsrdquoJournal of the American Statistical Association 95 227ndash237

Ngo L and Wand M P (2004) ldquoSmoothing With Mixed Model Softwarerdquo Journal of Statistical Software 9Available online at http wwwjstatsoftorg v09 i01

Nychka D Haaland P OrsquoConnell M and Ellner S (1998) ldquoFUNFITS Data Analysis and Statistical Toolsfor Estimating Functionsrdquo in Case Studies in Environmental Statistics Lecture Notes in Statistics vol 132eds D Nychka W W Piegorsch and L H Cox New York Springer Verlag pp 159ndash179 Online athttp wwwcgducaredu stats Software Funfits

Nychka D and Saltzman N (1998) ldquoDesign of Air Quality Monitoring Networksrdquo in Case studies in Environ-mental Statistics Lecture Notes in Statistics vol 132 eds D Nychka W W Piegorsch and L H Cox NewYork Springer Verlag pp 51ndash76

288 C M Crainiceanu et al

Pintore A Speckman P and Holmes C C (2006) ldquoSpatially Adaptive Smoothing Splinesrdquo Biometrika 93113ndash125

Ruppert D (1997) ldquoLocal Polynomial Regression and its Applications in Environmental Statisticsrdquo in Statisticsfor the Environment (vol 3) eds V Barnett and F Turkman Chicester Wiley pp 155ndash173

(2002) ldquoSelecting the Number of Knots for Penalized Splinesrdquo Journal of Computational and GraphicalStatistics 11 735ndash757

Ruppert D and Carroll R J (2000) ldquoSpatially-Adaptive Penalties for Spline Fittingrdquo Australian and NewZealand Journal of Statistics 42 205ndash223

Ruppert D Wand M P Holst and U Hossier O (1997) Technometrics 39 262ndash273

Ruppert D Wand M P and Carroll R J (2003) Semiparametric Regression Cambridge Cambridge UniversityPress

Smith M and Kohn R (1997) ldquoA Bayesian Approach to Nonparametric Bivariate Regressionrdquo Journal of theAmerican Statistical Association 92 1522ndash1535

Wood S (2006) Generalized Additive Models An Introduction with R New York Chapman amp HallCRC

Page 13: Spatially Adaptive Bayesian Penalized Splines - Department of

Spatially Adaptive Bayesian Penalized Splines 277

X and Z where the x-covariates are replaced by the knots κκκk and the knots are replaced bythe subknots κκκlowast

k Consider the following model for the responseyi = β0 +sumL

j=1 βjxij +sumKm

k=1 bkzik + εiεi sim N0 σ 2

ε (xi ) i = 1 nbk sim N0 σ 2

b (κκκk) k = 1 Km

(62)

where the error variance logσ 2ε (xi ) is modeled as a low rank log-thin plate spline

logσ 2ε (xi ) = γ0 +sumL

j=1 γjxij +sumKε

k=1 ckzikck sim N(0 σ 2

c ) k = 1 Kε(63)

and the random coefficient variance logσ 2b (κκκk) is modeled as another low rank log-thin

plate spline logσ 2

b (κκκk) = δ0 +sumLj=1 δj x

lowastkj +sumKb

j=1 dj zlowastkj

dj sim N(0 σ 2d ) j = 1 Kb

(64)

where xlowastkj and zlowast

kj are the entries of Xlowast and Zlowast matrices respectively and ck and ds areassumed mutually independent

In the case of low rank smoothers the set of knots for the covariates and subknots formodeling the shrinkage process have to be chosen One possibility is to use equally spacedknots and subknots We use this approach in our simulation study in Section 7 to make ourmethod comparable from this point of view with previously published methods Anotherpossibility is to select the knots and subknots using the space filling design (Nychka andSaltzman 1998) which is based on the maximal separation principle This avoids wastingknots and is likely to lead to better approximations in sparse regions of the data Thecoverdesign() function from the R package Fields (Fields Development Team 2006)provides software for space filling knot selection We use this approach in Section 8 in theNoshiro elevation example

7 COMPARISONS WITH OTHER ADAPTIVE SURFACEFITTING METHODS

Lang and Brezger (2004) compared their adaptive surface smoothing method with sev-eral methods used in a simulation study by Smith and Kohn (1997) MARS (Friedman1991) ldquolocfitrdquo (Cleveland and Grosse 1991) ldquotpsrdquo (bivariate cubic thin plate splines with asingle smoothing parameter) tensor product cubic smoothing splines with five smoothingparameters and a parametric linear interaction model We will use the following regressionfunctions also used by Lang and Brezger

bull m1(x1 x2) = x1 sin(4πx2) where x1 and x2 are distributed independently uniformon [0 1]

bull m2(x1 x2) = 15 exp(minus8x21 ) + 35 exp(minus8x2

2 ) where x1 and x2 are distributedindependently normal with mean 05 and variance 01

278 C M Crainiceanu et al

Function m1(middot) contains complex interactions and moderate spatial variability whilem2(middot) has only main effects and is much smoother We used a sample size n = 300σ = 14 range(f ) and 250 simulations from each model

We will compare the performance of our method with that of Lang and Brezgerrsquos whichperformed at least as good or better than other adaptive methods in a variety of examplesTo avoid dependence of results on knot choice we used the same choice as the one describedby Lang and Brezger 12 times 12 equally spaced knots for the mean function However weused thin-plate splines instead of tensor products of B-splines We modeled the smoothingparameter using a low rank log-thin plate spline with an equally spaced 5 times 5 knot grid Weconsidered two estimators one with constant error variance (CEV) and one with estimatederror variance using a log-thin plate spline (EEV) with the same 12 times 12 knot grid as forthe mean function The CEV estimator is the one that can directly be compared with theestimators considered by Lang and Brezger (2004) Because Lang and Brezger presentedonly comparative results using boxplots we will use their results as read from their Figure5 The performance of all estimators was measured by the log of the mean squared errorlog(MSE) where MSE(f ) = 1n

sumni=1f (xi) minus f (xi)2

For the CEV estimator and function f1(middot) corresponding to moderate spatial heterogene-ity we obtained a median log(MSE) of minus367 with an interquartile range [minus380minus353]and a range [minus421minus313] These values are better than the best values reported by Langand Brezgerrsquos methods in their Figure 5(b) Lang and Brezger reported that their method de-noted ldquoP-splinerdquo outperforms all other methods in this case and achieved a median log(MSE)of minus350 Remarkably the EEV estimator performed almost as well as the CEV estimatorin this case and also outperformed the other methods The median log(MSE) for EEV wasof minus359 with an interquartile range [minus374minus343] and a range [minus414minus302] Theseresults are consistent with univariate results reported by Ruppert and Carroll (2000) andBaladandayuthapani Mallick and Carroll (2005) who found that allowing the smoothingparameter to vary smoothly outperforms other adaptive techniques

For function f1(middot) Figure 4 displays the coverage probabilities for the 95 credibleintervals calculated over a 20 times 20 equally spaced grid in [0 1]2 For one grid point wecalculated the frequency with which the 95 pointwise credible interval covers the truevalue of the function at the grid point As expected these coverage probabilities showstrong spatial correlation with lower coverage probabilities along the ridges of the sinusfunction Coverage probability is lowest when x1 is in the [02 05] range The signal-to-noise in this region is about half the signal-to-noise ratio in the region corresponding to x1

close to 1 This explains why the coverage probability is smaller in this region Interestinglythe zeros of the true function appearing at x2 = 025 050 075 are covered with at least95 probability Another interesting feature is the lower coverage probabilities near thenorth south and east boundaries These features of the pointwise coverage probabilitymap are consistent with the frequentist analysis of Krivobokova et al (2007) but are notconsistent with the coverage probabilities reported by Lang and Brezger for their BayesianP-spline method

For our CEV estimator and function f2(middot) corresponding to very low spatial heterogene-ity we obtained a median log(MSE) of minus595 with an interquartile range [minus612minus566]

Spatially Adaptive Bayesian Penalized Splines 279

Figure 4 Coverage probability of the pointwise 95 credible intervals for the mean function for functionf1(x1 x2) = x1 sin(4πx2) with a constant error standard deviation σ = range(f )4 Probabilities are computedon a 20 times 20 equally spaced grid of points in [0 1]2 The model used was a thin plate spline with K = 100 knotsfor the mean function a log-thin plate spline with K = 100 knots for the variance function and a log-thin platespline with Klowast = 16 knots for the spatially adaptive shrinkage parameter The parameter controlling the degreeof smoothness of the thin-plate spline basis was m = 2

and a range [minus651minus541] The results of Lang and Brezger in their Figure 5(a) indicatethat their ldquoP-splinerdquo method achieved a median log(MSE) of minus600 Our method performsroughly similar to the Bayesian P-spline method of Lang and Brezger and to the cubicthin plate spline (tps) being outperformed only by Lang and Brezgerrsquos adaptive P-splinemethod with two main effects ldquoP-spline mainrdquo We could also add two main effects to ourbivariate adaptive smoother to improve MSE for this example but this is not our concernin this article Again the EEV estimator performed very similarly to our CEV estimator

The last simulation study was done using the function f1(middot middot) where x1 x2 are indepen-dent uniformly distributed in [0 1] with an error standard deviation function

σε(x1 x2) = r

32+ 3r

32x2

1

where r = range(f1) We compared our CEV and EEV estimators described at the beginningof this section and as expected the EEV estimator outperformed the CEV both in terms oflog(MSE) and coverage probabilities More precisely for log(MSE) we obtained a medianof minus526 for CEV and minus535 for EEV an interquartile range [minus537minus515] for CEV and[minus548minus524] for EEV and a range [minus465minus572] for CEV and [minus484minus584] for EEVFor both estimators the general patterns for coverage probabilities were the ones displayedin Figure 4 EEV outperformed CEV in terms of coverage probabilities For example theregions where the nominal level is exceeded for both estimators have a roughly similar

280 C M Crainiceanu et al

Figure 5 Locations of sampling (dots) with 100 knots (circles) in the Noshiro example

shape and location but the EEV coverage probabilities tend to be closer to their nominallevel

Note that our model incorporating nonparametric variance estimation is more generalthan the models considered in the literature However its performance for function m1(middot middot)with moderate spatial variability and constant variance was better than all methods consid-ered by Lang and Brezger in their simulation study (see their Figure 5 panel b) Our methodhas good performance for the function m2(middot middot) being the second best when compared tomethods consider by Lang and Brezger (see their Figure 5 panel a) For the case whenvariance is not constant our adaptive method that estimates the variance nonparametricallyoutperformed our adaptive method that estimates a constant variance

8 THE NOSHIRO EXAMPLE

Noshiro Japan was the site of a major earthquake Much of the damage from thequake was due to soil movement Professor Thomas OrsquoRourke a geotechnical engineerwas investigating the factors that might help predict soil movement during a future quakeOne factor thought to be of importance was the slope of the land To estimate the slopeRuppert (1997) used a dataset with 799 observations where x was longitude and latitudeand y was elevation at locations in Noshiro The object of primary interest is the gradientat least its magnitude and possibly its direction

We used the thin-plate spline models (62) for the mean of y with Km = 100 knotsselected by the maximum separation criterion implemented in the function coverdesign()

of the R package Fields (Nychka Haaland OrsquoConnell and Ellner 1998) The sampling

Spatially Adaptive Bayesian Penalized Splines 281

Figure 6 Posterior mean of the mean regression function for the Noshiro example The model used was a thinplate spline with K = 100 knots for the mean function a log-thin plate spline with K = 100 knots for the variancefunction and a log-thin plate spline with Klowast = 16 knots for the spatially adaptive shrinkage parameter Theparameter controlling the degree of smoothness of the thin-plate spline basis was m = 2

locations and the knots used for fitting are displayed in Figure 5 The log-thin plate splinedescribed in (63) with the same Kε = 100 knots was used to model the variance of theerror process σ 2

ε (xi ) The log-thin plate spline described in (64) with Kb = 16 subknotswas used to model the variance σ 2

b (κκκk) which controls the amount of shrinkage of the b

parameters The subknots were obtained using the coverdesign() function applied to the100 knots

Figure 6 displays the posterior mean of the mean response function There is a rathersharp peak around (042 04) with a slow decay from the peak in the general south directionand a much faster decay in all other directions The function becomes smoother towardsthe boundary This is exactly the type of function for which adaptive spatial smoothing cansubstantially improve the fit

Figure 7 shows that in large areas of the map σε(middot) is smooth and with small valuesHowever in a neighborhood of the peak of the mean function σε(middot) displays two relativelysharp maxima located NNW and SSE of the peak We also performed a two-stage frequentistanalysis In the first stage we used a low rank thin plate spline for the mean function and theresults from this regression were used to obtain residuals We then fitted another low rankthin plate spline to the absolute values of these residuals While this map was not identicalto the one in Figure 7 it did exhibit the same patterns of heteroscedasticity

The process logσ 2b (κκκk) controlling the shrinkage of the bk is displayed in Figure 8

282 C M Crainiceanu et al

Figure 7 Posterior mean of the standard deviation σε(xi ) of the error process function for the Noshiro exampleThe model used was a thin plate spline with K = 100 knots for the mean function a log-thin plate spline withK = 100 knots for the variance function and a log-thin plate spline with Klowast = 16 knots for the spatially adaptiveshrinkage parameter The parameter controlling the degree of smoothness of the thin-plate spline basis was m = 2

Figure 8 Posterior mean of the shrinkage process minus logσ 2b(κκκk) The model used was a thin plate spline with

K = 100 knots for the mean function a log-thin plate spline with K = 100 knots for the variance function and alog-thin plate spline with Klowast = 16 knots for the spatially adaptive shrinkage parameter The parameter controllingthe degree of smoothness of the thin-plate spline basis was m = 2

Spatially Adaptive Bayesian Penalized Splines 283

Smaller values of this function correspond to less shrinkage of bk towards zero and morelocal behavior of the smoother Figure 8 indicates that in a neighborhood of the peak of themean function the shrinkage is smaller to allow the mean function to change rapidly Awayfrom the peak the shrinkage is larger corresponding to a smoother mean

9 IMPLEMENTATION USING MCMC

We will now provide some details for MCMC simulations of model (21) wherethe variances σ 2

ε (xi) and σ 2b (κk) are modeled by Equations (23) and (24) The imple-

mentation for other models for example for the multivariate smoothing in Section 6is similar Consider independent normal priors for the coefficients of the monomialsβi sim N(0 σ 2

0β) γi sim N(0 σ 20γ ) i = 0 p δi sim N(0 σ 2

0δ) i = 0 q and in-

dependent inverse Gamma priors for the variance components σ 2c sim IGamma(ac bc) and

σ 2d sim IGamma(ad bd) Using these priors many full conditionals of the posterior distribu-

tion are easy to derive while a few have complex multivariate forms Our implementationof the MCMC using multivariate Metropolis-Hastings steps proved to be unstable with poormixing properties A simple and reliable solution was to change the model by adding errorterms to the log-spline models that is

logσ 2ε (xi)

= γ0 + middot middot middot + γpxpi +sumKε

s=1 cs(xi minus κεs )

p+ + ui

logσ 2b (κ

mk ) = δ0 + middot middot middot + δq(κ

mk )q +sumKb

s=1 ds(κmk minus κb

s )q+ + vk

(91)

where ui sim N(0 σ 2u ) and vk sim N(0 σ 2

v ) This idea was also used for σ 2b by Baladan-

dayuthapani Mallick and Carroll (2005) We fixed the values of σ 2u = σ 2

v = 001 as thesevariances appear nonidentifiable and a standard deviation of 01 is small on a log-scale Thisdevice makes computations feasible by replacing sampling from complex full conditionalsby simple univariate MH steps

Define as before by ε b θθθX and define θθθε = (γγγ T cT )T θθθb = (δδδT dT )T CX =(X Z) Cε = (Xε Zε) Cb = (Xb Zb) where X Xε Xb contain the monomials and ZZε Zb contain the truncated polynomials of the spline models for m logσ 2

ε and logσ 2b

respectively Also denote by

0X =[

σ 20βIp+1 0

0 b

]

0ε =[

σ 20γ Ip+1 0

0 σ 2c IKε

]

0b =[

σ 20δIq+1 0

0 σ 2d IKm

]

The full conditionals of the posterior are detailed below

1 [θθθX] sim N(MXCT

Xminus1ε YMX

) where MX =

(CT

Xminus1ε CX + minus1

0X

)minus1

2 [θθθε] sim N(MεCT

ε Yεσ2u Mε

) where Yε = [log(σ 2

ε1) log(σ 2εn)]

T and Mε

= (CTε Cεσ

2u +minus1

0ε )minus1

284 C M Crainiceanu et al

Figure 9 Computation time (in minutes) versus sample size for 10000 simulations from the joint posteriordistribution given the data The mean function is modeled as a penalized spline with Km knots with the log of thevariance of the shrinkage process modeled as another penalized spline with Kb = 4 subknots The log of the errorvariance is modeled as another penalized spline with Kε = Km knots Different symbols correspond to increasingnumber of knots dots (Km = 10) asterisks (Km = 20) circles (Km = 30) xrsquos (Km = 40)

3 [θθθb] sim N(MbCT

b Ybσ2v Mb

)where Yb =

[log(σ 2

1 ) log(σ 2Km

)]T

and Mb =(CT

b Cbσ2v +minus1

0b )minus1

4 [σ 2c ] sim IGamma

(ac + Kε2 bc + ||c||22

)

5 [σ 2d ] sim IGamma

(ad + Kb2 bd + ||d||22

)

6 [σ 2εi] prop σminus3

εi exp

minus(yi minus microi)

2(2σ 2εi ) minus

[log(σ 2

εi ) minus ηi)]2

(2σ 2u )

where microi

and ηi are the ith components of CXθθθX and Cεθθθε respectively

7 [σ 2k ] prop σminus3

k expminusb2

k(2σ2k ) minus [

log(σ 2k ) minus ζk)

]2(σ 2

v ) where ζk is the kth com-

ponent of Cbθθθb

All the above conditionals have an explicit form with the exception of the n + Km

one-dimensional conditionals from 6 and 7 For these distributions we use the Metropolis-Hastings algorithm with a normal proposal distribution centered at the current value andsmall variance

10 IMPLEMENTATION AND COMPUTATIONALPERFORMANCE

Our MCMC simulation algorithm provided reliable results and is reasonably fast To il-lustrate this Figure 9 displays computation time (in minutes) versus sample size for 10000

Spatially Adaptive Bayesian Penalized Splines 285

Figure 10 MCMC histories for eight parameters of interest including the log of standard deviation of the errorprocess and log of standard deviation of the shrinkage process as well as some parameters modeling the differentfunctions The model uses 40 equally spaced knots for the mean and error variance functions and 4 subknotsfor the adaptive shrinkage process Chains correspond to the Doppler function example using quadratic truncatedpolynomials to fit the mean and log variance function and linear truncated polynomial basis for fitting the shrinkageprocess For clarity of presentation only every 20th simulated observation is displayed

simulations from the joint posterior distribution given the data The mean function is mod-eled as a penalized spline with Km knots with the log of the variance of the shrinkageprocess modeled as another penalized spline with Kb = 4 subknots The log of the errorvariance is modeled as another penalized spline with Kε = Km knots Different symbolscorrespond to increasing number of knots dots (Km = 10) asterisks (Km = 20) circles(Km = 30) xrsquos (Km = 40) Computation time increases roughly linearly with the numberof observations with the same intercept 0 and larger slopes for larger number of knotsNote that even with n = 1000 observations and Km = Kε = 40 simulation time is abouttwo minutes

For the models and datasets presented in this article 10000 simulations proved suffi-cient for accurate and reliable inference Some parameters such as the mean and variancefunctions tended to exhibit much better mixing properties than the variance componentsσ 2c and σ 2

d controlling the shrinkage processes This is probably due to the reduced amountof information in the data about these parameters The MCMC convergence and mixingproperties were assessed by visual inspection of the chain histories of many parameters ofinterest Since we started with good starting values of the parameters convergence is fastFigure 10 displays the histories of eight parameters of interest for the model fitted to the

286 C M Crainiceanu et al

Doppler function described in Section 5 For clarity we display only every 100th sampledvalued of the chain but the same patterns would be observed in the full chain history Similargood mixing properties have been noted in all the other examples presented in this article

In the search for a good software platform we implemented our algorithms in MATLABC and FORTRAN Programs can be downloaded from wwwbiostatjhsphedu simccrainicFor our application and implementation there were no serious differences between C andFORTRAN which have performed better than MATLAB by reducing computation timetwo to three times The reported times in Figure 9 are for our implementation in C

Given the complexity of the model we were interested in improving the mixing prop-erties of the Markov Chains The computational trick described in Section 9 was the sinequa non method that allowed us to replace an intractable multivariate MH step with severaltractable univariate MH steps We used normal random walk proposal distributions centeredat the current value and variance tuned to provide acceptance rates between 30 and 50

11 COMMENTS

Our joint models for heterogeneous mean and nonconstant variance are designed to ex-tract the maximum amount of information from complex data and suggest simpler paramet-ric models Bayesian inference based on MCMC offers a natural and powerful frameworkfor model fitting However MCMC in this context is not straightforward and implemen-tation depends essentially on the ability of transforming intractable simulations from highdimensional distributions into tractable one dimensional updates as described in Section9 Our implementation provides accurate fast and reproducible results while software isavailable Actual simulation time increases only linearly with the number of observationsand mixing properties are adequate provided that proposal distribution variances are tunedfor optimal acceptance rates

ACKNOWLEDGMENTS

Carrollrsquos research was supported by a grant from the National Cancer Institute (CA-57030) and by the TexasAampM Center for Environmental and Rural Health via a grant from the National Institute of Environmental HealthSciences (P30-ES09106)

[Received September 2004 Revised October 2006]

REFERENCES

Baladandayuthapani V Mallick B K and Carroll R J (2005) ldquoSpatially Adaptive Bayesian RegressionSplinesrdquo Journal of Computational and Graphical Statistics 14 378ndash394

Besag J Green P Higdon D and Mengersen K (1995) ldquoBayesian Computation and Stochastic Systemsrdquo(with discussions) Statistical Science 10 3ndash66

Carroll R J (2003) ldquoVariances Are Not Always Nuisance Parameters The 2002 R A Fisher Lecturerdquo Biometrics59 211ndash220

Carroll R J and Ruppert D (1988) Transformation and Weighting in Regression New York Chapman andHall

Spatially Adaptive Bayesian Penalized Splines 287

Cleveland W S and Grosse E (1991) ldquoComputational Methods for Local Regressionrdquo Statistics and Computing1 47ndash62

Crainiceanu C and Ruppert D (2004) ldquoLikelihood Ratio Tests in Linear Mixed Models With One VarianceComponentrdquo Journal of the Royal Statistical Society Series B 66 165ndash185

Crainiceanu C M Ruppert D and Wand M P (2005) ldquoBayesian Analysis for Penalized Spline Regressionusing WinBUGSrdquo Journal of Statistical Software 14 Available online at http wwwjstatsoftorg v14 i14

Donoho D L and Johnstone I M (1994) ldquoIdeal Spatial Adaptation by Wavelet Shrinkagerdquo Biometrika 81425ndash455

Eilers P H C and Marx B D (1996) ldquoFlexible Smoothing With B-Splines and Penaltiesrdquo (with discussion)Statistical Science 11 89ndash121

(2002) ldquoGeneralized Linear Additive Smooth Structuresrdquo Journal of Computational and Graphical Statis-tics 11 758ndash783

(2004) ldquoSplines Knots and Penaltiesrdquo unpublished manuscript

Fields Development Team (2006) fields Tools for Spatial Data National Center for Atmospheric ResearchBoulder CO Available online at http wwwcgducaredu Software Fields

Friedman J H (1991) ldquoMultivariate Adaptive Regression Splinesrdquo (with discussion) The Annals of Statistics19 1ndash141

Gelman A (2006) ldquoPrior Distributions for Variance Parameters in Hierarchical Modelsrdquo Bayesian Analysis 1515ndash533

Gray R J (1994) ldquoSpline-Based Tests in Survival Analysisrdquo Biometrics 50 640ndash652

Holst U Hossjer O Bjorklund C Ragnarson P and Edner H (1996) ldquoLocally Weighted Least SquaresKernel Regression and Statistical Evaluation of LIDAR Measurementsrdquo Environmetrics 7 401ndash416

Jullion A and Lambert P (2006) Discussion paper DP0534 UCL

Kammann E E and Wand M P (2003) ldquoGeoadditive Modelsrdquo Applied Statistics 52 1ndash18

Kauermann G (2002) ldquoA Note on Bandwidth Selection for Penalised Spline Smoothingrdquo Technical Report 02-13Department of Statistics University of Glasgow

Krivobokova T Crainiceanu C M and Kauermann G (2007) ldquoFast Adaptive Penalized Splinesrdquo Journal ofComputational and Graphical Statistics to appear

Lang S and Bretzger A (2004) ldquoBayesian P-Splinesrdquo Journal of Compuational and Graphical Statistics 13183ndash212

Lang S Fronk E M and Fahrmeir L (2002) ldquoFunction Estimation with Locally Adaptive Dynamic ModelsrdquoComputational Statistics 17 479ndash499

Marx B D and Eilers PHC (1998) ldquoDirect Generalized Additive Modeling With Penalized LikelihoodrdquoComputational Statistics amp Data Analysis 28 193ndash209

Natarajan R and Kass R E (2000) ldquoReference Bayesian Methods for Generalized Linear Mixed ModelsrdquoJournal of the American Statistical Association 95 227ndash237

Ngo L and Wand M P (2004) ldquoSmoothing With Mixed Model Softwarerdquo Journal of Statistical Software 9Available online at http wwwjstatsoftorg v09 i01

Nychka D Haaland P OrsquoConnell M and Ellner S (1998) ldquoFUNFITS Data Analysis and Statistical Toolsfor Estimating Functionsrdquo in Case Studies in Environmental Statistics Lecture Notes in Statistics vol 132eds D Nychka W W Piegorsch and L H Cox New York Springer Verlag pp 159ndash179 Online athttp wwwcgducaredu stats Software Funfits

Nychka D and Saltzman N (1998) ldquoDesign of Air Quality Monitoring Networksrdquo in Case studies in Environ-mental Statistics Lecture Notes in Statistics vol 132 eds D Nychka W W Piegorsch and L H Cox NewYork Springer Verlag pp 51ndash76

288 C M Crainiceanu et al

Pintore A Speckman P and Holmes C C (2006) ldquoSpatially Adaptive Smoothing Splinesrdquo Biometrika 93113ndash125

Ruppert D (1997) ldquoLocal Polynomial Regression and its Applications in Environmental Statisticsrdquo in Statisticsfor the Environment (vol 3) eds V Barnett and F Turkman Chicester Wiley pp 155ndash173

(2002) ldquoSelecting the Number of Knots for Penalized Splinesrdquo Journal of Computational and GraphicalStatistics 11 735ndash757

Ruppert D and Carroll R J (2000) ldquoSpatially-Adaptive Penalties for Spline Fittingrdquo Australian and NewZealand Journal of Statistics 42 205ndash223

Ruppert D Wand M P Holst and U Hossier O (1997) Technometrics 39 262ndash273

Ruppert D Wand M P and Carroll R J (2003) Semiparametric Regression Cambridge Cambridge UniversityPress

Smith M and Kohn R (1997) ldquoA Bayesian Approach to Nonparametric Bivariate Regressionrdquo Journal of theAmerican Statistical Association 92 1522ndash1535

Wood S (2006) Generalized Additive Models An Introduction with R New York Chapman amp HallCRC

Page 14: Spatially Adaptive Bayesian Penalized Splines - Department of

278 C M Crainiceanu et al

Function m1(middot) contains complex interactions and moderate spatial variability whilem2(middot) has only main effects and is much smoother We used a sample size n = 300σ = 14 range(f ) and 250 simulations from each model

We will compare the performance of our method with that of Lang and Brezgerrsquos whichperformed at least as good or better than other adaptive methods in a variety of examplesTo avoid dependence of results on knot choice we used the same choice as the one describedby Lang and Brezger 12 times 12 equally spaced knots for the mean function However weused thin-plate splines instead of tensor products of B-splines We modeled the smoothingparameter using a low rank log-thin plate spline with an equally spaced 5 times 5 knot grid Weconsidered two estimators one with constant error variance (CEV) and one with estimatederror variance using a log-thin plate spline (EEV) with the same 12 times 12 knot grid as forthe mean function The CEV estimator is the one that can directly be compared with theestimators considered by Lang and Brezger (2004) Because Lang and Brezger presentedonly comparative results using boxplots we will use their results as read from their Figure5 The performance of all estimators was measured by the log of the mean squared errorlog(MSE) where MSE(f ) = 1n

sumni=1f (xi) minus f (xi)2

For the CEV estimator and function f1(middot) corresponding to moderate spatial heterogene-ity we obtained a median log(MSE) of minus367 with an interquartile range [minus380minus353]and a range [minus421minus313] These values are better than the best values reported by Langand Brezgerrsquos methods in their Figure 5(b) Lang and Brezger reported that their method de-noted ldquoP-splinerdquo outperforms all other methods in this case and achieved a median log(MSE)of minus350 Remarkably the EEV estimator performed almost as well as the CEV estimatorin this case and also outperformed the other methods The median log(MSE) for EEV wasof minus359 with an interquartile range [minus374minus343] and a range [minus414minus302] Theseresults are consistent with univariate results reported by Ruppert and Carroll (2000) andBaladandayuthapani Mallick and Carroll (2005) who found that allowing the smoothingparameter to vary smoothly outperforms other adaptive techniques

For function f1(middot) Figure 4 displays the coverage probabilities for the 95 credibleintervals calculated over a 20 times 20 equally spaced grid in [0 1]2 For one grid point wecalculated the frequency with which the 95 pointwise credible interval covers the truevalue of the function at the grid point As expected these coverage probabilities showstrong spatial correlation with lower coverage probabilities along the ridges of the sinusfunction Coverage probability is lowest when x1 is in the [02 05] range The signal-to-noise in this region is about half the signal-to-noise ratio in the region corresponding to x1

close to 1 This explains why the coverage probability is smaller in this region Interestinglythe zeros of the true function appearing at x2 = 025 050 075 are covered with at least95 probability Another interesting feature is the lower coverage probabilities near thenorth south and east boundaries These features of the pointwise coverage probabilitymap are consistent with the frequentist analysis of Krivobokova et al (2007) but are notconsistent with the coverage probabilities reported by Lang and Brezger for their BayesianP-spline method

For our CEV estimator and function f2(middot) corresponding to very low spatial heterogene-ity we obtained a median log(MSE) of minus595 with an interquartile range [minus612minus566]

Spatially Adaptive Bayesian Penalized Splines 279

Figure 4 Coverage probability of the pointwise 95 credible intervals for the mean function for functionf1(x1 x2) = x1 sin(4πx2) with a constant error standard deviation σ = range(f )4 Probabilities are computedon a 20 times 20 equally spaced grid of points in [0 1]2 The model used was a thin plate spline with K = 100 knotsfor the mean function a log-thin plate spline with K = 100 knots for the variance function and a log-thin platespline with Klowast = 16 knots for the spatially adaptive shrinkage parameter The parameter controlling the degreeof smoothness of the thin-plate spline basis was m = 2

and a range [minus651minus541] The results of Lang and Brezger in their Figure 5(a) indicatethat their ldquoP-splinerdquo method achieved a median log(MSE) of minus600 Our method performsroughly similar to the Bayesian P-spline method of Lang and Brezger and to the cubicthin plate spline (tps) being outperformed only by Lang and Brezgerrsquos adaptive P-splinemethod with two main effects ldquoP-spline mainrdquo We could also add two main effects to ourbivariate adaptive smoother to improve MSE for this example but this is not our concernin this article Again the EEV estimator performed very similarly to our CEV estimator

The last simulation study was done using the function f1(middot middot) where x1 x2 are indepen-dent uniformly distributed in [0 1] with an error standard deviation function

σε(x1 x2) = r

32+ 3r

32x2

1

where r = range(f1) We compared our CEV and EEV estimators described at the beginningof this section and as expected the EEV estimator outperformed the CEV both in terms oflog(MSE) and coverage probabilities More precisely for log(MSE) we obtained a medianof minus526 for CEV and minus535 for EEV an interquartile range [minus537minus515] for CEV and[minus548minus524] for EEV and a range [minus465minus572] for CEV and [minus484minus584] for EEVFor both estimators the general patterns for coverage probabilities were the ones displayedin Figure 4 EEV outperformed CEV in terms of coverage probabilities For example theregions where the nominal level is exceeded for both estimators have a roughly similar

280 C M Crainiceanu et al

Figure 5 Locations of sampling (dots) with 100 knots (circles) in the Noshiro example

shape and location but the EEV coverage probabilities tend to be closer to their nominallevel

Note that our model incorporating nonparametric variance estimation is more generalthan the models considered in the literature However its performance for function m1(middot middot)with moderate spatial variability and constant variance was better than all methods consid-ered by Lang and Brezger in their simulation study (see their Figure 5 panel b) Our methodhas good performance for the function m2(middot middot) being the second best when compared tomethods consider by Lang and Brezger (see their Figure 5 panel a) For the case whenvariance is not constant our adaptive method that estimates the variance nonparametricallyoutperformed our adaptive method that estimates a constant variance

8 THE NOSHIRO EXAMPLE

Noshiro Japan was the site of a major earthquake Much of the damage from thequake was due to soil movement Professor Thomas OrsquoRourke a geotechnical engineerwas investigating the factors that might help predict soil movement during a future quakeOne factor thought to be of importance was the slope of the land To estimate the slopeRuppert (1997) used a dataset with 799 observations where x was longitude and latitudeand y was elevation at locations in Noshiro The object of primary interest is the gradientat least its magnitude and possibly its direction

We used the thin-plate spline models (62) for the mean of y with Km = 100 knotsselected by the maximum separation criterion implemented in the function coverdesign()

of the R package Fields (Nychka Haaland OrsquoConnell and Ellner 1998) The sampling

Spatially Adaptive Bayesian Penalized Splines 281

Figure 6 Posterior mean of the mean regression function for the Noshiro example The model used was a thinplate spline with K = 100 knots for the mean function a log-thin plate spline with K = 100 knots for the variancefunction and a log-thin plate spline with Klowast = 16 knots for the spatially adaptive shrinkage parameter Theparameter controlling the degree of smoothness of the thin-plate spline basis was m = 2

locations and the knots used for fitting are displayed in Figure 5 The log-thin plate splinedescribed in (63) with the same Kε = 100 knots was used to model the variance of theerror process σ 2

ε (xi ) The log-thin plate spline described in (64) with Kb = 16 subknotswas used to model the variance σ 2

b (κκκk) which controls the amount of shrinkage of the b

parameters The subknots were obtained using the coverdesign() function applied to the100 knots

Figure 6 displays the posterior mean of the mean response function There is a rathersharp peak around (042 04) with a slow decay from the peak in the general south directionand a much faster decay in all other directions The function becomes smoother towardsthe boundary This is exactly the type of function for which adaptive spatial smoothing cansubstantially improve the fit

Figure 7 shows that in large areas of the map σε(middot) is smooth and with small valuesHowever in a neighborhood of the peak of the mean function σε(middot) displays two relativelysharp maxima located NNW and SSE of the peak We also performed a two-stage frequentistanalysis In the first stage we used a low rank thin plate spline for the mean function and theresults from this regression were used to obtain residuals We then fitted another low rankthin plate spline to the absolute values of these residuals While this map was not identicalto the one in Figure 7 it did exhibit the same patterns of heteroscedasticity

The process logσ 2b (κκκk) controlling the shrinkage of the bk is displayed in Figure 8

282 C M Crainiceanu et al

Figure 7 Posterior mean of the standard deviation σε(xi ) of the error process function for the Noshiro exampleThe model used was a thin plate spline with K = 100 knots for the mean function a log-thin plate spline withK = 100 knots for the variance function and a log-thin plate spline with Klowast = 16 knots for the spatially adaptiveshrinkage parameter The parameter controlling the degree of smoothness of the thin-plate spline basis was m = 2

Figure 8 Posterior mean of the shrinkage process minus logσ 2b(κκκk) The model used was a thin plate spline with

K = 100 knots for the mean function a log-thin plate spline with K = 100 knots for the variance function and alog-thin plate spline with Klowast = 16 knots for the spatially adaptive shrinkage parameter The parameter controllingthe degree of smoothness of the thin-plate spline basis was m = 2

Spatially Adaptive Bayesian Penalized Splines 283

Smaller values of this function correspond to less shrinkage of bk towards zero and morelocal behavior of the smoother Figure 8 indicates that in a neighborhood of the peak of themean function the shrinkage is smaller to allow the mean function to change rapidly Awayfrom the peak the shrinkage is larger corresponding to a smoother mean

9 IMPLEMENTATION USING MCMC

We will now provide some details for MCMC simulations of model (21) wherethe variances σ 2

ε (xi) and σ 2b (κk) are modeled by Equations (23) and (24) The imple-

mentation for other models for example for the multivariate smoothing in Section 6is similar Consider independent normal priors for the coefficients of the monomialsβi sim N(0 σ 2

0β) γi sim N(0 σ 20γ ) i = 0 p δi sim N(0 σ 2

0δ) i = 0 q and in-

dependent inverse Gamma priors for the variance components σ 2c sim IGamma(ac bc) and

σ 2d sim IGamma(ad bd) Using these priors many full conditionals of the posterior distribu-

tion are easy to derive while a few have complex multivariate forms Our implementationof the MCMC using multivariate Metropolis-Hastings steps proved to be unstable with poormixing properties A simple and reliable solution was to change the model by adding errorterms to the log-spline models that is

logσ 2ε (xi)

= γ0 + middot middot middot + γpxpi +sumKε

s=1 cs(xi minus κεs )

p+ + ui

logσ 2b (κ

mk ) = δ0 + middot middot middot + δq(κ

mk )q +sumKb

s=1 ds(κmk minus κb

s )q+ + vk

(91)

where ui sim N(0 σ 2u ) and vk sim N(0 σ 2

v ) This idea was also used for σ 2b by Baladan-

dayuthapani Mallick and Carroll (2005) We fixed the values of σ 2u = σ 2

v = 001 as thesevariances appear nonidentifiable and a standard deviation of 01 is small on a log-scale Thisdevice makes computations feasible by replacing sampling from complex full conditionalsby simple univariate MH steps

Define as before by ε b θθθX and define θθθε = (γγγ T cT )T θθθb = (δδδT dT )T CX =(X Z) Cε = (Xε Zε) Cb = (Xb Zb) where X Xε Xb contain the monomials and ZZε Zb contain the truncated polynomials of the spline models for m logσ 2

ε and logσ 2b

respectively Also denote by

0X =[

σ 20βIp+1 0

0 b

]

0ε =[

σ 20γ Ip+1 0

0 σ 2c IKε

]

0b =[

σ 20δIq+1 0

0 σ 2d IKm

]

The full conditionals of the posterior are detailed below

1 [θθθX] sim N(MXCT

Xminus1ε YMX

) where MX =

(CT

Xminus1ε CX + minus1

0X

)minus1

2 [θθθε] sim N(MεCT

ε Yεσ2u Mε

) where Yε = [log(σ 2

ε1) log(σ 2εn)]

T and Mε

= (CTε Cεσ

2u +minus1

0ε )minus1

284 C M Crainiceanu et al

Figure 9 Computation time (in minutes) versus sample size for 10000 simulations from the joint posteriordistribution given the data The mean function is modeled as a penalized spline with Km knots with the log of thevariance of the shrinkage process modeled as another penalized spline with Kb = 4 subknots The log of the errorvariance is modeled as another penalized spline with Kε = Km knots Different symbols correspond to increasingnumber of knots dots (Km = 10) asterisks (Km = 20) circles (Km = 30) xrsquos (Km = 40)

3 [θθθb] sim N(MbCT

b Ybσ2v Mb

)where Yb =

[log(σ 2

1 ) log(σ 2Km

)]T

and Mb =(CT

b Cbσ2v +minus1

0b )minus1

4 [σ 2c ] sim IGamma

(ac + Kε2 bc + ||c||22

)

5 [σ 2d ] sim IGamma

(ad + Kb2 bd + ||d||22

)

6 [σ 2εi] prop σminus3

εi exp

minus(yi minus microi)

2(2σ 2εi ) minus

[log(σ 2

εi ) minus ηi)]2

(2σ 2u )

where microi

and ηi are the ith components of CXθθθX and Cεθθθε respectively

7 [σ 2k ] prop σminus3

k expminusb2

k(2σ2k ) minus [

log(σ 2k ) minus ζk)

]2(σ 2

v ) where ζk is the kth com-

ponent of Cbθθθb

All the above conditionals have an explicit form with the exception of the n + Km

one-dimensional conditionals from 6 and 7 For these distributions we use the Metropolis-Hastings algorithm with a normal proposal distribution centered at the current value andsmall variance

10 IMPLEMENTATION AND COMPUTATIONALPERFORMANCE

Our MCMC simulation algorithm provided reliable results and is reasonably fast To il-lustrate this Figure 9 displays computation time (in minutes) versus sample size for 10000

Spatially Adaptive Bayesian Penalized Splines 285

Figure 10 MCMC histories for eight parameters of interest including the log of standard deviation of the errorprocess and log of standard deviation of the shrinkage process as well as some parameters modeling the differentfunctions The model uses 40 equally spaced knots for the mean and error variance functions and 4 subknotsfor the adaptive shrinkage process Chains correspond to the Doppler function example using quadratic truncatedpolynomials to fit the mean and log variance function and linear truncated polynomial basis for fitting the shrinkageprocess For clarity of presentation only every 20th simulated observation is displayed

simulations from the joint posterior distribution given the data The mean function is mod-eled as a penalized spline with Km knots with the log of the variance of the shrinkageprocess modeled as another penalized spline with Kb = 4 subknots The log of the errorvariance is modeled as another penalized spline with Kε = Km knots Different symbolscorrespond to increasing number of knots dots (Km = 10) asterisks (Km = 20) circles(Km = 30) xrsquos (Km = 40) Computation time increases roughly linearly with the numberof observations with the same intercept 0 and larger slopes for larger number of knotsNote that even with n = 1000 observations and Km = Kε = 40 simulation time is abouttwo minutes

For the models and datasets presented in this article 10000 simulations proved suffi-cient for accurate and reliable inference Some parameters such as the mean and variancefunctions tended to exhibit much better mixing properties than the variance componentsσ 2c and σ 2

d controlling the shrinkage processes This is probably due to the reduced amountof information in the data about these parameters The MCMC convergence and mixingproperties were assessed by visual inspection of the chain histories of many parameters ofinterest Since we started with good starting values of the parameters convergence is fastFigure 10 displays the histories of eight parameters of interest for the model fitted to the

286 C M Crainiceanu et al

Doppler function described in Section 5 For clarity we display only every 100th sampledvalued of the chain but the same patterns would be observed in the full chain history Similargood mixing properties have been noted in all the other examples presented in this article

In the search for a good software platform we implemented our algorithms in MATLABC and FORTRAN Programs can be downloaded from wwwbiostatjhsphedu simccrainicFor our application and implementation there were no serious differences between C andFORTRAN which have performed better than MATLAB by reducing computation timetwo to three times The reported times in Figure 9 are for our implementation in C

Given the complexity of the model we were interested in improving the mixing prop-erties of the Markov Chains The computational trick described in Section 9 was the sinequa non method that allowed us to replace an intractable multivariate MH step with severaltractable univariate MH steps We used normal random walk proposal distributions centeredat the current value and variance tuned to provide acceptance rates between 30 and 50

11 COMMENTS

Our joint models for heterogeneous mean and nonconstant variance are designed to ex-tract the maximum amount of information from complex data and suggest simpler paramet-ric models Bayesian inference based on MCMC offers a natural and powerful frameworkfor model fitting However MCMC in this context is not straightforward and implemen-tation depends essentially on the ability of transforming intractable simulations from highdimensional distributions into tractable one dimensional updates as described in Section9 Our implementation provides accurate fast and reproducible results while software isavailable Actual simulation time increases only linearly with the number of observationsand mixing properties are adequate provided that proposal distribution variances are tunedfor optimal acceptance rates

ACKNOWLEDGMENTS

Carrollrsquos research was supported by a grant from the National Cancer Institute (CA-57030) and by the TexasAampM Center for Environmental and Rural Health via a grant from the National Institute of Environmental HealthSciences (P30-ES09106)

[Received September 2004 Revised October 2006]

REFERENCES

Baladandayuthapani V Mallick B K and Carroll R J (2005) ldquoSpatially Adaptive Bayesian RegressionSplinesrdquo Journal of Computational and Graphical Statistics 14 378ndash394

Besag J Green P Higdon D and Mengersen K (1995) ldquoBayesian Computation and Stochastic Systemsrdquo(with discussions) Statistical Science 10 3ndash66

Carroll R J (2003) ldquoVariances Are Not Always Nuisance Parameters The 2002 R A Fisher Lecturerdquo Biometrics59 211ndash220

Carroll R J and Ruppert D (1988) Transformation and Weighting in Regression New York Chapman andHall

Spatially Adaptive Bayesian Penalized Splines 287

Cleveland W S and Grosse E (1991) ldquoComputational Methods for Local Regressionrdquo Statistics and Computing1 47ndash62

Crainiceanu C and Ruppert D (2004) ldquoLikelihood Ratio Tests in Linear Mixed Models With One VarianceComponentrdquo Journal of the Royal Statistical Society Series B 66 165ndash185

Crainiceanu C M Ruppert D and Wand M P (2005) ldquoBayesian Analysis for Penalized Spline Regressionusing WinBUGSrdquo Journal of Statistical Software 14 Available online at http wwwjstatsoftorg v14 i14

Donoho D L and Johnstone I M (1994) ldquoIdeal Spatial Adaptation by Wavelet Shrinkagerdquo Biometrika 81425ndash455

Eilers P H C and Marx B D (1996) ldquoFlexible Smoothing With B-Splines and Penaltiesrdquo (with discussion)Statistical Science 11 89ndash121

(2002) ldquoGeneralized Linear Additive Smooth Structuresrdquo Journal of Computational and Graphical Statis-tics 11 758ndash783

(2004) ldquoSplines Knots and Penaltiesrdquo unpublished manuscript

Fields Development Team (2006) fields Tools for Spatial Data National Center for Atmospheric ResearchBoulder CO Available online at http wwwcgducaredu Software Fields

Friedman J H (1991) ldquoMultivariate Adaptive Regression Splinesrdquo (with discussion) The Annals of Statistics19 1ndash141

Gelman A (2006) ldquoPrior Distributions for Variance Parameters in Hierarchical Modelsrdquo Bayesian Analysis 1515ndash533

Gray R J (1994) ldquoSpline-Based Tests in Survival Analysisrdquo Biometrics 50 640ndash652

Holst U Hossjer O Bjorklund C Ragnarson P and Edner H (1996) ldquoLocally Weighted Least SquaresKernel Regression and Statistical Evaluation of LIDAR Measurementsrdquo Environmetrics 7 401ndash416

Jullion A and Lambert P (2006) Discussion paper DP0534 UCL

Kammann E E and Wand M P (2003) ldquoGeoadditive Modelsrdquo Applied Statistics 52 1ndash18

Kauermann G (2002) ldquoA Note on Bandwidth Selection for Penalised Spline Smoothingrdquo Technical Report 02-13Department of Statistics University of Glasgow

Krivobokova T Crainiceanu C M and Kauermann G (2007) ldquoFast Adaptive Penalized Splinesrdquo Journal ofComputational and Graphical Statistics to appear

Lang S and Bretzger A (2004) ldquoBayesian P-Splinesrdquo Journal of Compuational and Graphical Statistics 13183ndash212

Lang S Fronk E M and Fahrmeir L (2002) ldquoFunction Estimation with Locally Adaptive Dynamic ModelsrdquoComputational Statistics 17 479ndash499

Marx B D and Eilers PHC (1998) ldquoDirect Generalized Additive Modeling With Penalized LikelihoodrdquoComputational Statistics amp Data Analysis 28 193ndash209

Natarajan R and Kass R E (2000) ldquoReference Bayesian Methods for Generalized Linear Mixed ModelsrdquoJournal of the American Statistical Association 95 227ndash237

Ngo L and Wand M P (2004) ldquoSmoothing With Mixed Model Softwarerdquo Journal of Statistical Software 9Available online at http wwwjstatsoftorg v09 i01

Nychka D Haaland P OrsquoConnell M and Ellner S (1998) ldquoFUNFITS Data Analysis and Statistical Toolsfor Estimating Functionsrdquo in Case Studies in Environmental Statistics Lecture Notes in Statistics vol 132eds D Nychka W W Piegorsch and L H Cox New York Springer Verlag pp 159ndash179 Online athttp wwwcgducaredu stats Software Funfits

Nychka D and Saltzman N (1998) ldquoDesign of Air Quality Monitoring Networksrdquo in Case studies in Environ-mental Statistics Lecture Notes in Statistics vol 132 eds D Nychka W W Piegorsch and L H Cox NewYork Springer Verlag pp 51ndash76

288 C M Crainiceanu et al

Pintore A Speckman P and Holmes C C (2006) ldquoSpatially Adaptive Smoothing Splinesrdquo Biometrika 93113ndash125

Ruppert D (1997) ldquoLocal Polynomial Regression and its Applications in Environmental Statisticsrdquo in Statisticsfor the Environment (vol 3) eds V Barnett and F Turkman Chicester Wiley pp 155ndash173

(2002) ldquoSelecting the Number of Knots for Penalized Splinesrdquo Journal of Computational and GraphicalStatistics 11 735ndash757

Ruppert D and Carroll R J (2000) ldquoSpatially-Adaptive Penalties for Spline Fittingrdquo Australian and NewZealand Journal of Statistics 42 205ndash223

Ruppert D Wand M P Holst and U Hossier O (1997) Technometrics 39 262ndash273

Ruppert D Wand M P and Carroll R J (2003) Semiparametric Regression Cambridge Cambridge UniversityPress

Smith M and Kohn R (1997) ldquoA Bayesian Approach to Nonparametric Bivariate Regressionrdquo Journal of theAmerican Statistical Association 92 1522ndash1535

Wood S (2006) Generalized Additive Models An Introduction with R New York Chapman amp HallCRC

Page 15: Spatially Adaptive Bayesian Penalized Splines - Department of

Spatially Adaptive Bayesian Penalized Splines 279

Figure 4 Coverage probability of the pointwise 95 credible intervals for the mean function for functionf1(x1 x2) = x1 sin(4πx2) with a constant error standard deviation σ = range(f )4 Probabilities are computedon a 20 times 20 equally spaced grid of points in [0 1]2 The model used was a thin plate spline with K = 100 knotsfor the mean function a log-thin plate spline with K = 100 knots for the variance function and a log-thin platespline with Klowast = 16 knots for the spatially adaptive shrinkage parameter The parameter controlling the degreeof smoothness of the thin-plate spline basis was m = 2

and a range [minus651minus541] The results of Lang and Brezger in their Figure 5(a) indicatethat their ldquoP-splinerdquo method achieved a median log(MSE) of minus600 Our method performsroughly similar to the Bayesian P-spline method of Lang and Brezger and to the cubicthin plate spline (tps) being outperformed only by Lang and Brezgerrsquos adaptive P-splinemethod with two main effects ldquoP-spline mainrdquo We could also add two main effects to ourbivariate adaptive smoother to improve MSE for this example but this is not our concernin this article Again the EEV estimator performed very similarly to our CEV estimator

The last simulation study was done using the function f1(middot middot) where x1 x2 are indepen-dent uniformly distributed in [0 1] with an error standard deviation function

σε(x1 x2) = r

32+ 3r

32x2

1

where r = range(f1) We compared our CEV and EEV estimators described at the beginningof this section and as expected the EEV estimator outperformed the CEV both in terms oflog(MSE) and coverage probabilities More precisely for log(MSE) we obtained a medianof minus526 for CEV and minus535 for EEV an interquartile range [minus537minus515] for CEV and[minus548minus524] for EEV and a range [minus465minus572] for CEV and [minus484minus584] for EEVFor both estimators the general patterns for coverage probabilities were the ones displayedin Figure 4 EEV outperformed CEV in terms of coverage probabilities For example theregions where the nominal level is exceeded for both estimators have a roughly similar

280 C M Crainiceanu et al

Figure 5 Locations of sampling (dots) with 100 knots (circles) in the Noshiro example

shape and location but the EEV coverage probabilities tend to be closer to their nominallevel

Note that our model incorporating nonparametric variance estimation is more generalthan the models considered in the literature However its performance for function m1(middot middot)with moderate spatial variability and constant variance was better than all methods consid-ered by Lang and Brezger in their simulation study (see their Figure 5 panel b) Our methodhas good performance for the function m2(middot middot) being the second best when compared tomethods consider by Lang and Brezger (see their Figure 5 panel a) For the case whenvariance is not constant our adaptive method that estimates the variance nonparametricallyoutperformed our adaptive method that estimates a constant variance

8 THE NOSHIRO EXAMPLE

Noshiro Japan was the site of a major earthquake Much of the damage from thequake was due to soil movement Professor Thomas OrsquoRourke a geotechnical engineerwas investigating the factors that might help predict soil movement during a future quakeOne factor thought to be of importance was the slope of the land To estimate the slopeRuppert (1997) used a dataset with 799 observations where x was longitude and latitudeand y was elevation at locations in Noshiro The object of primary interest is the gradientat least its magnitude and possibly its direction

We used the thin-plate spline models (62) for the mean of y with Km = 100 knotsselected by the maximum separation criterion implemented in the function coverdesign()

of the R package Fields (Nychka Haaland OrsquoConnell and Ellner 1998) The sampling

Spatially Adaptive Bayesian Penalized Splines 281

Figure 6 Posterior mean of the mean regression function for the Noshiro example The model used was a thinplate spline with K = 100 knots for the mean function a log-thin plate spline with K = 100 knots for the variancefunction and a log-thin plate spline with Klowast = 16 knots for the spatially adaptive shrinkage parameter Theparameter controlling the degree of smoothness of the thin-plate spline basis was m = 2

locations and the knots used for fitting are displayed in Figure 5 The log-thin plate splinedescribed in (63) with the same Kε = 100 knots was used to model the variance of theerror process σ 2

ε (xi ) The log-thin plate spline described in (64) with Kb = 16 subknotswas used to model the variance σ 2

b (κκκk) which controls the amount of shrinkage of the b

parameters The subknots were obtained using the coverdesign() function applied to the100 knots

Figure 6 displays the posterior mean of the mean response function There is a rathersharp peak around (042 04) with a slow decay from the peak in the general south directionand a much faster decay in all other directions The function becomes smoother towardsthe boundary This is exactly the type of function for which adaptive spatial smoothing cansubstantially improve the fit

Figure 7 shows that in large areas of the map σε(middot) is smooth and with small valuesHowever in a neighborhood of the peak of the mean function σε(middot) displays two relativelysharp maxima located NNW and SSE of the peak We also performed a two-stage frequentistanalysis In the first stage we used a low rank thin plate spline for the mean function and theresults from this regression were used to obtain residuals We then fitted another low rankthin plate spline to the absolute values of these residuals While this map was not identicalto the one in Figure 7 it did exhibit the same patterns of heteroscedasticity

The process logσ 2b (κκκk) controlling the shrinkage of the bk is displayed in Figure 8

282 C M Crainiceanu et al

Figure 7 Posterior mean of the standard deviation σε(xi ) of the error process function for the Noshiro exampleThe model used was a thin plate spline with K = 100 knots for the mean function a log-thin plate spline withK = 100 knots for the variance function and a log-thin plate spline with Klowast = 16 knots for the spatially adaptiveshrinkage parameter The parameter controlling the degree of smoothness of the thin-plate spline basis was m = 2

Figure 8 Posterior mean of the shrinkage process minus logσ 2b(κκκk) The model used was a thin plate spline with

K = 100 knots for the mean function a log-thin plate spline with K = 100 knots for the variance function and alog-thin plate spline with Klowast = 16 knots for the spatially adaptive shrinkage parameter The parameter controllingthe degree of smoothness of the thin-plate spline basis was m = 2

Spatially Adaptive Bayesian Penalized Splines 283

Smaller values of this function correspond to less shrinkage of bk towards zero and morelocal behavior of the smoother Figure 8 indicates that in a neighborhood of the peak of themean function the shrinkage is smaller to allow the mean function to change rapidly Awayfrom the peak the shrinkage is larger corresponding to a smoother mean

9 IMPLEMENTATION USING MCMC

We will now provide some details for MCMC simulations of model (21) wherethe variances σ 2

ε (xi) and σ 2b (κk) are modeled by Equations (23) and (24) The imple-

mentation for other models for example for the multivariate smoothing in Section 6is similar Consider independent normal priors for the coefficients of the monomialsβi sim N(0 σ 2

0β) γi sim N(0 σ 20γ ) i = 0 p δi sim N(0 σ 2

0δ) i = 0 q and in-

dependent inverse Gamma priors for the variance components σ 2c sim IGamma(ac bc) and

σ 2d sim IGamma(ad bd) Using these priors many full conditionals of the posterior distribu-

tion are easy to derive while a few have complex multivariate forms Our implementationof the MCMC using multivariate Metropolis-Hastings steps proved to be unstable with poormixing properties A simple and reliable solution was to change the model by adding errorterms to the log-spline models that is

logσ 2ε (xi)

= γ0 + middot middot middot + γpxpi +sumKε

s=1 cs(xi minus κεs )

p+ + ui

logσ 2b (κ

mk ) = δ0 + middot middot middot + δq(κ

mk )q +sumKb

s=1 ds(κmk minus κb

s )q+ + vk

(91)

where ui sim N(0 σ 2u ) and vk sim N(0 σ 2

v ) This idea was also used for σ 2b by Baladan-

dayuthapani Mallick and Carroll (2005) We fixed the values of σ 2u = σ 2

v = 001 as thesevariances appear nonidentifiable and a standard deviation of 01 is small on a log-scale Thisdevice makes computations feasible by replacing sampling from complex full conditionalsby simple univariate MH steps

Define as before by ε b θθθX and define θθθε = (γγγ T cT )T θθθb = (δδδT dT )T CX =(X Z) Cε = (Xε Zε) Cb = (Xb Zb) where X Xε Xb contain the monomials and ZZε Zb contain the truncated polynomials of the spline models for m logσ 2

ε and logσ 2b

respectively Also denote by

0X =[

σ 20βIp+1 0

0 b

]

0ε =[

σ 20γ Ip+1 0

0 σ 2c IKε

]

0b =[

σ 20δIq+1 0

0 σ 2d IKm

]

The full conditionals of the posterior are detailed below

1 [θθθX] sim N(MXCT

Xminus1ε YMX

) where MX =

(CT

Xminus1ε CX + minus1

0X

)minus1

2 [θθθε] sim N(MεCT

ε Yεσ2u Mε

) where Yε = [log(σ 2

ε1) log(σ 2εn)]

T and Mε

= (CTε Cεσ

2u +minus1

0ε )minus1

284 C M Crainiceanu et al

Figure 9 Computation time (in minutes) versus sample size for 10000 simulations from the joint posteriordistribution given the data The mean function is modeled as a penalized spline with Km knots with the log of thevariance of the shrinkage process modeled as another penalized spline with Kb = 4 subknots The log of the errorvariance is modeled as another penalized spline with Kε = Km knots Different symbols correspond to increasingnumber of knots dots (Km = 10) asterisks (Km = 20) circles (Km = 30) xrsquos (Km = 40)

3 [θθθb] sim N(MbCT

b Ybσ2v Mb

)where Yb =

[log(σ 2

1 ) log(σ 2Km

)]T

and Mb =(CT

b Cbσ2v +minus1

0b )minus1

4 [σ 2c ] sim IGamma

(ac + Kε2 bc + ||c||22

)

5 [σ 2d ] sim IGamma

(ad + Kb2 bd + ||d||22

)

6 [σ 2εi] prop σminus3

εi exp

minus(yi minus microi)

2(2σ 2εi ) minus

[log(σ 2

εi ) minus ηi)]2

(2σ 2u )

where microi

and ηi are the ith components of CXθθθX and Cεθθθε respectively

7 [σ 2k ] prop σminus3

k expminusb2

k(2σ2k ) minus [

log(σ 2k ) minus ζk)

]2(σ 2

v ) where ζk is the kth com-

ponent of Cbθθθb

All the above conditionals have an explicit form with the exception of the n + Km

one-dimensional conditionals from 6 and 7 For these distributions we use the Metropolis-Hastings algorithm with a normal proposal distribution centered at the current value andsmall variance

10 IMPLEMENTATION AND COMPUTATIONALPERFORMANCE

Our MCMC simulation algorithm provided reliable results and is reasonably fast To il-lustrate this Figure 9 displays computation time (in minutes) versus sample size for 10000

Spatially Adaptive Bayesian Penalized Splines 285

Figure 10 MCMC histories for eight parameters of interest including the log of standard deviation of the errorprocess and log of standard deviation of the shrinkage process as well as some parameters modeling the differentfunctions The model uses 40 equally spaced knots for the mean and error variance functions and 4 subknotsfor the adaptive shrinkage process Chains correspond to the Doppler function example using quadratic truncatedpolynomials to fit the mean and log variance function and linear truncated polynomial basis for fitting the shrinkageprocess For clarity of presentation only every 20th simulated observation is displayed

simulations from the joint posterior distribution given the data The mean function is mod-eled as a penalized spline with Km knots with the log of the variance of the shrinkageprocess modeled as another penalized spline with Kb = 4 subknots The log of the errorvariance is modeled as another penalized spline with Kε = Km knots Different symbolscorrespond to increasing number of knots dots (Km = 10) asterisks (Km = 20) circles(Km = 30) xrsquos (Km = 40) Computation time increases roughly linearly with the numberof observations with the same intercept 0 and larger slopes for larger number of knotsNote that even with n = 1000 observations and Km = Kε = 40 simulation time is abouttwo minutes

For the models and datasets presented in this article 10000 simulations proved suffi-cient for accurate and reliable inference Some parameters such as the mean and variancefunctions tended to exhibit much better mixing properties than the variance componentsσ 2c and σ 2

d controlling the shrinkage processes This is probably due to the reduced amountof information in the data about these parameters The MCMC convergence and mixingproperties were assessed by visual inspection of the chain histories of many parameters ofinterest Since we started with good starting values of the parameters convergence is fastFigure 10 displays the histories of eight parameters of interest for the model fitted to the

286 C M Crainiceanu et al

Doppler function described in Section 5 For clarity we display only every 100th sampledvalued of the chain but the same patterns would be observed in the full chain history Similargood mixing properties have been noted in all the other examples presented in this article

In the search for a good software platform we implemented our algorithms in MATLABC and FORTRAN Programs can be downloaded from wwwbiostatjhsphedu simccrainicFor our application and implementation there were no serious differences between C andFORTRAN which have performed better than MATLAB by reducing computation timetwo to three times The reported times in Figure 9 are for our implementation in C

Given the complexity of the model we were interested in improving the mixing prop-erties of the Markov Chains The computational trick described in Section 9 was the sinequa non method that allowed us to replace an intractable multivariate MH step with severaltractable univariate MH steps We used normal random walk proposal distributions centeredat the current value and variance tuned to provide acceptance rates between 30 and 50

11 COMMENTS

Our joint models for heterogeneous mean and nonconstant variance are designed to ex-tract the maximum amount of information from complex data and suggest simpler paramet-ric models Bayesian inference based on MCMC offers a natural and powerful frameworkfor model fitting However MCMC in this context is not straightforward and implemen-tation depends essentially on the ability of transforming intractable simulations from highdimensional distributions into tractable one dimensional updates as described in Section9 Our implementation provides accurate fast and reproducible results while software isavailable Actual simulation time increases only linearly with the number of observationsand mixing properties are adequate provided that proposal distribution variances are tunedfor optimal acceptance rates

ACKNOWLEDGMENTS

Carrollrsquos research was supported by a grant from the National Cancer Institute (CA-57030) and by the TexasAampM Center for Environmental and Rural Health via a grant from the National Institute of Environmental HealthSciences (P30-ES09106)

[Received September 2004 Revised October 2006]

REFERENCES

Baladandayuthapani V Mallick B K and Carroll R J (2005) ldquoSpatially Adaptive Bayesian RegressionSplinesrdquo Journal of Computational and Graphical Statistics 14 378ndash394

Besag J Green P Higdon D and Mengersen K (1995) ldquoBayesian Computation and Stochastic Systemsrdquo(with discussions) Statistical Science 10 3ndash66

Carroll R J (2003) ldquoVariances Are Not Always Nuisance Parameters The 2002 R A Fisher Lecturerdquo Biometrics59 211ndash220

Carroll R J and Ruppert D (1988) Transformation and Weighting in Regression New York Chapman andHall

Spatially Adaptive Bayesian Penalized Splines 287

Cleveland W S and Grosse E (1991) ldquoComputational Methods for Local Regressionrdquo Statistics and Computing1 47ndash62

Crainiceanu C and Ruppert D (2004) ldquoLikelihood Ratio Tests in Linear Mixed Models With One VarianceComponentrdquo Journal of the Royal Statistical Society Series B 66 165ndash185

Crainiceanu C M Ruppert D and Wand M P (2005) ldquoBayesian Analysis for Penalized Spline Regressionusing WinBUGSrdquo Journal of Statistical Software 14 Available online at http wwwjstatsoftorg v14 i14

Donoho D L and Johnstone I M (1994) ldquoIdeal Spatial Adaptation by Wavelet Shrinkagerdquo Biometrika 81425ndash455

Eilers P H C and Marx B D (1996) ldquoFlexible Smoothing With B-Splines and Penaltiesrdquo (with discussion)Statistical Science 11 89ndash121

(2002) ldquoGeneralized Linear Additive Smooth Structuresrdquo Journal of Computational and Graphical Statis-tics 11 758ndash783

(2004) ldquoSplines Knots and Penaltiesrdquo unpublished manuscript

Fields Development Team (2006) fields Tools for Spatial Data National Center for Atmospheric ResearchBoulder CO Available online at http wwwcgducaredu Software Fields

Friedman J H (1991) ldquoMultivariate Adaptive Regression Splinesrdquo (with discussion) The Annals of Statistics19 1ndash141

Gelman A (2006) ldquoPrior Distributions for Variance Parameters in Hierarchical Modelsrdquo Bayesian Analysis 1515ndash533

Gray R J (1994) ldquoSpline-Based Tests in Survival Analysisrdquo Biometrics 50 640ndash652

Holst U Hossjer O Bjorklund C Ragnarson P and Edner H (1996) ldquoLocally Weighted Least SquaresKernel Regression and Statistical Evaluation of LIDAR Measurementsrdquo Environmetrics 7 401ndash416

Jullion A and Lambert P (2006) Discussion paper DP0534 UCL

Kammann E E and Wand M P (2003) ldquoGeoadditive Modelsrdquo Applied Statistics 52 1ndash18

Kauermann G (2002) ldquoA Note on Bandwidth Selection for Penalised Spline Smoothingrdquo Technical Report 02-13Department of Statistics University of Glasgow

Krivobokova T Crainiceanu C M and Kauermann G (2007) ldquoFast Adaptive Penalized Splinesrdquo Journal ofComputational and Graphical Statistics to appear

Lang S and Bretzger A (2004) ldquoBayesian P-Splinesrdquo Journal of Compuational and Graphical Statistics 13183ndash212

Lang S Fronk E M and Fahrmeir L (2002) ldquoFunction Estimation with Locally Adaptive Dynamic ModelsrdquoComputational Statistics 17 479ndash499

Marx B D and Eilers PHC (1998) ldquoDirect Generalized Additive Modeling With Penalized LikelihoodrdquoComputational Statistics amp Data Analysis 28 193ndash209

Natarajan R and Kass R E (2000) ldquoReference Bayesian Methods for Generalized Linear Mixed ModelsrdquoJournal of the American Statistical Association 95 227ndash237

Ngo L and Wand M P (2004) ldquoSmoothing With Mixed Model Softwarerdquo Journal of Statistical Software 9Available online at http wwwjstatsoftorg v09 i01

Nychka D Haaland P OrsquoConnell M and Ellner S (1998) ldquoFUNFITS Data Analysis and Statistical Toolsfor Estimating Functionsrdquo in Case Studies in Environmental Statistics Lecture Notes in Statistics vol 132eds D Nychka W W Piegorsch and L H Cox New York Springer Verlag pp 159ndash179 Online athttp wwwcgducaredu stats Software Funfits

Nychka D and Saltzman N (1998) ldquoDesign of Air Quality Monitoring Networksrdquo in Case studies in Environ-mental Statistics Lecture Notes in Statistics vol 132 eds D Nychka W W Piegorsch and L H Cox NewYork Springer Verlag pp 51ndash76

288 C M Crainiceanu et al

Pintore A Speckman P and Holmes C C (2006) ldquoSpatially Adaptive Smoothing Splinesrdquo Biometrika 93113ndash125

Ruppert D (1997) ldquoLocal Polynomial Regression and its Applications in Environmental Statisticsrdquo in Statisticsfor the Environment (vol 3) eds V Barnett and F Turkman Chicester Wiley pp 155ndash173

(2002) ldquoSelecting the Number of Knots for Penalized Splinesrdquo Journal of Computational and GraphicalStatistics 11 735ndash757

Ruppert D and Carroll R J (2000) ldquoSpatially-Adaptive Penalties for Spline Fittingrdquo Australian and NewZealand Journal of Statistics 42 205ndash223

Ruppert D Wand M P Holst and U Hossier O (1997) Technometrics 39 262ndash273

Ruppert D Wand M P and Carroll R J (2003) Semiparametric Regression Cambridge Cambridge UniversityPress

Smith M and Kohn R (1997) ldquoA Bayesian Approach to Nonparametric Bivariate Regressionrdquo Journal of theAmerican Statistical Association 92 1522ndash1535

Wood S (2006) Generalized Additive Models An Introduction with R New York Chapman amp HallCRC

Page 16: Spatially Adaptive Bayesian Penalized Splines - Department of

280 C M Crainiceanu et al

Figure 5 Locations of sampling (dots) with 100 knots (circles) in the Noshiro example

shape and location but the EEV coverage probabilities tend to be closer to their nominallevel

Note that our model incorporating nonparametric variance estimation is more generalthan the models considered in the literature However its performance for function m1(middot middot)with moderate spatial variability and constant variance was better than all methods consid-ered by Lang and Brezger in their simulation study (see their Figure 5 panel b) Our methodhas good performance for the function m2(middot middot) being the second best when compared tomethods consider by Lang and Brezger (see their Figure 5 panel a) For the case whenvariance is not constant our adaptive method that estimates the variance nonparametricallyoutperformed our adaptive method that estimates a constant variance

8 THE NOSHIRO EXAMPLE

Noshiro Japan was the site of a major earthquake Much of the damage from thequake was due to soil movement Professor Thomas OrsquoRourke a geotechnical engineerwas investigating the factors that might help predict soil movement during a future quakeOne factor thought to be of importance was the slope of the land To estimate the slopeRuppert (1997) used a dataset with 799 observations where x was longitude and latitudeand y was elevation at locations in Noshiro The object of primary interest is the gradientat least its magnitude and possibly its direction

We used the thin-plate spline models (62) for the mean of y with Km = 100 knotsselected by the maximum separation criterion implemented in the function coverdesign()

of the R package Fields (Nychka Haaland OrsquoConnell and Ellner 1998) The sampling

Spatially Adaptive Bayesian Penalized Splines 281

Figure 6 Posterior mean of the mean regression function for the Noshiro example The model used was a thinplate spline with K = 100 knots for the mean function a log-thin plate spline with K = 100 knots for the variancefunction and a log-thin plate spline with Klowast = 16 knots for the spatially adaptive shrinkage parameter Theparameter controlling the degree of smoothness of the thin-plate spline basis was m = 2

locations and the knots used for fitting are displayed in Figure 5 The log-thin plate splinedescribed in (63) with the same Kε = 100 knots was used to model the variance of theerror process σ 2

ε (xi ) The log-thin plate spline described in (64) with Kb = 16 subknotswas used to model the variance σ 2

b (κκκk) which controls the amount of shrinkage of the b

parameters The subknots were obtained using the coverdesign() function applied to the100 knots

Figure 6 displays the posterior mean of the mean response function There is a rathersharp peak around (042 04) with a slow decay from the peak in the general south directionand a much faster decay in all other directions The function becomes smoother towardsthe boundary This is exactly the type of function for which adaptive spatial smoothing cansubstantially improve the fit

Figure 7 shows that in large areas of the map σε(middot) is smooth and with small valuesHowever in a neighborhood of the peak of the mean function σε(middot) displays two relativelysharp maxima located NNW and SSE of the peak We also performed a two-stage frequentistanalysis In the first stage we used a low rank thin plate spline for the mean function and theresults from this regression were used to obtain residuals We then fitted another low rankthin plate spline to the absolute values of these residuals While this map was not identicalto the one in Figure 7 it did exhibit the same patterns of heteroscedasticity

The process logσ 2b (κκκk) controlling the shrinkage of the bk is displayed in Figure 8

282 C M Crainiceanu et al

Figure 7 Posterior mean of the standard deviation σε(xi ) of the error process function for the Noshiro exampleThe model used was a thin plate spline with K = 100 knots for the mean function a log-thin plate spline withK = 100 knots for the variance function and a log-thin plate spline with Klowast = 16 knots for the spatially adaptiveshrinkage parameter The parameter controlling the degree of smoothness of the thin-plate spline basis was m = 2

Figure 8 Posterior mean of the shrinkage process minus logσ 2b(κκκk) The model used was a thin plate spline with

K = 100 knots for the mean function a log-thin plate spline with K = 100 knots for the variance function and alog-thin plate spline with Klowast = 16 knots for the spatially adaptive shrinkage parameter The parameter controllingthe degree of smoothness of the thin-plate spline basis was m = 2

Spatially Adaptive Bayesian Penalized Splines 283

Smaller values of this function correspond to less shrinkage of bk towards zero and morelocal behavior of the smoother Figure 8 indicates that in a neighborhood of the peak of themean function the shrinkage is smaller to allow the mean function to change rapidly Awayfrom the peak the shrinkage is larger corresponding to a smoother mean

9 IMPLEMENTATION USING MCMC

We will now provide some details for MCMC simulations of model (21) wherethe variances σ 2

ε (xi) and σ 2b (κk) are modeled by Equations (23) and (24) The imple-

mentation for other models for example for the multivariate smoothing in Section 6is similar Consider independent normal priors for the coefficients of the monomialsβi sim N(0 σ 2

0β) γi sim N(0 σ 20γ ) i = 0 p δi sim N(0 σ 2

0δ) i = 0 q and in-

dependent inverse Gamma priors for the variance components σ 2c sim IGamma(ac bc) and

σ 2d sim IGamma(ad bd) Using these priors many full conditionals of the posterior distribu-

tion are easy to derive while a few have complex multivariate forms Our implementationof the MCMC using multivariate Metropolis-Hastings steps proved to be unstable with poormixing properties A simple and reliable solution was to change the model by adding errorterms to the log-spline models that is

logσ 2ε (xi)

= γ0 + middot middot middot + γpxpi +sumKε

s=1 cs(xi minus κεs )

p+ + ui

logσ 2b (κ

mk ) = δ0 + middot middot middot + δq(κ

mk )q +sumKb

s=1 ds(κmk minus κb

s )q+ + vk

(91)

where ui sim N(0 σ 2u ) and vk sim N(0 σ 2

v ) This idea was also used for σ 2b by Baladan-

dayuthapani Mallick and Carroll (2005) We fixed the values of σ 2u = σ 2

v = 001 as thesevariances appear nonidentifiable and a standard deviation of 01 is small on a log-scale Thisdevice makes computations feasible by replacing sampling from complex full conditionalsby simple univariate MH steps

Define as before by ε b θθθX and define θθθε = (γγγ T cT )T θθθb = (δδδT dT )T CX =(X Z) Cε = (Xε Zε) Cb = (Xb Zb) where X Xε Xb contain the monomials and ZZε Zb contain the truncated polynomials of the spline models for m logσ 2

ε and logσ 2b

respectively Also denote by

0X =[

σ 20βIp+1 0

0 b

]

0ε =[

σ 20γ Ip+1 0

0 σ 2c IKε

]

0b =[

σ 20δIq+1 0

0 σ 2d IKm

]

The full conditionals of the posterior are detailed below

1 [θθθX] sim N(MXCT

Xminus1ε YMX

) where MX =

(CT

Xminus1ε CX + minus1

0X

)minus1

2 [θθθε] sim N(MεCT

ε Yεσ2u Mε

) where Yε = [log(σ 2

ε1) log(σ 2εn)]

T and Mε

= (CTε Cεσ

2u +minus1

0ε )minus1

284 C M Crainiceanu et al

Figure 9 Computation time (in minutes) versus sample size for 10000 simulations from the joint posteriordistribution given the data The mean function is modeled as a penalized spline with Km knots with the log of thevariance of the shrinkage process modeled as another penalized spline with Kb = 4 subknots The log of the errorvariance is modeled as another penalized spline with Kε = Km knots Different symbols correspond to increasingnumber of knots dots (Km = 10) asterisks (Km = 20) circles (Km = 30) xrsquos (Km = 40)

3 [θθθb] sim N(MbCT

b Ybσ2v Mb

)where Yb =

[log(σ 2

1 ) log(σ 2Km

)]T

and Mb =(CT

b Cbσ2v +minus1

0b )minus1

4 [σ 2c ] sim IGamma

(ac + Kε2 bc + ||c||22

)

5 [σ 2d ] sim IGamma

(ad + Kb2 bd + ||d||22

)

6 [σ 2εi] prop σminus3

εi exp

minus(yi minus microi)

2(2σ 2εi ) minus

[log(σ 2

εi ) minus ηi)]2

(2σ 2u )

where microi

and ηi are the ith components of CXθθθX and Cεθθθε respectively

7 [σ 2k ] prop σminus3

k expminusb2

k(2σ2k ) minus [

log(σ 2k ) minus ζk)

]2(σ 2

v ) where ζk is the kth com-

ponent of Cbθθθb

All the above conditionals have an explicit form with the exception of the n + Km

one-dimensional conditionals from 6 and 7 For these distributions we use the Metropolis-Hastings algorithm with a normal proposal distribution centered at the current value andsmall variance

10 IMPLEMENTATION AND COMPUTATIONALPERFORMANCE

Our MCMC simulation algorithm provided reliable results and is reasonably fast To il-lustrate this Figure 9 displays computation time (in minutes) versus sample size for 10000

Spatially Adaptive Bayesian Penalized Splines 285

Figure 10 MCMC histories for eight parameters of interest including the log of standard deviation of the errorprocess and log of standard deviation of the shrinkage process as well as some parameters modeling the differentfunctions The model uses 40 equally spaced knots for the mean and error variance functions and 4 subknotsfor the adaptive shrinkage process Chains correspond to the Doppler function example using quadratic truncatedpolynomials to fit the mean and log variance function and linear truncated polynomial basis for fitting the shrinkageprocess For clarity of presentation only every 20th simulated observation is displayed

simulations from the joint posterior distribution given the data The mean function is mod-eled as a penalized spline with Km knots with the log of the variance of the shrinkageprocess modeled as another penalized spline with Kb = 4 subknots The log of the errorvariance is modeled as another penalized spline with Kε = Km knots Different symbolscorrespond to increasing number of knots dots (Km = 10) asterisks (Km = 20) circles(Km = 30) xrsquos (Km = 40) Computation time increases roughly linearly with the numberof observations with the same intercept 0 and larger slopes for larger number of knotsNote that even with n = 1000 observations and Km = Kε = 40 simulation time is abouttwo minutes

For the models and datasets presented in this article 10000 simulations proved suffi-cient for accurate and reliable inference Some parameters such as the mean and variancefunctions tended to exhibit much better mixing properties than the variance componentsσ 2c and σ 2

d controlling the shrinkage processes This is probably due to the reduced amountof information in the data about these parameters The MCMC convergence and mixingproperties were assessed by visual inspection of the chain histories of many parameters ofinterest Since we started with good starting values of the parameters convergence is fastFigure 10 displays the histories of eight parameters of interest for the model fitted to the

286 C M Crainiceanu et al

Doppler function described in Section 5 For clarity we display only every 100th sampledvalued of the chain but the same patterns would be observed in the full chain history Similargood mixing properties have been noted in all the other examples presented in this article

In the search for a good software platform we implemented our algorithms in MATLABC and FORTRAN Programs can be downloaded from wwwbiostatjhsphedu simccrainicFor our application and implementation there were no serious differences between C andFORTRAN which have performed better than MATLAB by reducing computation timetwo to three times The reported times in Figure 9 are for our implementation in C

Given the complexity of the model we were interested in improving the mixing prop-erties of the Markov Chains The computational trick described in Section 9 was the sinequa non method that allowed us to replace an intractable multivariate MH step with severaltractable univariate MH steps We used normal random walk proposal distributions centeredat the current value and variance tuned to provide acceptance rates between 30 and 50

11 COMMENTS

Our joint models for heterogeneous mean and nonconstant variance are designed to ex-tract the maximum amount of information from complex data and suggest simpler paramet-ric models Bayesian inference based on MCMC offers a natural and powerful frameworkfor model fitting However MCMC in this context is not straightforward and implemen-tation depends essentially on the ability of transforming intractable simulations from highdimensional distributions into tractable one dimensional updates as described in Section9 Our implementation provides accurate fast and reproducible results while software isavailable Actual simulation time increases only linearly with the number of observationsand mixing properties are adequate provided that proposal distribution variances are tunedfor optimal acceptance rates

ACKNOWLEDGMENTS

Carrollrsquos research was supported by a grant from the National Cancer Institute (CA-57030) and by the TexasAampM Center for Environmental and Rural Health via a grant from the National Institute of Environmental HealthSciences (P30-ES09106)

[Received September 2004 Revised October 2006]

REFERENCES

Baladandayuthapani V Mallick B K and Carroll R J (2005) ldquoSpatially Adaptive Bayesian RegressionSplinesrdquo Journal of Computational and Graphical Statistics 14 378ndash394

Besag J Green P Higdon D and Mengersen K (1995) ldquoBayesian Computation and Stochastic Systemsrdquo(with discussions) Statistical Science 10 3ndash66

Carroll R J (2003) ldquoVariances Are Not Always Nuisance Parameters The 2002 R A Fisher Lecturerdquo Biometrics59 211ndash220

Carroll R J and Ruppert D (1988) Transformation and Weighting in Regression New York Chapman andHall

Spatially Adaptive Bayesian Penalized Splines 287

Cleveland W S and Grosse E (1991) ldquoComputational Methods for Local Regressionrdquo Statistics and Computing1 47ndash62

Crainiceanu C and Ruppert D (2004) ldquoLikelihood Ratio Tests in Linear Mixed Models With One VarianceComponentrdquo Journal of the Royal Statistical Society Series B 66 165ndash185

Crainiceanu C M Ruppert D and Wand M P (2005) ldquoBayesian Analysis for Penalized Spline Regressionusing WinBUGSrdquo Journal of Statistical Software 14 Available online at http wwwjstatsoftorg v14 i14

Donoho D L and Johnstone I M (1994) ldquoIdeal Spatial Adaptation by Wavelet Shrinkagerdquo Biometrika 81425ndash455

Eilers P H C and Marx B D (1996) ldquoFlexible Smoothing With B-Splines and Penaltiesrdquo (with discussion)Statistical Science 11 89ndash121

(2002) ldquoGeneralized Linear Additive Smooth Structuresrdquo Journal of Computational and Graphical Statis-tics 11 758ndash783

(2004) ldquoSplines Knots and Penaltiesrdquo unpublished manuscript

Fields Development Team (2006) fields Tools for Spatial Data National Center for Atmospheric ResearchBoulder CO Available online at http wwwcgducaredu Software Fields

Friedman J H (1991) ldquoMultivariate Adaptive Regression Splinesrdquo (with discussion) The Annals of Statistics19 1ndash141

Gelman A (2006) ldquoPrior Distributions for Variance Parameters in Hierarchical Modelsrdquo Bayesian Analysis 1515ndash533

Gray R J (1994) ldquoSpline-Based Tests in Survival Analysisrdquo Biometrics 50 640ndash652

Holst U Hossjer O Bjorklund C Ragnarson P and Edner H (1996) ldquoLocally Weighted Least SquaresKernel Regression and Statistical Evaluation of LIDAR Measurementsrdquo Environmetrics 7 401ndash416

Jullion A and Lambert P (2006) Discussion paper DP0534 UCL

Kammann E E and Wand M P (2003) ldquoGeoadditive Modelsrdquo Applied Statistics 52 1ndash18

Kauermann G (2002) ldquoA Note on Bandwidth Selection for Penalised Spline Smoothingrdquo Technical Report 02-13Department of Statistics University of Glasgow

Krivobokova T Crainiceanu C M and Kauermann G (2007) ldquoFast Adaptive Penalized Splinesrdquo Journal ofComputational and Graphical Statistics to appear

Lang S and Bretzger A (2004) ldquoBayesian P-Splinesrdquo Journal of Compuational and Graphical Statistics 13183ndash212

Lang S Fronk E M and Fahrmeir L (2002) ldquoFunction Estimation with Locally Adaptive Dynamic ModelsrdquoComputational Statistics 17 479ndash499

Marx B D and Eilers PHC (1998) ldquoDirect Generalized Additive Modeling With Penalized LikelihoodrdquoComputational Statistics amp Data Analysis 28 193ndash209

Natarajan R and Kass R E (2000) ldquoReference Bayesian Methods for Generalized Linear Mixed ModelsrdquoJournal of the American Statistical Association 95 227ndash237

Ngo L and Wand M P (2004) ldquoSmoothing With Mixed Model Softwarerdquo Journal of Statistical Software 9Available online at http wwwjstatsoftorg v09 i01

Nychka D Haaland P OrsquoConnell M and Ellner S (1998) ldquoFUNFITS Data Analysis and Statistical Toolsfor Estimating Functionsrdquo in Case Studies in Environmental Statistics Lecture Notes in Statistics vol 132eds D Nychka W W Piegorsch and L H Cox New York Springer Verlag pp 159ndash179 Online athttp wwwcgducaredu stats Software Funfits

Nychka D and Saltzman N (1998) ldquoDesign of Air Quality Monitoring Networksrdquo in Case studies in Environ-mental Statistics Lecture Notes in Statistics vol 132 eds D Nychka W W Piegorsch and L H Cox NewYork Springer Verlag pp 51ndash76

288 C M Crainiceanu et al

Pintore A Speckman P and Holmes C C (2006) ldquoSpatially Adaptive Smoothing Splinesrdquo Biometrika 93113ndash125

Ruppert D (1997) ldquoLocal Polynomial Regression and its Applications in Environmental Statisticsrdquo in Statisticsfor the Environment (vol 3) eds V Barnett and F Turkman Chicester Wiley pp 155ndash173

(2002) ldquoSelecting the Number of Knots for Penalized Splinesrdquo Journal of Computational and GraphicalStatistics 11 735ndash757

Ruppert D and Carroll R J (2000) ldquoSpatially-Adaptive Penalties for Spline Fittingrdquo Australian and NewZealand Journal of Statistics 42 205ndash223

Ruppert D Wand M P Holst and U Hossier O (1997) Technometrics 39 262ndash273

Ruppert D Wand M P and Carroll R J (2003) Semiparametric Regression Cambridge Cambridge UniversityPress

Smith M and Kohn R (1997) ldquoA Bayesian Approach to Nonparametric Bivariate Regressionrdquo Journal of theAmerican Statistical Association 92 1522ndash1535

Wood S (2006) Generalized Additive Models An Introduction with R New York Chapman amp HallCRC

Page 17: Spatially Adaptive Bayesian Penalized Splines - Department of

Spatially Adaptive Bayesian Penalized Splines 281

Figure 6 Posterior mean of the mean regression function for the Noshiro example The model used was a thinplate spline with K = 100 knots for the mean function a log-thin plate spline with K = 100 knots for the variancefunction and a log-thin plate spline with Klowast = 16 knots for the spatially adaptive shrinkage parameter Theparameter controlling the degree of smoothness of the thin-plate spline basis was m = 2

locations and the knots used for fitting are displayed in Figure 5 The log-thin plate splinedescribed in (63) with the same Kε = 100 knots was used to model the variance of theerror process σ 2

ε (xi ) The log-thin plate spline described in (64) with Kb = 16 subknotswas used to model the variance σ 2

b (κκκk) which controls the amount of shrinkage of the b

parameters The subknots were obtained using the coverdesign() function applied to the100 knots

Figure 6 displays the posterior mean of the mean response function There is a rathersharp peak around (042 04) with a slow decay from the peak in the general south directionand a much faster decay in all other directions The function becomes smoother towardsthe boundary This is exactly the type of function for which adaptive spatial smoothing cansubstantially improve the fit

Figure 7 shows that in large areas of the map σε(middot) is smooth and with small valuesHowever in a neighborhood of the peak of the mean function σε(middot) displays two relativelysharp maxima located NNW and SSE of the peak We also performed a two-stage frequentistanalysis In the first stage we used a low rank thin plate spline for the mean function and theresults from this regression were used to obtain residuals We then fitted another low rankthin plate spline to the absolute values of these residuals While this map was not identicalto the one in Figure 7 it did exhibit the same patterns of heteroscedasticity

The process logσ 2b (κκκk) controlling the shrinkage of the bk is displayed in Figure 8

282 C M Crainiceanu et al

Figure 7 Posterior mean of the standard deviation σε(xi ) of the error process function for the Noshiro exampleThe model used was a thin plate spline with K = 100 knots for the mean function a log-thin plate spline withK = 100 knots for the variance function and a log-thin plate spline with Klowast = 16 knots for the spatially adaptiveshrinkage parameter The parameter controlling the degree of smoothness of the thin-plate spline basis was m = 2

Figure 8 Posterior mean of the shrinkage process minus logσ 2b(κκκk) The model used was a thin plate spline with

K = 100 knots for the mean function a log-thin plate spline with K = 100 knots for the variance function and alog-thin plate spline with Klowast = 16 knots for the spatially adaptive shrinkage parameter The parameter controllingthe degree of smoothness of the thin-plate spline basis was m = 2

Spatially Adaptive Bayesian Penalized Splines 283

Smaller values of this function correspond to less shrinkage of bk towards zero and morelocal behavior of the smoother Figure 8 indicates that in a neighborhood of the peak of themean function the shrinkage is smaller to allow the mean function to change rapidly Awayfrom the peak the shrinkage is larger corresponding to a smoother mean

9 IMPLEMENTATION USING MCMC

We will now provide some details for MCMC simulations of model (21) wherethe variances σ 2

ε (xi) and σ 2b (κk) are modeled by Equations (23) and (24) The imple-

mentation for other models for example for the multivariate smoothing in Section 6is similar Consider independent normal priors for the coefficients of the monomialsβi sim N(0 σ 2

0β) γi sim N(0 σ 20γ ) i = 0 p δi sim N(0 σ 2

0δ) i = 0 q and in-

dependent inverse Gamma priors for the variance components σ 2c sim IGamma(ac bc) and

σ 2d sim IGamma(ad bd) Using these priors many full conditionals of the posterior distribu-

tion are easy to derive while a few have complex multivariate forms Our implementationof the MCMC using multivariate Metropolis-Hastings steps proved to be unstable with poormixing properties A simple and reliable solution was to change the model by adding errorterms to the log-spline models that is

logσ 2ε (xi)

= γ0 + middot middot middot + γpxpi +sumKε

s=1 cs(xi minus κεs )

p+ + ui

logσ 2b (κ

mk ) = δ0 + middot middot middot + δq(κ

mk )q +sumKb

s=1 ds(κmk minus κb

s )q+ + vk

(91)

where ui sim N(0 σ 2u ) and vk sim N(0 σ 2

v ) This idea was also used for σ 2b by Baladan-

dayuthapani Mallick and Carroll (2005) We fixed the values of σ 2u = σ 2

v = 001 as thesevariances appear nonidentifiable and a standard deviation of 01 is small on a log-scale Thisdevice makes computations feasible by replacing sampling from complex full conditionalsby simple univariate MH steps

Define as before by ε b θθθX and define θθθε = (γγγ T cT )T θθθb = (δδδT dT )T CX =(X Z) Cε = (Xε Zε) Cb = (Xb Zb) where X Xε Xb contain the monomials and ZZε Zb contain the truncated polynomials of the spline models for m logσ 2

ε and logσ 2b

respectively Also denote by

0X =[

σ 20βIp+1 0

0 b

]

0ε =[

σ 20γ Ip+1 0

0 σ 2c IKε

]

0b =[

σ 20δIq+1 0

0 σ 2d IKm

]

The full conditionals of the posterior are detailed below

1 [θθθX] sim N(MXCT

Xminus1ε YMX

) where MX =

(CT

Xminus1ε CX + minus1

0X

)minus1

2 [θθθε] sim N(MεCT

ε Yεσ2u Mε

) where Yε = [log(σ 2

ε1) log(σ 2εn)]

T and Mε

= (CTε Cεσ

2u +minus1

0ε )minus1

284 C M Crainiceanu et al

Figure 9 Computation time (in minutes) versus sample size for 10000 simulations from the joint posteriordistribution given the data The mean function is modeled as a penalized spline with Km knots with the log of thevariance of the shrinkage process modeled as another penalized spline with Kb = 4 subknots The log of the errorvariance is modeled as another penalized spline with Kε = Km knots Different symbols correspond to increasingnumber of knots dots (Km = 10) asterisks (Km = 20) circles (Km = 30) xrsquos (Km = 40)

3 [θθθb] sim N(MbCT

b Ybσ2v Mb

)where Yb =

[log(σ 2

1 ) log(σ 2Km

)]T

and Mb =(CT

b Cbσ2v +minus1

0b )minus1

4 [σ 2c ] sim IGamma

(ac + Kε2 bc + ||c||22

)

5 [σ 2d ] sim IGamma

(ad + Kb2 bd + ||d||22

)

6 [σ 2εi] prop σminus3

εi exp

minus(yi minus microi)

2(2σ 2εi ) minus

[log(σ 2

εi ) minus ηi)]2

(2σ 2u )

where microi

and ηi are the ith components of CXθθθX and Cεθθθε respectively

7 [σ 2k ] prop σminus3

k expminusb2

k(2σ2k ) minus [

log(σ 2k ) minus ζk)

]2(σ 2

v ) where ζk is the kth com-

ponent of Cbθθθb

All the above conditionals have an explicit form with the exception of the n + Km

one-dimensional conditionals from 6 and 7 For these distributions we use the Metropolis-Hastings algorithm with a normal proposal distribution centered at the current value andsmall variance

10 IMPLEMENTATION AND COMPUTATIONALPERFORMANCE

Our MCMC simulation algorithm provided reliable results and is reasonably fast To il-lustrate this Figure 9 displays computation time (in minutes) versus sample size for 10000

Spatially Adaptive Bayesian Penalized Splines 285

Figure 10 MCMC histories for eight parameters of interest including the log of standard deviation of the errorprocess and log of standard deviation of the shrinkage process as well as some parameters modeling the differentfunctions The model uses 40 equally spaced knots for the mean and error variance functions and 4 subknotsfor the adaptive shrinkage process Chains correspond to the Doppler function example using quadratic truncatedpolynomials to fit the mean and log variance function and linear truncated polynomial basis for fitting the shrinkageprocess For clarity of presentation only every 20th simulated observation is displayed

simulations from the joint posterior distribution given the data The mean function is mod-eled as a penalized spline with Km knots with the log of the variance of the shrinkageprocess modeled as another penalized spline with Kb = 4 subknots The log of the errorvariance is modeled as another penalized spline with Kε = Km knots Different symbolscorrespond to increasing number of knots dots (Km = 10) asterisks (Km = 20) circles(Km = 30) xrsquos (Km = 40) Computation time increases roughly linearly with the numberof observations with the same intercept 0 and larger slopes for larger number of knotsNote that even with n = 1000 observations and Km = Kε = 40 simulation time is abouttwo minutes

For the models and datasets presented in this article 10000 simulations proved suffi-cient for accurate and reliable inference Some parameters such as the mean and variancefunctions tended to exhibit much better mixing properties than the variance componentsσ 2c and σ 2

d controlling the shrinkage processes This is probably due to the reduced amountof information in the data about these parameters The MCMC convergence and mixingproperties were assessed by visual inspection of the chain histories of many parameters ofinterest Since we started with good starting values of the parameters convergence is fastFigure 10 displays the histories of eight parameters of interest for the model fitted to the

286 C M Crainiceanu et al

Doppler function described in Section 5 For clarity we display only every 100th sampledvalued of the chain but the same patterns would be observed in the full chain history Similargood mixing properties have been noted in all the other examples presented in this article

In the search for a good software platform we implemented our algorithms in MATLABC and FORTRAN Programs can be downloaded from wwwbiostatjhsphedu simccrainicFor our application and implementation there were no serious differences between C andFORTRAN which have performed better than MATLAB by reducing computation timetwo to three times The reported times in Figure 9 are for our implementation in C

Given the complexity of the model we were interested in improving the mixing prop-erties of the Markov Chains The computational trick described in Section 9 was the sinequa non method that allowed us to replace an intractable multivariate MH step with severaltractable univariate MH steps We used normal random walk proposal distributions centeredat the current value and variance tuned to provide acceptance rates between 30 and 50

11 COMMENTS

Our joint models for heterogeneous mean and nonconstant variance are designed to ex-tract the maximum amount of information from complex data and suggest simpler paramet-ric models Bayesian inference based on MCMC offers a natural and powerful frameworkfor model fitting However MCMC in this context is not straightforward and implemen-tation depends essentially on the ability of transforming intractable simulations from highdimensional distributions into tractable one dimensional updates as described in Section9 Our implementation provides accurate fast and reproducible results while software isavailable Actual simulation time increases only linearly with the number of observationsand mixing properties are adequate provided that proposal distribution variances are tunedfor optimal acceptance rates

ACKNOWLEDGMENTS

Carrollrsquos research was supported by a grant from the National Cancer Institute (CA-57030) and by the TexasAampM Center for Environmental and Rural Health via a grant from the National Institute of Environmental HealthSciences (P30-ES09106)

[Received September 2004 Revised October 2006]

REFERENCES

Baladandayuthapani V Mallick B K and Carroll R J (2005) ldquoSpatially Adaptive Bayesian RegressionSplinesrdquo Journal of Computational and Graphical Statistics 14 378ndash394

Besag J Green P Higdon D and Mengersen K (1995) ldquoBayesian Computation and Stochastic Systemsrdquo(with discussions) Statistical Science 10 3ndash66

Carroll R J (2003) ldquoVariances Are Not Always Nuisance Parameters The 2002 R A Fisher Lecturerdquo Biometrics59 211ndash220

Carroll R J and Ruppert D (1988) Transformation and Weighting in Regression New York Chapman andHall

Spatially Adaptive Bayesian Penalized Splines 287

Cleveland W S and Grosse E (1991) ldquoComputational Methods for Local Regressionrdquo Statistics and Computing1 47ndash62

Crainiceanu C and Ruppert D (2004) ldquoLikelihood Ratio Tests in Linear Mixed Models With One VarianceComponentrdquo Journal of the Royal Statistical Society Series B 66 165ndash185

Crainiceanu C M Ruppert D and Wand M P (2005) ldquoBayesian Analysis for Penalized Spline Regressionusing WinBUGSrdquo Journal of Statistical Software 14 Available online at http wwwjstatsoftorg v14 i14

Donoho D L and Johnstone I M (1994) ldquoIdeal Spatial Adaptation by Wavelet Shrinkagerdquo Biometrika 81425ndash455

Eilers P H C and Marx B D (1996) ldquoFlexible Smoothing With B-Splines and Penaltiesrdquo (with discussion)Statistical Science 11 89ndash121

(2002) ldquoGeneralized Linear Additive Smooth Structuresrdquo Journal of Computational and Graphical Statis-tics 11 758ndash783

(2004) ldquoSplines Knots and Penaltiesrdquo unpublished manuscript

Fields Development Team (2006) fields Tools for Spatial Data National Center for Atmospheric ResearchBoulder CO Available online at http wwwcgducaredu Software Fields

Friedman J H (1991) ldquoMultivariate Adaptive Regression Splinesrdquo (with discussion) The Annals of Statistics19 1ndash141

Gelman A (2006) ldquoPrior Distributions for Variance Parameters in Hierarchical Modelsrdquo Bayesian Analysis 1515ndash533

Gray R J (1994) ldquoSpline-Based Tests in Survival Analysisrdquo Biometrics 50 640ndash652

Holst U Hossjer O Bjorklund C Ragnarson P and Edner H (1996) ldquoLocally Weighted Least SquaresKernel Regression and Statistical Evaluation of LIDAR Measurementsrdquo Environmetrics 7 401ndash416

Jullion A and Lambert P (2006) Discussion paper DP0534 UCL

Kammann E E and Wand M P (2003) ldquoGeoadditive Modelsrdquo Applied Statistics 52 1ndash18

Kauermann G (2002) ldquoA Note on Bandwidth Selection for Penalised Spline Smoothingrdquo Technical Report 02-13Department of Statistics University of Glasgow

Krivobokova T Crainiceanu C M and Kauermann G (2007) ldquoFast Adaptive Penalized Splinesrdquo Journal ofComputational and Graphical Statistics to appear

Lang S and Bretzger A (2004) ldquoBayesian P-Splinesrdquo Journal of Compuational and Graphical Statistics 13183ndash212

Lang S Fronk E M and Fahrmeir L (2002) ldquoFunction Estimation with Locally Adaptive Dynamic ModelsrdquoComputational Statistics 17 479ndash499

Marx B D and Eilers PHC (1998) ldquoDirect Generalized Additive Modeling With Penalized LikelihoodrdquoComputational Statistics amp Data Analysis 28 193ndash209

Natarajan R and Kass R E (2000) ldquoReference Bayesian Methods for Generalized Linear Mixed ModelsrdquoJournal of the American Statistical Association 95 227ndash237

Ngo L and Wand M P (2004) ldquoSmoothing With Mixed Model Softwarerdquo Journal of Statistical Software 9Available online at http wwwjstatsoftorg v09 i01

Nychka D Haaland P OrsquoConnell M and Ellner S (1998) ldquoFUNFITS Data Analysis and Statistical Toolsfor Estimating Functionsrdquo in Case Studies in Environmental Statistics Lecture Notes in Statistics vol 132eds D Nychka W W Piegorsch and L H Cox New York Springer Verlag pp 159ndash179 Online athttp wwwcgducaredu stats Software Funfits

Nychka D and Saltzman N (1998) ldquoDesign of Air Quality Monitoring Networksrdquo in Case studies in Environ-mental Statistics Lecture Notes in Statistics vol 132 eds D Nychka W W Piegorsch and L H Cox NewYork Springer Verlag pp 51ndash76

288 C M Crainiceanu et al

Pintore A Speckman P and Holmes C C (2006) ldquoSpatially Adaptive Smoothing Splinesrdquo Biometrika 93113ndash125

Ruppert D (1997) ldquoLocal Polynomial Regression and its Applications in Environmental Statisticsrdquo in Statisticsfor the Environment (vol 3) eds V Barnett and F Turkman Chicester Wiley pp 155ndash173

(2002) ldquoSelecting the Number of Knots for Penalized Splinesrdquo Journal of Computational and GraphicalStatistics 11 735ndash757

Ruppert D and Carroll R J (2000) ldquoSpatially-Adaptive Penalties for Spline Fittingrdquo Australian and NewZealand Journal of Statistics 42 205ndash223

Ruppert D Wand M P Holst and U Hossier O (1997) Technometrics 39 262ndash273

Ruppert D Wand M P and Carroll R J (2003) Semiparametric Regression Cambridge Cambridge UniversityPress

Smith M and Kohn R (1997) ldquoA Bayesian Approach to Nonparametric Bivariate Regressionrdquo Journal of theAmerican Statistical Association 92 1522ndash1535

Wood S (2006) Generalized Additive Models An Introduction with R New York Chapman amp HallCRC

Page 18: Spatially Adaptive Bayesian Penalized Splines - Department of

282 C M Crainiceanu et al

Figure 7 Posterior mean of the standard deviation σε(xi ) of the error process function for the Noshiro exampleThe model used was a thin plate spline with K = 100 knots for the mean function a log-thin plate spline withK = 100 knots for the variance function and a log-thin plate spline with Klowast = 16 knots for the spatially adaptiveshrinkage parameter The parameter controlling the degree of smoothness of the thin-plate spline basis was m = 2

Figure 8 Posterior mean of the shrinkage process minus logσ 2b(κκκk) The model used was a thin plate spline with

K = 100 knots for the mean function a log-thin plate spline with K = 100 knots for the variance function and alog-thin plate spline with Klowast = 16 knots for the spatially adaptive shrinkage parameter The parameter controllingthe degree of smoothness of the thin-plate spline basis was m = 2

Spatially Adaptive Bayesian Penalized Splines 283

Smaller values of this function correspond to less shrinkage of bk towards zero and morelocal behavior of the smoother Figure 8 indicates that in a neighborhood of the peak of themean function the shrinkage is smaller to allow the mean function to change rapidly Awayfrom the peak the shrinkage is larger corresponding to a smoother mean

9 IMPLEMENTATION USING MCMC

We will now provide some details for MCMC simulations of model (21) wherethe variances σ 2

ε (xi) and σ 2b (κk) are modeled by Equations (23) and (24) The imple-

mentation for other models for example for the multivariate smoothing in Section 6is similar Consider independent normal priors for the coefficients of the monomialsβi sim N(0 σ 2

0β) γi sim N(0 σ 20γ ) i = 0 p δi sim N(0 σ 2

0δ) i = 0 q and in-

dependent inverse Gamma priors for the variance components σ 2c sim IGamma(ac bc) and

σ 2d sim IGamma(ad bd) Using these priors many full conditionals of the posterior distribu-

tion are easy to derive while a few have complex multivariate forms Our implementationof the MCMC using multivariate Metropolis-Hastings steps proved to be unstable with poormixing properties A simple and reliable solution was to change the model by adding errorterms to the log-spline models that is

logσ 2ε (xi)

= γ0 + middot middot middot + γpxpi +sumKε

s=1 cs(xi minus κεs )

p+ + ui

logσ 2b (κ

mk ) = δ0 + middot middot middot + δq(κ

mk )q +sumKb

s=1 ds(κmk minus κb

s )q+ + vk

(91)

where ui sim N(0 σ 2u ) and vk sim N(0 σ 2

v ) This idea was also used for σ 2b by Baladan-

dayuthapani Mallick and Carroll (2005) We fixed the values of σ 2u = σ 2

v = 001 as thesevariances appear nonidentifiable and a standard deviation of 01 is small on a log-scale Thisdevice makes computations feasible by replacing sampling from complex full conditionalsby simple univariate MH steps

Define as before by ε b θθθX and define θθθε = (γγγ T cT )T θθθb = (δδδT dT )T CX =(X Z) Cε = (Xε Zε) Cb = (Xb Zb) where X Xε Xb contain the monomials and ZZε Zb contain the truncated polynomials of the spline models for m logσ 2

ε and logσ 2b

respectively Also denote by

0X =[

σ 20βIp+1 0

0 b

]

0ε =[

σ 20γ Ip+1 0

0 σ 2c IKε

]

0b =[

σ 20δIq+1 0

0 σ 2d IKm

]

The full conditionals of the posterior are detailed below

1 [θθθX] sim N(MXCT

Xminus1ε YMX

) where MX =

(CT

Xminus1ε CX + minus1

0X

)minus1

2 [θθθε] sim N(MεCT

ε Yεσ2u Mε

) where Yε = [log(σ 2

ε1) log(σ 2εn)]

T and Mε

= (CTε Cεσ

2u +minus1

0ε )minus1

284 C M Crainiceanu et al

Figure 9 Computation time (in minutes) versus sample size for 10000 simulations from the joint posteriordistribution given the data The mean function is modeled as a penalized spline with Km knots with the log of thevariance of the shrinkage process modeled as another penalized spline with Kb = 4 subknots The log of the errorvariance is modeled as another penalized spline with Kε = Km knots Different symbols correspond to increasingnumber of knots dots (Km = 10) asterisks (Km = 20) circles (Km = 30) xrsquos (Km = 40)

3 [θθθb] sim N(MbCT

b Ybσ2v Mb

)where Yb =

[log(σ 2

1 ) log(σ 2Km

)]T

and Mb =(CT

b Cbσ2v +minus1

0b )minus1

4 [σ 2c ] sim IGamma

(ac + Kε2 bc + ||c||22

)

5 [σ 2d ] sim IGamma

(ad + Kb2 bd + ||d||22

)

6 [σ 2εi] prop σminus3

εi exp

minus(yi minus microi)

2(2σ 2εi ) minus

[log(σ 2

εi ) minus ηi)]2

(2σ 2u )

where microi

and ηi are the ith components of CXθθθX and Cεθθθε respectively

7 [σ 2k ] prop σminus3

k expminusb2

k(2σ2k ) minus [

log(σ 2k ) minus ζk)

]2(σ 2

v ) where ζk is the kth com-

ponent of Cbθθθb

All the above conditionals have an explicit form with the exception of the n + Km

one-dimensional conditionals from 6 and 7 For these distributions we use the Metropolis-Hastings algorithm with a normal proposal distribution centered at the current value andsmall variance

10 IMPLEMENTATION AND COMPUTATIONALPERFORMANCE

Our MCMC simulation algorithm provided reliable results and is reasonably fast To il-lustrate this Figure 9 displays computation time (in minutes) versus sample size for 10000

Spatially Adaptive Bayesian Penalized Splines 285

Figure 10 MCMC histories for eight parameters of interest including the log of standard deviation of the errorprocess and log of standard deviation of the shrinkage process as well as some parameters modeling the differentfunctions The model uses 40 equally spaced knots for the mean and error variance functions and 4 subknotsfor the adaptive shrinkage process Chains correspond to the Doppler function example using quadratic truncatedpolynomials to fit the mean and log variance function and linear truncated polynomial basis for fitting the shrinkageprocess For clarity of presentation only every 20th simulated observation is displayed

simulations from the joint posterior distribution given the data The mean function is mod-eled as a penalized spline with Km knots with the log of the variance of the shrinkageprocess modeled as another penalized spline with Kb = 4 subknots The log of the errorvariance is modeled as another penalized spline with Kε = Km knots Different symbolscorrespond to increasing number of knots dots (Km = 10) asterisks (Km = 20) circles(Km = 30) xrsquos (Km = 40) Computation time increases roughly linearly with the numberof observations with the same intercept 0 and larger slopes for larger number of knotsNote that even with n = 1000 observations and Km = Kε = 40 simulation time is abouttwo minutes

For the models and datasets presented in this article 10000 simulations proved suffi-cient for accurate and reliable inference Some parameters such as the mean and variancefunctions tended to exhibit much better mixing properties than the variance componentsσ 2c and σ 2

d controlling the shrinkage processes This is probably due to the reduced amountof information in the data about these parameters The MCMC convergence and mixingproperties were assessed by visual inspection of the chain histories of many parameters ofinterest Since we started with good starting values of the parameters convergence is fastFigure 10 displays the histories of eight parameters of interest for the model fitted to the

286 C M Crainiceanu et al

Doppler function described in Section 5 For clarity we display only every 100th sampledvalued of the chain but the same patterns would be observed in the full chain history Similargood mixing properties have been noted in all the other examples presented in this article

In the search for a good software platform we implemented our algorithms in MATLABC and FORTRAN Programs can be downloaded from wwwbiostatjhsphedu simccrainicFor our application and implementation there were no serious differences between C andFORTRAN which have performed better than MATLAB by reducing computation timetwo to three times The reported times in Figure 9 are for our implementation in C

Given the complexity of the model we were interested in improving the mixing prop-erties of the Markov Chains The computational trick described in Section 9 was the sinequa non method that allowed us to replace an intractable multivariate MH step with severaltractable univariate MH steps We used normal random walk proposal distributions centeredat the current value and variance tuned to provide acceptance rates between 30 and 50

11 COMMENTS

Our joint models for heterogeneous mean and nonconstant variance are designed to ex-tract the maximum amount of information from complex data and suggest simpler paramet-ric models Bayesian inference based on MCMC offers a natural and powerful frameworkfor model fitting However MCMC in this context is not straightforward and implemen-tation depends essentially on the ability of transforming intractable simulations from highdimensional distributions into tractable one dimensional updates as described in Section9 Our implementation provides accurate fast and reproducible results while software isavailable Actual simulation time increases only linearly with the number of observationsand mixing properties are adequate provided that proposal distribution variances are tunedfor optimal acceptance rates

ACKNOWLEDGMENTS

Carrollrsquos research was supported by a grant from the National Cancer Institute (CA-57030) and by the TexasAampM Center for Environmental and Rural Health via a grant from the National Institute of Environmental HealthSciences (P30-ES09106)

[Received September 2004 Revised October 2006]

REFERENCES

Baladandayuthapani V Mallick B K and Carroll R J (2005) ldquoSpatially Adaptive Bayesian RegressionSplinesrdquo Journal of Computational and Graphical Statistics 14 378ndash394

Besag J Green P Higdon D and Mengersen K (1995) ldquoBayesian Computation and Stochastic Systemsrdquo(with discussions) Statistical Science 10 3ndash66

Carroll R J (2003) ldquoVariances Are Not Always Nuisance Parameters The 2002 R A Fisher Lecturerdquo Biometrics59 211ndash220

Carroll R J and Ruppert D (1988) Transformation and Weighting in Regression New York Chapman andHall

Spatially Adaptive Bayesian Penalized Splines 287

Cleveland W S and Grosse E (1991) ldquoComputational Methods for Local Regressionrdquo Statistics and Computing1 47ndash62

Crainiceanu C and Ruppert D (2004) ldquoLikelihood Ratio Tests in Linear Mixed Models With One VarianceComponentrdquo Journal of the Royal Statistical Society Series B 66 165ndash185

Crainiceanu C M Ruppert D and Wand M P (2005) ldquoBayesian Analysis for Penalized Spline Regressionusing WinBUGSrdquo Journal of Statistical Software 14 Available online at http wwwjstatsoftorg v14 i14

Donoho D L and Johnstone I M (1994) ldquoIdeal Spatial Adaptation by Wavelet Shrinkagerdquo Biometrika 81425ndash455

Eilers P H C and Marx B D (1996) ldquoFlexible Smoothing With B-Splines and Penaltiesrdquo (with discussion)Statistical Science 11 89ndash121

(2002) ldquoGeneralized Linear Additive Smooth Structuresrdquo Journal of Computational and Graphical Statis-tics 11 758ndash783

(2004) ldquoSplines Knots and Penaltiesrdquo unpublished manuscript

Fields Development Team (2006) fields Tools for Spatial Data National Center for Atmospheric ResearchBoulder CO Available online at http wwwcgducaredu Software Fields

Friedman J H (1991) ldquoMultivariate Adaptive Regression Splinesrdquo (with discussion) The Annals of Statistics19 1ndash141

Gelman A (2006) ldquoPrior Distributions for Variance Parameters in Hierarchical Modelsrdquo Bayesian Analysis 1515ndash533

Gray R J (1994) ldquoSpline-Based Tests in Survival Analysisrdquo Biometrics 50 640ndash652

Holst U Hossjer O Bjorklund C Ragnarson P and Edner H (1996) ldquoLocally Weighted Least SquaresKernel Regression and Statistical Evaluation of LIDAR Measurementsrdquo Environmetrics 7 401ndash416

Jullion A and Lambert P (2006) Discussion paper DP0534 UCL

Kammann E E and Wand M P (2003) ldquoGeoadditive Modelsrdquo Applied Statistics 52 1ndash18

Kauermann G (2002) ldquoA Note on Bandwidth Selection for Penalised Spline Smoothingrdquo Technical Report 02-13Department of Statistics University of Glasgow

Krivobokova T Crainiceanu C M and Kauermann G (2007) ldquoFast Adaptive Penalized Splinesrdquo Journal ofComputational and Graphical Statistics to appear

Lang S and Bretzger A (2004) ldquoBayesian P-Splinesrdquo Journal of Compuational and Graphical Statistics 13183ndash212

Lang S Fronk E M and Fahrmeir L (2002) ldquoFunction Estimation with Locally Adaptive Dynamic ModelsrdquoComputational Statistics 17 479ndash499

Marx B D and Eilers PHC (1998) ldquoDirect Generalized Additive Modeling With Penalized LikelihoodrdquoComputational Statistics amp Data Analysis 28 193ndash209

Natarajan R and Kass R E (2000) ldquoReference Bayesian Methods for Generalized Linear Mixed ModelsrdquoJournal of the American Statistical Association 95 227ndash237

Ngo L and Wand M P (2004) ldquoSmoothing With Mixed Model Softwarerdquo Journal of Statistical Software 9Available online at http wwwjstatsoftorg v09 i01

Nychka D Haaland P OrsquoConnell M and Ellner S (1998) ldquoFUNFITS Data Analysis and Statistical Toolsfor Estimating Functionsrdquo in Case Studies in Environmental Statistics Lecture Notes in Statistics vol 132eds D Nychka W W Piegorsch and L H Cox New York Springer Verlag pp 159ndash179 Online athttp wwwcgducaredu stats Software Funfits

Nychka D and Saltzman N (1998) ldquoDesign of Air Quality Monitoring Networksrdquo in Case studies in Environ-mental Statistics Lecture Notes in Statistics vol 132 eds D Nychka W W Piegorsch and L H Cox NewYork Springer Verlag pp 51ndash76

288 C M Crainiceanu et al

Pintore A Speckman P and Holmes C C (2006) ldquoSpatially Adaptive Smoothing Splinesrdquo Biometrika 93113ndash125

Ruppert D (1997) ldquoLocal Polynomial Regression and its Applications in Environmental Statisticsrdquo in Statisticsfor the Environment (vol 3) eds V Barnett and F Turkman Chicester Wiley pp 155ndash173

(2002) ldquoSelecting the Number of Knots for Penalized Splinesrdquo Journal of Computational and GraphicalStatistics 11 735ndash757

Ruppert D and Carroll R J (2000) ldquoSpatially-Adaptive Penalties for Spline Fittingrdquo Australian and NewZealand Journal of Statistics 42 205ndash223

Ruppert D Wand M P Holst and U Hossier O (1997) Technometrics 39 262ndash273

Ruppert D Wand M P and Carroll R J (2003) Semiparametric Regression Cambridge Cambridge UniversityPress

Smith M and Kohn R (1997) ldquoA Bayesian Approach to Nonparametric Bivariate Regressionrdquo Journal of theAmerican Statistical Association 92 1522ndash1535

Wood S (2006) Generalized Additive Models An Introduction with R New York Chapman amp HallCRC

Page 19: Spatially Adaptive Bayesian Penalized Splines - Department of

Spatially Adaptive Bayesian Penalized Splines 283

Smaller values of this function correspond to less shrinkage of bk towards zero and morelocal behavior of the smoother Figure 8 indicates that in a neighborhood of the peak of themean function the shrinkage is smaller to allow the mean function to change rapidly Awayfrom the peak the shrinkage is larger corresponding to a smoother mean

9 IMPLEMENTATION USING MCMC

We will now provide some details for MCMC simulations of model (21) wherethe variances σ 2

ε (xi) and σ 2b (κk) are modeled by Equations (23) and (24) The imple-

mentation for other models for example for the multivariate smoothing in Section 6is similar Consider independent normal priors for the coefficients of the monomialsβi sim N(0 σ 2

0β) γi sim N(0 σ 20γ ) i = 0 p δi sim N(0 σ 2

0δ) i = 0 q and in-

dependent inverse Gamma priors for the variance components σ 2c sim IGamma(ac bc) and

σ 2d sim IGamma(ad bd) Using these priors many full conditionals of the posterior distribu-

tion are easy to derive while a few have complex multivariate forms Our implementationof the MCMC using multivariate Metropolis-Hastings steps proved to be unstable with poormixing properties A simple and reliable solution was to change the model by adding errorterms to the log-spline models that is

logσ 2ε (xi)

= γ0 + middot middot middot + γpxpi +sumKε

s=1 cs(xi minus κεs )

p+ + ui

logσ 2b (κ

mk ) = δ0 + middot middot middot + δq(κ

mk )q +sumKb

s=1 ds(κmk minus κb

s )q+ + vk

(91)

where ui sim N(0 σ 2u ) and vk sim N(0 σ 2

v ) This idea was also used for σ 2b by Baladan-

dayuthapani Mallick and Carroll (2005) We fixed the values of σ 2u = σ 2

v = 001 as thesevariances appear nonidentifiable and a standard deviation of 01 is small on a log-scale Thisdevice makes computations feasible by replacing sampling from complex full conditionalsby simple univariate MH steps

Define as before by ε b θθθX and define θθθε = (γγγ T cT )T θθθb = (δδδT dT )T CX =(X Z) Cε = (Xε Zε) Cb = (Xb Zb) where X Xε Xb contain the monomials and ZZε Zb contain the truncated polynomials of the spline models for m logσ 2

ε and logσ 2b

respectively Also denote by

0X =[

σ 20βIp+1 0

0 b

]

0ε =[

σ 20γ Ip+1 0

0 σ 2c IKε

]

0b =[

σ 20δIq+1 0

0 σ 2d IKm

]

The full conditionals of the posterior are detailed below

1 [θθθX] sim N(MXCT

Xminus1ε YMX

) where MX =

(CT

Xminus1ε CX + minus1

0X

)minus1

2 [θθθε] sim N(MεCT

ε Yεσ2u Mε

) where Yε = [log(σ 2

ε1) log(σ 2εn)]

T and Mε

= (CTε Cεσ

2u +minus1

0ε )minus1

284 C M Crainiceanu et al

Figure 9 Computation time (in minutes) versus sample size for 10000 simulations from the joint posteriordistribution given the data The mean function is modeled as a penalized spline with Km knots with the log of thevariance of the shrinkage process modeled as another penalized spline with Kb = 4 subknots The log of the errorvariance is modeled as another penalized spline with Kε = Km knots Different symbols correspond to increasingnumber of knots dots (Km = 10) asterisks (Km = 20) circles (Km = 30) xrsquos (Km = 40)

3 [θθθb] sim N(MbCT

b Ybσ2v Mb

)where Yb =

[log(σ 2

1 ) log(σ 2Km

)]T

and Mb =(CT

b Cbσ2v +minus1

0b )minus1

4 [σ 2c ] sim IGamma

(ac + Kε2 bc + ||c||22

)

5 [σ 2d ] sim IGamma

(ad + Kb2 bd + ||d||22

)

6 [σ 2εi] prop σminus3

εi exp

minus(yi minus microi)

2(2σ 2εi ) minus

[log(σ 2

εi ) minus ηi)]2

(2σ 2u )

where microi

and ηi are the ith components of CXθθθX and Cεθθθε respectively

7 [σ 2k ] prop σminus3

k expminusb2

k(2σ2k ) minus [

log(σ 2k ) minus ζk)

]2(σ 2

v ) where ζk is the kth com-

ponent of Cbθθθb

All the above conditionals have an explicit form with the exception of the n + Km

one-dimensional conditionals from 6 and 7 For these distributions we use the Metropolis-Hastings algorithm with a normal proposal distribution centered at the current value andsmall variance

10 IMPLEMENTATION AND COMPUTATIONALPERFORMANCE

Our MCMC simulation algorithm provided reliable results and is reasonably fast To il-lustrate this Figure 9 displays computation time (in minutes) versus sample size for 10000

Spatially Adaptive Bayesian Penalized Splines 285

Figure 10 MCMC histories for eight parameters of interest including the log of standard deviation of the errorprocess and log of standard deviation of the shrinkage process as well as some parameters modeling the differentfunctions The model uses 40 equally spaced knots for the mean and error variance functions and 4 subknotsfor the adaptive shrinkage process Chains correspond to the Doppler function example using quadratic truncatedpolynomials to fit the mean and log variance function and linear truncated polynomial basis for fitting the shrinkageprocess For clarity of presentation only every 20th simulated observation is displayed

simulations from the joint posterior distribution given the data The mean function is mod-eled as a penalized spline with Km knots with the log of the variance of the shrinkageprocess modeled as another penalized spline with Kb = 4 subknots The log of the errorvariance is modeled as another penalized spline with Kε = Km knots Different symbolscorrespond to increasing number of knots dots (Km = 10) asterisks (Km = 20) circles(Km = 30) xrsquos (Km = 40) Computation time increases roughly linearly with the numberof observations with the same intercept 0 and larger slopes for larger number of knotsNote that even with n = 1000 observations and Km = Kε = 40 simulation time is abouttwo minutes

For the models and datasets presented in this article 10000 simulations proved suffi-cient for accurate and reliable inference Some parameters such as the mean and variancefunctions tended to exhibit much better mixing properties than the variance componentsσ 2c and σ 2

d controlling the shrinkage processes This is probably due to the reduced amountof information in the data about these parameters The MCMC convergence and mixingproperties were assessed by visual inspection of the chain histories of many parameters ofinterest Since we started with good starting values of the parameters convergence is fastFigure 10 displays the histories of eight parameters of interest for the model fitted to the

286 C M Crainiceanu et al

Doppler function described in Section 5 For clarity we display only every 100th sampledvalued of the chain but the same patterns would be observed in the full chain history Similargood mixing properties have been noted in all the other examples presented in this article

In the search for a good software platform we implemented our algorithms in MATLABC and FORTRAN Programs can be downloaded from wwwbiostatjhsphedu simccrainicFor our application and implementation there were no serious differences between C andFORTRAN which have performed better than MATLAB by reducing computation timetwo to three times The reported times in Figure 9 are for our implementation in C

Given the complexity of the model we were interested in improving the mixing prop-erties of the Markov Chains The computational trick described in Section 9 was the sinequa non method that allowed us to replace an intractable multivariate MH step with severaltractable univariate MH steps We used normal random walk proposal distributions centeredat the current value and variance tuned to provide acceptance rates between 30 and 50

11 COMMENTS

Our joint models for heterogeneous mean and nonconstant variance are designed to ex-tract the maximum amount of information from complex data and suggest simpler paramet-ric models Bayesian inference based on MCMC offers a natural and powerful frameworkfor model fitting However MCMC in this context is not straightforward and implemen-tation depends essentially on the ability of transforming intractable simulations from highdimensional distributions into tractable one dimensional updates as described in Section9 Our implementation provides accurate fast and reproducible results while software isavailable Actual simulation time increases only linearly with the number of observationsand mixing properties are adequate provided that proposal distribution variances are tunedfor optimal acceptance rates

ACKNOWLEDGMENTS

Carrollrsquos research was supported by a grant from the National Cancer Institute (CA-57030) and by the TexasAampM Center for Environmental and Rural Health via a grant from the National Institute of Environmental HealthSciences (P30-ES09106)

[Received September 2004 Revised October 2006]

REFERENCES

Baladandayuthapani V Mallick B K and Carroll R J (2005) ldquoSpatially Adaptive Bayesian RegressionSplinesrdquo Journal of Computational and Graphical Statistics 14 378ndash394

Besag J Green P Higdon D and Mengersen K (1995) ldquoBayesian Computation and Stochastic Systemsrdquo(with discussions) Statistical Science 10 3ndash66

Carroll R J (2003) ldquoVariances Are Not Always Nuisance Parameters The 2002 R A Fisher Lecturerdquo Biometrics59 211ndash220

Carroll R J and Ruppert D (1988) Transformation and Weighting in Regression New York Chapman andHall

Spatially Adaptive Bayesian Penalized Splines 287

Cleveland W S and Grosse E (1991) ldquoComputational Methods for Local Regressionrdquo Statistics and Computing1 47ndash62

Crainiceanu C and Ruppert D (2004) ldquoLikelihood Ratio Tests in Linear Mixed Models With One VarianceComponentrdquo Journal of the Royal Statistical Society Series B 66 165ndash185

Crainiceanu C M Ruppert D and Wand M P (2005) ldquoBayesian Analysis for Penalized Spline Regressionusing WinBUGSrdquo Journal of Statistical Software 14 Available online at http wwwjstatsoftorg v14 i14

Donoho D L and Johnstone I M (1994) ldquoIdeal Spatial Adaptation by Wavelet Shrinkagerdquo Biometrika 81425ndash455

Eilers P H C and Marx B D (1996) ldquoFlexible Smoothing With B-Splines and Penaltiesrdquo (with discussion)Statistical Science 11 89ndash121

(2002) ldquoGeneralized Linear Additive Smooth Structuresrdquo Journal of Computational and Graphical Statis-tics 11 758ndash783

(2004) ldquoSplines Knots and Penaltiesrdquo unpublished manuscript

Fields Development Team (2006) fields Tools for Spatial Data National Center for Atmospheric ResearchBoulder CO Available online at http wwwcgducaredu Software Fields

Friedman J H (1991) ldquoMultivariate Adaptive Regression Splinesrdquo (with discussion) The Annals of Statistics19 1ndash141

Gelman A (2006) ldquoPrior Distributions for Variance Parameters in Hierarchical Modelsrdquo Bayesian Analysis 1515ndash533

Gray R J (1994) ldquoSpline-Based Tests in Survival Analysisrdquo Biometrics 50 640ndash652

Holst U Hossjer O Bjorklund C Ragnarson P and Edner H (1996) ldquoLocally Weighted Least SquaresKernel Regression and Statistical Evaluation of LIDAR Measurementsrdquo Environmetrics 7 401ndash416

Jullion A and Lambert P (2006) Discussion paper DP0534 UCL

Kammann E E and Wand M P (2003) ldquoGeoadditive Modelsrdquo Applied Statistics 52 1ndash18

Kauermann G (2002) ldquoA Note on Bandwidth Selection for Penalised Spline Smoothingrdquo Technical Report 02-13Department of Statistics University of Glasgow

Krivobokova T Crainiceanu C M and Kauermann G (2007) ldquoFast Adaptive Penalized Splinesrdquo Journal ofComputational and Graphical Statistics to appear

Lang S and Bretzger A (2004) ldquoBayesian P-Splinesrdquo Journal of Compuational and Graphical Statistics 13183ndash212

Lang S Fronk E M and Fahrmeir L (2002) ldquoFunction Estimation with Locally Adaptive Dynamic ModelsrdquoComputational Statistics 17 479ndash499

Marx B D and Eilers PHC (1998) ldquoDirect Generalized Additive Modeling With Penalized LikelihoodrdquoComputational Statistics amp Data Analysis 28 193ndash209

Natarajan R and Kass R E (2000) ldquoReference Bayesian Methods for Generalized Linear Mixed ModelsrdquoJournal of the American Statistical Association 95 227ndash237

Ngo L and Wand M P (2004) ldquoSmoothing With Mixed Model Softwarerdquo Journal of Statistical Software 9Available online at http wwwjstatsoftorg v09 i01

Nychka D Haaland P OrsquoConnell M and Ellner S (1998) ldquoFUNFITS Data Analysis and Statistical Toolsfor Estimating Functionsrdquo in Case Studies in Environmental Statistics Lecture Notes in Statistics vol 132eds D Nychka W W Piegorsch and L H Cox New York Springer Verlag pp 159ndash179 Online athttp wwwcgducaredu stats Software Funfits

Nychka D and Saltzman N (1998) ldquoDesign of Air Quality Monitoring Networksrdquo in Case studies in Environ-mental Statistics Lecture Notes in Statistics vol 132 eds D Nychka W W Piegorsch and L H Cox NewYork Springer Verlag pp 51ndash76

288 C M Crainiceanu et al

Pintore A Speckman P and Holmes C C (2006) ldquoSpatially Adaptive Smoothing Splinesrdquo Biometrika 93113ndash125

Ruppert D (1997) ldquoLocal Polynomial Regression and its Applications in Environmental Statisticsrdquo in Statisticsfor the Environment (vol 3) eds V Barnett and F Turkman Chicester Wiley pp 155ndash173

(2002) ldquoSelecting the Number of Knots for Penalized Splinesrdquo Journal of Computational and GraphicalStatistics 11 735ndash757

Ruppert D and Carroll R J (2000) ldquoSpatially-Adaptive Penalties for Spline Fittingrdquo Australian and NewZealand Journal of Statistics 42 205ndash223

Ruppert D Wand M P Holst and U Hossier O (1997) Technometrics 39 262ndash273

Ruppert D Wand M P and Carroll R J (2003) Semiparametric Regression Cambridge Cambridge UniversityPress

Smith M and Kohn R (1997) ldquoA Bayesian Approach to Nonparametric Bivariate Regressionrdquo Journal of theAmerican Statistical Association 92 1522ndash1535

Wood S (2006) Generalized Additive Models An Introduction with R New York Chapman amp HallCRC

Page 20: Spatially Adaptive Bayesian Penalized Splines - Department of

284 C M Crainiceanu et al

Figure 9 Computation time (in minutes) versus sample size for 10000 simulations from the joint posteriordistribution given the data The mean function is modeled as a penalized spline with Km knots with the log of thevariance of the shrinkage process modeled as another penalized spline with Kb = 4 subknots The log of the errorvariance is modeled as another penalized spline with Kε = Km knots Different symbols correspond to increasingnumber of knots dots (Km = 10) asterisks (Km = 20) circles (Km = 30) xrsquos (Km = 40)

3 [θθθb] sim N(MbCT

b Ybσ2v Mb

)where Yb =

[log(σ 2

1 ) log(σ 2Km

)]T

and Mb =(CT

b Cbσ2v +minus1

0b )minus1

4 [σ 2c ] sim IGamma

(ac + Kε2 bc + ||c||22

)

5 [σ 2d ] sim IGamma

(ad + Kb2 bd + ||d||22

)

6 [σ 2εi] prop σminus3

εi exp

minus(yi minus microi)

2(2σ 2εi ) minus

[log(σ 2

εi ) minus ηi)]2

(2σ 2u )

where microi

and ηi are the ith components of CXθθθX and Cεθθθε respectively

7 [σ 2k ] prop σminus3

k expminusb2

k(2σ2k ) minus [

log(σ 2k ) minus ζk)

]2(σ 2

v ) where ζk is the kth com-

ponent of Cbθθθb

All the above conditionals have an explicit form with the exception of the n + Km

one-dimensional conditionals from 6 and 7 For these distributions we use the Metropolis-Hastings algorithm with a normal proposal distribution centered at the current value andsmall variance

10 IMPLEMENTATION AND COMPUTATIONALPERFORMANCE

Our MCMC simulation algorithm provided reliable results and is reasonably fast To il-lustrate this Figure 9 displays computation time (in minutes) versus sample size for 10000

Spatially Adaptive Bayesian Penalized Splines 285

Figure 10 MCMC histories for eight parameters of interest including the log of standard deviation of the errorprocess and log of standard deviation of the shrinkage process as well as some parameters modeling the differentfunctions The model uses 40 equally spaced knots for the mean and error variance functions and 4 subknotsfor the adaptive shrinkage process Chains correspond to the Doppler function example using quadratic truncatedpolynomials to fit the mean and log variance function and linear truncated polynomial basis for fitting the shrinkageprocess For clarity of presentation only every 20th simulated observation is displayed

simulations from the joint posterior distribution given the data The mean function is mod-eled as a penalized spline with Km knots with the log of the variance of the shrinkageprocess modeled as another penalized spline with Kb = 4 subknots The log of the errorvariance is modeled as another penalized spline with Kε = Km knots Different symbolscorrespond to increasing number of knots dots (Km = 10) asterisks (Km = 20) circles(Km = 30) xrsquos (Km = 40) Computation time increases roughly linearly with the numberof observations with the same intercept 0 and larger slopes for larger number of knotsNote that even with n = 1000 observations and Km = Kε = 40 simulation time is abouttwo minutes

For the models and datasets presented in this article 10000 simulations proved suffi-cient for accurate and reliable inference Some parameters such as the mean and variancefunctions tended to exhibit much better mixing properties than the variance componentsσ 2c and σ 2

d controlling the shrinkage processes This is probably due to the reduced amountof information in the data about these parameters The MCMC convergence and mixingproperties were assessed by visual inspection of the chain histories of many parameters ofinterest Since we started with good starting values of the parameters convergence is fastFigure 10 displays the histories of eight parameters of interest for the model fitted to the

286 C M Crainiceanu et al

Doppler function described in Section 5 For clarity we display only every 100th sampledvalued of the chain but the same patterns would be observed in the full chain history Similargood mixing properties have been noted in all the other examples presented in this article

In the search for a good software platform we implemented our algorithms in MATLABC and FORTRAN Programs can be downloaded from wwwbiostatjhsphedu simccrainicFor our application and implementation there were no serious differences between C andFORTRAN which have performed better than MATLAB by reducing computation timetwo to three times The reported times in Figure 9 are for our implementation in C

Given the complexity of the model we were interested in improving the mixing prop-erties of the Markov Chains The computational trick described in Section 9 was the sinequa non method that allowed us to replace an intractable multivariate MH step with severaltractable univariate MH steps We used normal random walk proposal distributions centeredat the current value and variance tuned to provide acceptance rates between 30 and 50

11 COMMENTS

Our joint models for heterogeneous mean and nonconstant variance are designed to ex-tract the maximum amount of information from complex data and suggest simpler paramet-ric models Bayesian inference based on MCMC offers a natural and powerful frameworkfor model fitting However MCMC in this context is not straightforward and implemen-tation depends essentially on the ability of transforming intractable simulations from highdimensional distributions into tractable one dimensional updates as described in Section9 Our implementation provides accurate fast and reproducible results while software isavailable Actual simulation time increases only linearly with the number of observationsand mixing properties are adequate provided that proposal distribution variances are tunedfor optimal acceptance rates

ACKNOWLEDGMENTS

Carrollrsquos research was supported by a grant from the National Cancer Institute (CA-57030) and by the TexasAampM Center for Environmental and Rural Health via a grant from the National Institute of Environmental HealthSciences (P30-ES09106)

[Received September 2004 Revised October 2006]

REFERENCES

Baladandayuthapani V Mallick B K and Carroll R J (2005) ldquoSpatially Adaptive Bayesian RegressionSplinesrdquo Journal of Computational and Graphical Statistics 14 378ndash394

Besag J Green P Higdon D and Mengersen K (1995) ldquoBayesian Computation and Stochastic Systemsrdquo(with discussions) Statistical Science 10 3ndash66

Carroll R J (2003) ldquoVariances Are Not Always Nuisance Parameters The 2002 R A Fisher Lecturerdquo Biometrics59 211ndash220

Carroll R J and Ruppert D (1988) Transformation and Weighting in Regression New York Chapman andHall

Spatially Adaptive Bayesian Penalized Splines 287

Cleveland W S and Grosse E (1991) ldquoComputational Methods for Local Regressionrdquo Statistics and Computing1 47ndash62

Crainiceanu C and Ruppert D (2004) ldquoLikelihood Ratio Tests in Linear Mixed Models With One VarianceComponentrdquo Journal of the Royal Statistical Society Series B 66 165ndash185

Crainiceanu C M Ruppert D and Wand M P (2005) ldquoBayesian Analysis for Penalized Spline Regressionusing WinBUGSrdquo Journal of Statistical Software 14 Available online at http wwwjstatsoftorg v14 i14

Donoho D L and Johnstone I M (1994) ldquoIdeal Spatial Adaptation by Wavelet Shrinkagerdquo Biometrika 81425ndash455

Eilers P H C and Marx B D (1996) ldquoFlexible Smoothing With B-Splines and Penaltiesrdquo (with discussion)Statistical Science 11 89ndash121

(2002) ldquoGeneralized Linear Additive Smooth Structuresrdquo Journal of Computational and Graphical Statis-tics 11 758ndash783

(2004) ldquoSplines Knots and Penaltiesrdquo unpublished manuscript

Fields Development Team (2006) fields Tools for Spatial Data National Center for Atmospheric ResearchBoulder CO Available online at http wwwcgducaredu Software Fields

Friedman J H (1991) ldquoMultivariate Adaptive Regression Splinesrdquo (with discussion) The Annals of Statistics19 1ndash141

Gelman A (2006) ldquoPrior Distributions for Variance Parameters in Hierarchical Modelsrdquo Bayesian Analysis 1515ndash533

Gray R J (1994) ldquoSpline-Based Tests in Survival Analysisrdquo Biometrics 50 640ndash652

Holst U Hossjer O Bjorklund C Ragnarson P and Edner H (1996) ldquoLocally Weighted Least SquaresKernel Regression and Statistical Evaluation of LIDAR Measurementsrdquo Environmetrics 7 401ndash416

Jullion A and Lambert P (2006) Discussion paper DP0534 UCL

Kammann E E and Wand M P (2003) ldquoGeoadditive Modelsrdquo Applied Statistics 52 1ndash18

Kauermann G (2002) ldquoA Note on Bandwidth Selection for Penalised Spline Smoothingrdquo Technical Report 02-13Department of Statistics University of Glasgow

Krivobokova T Crainiceanu C M and Kauermann G (2007) ldquoFast Adaptive Penalized Splinesrdquo Journal ofComputational and Graphical Statistics to appear

Lang S and Bretzger A (2004) ldquoBayesian P-Splinesrdquo Journal of Compuational and Graphical Statistics 13183ndash212

Lang S Fronk E M and Fahrmeir L (2002) ldquoFunction Estimation with Locally Adaptive Dynamic ModelsrdquoComputational Statistics 17 479ndash499

Marx B D and Eilers PHC (1998) ldquoDirect Generalized Additive Modeling With Penalized LikelihoodrdquoComputational Statistics amp Data Analysis 28 193ndash209

Natarajan R and Kass R E (2000) ldquoReference Bayesian Methods for Generalized Linear Mixed ModelsrdquoJournal of the American Statistical Association 95 227ndash237

Ngo L and Wand M P (2004) ldquoSmoothing With Mixed Model Softwarerdquo Journal of Statistical Software 9Available online at http wwwjstatsoftorg v09 i01

Nychka D Haaland P OrsquoConnell M and Ellner S (1998) ldquoFUNFITS Data Analysis and Statistical Toolsfor Estimating Functionsrdquo in Case Studies in Environmental Statistics Lecture Notes in Statistics vol 132eds D Nychka W W Piegorsch and L H Cox New York Springer Verlag pp 159ndash179 Online athttp wwwcgducaredu stats Software Funfits

Nychka D and Saltzman N (1998) ldquoDesign of Air Quality Monitoring Networksrdquo in Case studies in Environ-mental Statistics Lecture Notes in Statistics vol 132 eds D Nychka W W Piegorsch and L H Cox NewYork Springer Verlag pp 51ndash76

288 C M Crainiceanu et al

Pintore A Speckman P and Holmes C C (2006) ldquoSpatially Adaptive Smoothing Splinesrdquo Biometrika 93113ndash125

Ruppert D (1997) ldquoLocal Polynomial Regression and its Applications in Environmental Statisticsrdquo in Statisticsfor the Environment (vol 3) eds V Barnett and F Turkman Chicester Wiley pp 155ndash173

(2002) ldquoSelecting the Number of Knots for Penalized Splinesrdquo Journal of Computational and GraphicalStatistics 11 735ndash757

Ruppert D and Carroll R J (2000) ldquoSpatially-Adaptive Penalties for Spline Fittingrdquo Australian and NewZealand Journal of Statistics 42 205ndash223

Ruppert D Wand M P Holst and U Hossier O (1997) Technometrics 39 262ndash273

Ruppert D Wand M P and Carroll R J (2003) Semiparametric Regression Cambridge Cambridge UniversityPress

Smith M and Kohn R (1997) ldquoA Bayesian Approach to Nonparametric Bivariate Regressionrdquo Journal of theAmerican Statistical Association 92 1522ndash1535

Wood S (2006) Generalized Additive Models An Introduction with R New York Chapman amp HallCRC

Page 21: Spatially Adaptive Bayesian Penalized Splines - Department of

Spatially Adaptive Bayesian Penalized Splines 285

Figure 10 MCMC histories for eight parameters of interest including the log of standard deviation of the errorprocess and log of standard deviation of the shrinkage process as well as some parameters modeling the differentfunctions The model uses 40 equally spaced knots for the mean and error variance functions and 4 subknotsfor the adaptive shrinkage process Chains correspond to the Doppler function example using quadratic truncatedpolynomials to fit the mean and log variance function and linear truncated polynomial basis for fitting the shrinkageprocess For clarity of presentation only every 20th simulated observation is displayed

simulations from the joint posterior distribution given the data The mean function is mod-eled as a penalized spline with Km knots with the log of the variance of the shrinkageprocess modeled as another penalized spline with Kb = 4 subknots The log of the errorvariance is modeled as another penalized spline with Kε = Km knots Different symbolscorrespond to increasing number of knots dots (Km = 10) asterisks (Km = 20) circles(Km = 30) xrsquos (Km = 40) Computation time increases roughly linearly with the numberof observations with the same intercept 0 and larger slopes for larger number of knotsNote that even with n = 1000 observations and Km = Kε = 40 simulation time is abouttwo minutes

For the models and datasets presented in this article 10000 simulations proved suffi-cient for accurate and reliable inference Some parameters such as the mean and variancefunctions tended to exhibit much better mixing properties than the variance componentsσ 2c and σ 2

d controlling the shrinkage processes This is probably due to the reduced amountof information in the data about these parameters The MCMC convergence and mixingproperties were assessed by visual inspection of the chain histories of many parameters ofinterest Since we started with good starting values of the parameters convergence is fastFigure 10 displays the histories of eight parameters of interest for the model fitted to the

286 C M Crainiceanu et al

Doppler function described in Section 5 For clarity we display only every 100th sampledvalued of the chain but the same patterns would be observed in the full chain history Similargood mixing properties have been noted in all the other examples presented in this article

In the search for a good software platform we implemented our algorithms in MATLABC and FORTRAN Programs can be downloaded from wwwbiostatjhsphedu simccrainicFor our application and implementation there were no serious differences between C andFORTRAN which have performed better than MATLAB by reducing computation timetwo to three times The reported times in Figure 9 are for our implementation in C

Given the complexity of the model we were interested in improving the mixing prop-erties of the Markov Chains The computational trick described in Section 9 was the sinequa non method that allowed us to replace an intractable multivariate MH step with severaltractable univariate MH steps We used normal random walk proposal distributions centeredat the current value and variance tuned to provide acceptance rates between 30 and 50

11 COMMENTS

Our joint models for heterogeneous mean and nonconstant variance are designed to ex-tract the maximum amount of information from complex data and suggest simpler paramet-ric models Bayesian inference based on MCMC offers a natural and powerful frameworkfor model fitting However MCMC in this context is not straightforward and implemen-tation depends essentially on the ability of transforming intractable simulations from highdimensional distributions into tractable one dimensional updates as described in Section9 Our implementation provides accurate fast and reproducible results while software isavailable Actual simulation time increases only linearly with the number of observationsand mixing properties are adequate provided that proposal distribution variances are tunedfor optimal acceptance rates

ACKNOWLEDGMENTS

Carrollrsquos research was supported by a grant from the National Cancer Institute (CA-57030) and by the TexasAampM Center for Environmental and Rural Health via a grant from the National Institute of Environmental HealthSciences (P30-ES09106)

[Received September 2004 Revised October 2006]

REFERENCES

Baladandayuthapani V Mallick B K and Carroll R J (2005) ldquoSpatially Adaptive Bayesian RegressionSplinesrdquo Journal of Computational and Graphical Statistics 14 378ndash394

Besag J Green P Higdon D and Mengersen K (1995) ldquoBayesian Computation and Stochastic Systemsrdquo(with discussions) Statistical Science 10 3ndash66

Carroll R J (2003) ldquoVariances Are Not Always Nuisance Parameters The 2002 R A Fisher Lecturerdquo Biometrics59 211ndash220

Carroll R J and Ruppert D (1988) Transformation and Weighting in Regression New York Chapman andHall

Spatially Adaptive Bayesian Penalized Splines 287

Cleveland W S and Grosse E (1991) ldquoComputational Methods for Local Regressionrdquo Statistics and Computing1 47ndash62

Crainiceanu C and Ruppert D (2004) ldquoLikelihood Ratio Tests in Linear Mixed Models With One VarianceComponentrdquo Journal of the Royal Statistical Society Series B 66 165ndash185

Crainiceanu C M Ruppert D and Wand M P (2005) ldquoBayesian Analysis for Penalized Spline Regressionusing WinBUGSrdquo Journal of Statistical Software 14 Available online at http wwwjstatsoftorg v14 i14

Donoho D L and Johnstone I M (1994) ldquoIdeal Spatial Adaptation by Wavelet Shrinkagerdquo Biometrika 81425ndash455

Eilers P H C and Marx B D (1996) ldquoFlexible Smoothing With B-Splines and Penaltiesrdquo (with discussion)Statistical Science 11 89ndash121

(2002) ldquoGeneralized Linear Additive Smooth Structuresrdquo Journal of Computational and Graphical Statis-tics 11 758ndash783

(2004) ldquoSplines Knots and Penaltiesrdquo unpublished manuscript

Fields Development Team (2006) fields Tools for Spatial Data National Center for Atmospheric ResearchBoulder CO Available online at http wwwcgducaredu Software Fields

Friedman J H (1991) ldquoMultivariate Adaptive Regression Splinesrdquo (with discussion) The Annals of Statistics19 1ndash141

Gelman A (2006) ldquoPrior Distributions for Variance Parameters in Hierarchical Modelsrdquo Bayesian Analysis 1515ndash533

Gray R J (1994) ldquoSpline-Based Tests in Survival Analysisrdquo Biometrics 50 640ndash652

Holst U Hossjer O Bjorklund C Ragnarson P and Edner H (1996) ldquoLocally Weighted Least SquaresKernel Regression and Statistical Evaluation of LIDAR Measurementsrdquo Environmetrics 7 401ndash416

Jullion A and Lambert P (2006) Discussion paper DP0534 UCL

Kammann E E and Wand M P (2003) ldquoGeoadditive Modelsrdquo Applied Statistics 52 1ndash18

Kauermann G (2002) ldquoA Note on Bandwidth Selection for Penalised Spline Smoothingrdquo Technical Report 02-13Department of Statistics University of Glasgow

Krivobokova T Crainiceanu C M and Kauermann G (2007) ldquoFast Adaptive Penalized Splinesrdquo Journal ofComputational and Graphical Statistics to appear

Lang S and Bretzger A (2004) ldquoBayesian P-Splinesrdquo Journal of Compuational and Graphical Statistics 13183ndash212

Lang S Fronk E M and Fahrmeir L (2002) ldquoFunction Estimation with Locally Adaptive Dynamic ModelsrdquoComputational Statistics 17 479ndash499

Marx B D and Eilers PHC (1998) ldquoDirect Generalized Additive Modeling With Penalized LikelihoodrdquoComputational Statistics amp Data Analysis 28 193ndash209

Natarajan R and Kass R E (2000) ldquoReference Bayesian Methods for Generalized Linear Mixed ModelsrdquoJournal of the American Statistical Association 95 227ndash237

Ngo L and Wand M P (2004) ldquoSmoothing With Mixed Model Softwarerdquo Journal of Statistical Software 9Available online at http wwwjstatsoftorg v09 i01

Nychka D Haaland P OrsquoConnell M and Ellner S (1998) ldquoFUNFITS Data Analysis and Statistical Toolsfor Estimating Functionsrdquo in Case Studies in Environmental Statistics Lecture Notes in Statistics vol 132eds D Nychka W W Piegorsch and L H Cox New York Springer Verlag pp 159ndash179 Online athttp wwwcgducaredu stats Software Funfits

Nychka D and Saltzman N (1998) ldquoDesign of Air Quality Monitoring Networksrdquo in Case studies in Environ-mental Statistics Lecture Notes in Statistics vol 132 eds D Nychka W W Piegorsch and L H Cox NewYork Springer Verlag pp 51ndash76

288 C M Crainiceanu et al

Pintore A Speckman P and Holmes C C (2006) ldquoSpatially Adaptive Smoothing Splinesrdquo Biometrika 93113ndash125

Ruppert D (1997) ldquoLocal Polynomial Regression and its Applications in Environmental Statisticsrdquo in Statisticsfor the Environment (vol 3) eds V Barnett and F Turkman Chicester Wiley pp 155ndash173

(2002) ldquoSelecting the Number of Knots for Penalized Splinesrdquo Journal of Computational and GraphicalStatistics 11 735ndash757

Ruppert D and Carroll R J (2000) ldquoSpatially-Adaptive Penalties for Spline Fittingrdquo Australian and NewZealand Journal of Statistics 42 205ndash223

Ruppert D Wand M P Holst and U Hossier O (1997) Technometrics 39 262ndash273

Ruppert D Wand M P and Carroll R J (2003) Semiparametric Regression Cambridge Cambridge UniversityPress

Smith M and Kohn R (1997) ldquoA Bayesian Approach to Nonparametric Bivariate Regressionrdquo Journal of theAmerican Statistical Association 92 1522ndash1535

Wood S (2006) Generalized Additive Models An Introduction with R New York Chapman amp HallCRC

Page 22: Spatially Adaptive Bayesian Penalized Splines - Department of

286 C M Crainiceanu et al

Doppler function described in Section 5 For clarity we display only every 100th sampledvalued of the chain but the same patterns would be observed in the full chain history Similargood mixing properties have been noted in all the other examples presented in this article

In the search for a good software platform we implemented our algorithms in MATLABC and FORTRAN Programs can be downloaded from wwwbiostatjhsphedu simccrainicFor our application and implementation there were no serious differences between C andFORTRAN which have performed better than MATLAB by reducing computation timetwo to three times The reported times in Figure 9 are for our implementation in C

Given the complexity of the model we were interested in improving the mixing prop-erties of the Markov Chains The computational trick described in Section 9 was the sinequa non method that allowed us to replace an intractable multivariate MH step with severaltractable univariate MH steps We used normal random walk proposal distributions centeredat the current value and variance tuned to provide acceptance rates between 30 and 50

11 COMMENTS

Our joint models for heterogeneous mean and nonconstant variance are designed to ex-tract the maximum amount of information from complex data and suggest simpler paramet-ric models Bayesian inference based on MCMC offers a natural and powerful frameworkfor model fitting However MCMC in this context is not straightforward and implemen-tation depends essentially on the ability of transforming intractable simulations from highdimensional distributions into tractable one dimensional updates as described in Section9 Our implementation provides accurate fast and reproducible results while software isavailable Actual simulation time increases only linearly with the number of observationsand mixing properties are adequate provided that proposal distribution variances are tunedfor optimal acceptance rates

ACKNOWLEDGMENTS

Carrollrsquos research was supported by a grant from the National Cancer Institute (CA-57030) and by the TexasAampM Center for Environmental and Rural Health via a grant from the National Institute of Environmental HealthSciences (P30-ES09106)

[Received September 2004 Revised October 2006]

REFERENCES

Baladandayuthapani V Mallick B K and Carroll R J (2005) ldquoSpatially Adaptive Bayesian RegressionSplinesrdquo Journal of Computational and Graphical Statistics 14 378ndash394

Besag J Green P Higdon D and Mengersen K (1995) ldquoBayesian Computation and Stochastic Systemsrdquo(with discussions) Statistical Science 10 3ndash66

Carroll R J (2003) ldquoVariances Are Not Always Nuisance Parameters The 2002 R A Fisher Lecturerdquo Biometrics59 211ndash220

Carroll R J and Ruppert D (1988) Transformation and Weighting in Regression New York Chapman andHall

Spatially Adaptive Bayesian Penalized Splines 287

Cleveland W S and Grosse E (1991) ldquoComputational Methods for Local Regressionrdquo Statistics and Computing1 47ndash62

Crainiceanu C and Ruppert D (2004) ldquoLikelihood Ratio Tests in Linear Mixed Models With One VarianceComponentrdquo Journal of the Royal Statistical Society Series B 66 165ndash185

Crainiceanu C M Ruppert D and Wand M P (2005) ldquoBayesian Analysis for Penalized Spline Regressionusing WinBUGSrdquo Journal of Statistical Software 14 Available online at http wwwjstatsoftorg v14 i14

Donoho D L and Johnstone I M (1994) ldquoIdeal Spatial Adaptation by Wavelet Shrinkagerdquo Biometrika 81425ndash455

Eilers P H C and Marx B D (1996) ldquoFlexible Smoothing With B-Splines and Penaltiesrdquo (with discussion)Statistical Science 11 89ndash121

(2002) ldquoGeneralized Linear Additive Smooth Structuresrdquo Journal of Computational and Graphical Statis-tics 11 758ndash783

(2004) ldquoSplines Knots and Penaltiesrdquo unpublished manuscript

Fields Development Team (2006) fields Tools for Spatial Data National Center for Atmospheric ResearchBoulder CO Available online at http wwwcgducaredu Software Fields

Friedman J H (1991) ldquoMultivariate Adaptive Regression Splinesrdquo (with discussion) The Annals of Statistics19 1ndash141

Gelman A (2006) ldquoPrior Distributions for Variance Parameters in Hierarchical Modelsrdquo Bayesian Analysis 1515ndash533

Gray R J (1994) ldquoSpline-Based Tests in Survival Analysisrdquo Biometrics 50 640ndash652

Holst U Hossjer O Bjorklund C Ragnarson P and Edner H (1996) ldquoLocally Weighted Least SquaresKernel Regression and Statistical Evaluation of LIDAR Measurementsrdquo Environmetrics 7 401ndash416

Jullion A and Lambert P (2006) Discussion paper DP0534 UCL

Kammann E E and Wand M P (2003) ldquoGeoadditive Modelsrdquo Applied Statistics 52 1ndash18

Kauermann G (2002) ldquoA Note on Bandwidth Selection for Penalised Spline Smoothingrdquo Technical Report 02-13Department of Statistics University of Glasgow

Krivobokova T Crainiceanu C M and Kauermann G (2007) ldquoFast Adaptive Penalized Splinesrdquo Journal ofComputational and Graphical Statistics to appear

Lang S and Bretzger A (2004) ldquoBayesian P-Splinesrdquo Journal of Compuational and Graphical Statistics 13183ndash212

Lang S Fronk E M and Fahrmeir L (2002) ldquoFunction Estimation with Locally Adaptive Dynamic ModelsrdquoComputational Statistics 17 479ndash499

Marx B D and Eilers PHC (1998) ldquoDirect Generalized Additive Modeling With Penalized LikelihoodrdquoComputational Statistics amp Data Analysis 28 193ndash209

Natarajan R and Kass R E (2000) ldquoReference Bayesian Methods for Generalized Linear Mixed ModelsrdquoJournal of the American Statistical Association 95 227ndash237

Ngo L and Wand M P (2004) ldquoSmoothing With Mixed Model Softwarerdquo Journal of Statistical Software 9Available online at http wwwjstatsoftorg v09 i01

Nychka D Haaland P OrsquoConnell M and Ellner S (1998) ldquoFUNFITS Data Analysis and Statistical Toolsfor Estimating Functionsrdquo in Case Studies in Environmental Statistics Lecture Notes in Statistics vol 132eds D Nychka W W Piegorsch and L H Cox New York Springer Verlag pp 159ndash179 Online athttp wwwcgducaredu stats Software Funfits

Nychka D and Saltzman N (1998) ldquoDesign of Air Quality Monitoring Networksrdquo in Case studies in Environ-mental Statistics Lecture Notes in Statistics vol 132 eds D Nychka W W Piegorsch and L H Cox NewYork Springer Verlag pp 51ndash76

288 C M Crainiceanu et al

Pintore A Speckman P and Holmes C C (2006) ldquoSpatially Adaptive Smoothing Splinesrdquo Biometrika 93113ndash125

Ruppert D (1997) ldquoLocal Polynomial Regression and its Applications in Environmental Statisticsrdquo in Statisticsfor the Environment (vol 3) eds V Barnett and F Turkman Chicester Wiley pp 155ndash173

(2002) ldquoSelecting the Number of Knots for Penalized Splinesrdquo Journal of Computational and GraphicalStatistics 11 735ndash757

Ruppert D and Carroll R J (2000) ldquoSpatially-Adaptive Penalties for Spline Fittingrdquo Australian and NewZealand Journal of Statistics 42 205ndash223

Ruppert D Wand M P Holst and U Hossier O (1997) Technometrics 39 262ndash273

Ruppert D Wand M P and Carroll R J (2003) Semiparametric Regression Cambridge Cambridge UniversityPress

Smith M and Kohn R (1997) ldquoA Bayesian Approach to Nonparametric Bivariate Regressionrdquo Journal of theAmerican Statistical Association 92 1522ndash1535

Wood S (2006) Generalized Additive Models An Introduction with R New York Chapman amp HallCRC

Page 23: Spatially Adaptive Bayesian Penalized Splines - Department of

Spatially Adaptive Bayesian Penalized Splines 287

Cleveland W S and Grosse E (1991) ldquoComputational Methods for Local Regressionrdquo Statistics and Computing1 47ndash62

Crainiceanu C and Ruppert D (2004) ldquoLikelihood Ratio Tests in Linear Mixed Models With One VarianceComponentrdquo Journal of the Royal Statistical Society Series B 66 165ndash185

Crainiceanu C M Ruppert D and Wand M P (2005) ldquoBayesian Analysis for Penalized Spline Regressionusing WinBUGSrdquo Journal of Statistical Software 14 Available online at http wwwjstatsoftorg v14 i14

Donoho D L and Johnstone I M (1994) ldquoIdeal Spatial Adaptation by Wavelet Shrinkagerdquo Biometrika 81425ndash455

Eilers P H C and Marx B D (1996) ldquoFlexible Smoothing With B-Splines and Penaltiesrdquo (with discussion)Statistical Science 11 89ndash121

(2002) ldquoGeneralized Linear Additive Smooth Structuresrdquo Journal of Computational and Graphical Statis-tics 11 758ndash783

(2004) ldquoSplines Knots and Penaltiesrdquo unpublished manuscript

Fields Development Team (2006) fields Tools for Spatial Data National Center for Atmospheric ResearchBoulder CO Available online at http wwwcgducaredu Software Fields

Friedman J H (1991) ldquoMultivariate Adaptive Regression Splinesrdquo (with discussion) The Annals of Statistics19 1ndash141

Gelman A (2006) ldquoPrior Distributions for Variance Parameters in Hierarchical Modelsrdquo Bayesian Analysis 1515ndash533

Gray R J (1994) ldquoSpline-Based Tests in Survival Analysisrdquo Biometrics 50 640ndash652

Holst U Hossjer O Bjorklund C Ragnarson P and Edner H (1996) ldquoLocally Weighted Least SquaresKernel Regression and Statistical Evaluation of LIDAR Measurementsrdquo Environmetrics 7 401ndash416

Jullion A and Lambert P (2006) Discussion paper DP0534 UCL

Kammann E E and Wand M P (2003) ldquoGeoadditive Modelsrdquo Applied Statistics 52 1ndash18

Kauermann G (2002) ldquoA Note on Bandwidth Selection for Penalised Spline Smoothingrdquo Technical Report 02-13Department of Statistics University of Glasgow

Krivobokova T Crainiceanu C M and Kauermann G (2007) ldquoFast Adaptive Penalized Splinesrdquo Journal ofComputational and Graphical Statistics to appear

Lang S and Bretzger A (2004) ldquoBayesian P-Splinesrdquo Journal of Compuational and Graphical Statistics 13183ndash212

Lang S Fronk E M and Fahrmeir L (2002) ldquoFunction Estimation with Locally Adaptive Dynamic ModelsrdquoComputational Statistics 17 479ndash499

Marx B D and Eilers PHC (1998) ldquoDirect Generalized Additive Modeling With Penalized LikelihoodrdquoComputational Statistics amp Data Analysis 28 193ndash209

Natarajan R and Kass R E (2000) ldquoReference Bayesian Methods for Generalized Linear Mixed ModelsrdquoJournal of the American Statistical Association 95 227ndash237

Ngo L and Wand M P (2004) ldquoSmoothing With Mixed Model Softwarerdquo Journal of Statistical Software 9Available online at http wwwjstatsoftorg v09 i01

Nychka D Haaland P OrsquoConnell M and Ellner S (1998) ldquoFUNFITS Data Analysis and Statistical Toolsfor Estimating Functionsrdquo in Case Studies in Environmental Statistics Lecture Notes in Statistics vol 132eds D Nychka W W Piegorsch and L H Cox New York Springer Verlag pp 159ndash179 Online athttp wwwcgducaredu stats Software Funfits

Nychka D and Saltzman N (1998) ldquoDesign of Air Quality Monitoring Networksrdquo in Case studies in Environ-mental Statistics Lecture Notes in Statistics vol 132 eds D Nychka W W Piegorsch and L H Cox NewYork Springer Verlag pp 51ndash76

288 C M Crainiceanu et al

Pintore A Speckman P and Holmes C C (2006) ldquoSpatially Adaptive Smoothing Splinesrdquo Biometrika 93113ndash125

Ruppert D (1997) ldquoLocal Polynomial Regression and its Applications in Environmental Statisticsrdquo in Statisticsfor the Environment (vol 3) eds V Barnett and F Turkman Chicester Wiley pp 155ndash173

(2002) ldquoSelecting the Number of Knots for Penalized Splinesrdquo Journal of Computational and GraphicalStatistics 11 735ndash757

Ruppert D and Carroll R J (2000) ldquoSpatially-Adaptive Penalties for Spline Fittingrdquo Australian and NewZealand Journal of Statistics 42 205ndash223

Ruppert D Wand M P Holst and U Hossier O (1997) Technometrics 39 262ndash273

Ruppert D Wand M P and Carroll R J (2003) Semiparametric Regression Cambridge Cambridge UniversityPress

Smith M and Kohn R (1997) ldquoA Bayesian Approach to Nonparametric Bivariate Regressionrdquo Journal of theAmerican Statistical Association 92 1522ndash1535

Wood S (2006) Generalized Additive Models An Introduction with R New York Chapman amp HallCRC

Page 24: Spatially Adaptive Bayesian Penalized Splines - Department of

288 C M Crainiceanu et al

Pintore A Speckman P and Holmes C C (2006) ldquoSpatially Adaptive Smoothing Splinesrdquo Biometrika 93113ndash125

Ruppert D (1997) ldquoLocal Polynomial Regression and its Applications in Environmental Statisticsrdquo in Statisticsfor the Environment (vol 3) eds V Barnett and F Turkman Chicester Wiley pp 155ndash173

(2002) ldquoSelecting the Number of Knots for Penalized Splinesrdquo Journal of Computational and GraphicalStatistics 11 735ndash757

Ruppert D and Carroll R J (2000) ldquoSpatially-Adaptive Penalties for Spline Fittingrdquo Australian and NewZealand Journal of Statistics 42 205ndash223

Ruppert D Wand M P Holst and U Hossier O (1997) Technometrics 39 262ndash273

Ruppert D Wand M P and Carroll R J (2003) Semiparametric Regression Cambridge Cambridge UniversityPress

Smith M and Kohn R (1997) ldquoA Bayesian Approach to Nonparametric Bivariate Regressionrdquo Journal of theAmerican Statistical Association 92 1522ndash1535

Wood S (2006) Generalized Additive Models An Introduction with R New York Chapman amp HallCRC