12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with...

165
“Schweder-Book” — 2015/10/21 — 17:46 — page 336 — #356 12 Predictions and confidence The previous chapters have focussed on confidence distributions and associated inference for parameters of statistical models. Sometimes the goal of an analysis is, however, to make predictions about as yet unobserved or otherwise hidden random variables, such as the next data point in a sequence, or to infer values of missing data, so forth. This chapter discusses and illustrates how the concept of confidence distributions may be lifted to such settings. Applications are given to predicting the next observation in a sequence, to regression models, kriging in geostatistics and time series models. 12.1 Introduction In earlier chapters we have developed and discussed concepts and methods for confidence distributions for parameters of statistical models. Sometimes the goal of fitting and analysing a model to data is, however, to predict as yet unobserved random quantities, like the next observation in a sequence, a missing data point in a data matrix or inferring the distribution for a future Y 0 in a regression model as a function of its associated covariates x 0 , and so on. For such a future or onobserved Y 0 we may then wish to construct a predictive distribution, say C pred (y 0 ), with the property that C pred (b) C pred (a) may be interpreted as the probability that a Y 0 b. As such intervals for the unobserved Y 0 with given coverage degree may be read off, via [C 1 pred (α), C 1 pred (1 α)], as for ordinary confidence intervals. There is a tradition in some statistics literature to use ‘credibility intervals’ rather than ‘confidence intervals’, when the quantity in question for which one needs these intervals is a random variable rather than a parameter of a statistical model. This term is also in frequent use for Bayesian statistics, where there is no clear division in parameters and variables, as also model parameters are considered random. We shall, however, continue to use ‘confidence intervals’ and indeed ‘confidence distributions’ for these prediction settings. We shall start our discussion for the case of predicting the next data point in a sequence of i.i.d. observations, in Section 12.2. Our frequentist predictive approach is different from the Bayesian one, where the model density is being integrated over the parameters with respect to their posterior distribution. We provide a brief discussion and comparison of the two approaches in Section 12.3, in particular pointing to the reassuring fact that the two tend to agree for large n. We then proceed to regression models in Section 12.4. In such situations the task is to construct a confidence distribution C pred (y 0 | x 0 ) for an observation Y 0 associated with position x 0 in the covariate space, along with, for example, prediction 336

Upload: others

Post on 01-Jan-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 336 — #356�

12

Predictions and confidence

The previous chapters have focussed on confidence distributions and associated inferencefor parameters of statistical models. Sometimes the goal of an analysis is, however, tomake predictions about as yet unobserved or otherwise hidden random variables, such asthe next data point in a sequence, or to infer values of missing data, so forth. This chapterdiscusses and illustrates how the concept of confidence distributions may be lifted tosuch settings. Applications are given to predicting the next observation in a sequence, toregression models, kriging in geostatistics and time series models.

12.1 Introduction

In earlier chapters we have developed and discussed concepts and methods for confidencedistributions for parameters of statistical models. Sometimes the goal of fitting and analysinga model to data is, however, to predict as yet unobserved random quantities, like the nextobservation in a sequence, a missing data point in a data matrix or inferring the distributionfor a future Y0 in a regression model as a function of its associated covariates x0, and so on.For such a future or onobserved Y0 we may then wish to construct a predictive distribution,say Cpred(y0), with the property that Cpred(b)−Cpred(a)may be interpreted as the probabilitythat a ≤ Y0 ≤ b. As such intervals for the unobserved Y0 with given coverage degree may beread off, via [C−1

pred(α),C−1pred(1 −α)], as for ordinary confidence intervals.

There is a tradition in some statistics literature to use ‘credibility intervals’ rather than‘confidence intervals’, when the quantity in question for which one needs these intervalsis a random variable rather than a parameter of a statistical model. This term is also infrequent use for Bayesian statistics, where there is no clear division in parameters andvariables, as also model parameters are considered random. We shall, however, continueto use ‘confidence intervals’ and indeed ‘confidence distributions’ for these predictionsettings.

We shall start our discussion for the case of predicting the next data point in a sequenceof i.i.d. observations, in Section 12.2. Our frequentist predictive approach is different fromthe Bayesian one, where the model density is being integrated over the parameters withrespect to their posterior distribution. We provide a brief discussion and comparison of thetwo approaches in Section 12.3, in particular pointing to the reassuring fact that the twotend to agree for large n. We then proceed to regression models in Section 12.4. In suchsituations the task is to construct a confidence distribution Cpred(y0 |x0) for an observationY0 associated with position x0 in the covariate space, along with, for example, prediction

336

Page 2: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 337 — #357�

12.2 The next data point 337

confidence densities and prediction confidence curves

cpred(y0 |x0)= ∂Cpred(y0 |x0)/∂y0,

ccpred(y0 |x0)= |2Cpred(y0 |x0)− 1|,much as in Section 3.3; see, for example, (3.5) and (3.6). We then briefly discuss predictionconfidence in time series models and geostatistical kriging in Section 12.5.

In each of these cases the goal is to construct a predictive distribution, which here followsthe frequentist framework and apparatus developed for focus parameters in earlier chapters.The definition is that such a confidence distribution for the variable Y0 to be predicted, sayCpred(y0,Y ) based on data Y , needs to satisfy the following two conditions. First, Cpred(y0,y)is a distribution function in y0 for each y. Second, we should have

U = Cpred(Y0,Y )∼ uniform on [0,1], (12.1)

when (Y ,Y0) has the joint distribution dictated by the true parameters of the model used. Wecall this a predictive confidence distribution for Y0; that it is predictive will be clear from thecontext. When a point prediction is called for, one may use the median predictive confidencepoint, that is, y0 = C−1

pred(12 ).

A useful prototype illustration is that of predicting Yn+1 in a sequence of independentrealisations from the normal model N(μ,σ 2) with unknown parameters, after havingobserved Y = (Y1, . . . ,Yn). With the usual parameter estimators Y and empirical standarddeviation σ , the studentised ratio R = (Yn+1 − Y )/σ has a distribution independent of theparameters of the model, say Kn. Hence

Cpred(yn+1,yobs)= Kn

(yn+1 − yobs

σobs

)(12.2)

satisfies the two criteria. That the distribution of R happens to be closely related to the twith degrees of freedom n − 1 is less crucial than the construction itself and the fact thatR is a pivot. Even if we did not manage to see or prove that in fact R is distributed as(1+1/n)1/2tn−1, we could still put up (12.2), and use simulation to produce the curve and theensuing intervals from C−1

pred(α)= yobs + σK−1n (α). Another feature is also clear from (12.2),

namely that when the dataset upon which we base our prediction is large, then parameterestimates are close to the underlying true values, making the predictive distribution comingclose to the ‘oracle’ method, which is C(yn+1,y) equal to the distribution function of Yn+1,e.g. C(yn+1)=�((y −μ)/σ) in the case treated here.

12.2 The next data point

Suppose i.i.d. data Y1, . . . ,Yn have been observed. How can we use the existing data toconstruct a confidence distribution for the next point, say Y0 = Yn+1? We start out workingwith this question nonparametrically, aiming at a confidence distribution Cpred(y0) for Y0 thatought to work without further restrictions on the distribution F underlying the observations,before we turn to parametric versions.

One helpful general perspective is that the oracle answer to the question, which assumesthe distribution function F to be known, would be C(y0) = F(y0) itself. Hence using the

Page 3: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 338 — #358�

338 Predictions and confidence

data to generate a nonparametric estimate F(y0) gives at least a reasonable approximation,so Cpred(y0) = F(y0) is an option. The immediate choice here is the empirical distributionfunction, for which

Cpred(y0)= n−1n∑

i=1

I{yi ≤ y0} = n−1n∑

i=1

I{y(i) ≤ y0}. (12.3)

Here y(1) < · · · < y(n) are the ordered data points. In particular, Cpred(y(i)) = i/n. Note that(12.3) emerged without any direct probability calculations per se, we simply inserted F forF. A different fruitful venue is to work out the probability P{Y0 ≤ Y(i)}. Writing gi(ui) forthe density of the ith ordered observation U(i) in a uniform sample U1, . . . ,Un, we have

P{Y0 ≤ Y(i)} = P{U0 ≤ U(i)} =∫ 1

0P{U0 ≤ ui}gi(ui)dui =

∫ 1

0uigi(ui)dui.

So this is EU(i), which is equal to i/(n + 1); see Exercise 12.2. This leads to aprediction confidence distribution C∗

pred with C∗pred(y(i))= i/(n+1), where we may use linear

interpolation between the data points to define a bona fide distribution function.We then turn to parametric settings. Suppose i.i.d. data Y1, . . . ,Yn have been observed

and fitted to some model, say with density f(y,θ) and cumulative F(y,θ). We shall use theexisting data to construct a confidence distribution for the next point, say Y0 = Yn+1. Asimple construction is the directly estimated

C0,pred(y0)= F(y0, θ ), (12.4)

inserting say the maximum likelihood estimate for θ . Although it is clear that this will workreasonably well for large n, in that θ then will be close enough to the true or least false valueθ0, the method may fail for moderate or small sample sizes as it does not reflect the inherentuncertainty in the parameter estimate. In other words, if θ is not close to θ0, then predictionintervals read off from (12.4) will be off target too.

It is instructive to study the case of a location and scale model, with known basedistribution. Observations are of the form Yi = μ + σεi, with the εi being independentfrom a distribution F0 with density f0. For predicting the next observation Y0, the‘oracle’ mechanism would be F0((y0 − μ)/σ), with the true values for (μ,σ), and thedirect parametric plug-in method (12.4) is F0((y0 − μ)/σ ) with the maximum likelihoodestimators. This is not taking the sampling variability of these estimators into account,however. For this type of model, the maximum likelihood estimators can be representedas μ = μ+ σ An and σ = σ Bn, with (An, Bn) = ((μ− μ)/σ , σ /σ ) having a distributionindependent of the parameters. Hence

T = Y0 − μσ

=dμ+σε−μ−σ An

σ Bn= ε− An

Bn

is a pivot, with a distribution Kn, say. The consequent confidence distribution for the nextdata point is hence

Cpred(y0)= Kn((y0 − μobs)/σobs)= P{T ≤ (y0 − μobs)/σobs}.In a few cases we may find the distribution Kn explicitly, as with the normal, but otherwiseit may conveniently be computed via simulation from the Kn distribution.

Page 4: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 339 — #359�

12.2 The next data point 339

−30 −20 −10 0 10 20 30 −30 −20 −10 0 10 20 30

the next y0

pred

ictiv

e co

nfid

ence

the next y0

pred

ictiv

e co

nfid

ence

cur

ves

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

Figure 12.1 For the Cauchy sample of size n = 5, the figure displays the prediction confidencedistribution (left) and the prediction confidence curve (right). The solid line is the real predictivedistribution whereas the dotted line gives the parametric plug-in approximation.

Example 12.1 Predicting the next Cauchy observation

In Example 4.6 we considered a small set of data from the Cauchy model, with observations−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to dowith multimodal likelihood functions. Here we use the two-parameter Cauchy model, withdensity σ−1f0(σ−1(y −μ)) and f0(x) = (1/π)/(1 + x2), to illustrate the above method forpredicting the sixth data point. The maximum likelihood estimates are (−0.304,2.467). Wesimulate B = 104 realisations of T to compute

Cpred(y0)= 1

B

B∑j=1

I{ε∗j − A∗

n,j

B∗n,j

≤ y0 + 0.304

2.467

}.

Figure 12.1 gives the cumulative confidence distribution as well as the predictive confidencecurves, along with the parametric plug-in versions, which do not perform well here withsuch a small dataset. We note that finding maximum likelihood estimates for Cauchy data iscomputationally demanding, as the log-likelihood functions often are multimodal. We tacklethis by performing an initial search for a safe start value before using this to carry out thenumerical optimisation.

Example 12.2 Prediction in the normal model

Returning to our brief illustration in Section 12.1, assume that Y1, . . . ,Yn and the next Y0

are i.i.d. N(μ,σ 2), with standard estimates μ = Y and σ = {Q0/(n − 1)}1/2, where Q0 =

Page 5: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 340 — #360�

340 Predictions and confidence

new y0

conf

iden

ce c

urve

s

3 4 5 6 7 8 9 3 4 5 6 7 8 9

new y0

conf

iden

ce d

ensi

ties

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.1

0.2

0.3

Figure 12.2 For a normal sample of size n = 5 the figure displays two prediction confidencecurves (left) and two prediction confidence densities (right). Curves with dashed lines indicatethe simple direct method associated with (12.4), and that do not reflect uncertainty in theparameter estimates. The curves with full lines are the modified and correct ones, based onthe correct distribution of (Y0 − μ)/σ .

∑ni=1(Yi − Y )2. Then by familiar properties of these estimators

Rn = (Y0 − μ)/σ ∼ (1 + 1/n)1/2tn−1;

see Exercise 12.1. It follows that the method of (12.4), which in this case amounts to usingC0,pred(y0) = �((y0 − μ)/σ ), underestimates the level of correct uncertainty; the implied95% prediction interval μ± 1.96 σ , for example, actually has coverage level

P{|Y0 − μ|/σ ≤ 1.96} = P{|tn−1| ≤ 1.96/(1 + 1/n)1/2}.This probability is, for example, equal to 0.852 for the case of n = 5, further illustratedin Figure 12.2. For this sample size, the correct 95% prediction interval takes the ratherwider form μ± 3.041 σ . In this case the simple method of (12.4) may be corrected usingthe preceding result concerning the exact distribution of the pivot Rn. The correct predictionconfidence distribution becomes

Cpred(y0)= Fn−1

( y0 − μσ (1 + 1/n)1/2

),

see again Exercise 12.1.

Example 12.3 Predicting the next uniform observation

Suppose Y1,Y2, . . . are independent data from the uniform distribution over [0,θ], withunknown θ . With Y(n) the largest of the first n observations, and Y0 = Y(n+1) the point

Page 6: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 341 — #361�

12.2 The next data point 341

to follow, we may write Y(n) = θU(n) and Y0 = θU0, where U0 is standard uniform andU(n) the maximum of n independent standard uniforms. It follows that Y0/Y(n) has thesame distribution as W = U0/U(n), say Kn, and is a pivot. Hence the predictive confidencedistribution is Cpred(y0) = Kn(y0/y(n)). The Kn may be computed exactly, and is a mixtureof a uniform on [0,1] with high probability n/(n + 1) but with a second component havingdensity (n + 1)−1/wn+1 with low probability 1/(n + 1). The predictive point estimate isy0 = C−1

pred(12 )= 1

2 (1 + 1/n)y(n). See Exercise 12.7.

There are various ways of modifying or repairing the direct plug-in method of (12.4), asalready illustrated for the normal case in Example 12.2. One may also bypass the estimatedcumulative distribution (12.4) altogether and start from exact or approximate pivots.

A useful general strategy is as follows, leading to good approximations for predictiveconfidence. Suppose in general that there is a candidate for such an exact or approximatepivot, say Zn = gn(Y0, θ ), with gn(y0, θ ) increasing in y0. Its distribution

Hn(z,θ)= P{gn(Y0, θ )≤ z,θ}may depend (strongly or weakly) on where θ is in the parameter space; a successful choiceof gn is one where this distribution is almost the same for θ in a reasonable neighbourhoodaround the true value. This may be checked by simulating Zn from different positions in thevicinity of the maximum likelihood estimate θ .

We may in particular compute the Hn distribution via simulation from the estimatedposition θ in the parameter space, say

Hn(z)= Hn(z, θ )= B−1B∑

j=1

I{Z∗n,j ≤ z}.

Here the Z∗n,j are a large enough number of simulated versions gn(Y ∗

0 , θ∗) where both Y ∗0

and the dataset Y ∗1 , . . . ,Y ∗

n leading to θ∗ are drawn from f(y, θ ), that is, as in parametricbootstrapping. The consequent prediction confidence distribution is

Cpred(y0)= P{gn(Y0, θ )≤ gn(y0, θobs)} = Hn(gn(y0, θobs)). (12.5)

Example 12.4 Predicting Egyptian lifetimes

Consider again the lifetimes of ancient Roman Egypt, studied in Examples 1.4, 3.7 and 10.1.For the present illustration we shall focus on the n = 59 observed lifetimes for women andmodel these using the Weibull distribution, with cumulative distribution F(y,a,b) = 1 −exp{−(y/a)b}. Maximum likelihood estimates for (a,b) are (28.809,1.546) with standarderrors (estimates of standard deviation) (2.555,0.153). This also gives rise to the directlyestimated confidence curve

cc0(y0)= |2F(y0, a, b)− 1| = |2 exp{−(y0/a)b}− 1|.

This curve does not reflect the uncertainty in the maximum likelihood estimates, however.A more elaborate construction that does take this uncertainty into account is via a suitable

approximate pivot, and then following the preceding recipe. The cumulative hazard rate hereis (y/a)b, so (Y/a)b has a unit exponential distribution under the true parameters. Hence also

Page 7: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 342 — #362�

342 Predictions and confidence

its logarithm, with estimated parameters plugged in, is at least asymptotically guaranteed tobe a pivot, which motivates working with

Zn = gn(Y0, a, b)= b(logY0 − log a).

Using the representation Y0 = aU 1/b0 , in terms of a unit exponential U0, we also have

Zn = b{V0/b − log(a/a)}, where V0 = logU0, so it is clear that for large sample size, thedistribution of Zn is close to that of V0, that is, independent of (a,b). For small or moderaten its distribution

Hn(z,a,b)= P{b(logY0 − log a)≤ z;a,b}can be estimated at each position (a,b) in the parameter space, via simulation, and someinspection shows that it indeed is approximately the same for each (a,b) inside a reasonableneighbourhood around the maximum likelihood values. The general recipe given earlierthen leads to the estimated predictive distribution as

Cpred(y0)= Hn(b(logy0 − log a))

and the consequent ccn(y0)= |2 Hn(y0)− 1|. See Figure 12.3.

0 20 40 60 80

age of Egyptian woman

conf

iden

ce c

urve

s

0.0

0.2

0.4

0.6

0.8

1.0

Figure 12.3 Two prediction confidence curves are shown for the age of a woman from ancientEgypt, both based on the Weibull distribution. The dotted curve corresponds to the directlyestimated Weibull distribution, whereas the solid line is the more elaborate construction basedon the estimated distribution of the approximate pivot Zn = b(logY0 − log a). The direct plug-inmethod has median point 22.7 (also equal to the maximum likelihood estimate of the median)and 90% prediction interval [4.2,58.6], whereas the better method has median point 24.6 and90% prediction interval [6.5,53.0].

Page 8: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 343 — #363�

12.3 Comparison with Bayesian prediction 343

12.3 Comparison with Bayesian prediction

The Bayesian way of creating a predictive distribution is to integrate over the modelparameters, with respect to their posterior distribution. This may be carried out on thecumulative as well as on the density scale, in the framework of (12.4) leading to

Fn(y0)=∫

F(y0,θ)πn(θ)dθ and fn(y0)=∫

f(y0,θ)πn(θ)dθ . (12.6)

Here πn(θ) ∝ π(θ)Ln(θ) is the posterior distribution, having started with some prior π(θ);cf. Section 2.5. The data y upon which the posterior is based are suppressed in the notationhere, so when precision requires it we would write πn(θ)= π(θ |y), along with Fn(y0,y) andfn(y0,y). The Bayesian strategy is clearly different in spirit from our frequentist setup, andthere is a priori no ambition on the part of (12.6) to produce accurate confidence in the senseof (12.1). In other words, it is not clear to what extent Un = Fn(Y0;Y ) has a distributionclose to the uniform. It will follow from arguments that follow that Un tends to the uniform,however.

The preceding strategy might invite the confidence distribution parallel implementation,of the form

Fn(y0)=∫

F(y0,θ)dCn(θ) and fn(y0)=∫

f(y0,θ)dCn(θ), (12.7)

in which the parameter is integrated with respect to a confidence distribution Cn, as opposedto a Bayesian posterior. This would work when θ is one-dimensional, but meets with certaintrouble in higher dimension, as discussed generally in Chapter 9; in particular, with severalof the multidimensional confidence distribution schemes of that chapter, there is no clearrecipe for sampling realisations θCD ∼ Cn. For large n, however, things become normal, asper the general theory of Chapters 2 and 3. For the most relevant confidence distributions,based on or closely related to maximum likelihood estimators or deviances, their densitiestake the form, or come close to, cn(θ)= φp(θ − θ , J−1/n), corresponding to the underlyingresult

√n(θ − θ)→d Np(0,J−1). Here φp(x,�) is the density of a p-dimensional Np(0,�),

and J = −n−1∂2n(θ)/∂θ∂θt. Thus the (12.7) scheme at least works in the approximate

sense, for large n, with θCD ∼ Np(θ , J−1/n).But this precisely matches the Bernsteın–von Mises type theorems, briefly described in

Section 2.5, valid for any start prior. To the first order of magnitude, therefore, that is, upto 1/

√n errors, the Bayes and our confidence distribution based recipes, that is, (12.6) and

(12.7), become equivalent. In Section 2.5 we briefly discussed how the Bernsteın–von Misestheorems offer a ‘lazy Bayesian’ strategy, bypassing the troublesome setting up a concreteprior because it will be washed out by the data for large n. One may similarly contemplatea ‘lazy confidencian’ strategy, using (12.7) with θCD ∼ Np(θ , J−1/n) to produce a predictiveconfidence distribution. Note that this may easily be carried out in practice by simulatingsay 104 copies of θCD from the relevant multinormal distribution and then computing theaverage of the f(y,θ) density curves.

It is also worthwhile working out an explicit approximation to the integrals of (12.6) and(12.7), for the practical reason that higher-dimensional integration easily becomes difficultto implement and because it provides further insights. Write u(y,θ) and i(u,θ) for the firstand minus the second-order derivatives of log f(y,θ). Then, for θ in the broad vicinity of the

Page 9: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 344 — #364�

344 Predictions and confidence

maximum likelihood estimate θ , a second-order Taylor approximation gives

log f(y,θ).= log f(y, θ )+ u(y, θ )t(θ − θ )− 1

2 (θ − θ )ti(y, θ )(θ − θ ).Now with both of the preceding approaches, we may write θ − θ = Z/

√n, where Z is close

to and converges to a Np(0,J−1). This yields

f(y,θ).= f(y, θ )exp{u(y)t Z/√n − 1

2 Z t i(y)Z/n}and hence a multiplicative approximation formula of the practical type

fn(y)= f(y, θ )rn(y), rn(y)= E exp{u(y)t Z/√n − 12 Z t i(y)Z/n}. (12.8)

Here u(y)= u(y, θ ) and i(y)= i(y, θ ), and the expectation is with respect to Z . The rn(y) termis of size 1 + O(1/n), so for large n we are back to the parametric plug-in method (12.4).

To proceed we need a technical lemma, as follows. Suppose W ∼ Np(0,�), and that a isany vector and B a symmetric p × p matrix. Then, as long as −1 =�−1 + B−1 is positivedefinite, we have

E exp(atW − 12W t B−1W)= |I +�B−1|−1/2 exp( 1

2at a). (12.9)

For details of a proof for this, see Exercise 12.5. We thus have fn(y)= f(y, θ )rn(y), featuringthe correction term

rn(y).= |I + J−1i(y)/n|−1/2 exp{ 1

2 u(y)t J−1u(y)/n}.= [1 + Tr{J−1i(y)}/n]−1/2 exp{ 1

2 u(y)t J−1u(y)/n}.= exp

(12

[u(y)tJ−1u(y)− Tr{J−1i(y)}]/n).

A further multiplicative correction may be supplied, to the effect of making the f(y, θ )rn(y)integrate to 1, which in general leads to a better performance; see Glad et al. (2003).Importantly, these approximations are valid for both the Bayesian predictive and ourfrequentist predictive confidence densities. Also, they may lead to constructions gn(Y0, θ )that are closer to genuine pivots, that is, with distributions changing less with the modelparameters, and therefore useful for the predictive confidence scheme (12.5).

Example 12.5 The next normal

Consider again the normal variate prediction setup of Example 12.2, but with thesimplifying assumption that σ is known, which we then set to 1. The canonical confidencedistribution for μ after having observed n data points from the N(μ,1) is μCD ∼ N(y,1/n);cf. Section 3.5. Applying (12.7), this leads to the prediction confidence density

fn(y)=∫φ(y − θ)φ1/

√n(θ − y)dθ = φ(1+1/n)1/2(y − y),

where φκ(x) is used for κ−1φ(κ−1x), that is, the N(0,κ2) density. Hence the predictivedensity is y0 ∼ N(y,1 + 1/n). This is fully identical to what the Bayesian scheme (12.6)leads to, if the start prior is the flat (and improper) one. Interestingly, the 1 + O(1/n)type multiplicative correction scheme (12.8) gives a very close approximation to the exactanswers.

Page 10: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 345 — #365�

12.3 Comparison with Bayesian prediction 345

Example 12.6 The next exponential observation

Suppose positive independent variables Y1,Y2, . . . follow the exponential model with densityf(y,θ) = θ exp(−θy). After having seen n such data points, the confidence distribution isCn(θ) = �2n(2nθ/θ), with maximum likelihood estimate θ = 1/Y . Inverting this givesthe representation θCD = θobsVn, where Vn ∼ χ2

2n/(2n). The Bayes approach, starting witha Gamma(a,b) prior for θ , gives the posterior Gamma(a + n,b + nyobs), which with nlarge compared to a and b means θB ∼ Gamma(n,n/θobs). But this is manifestly the samecharacterisation as with the confidence distribution. Hence (12.6) and (12.7) lead to the samepredictive confidence distribution, with density

fn(y)= θ (1 + θy/n)−(n+1)

for the next data point. Using the multiplicative correction (12.8) gives a close approxima-tion to this. Also, this predictive density is equivalent to having started with the pivot θY0.For details and a generalisation, see Exercise 12.6.

Example 12.7 The next binomial

Suppose there is an unknown probability θ at work for tomorrow’s binomial experiment,which will give Y0 ∼ Bin(n0,θ). Assume also that there have been binomial experimentsin the past, with the same θ , which may be combined to a Y ∼ Bin(n,θ). The task isto construct a confidence distribution for tomorrow’s Y0 after having observed yobs fromthe past.

Here we are outside the clear realm of exact confidence and (12.1), due to the discretenessof the sample space. One may attempt pivotal constructions starting from Y0 − n0θ , perhapsdivided by {θ (1 − θ )}1/2, where θ = Y/n. This works well for reasonable n and n0,but not quite for smaller sample sizes. With some numerical efforts we may, however,compute

f(y0)=∫ 1

0

(n0

y0

)θ y0(1 − θ)n0−y0 dCn(θ) for y0 = 0,1, . . . ,n0,

where Cn(θ)= Pθ {Y > yobs}+ 12Pθ {Y = yobs} is the optimal confidence distribution; we have

done this through simulation of θCD from the Cn distribution. As predicted by the theorydeveloped previously, this is very close to the Bayesian predictive distribution, startingwith Bayes’s original prior for θ , that is, the uniform. Figure 12.4 illustrates this, withobserved yobs = 20 from yesterday’s bin(100,θ), with n0 = 25. The direct plug-in method of(12.4), using the binomial (n0,yobs/100), does not work well as it fails to take the samplingvariability of the estimate into account.

The f(y0) method works well here, but meets difficulties for the extreme case wherey0 is very close to either zero or n. In that case the half-correction method used abovefor Cn(θ) cannot be trusted, with implications for f. In the extreme case of yobs = n onemight argue that Cn(θ) = Pθ {Y = n} = θn is reasonable. This gives predictive confidenceprobabilities∫ 1

0

(n0

y0

)θ y0(1 − θ)n0−y0nθn−1 dθ =

(n0

y0

)n�(y0 + n)�(n0 − y0 + 1)

�(n0 + n + 1)

Page 11: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 346 — #366�

346 Predictions and confidence

0 5 10 15 20 25

0.00

0.05

0.10

0.15

0.20

y0

conf

iden

ce p

roba

bilit

y fu

nctio

n

Figure 12.4 Confidence points densities for tomorrow’s binomial (25,θ), after having observedyobs = 20 from a previous binomial (100,θ). The f(y0) is close to the Bayes predictivedistribution, but the plug-in method is too narrow.

for y0 = 0,1, . . . ,n0. In particular, the chance of full success tomorrow too is then n/(n0 +n).In his famous memoir on inverse probability, Laplace (1774) found that the probability thatthe sun rises tomorrow too, given that it actually has risen n times in a row, should be(n + 1)/(n + 2), which is the Bayesian answer from a uniform prior; see Stigler (1986b) foran accurate translation and Stigler (1986c) for illumination and discussion of the memoir.Our confidence in tomorrow’s sun rising is rather n/(n + 1).

12.4 Prediction in regression models

Consider the normal linear regression model, say

Yi = xtiβ+ εi for i = 1, . . . ,n,

where the x1, . . . ,xn are p-dimensional covariate vectors with full rank for their variancematrix�n = n−1

∑ni=1 xixt

i, and with the εi being i.i.d. N(0,σ 2). How can we predict the valueof an independent but as yet nonobserved Y0, associated with position x0 in the covariatespace, and how can prediction intervals be formed?

Here Y0 ∼ N(xt0β,σ 2), where the mean can be estimated using xt

0β, with varianceσ 2xt

0�−1n x0/n; see Section 3.5. Here β = �−1

n

∑ni=1 xiYi is the ordinary least squares

estimator of β. A natural pivot is therefore

t = Y0 − xt0β

σ (1 + n−1xt0�

−1n x0)1/2

, (12.10)

Page 12: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 347 — #367�

12.4 Prediction in regression models 347

where σ 2 = Q0/(n − p) and Q0 = ∑ni=1(Yi − xt

iβ)2 is the minimum sum of squares. When

the error distribution is indeed normal, the t has an exact tn−p distribution. It follows that

Cn(y0)= Fn−p

( y0 − xt0β

σ (1 + n−1xt0�

−1n x0)1/2

)(12.11)

is a confidence distribution for the unobserved Y0, in the natural sense that it generates allprediction intervals. Here Fn−p(·) is the distribution function of the tn−p distribution. We maysimilarly form the confidence prediction density cn(y0), which becomes a properly scaledtn−p density centred on xt

0β, and confidence prediction curve ccn(y0) = |2Cn(y0)− 1|. Wenote that this method is first-order correct even without the error distribution being exactlynormal, as long as the εi are independent with the same standard deviation.

When n is large, β becomes more precise, but the uncertainty in the ε0 part of Y0 =xt

0β + ε0 persists; in other words, there is no consistent predictor of Y0, and the confidencedistribution (12.11) does not become more narrow than its limit �((y0 − xt

0β)/σ).

Example 12.8 Broken records (Example 10.2 continued)

We return to the data used in Oeppen and Vaupel (2002) to demonstrate the amazinglyconstant linear growth in the expected lifelengths for women of the ‘record-holding country’(from Sweden and Norway 1840–1859 via New Zealand 1880–1915 to Japan 1986–2000),at a rate of very nearly three months per calendar year. Here we use these data to predictthe lifelengths of women (in the appropriate record-breaking country) for the years 2020,2030, 2040, 2050. The aforementioned apparatus is used, specifically taking the derivativeof (12.11) to display densities rather than cumulatives; see Figure 12.5. We note that thespread of these distributions is determined via σ (1 + n−1xt

0�−1n x0)

1/2, which grows slowlywith the time horizon contemplated, that is, the standard deviation grows and the top pointof the densities becomes lower as we pass from 2020 to 2050. These changes are almost notnoticeable, however; some details are in Exercise 12.8. Note that these predictions assumethe linear trend to continue. Uncertainty with respect to this assumption is not accounted forhere.

The Oeppen and Vaupel (2002) data utilise records that stop with the year 2000 (and theentries for 1996–2000 for the highest female life mean times are 83.60, 83.80, 84.00, 83.90,84.62); cf. Figure 10.2. We may check how the consequent predictive distribution for theyear 2012 matches our reality, for which we may consult CIA: The World Factbook. Thepredictive distribution for 2012, as seen from 2000, is a t159 distribution with centre 87.450and spread parameter 1.555. The science publication just mentioned tells us that for 2012,Japan and Macau are tied for the record female life expectancy among countries with at leasthalf a million inhabitants, both with 87.40 years. This is amazingly close to the centre of thepredictive distribution (and in fact equal to its 0.487 quantile).

Example 12.9 Predicting unemployment

Gujarati (1968) analyses data related to unemployment in the USA and certain ‘help wanted’indexes. We shall use the unemployment rates for 24 quarters, running from the first quarterof 1962 to the fourth quarter of 1967, to predict the unemployment for quarter no. 25, thatis, the first quarter of 1968; see Figure 12.6. We fit a regression model with a third order

Page 13: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 348 — #368�

348 Predictions and confidence

100959085

0.00

0.05

0.10

0.15

0.20

0.25

expected female lifelength

pred

ictio

n co

nfid

ence

den

sity

2020 2030 2040 2050

Figure 12.5 Confidence prediction densities are shown for the expected lifelength of womenin the appropriate record holding countries for the calendar years 2020, 2030, 2040, 2050. SeeExample 12.8.

1962 1963 1964 1965 1966 1967 1968

4.0

4.5

5.0

5.5

year

unem

ploy

men

t

Figure 12.6 Predicting the unemployment rate for the first quarter of 1968, based onunemployment observations from the first quarter of 1962 up to and including the fourthquarter of 1967: quantiles 0.1, 0.3, 0.5, 0.7, 0.9 of the predictive confidence distribution. SeeExample 12.9.

Page 14: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 349 — #369�

12.4 Prediction in regression models 349

polynomial trend to the data, that is,

yi = β0 +β1xi +β2x2i +β3x

3i +σεi for i = 1, . . . ,24,

with the εi taken as independent standard normals. The model fits well, as revealed byinspection of residuals and looking at the Akaike information criterion (AIC) score. Therecipe from (12.11) giving a prediction confidence distribution

C(y25)= F21

(y25 − m(x25)

σ {1 + d(x25)t�−1c(x25)/25}1/2

),

where m(x) is the estimated trend function β0 +β1x+β2x2 +β3x3 at position x. Also, d(x)=(1,x,x2,x3) and � = X t X/n is the 4×4 matrix associated with the regresssion analysis withthe cubic trend.

The predictive confidence distribution may clearly be displayed in different ways,depending on the context and purpose. Plotting the density of the confidence distribution,centred at m(x1968/I) = 4.146, is instructive; see Exercise 12.9. To the right in Figure 12.6we show the 0.1, 0.3, 0.5, 0.7, 0.9 confidence quantiles for 1968/I, based on data up to1967/IV and the fitted model. The particular phenomenon of prediction, that the distantfuture happens to be harder to predict than tomorrow, is visible in Figure 12.7, concerningthe four quarters of 1968, as seen from 1967/IV. The prediction estimates are 4.146, 4.398,4.773, 5.126, but with steadily increasing prediction margins, as indicated by the nowhorizontally positioned confidence curves. The 90% prediction intervals are also plotted.Taking model uncertainty into account would reveal even greater variation around the pointestimates. That this phenomenon is less visible for the case of Example 12.8 has to do with

19701968196619641962

4.0

4.5

5.0

5.5

year

unem

ploy

men

t, w

ith p

redi

ctio

n

Figure 12.7 Predicting the unemployment rate for the four quarters of 1968, based on data up tofourth quarter of 1967: predictive confidence curves, along with 90% confidence intervals. SeeExample 12.9.

Page 15: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 350 — #370�

350 Predictions and confidence

the regression model used there being simpler with a better fit. Also, it is clear that theanalysis leading to predictions displayed in Figure 12.7 relies more on a simple curve fittingapproach than on any sound econometric model. This might be fruitful and give accurateshort-term predictions, but not when the time horizon is longer. Model uncertainty is alsogreater for the future, and this is not accounted for in Figure 12.7.

The preceding methodology may be suitably generalised to more complex regressionmodels, sometimes with separate transformation and resampling tricks to compensate forthe lack of clear and informative pivots. In such situations a clear formula like that of (12.11)cannot be reached.

To illustrate this line of construction, consider again the exponential Weibull hazardrate model used in Example 4.10 to analyse survival time after operation for patients withcarcinoma of the oropharynx. The hazard rate for an individual with covariates x1,x2,x3,x4

takes the form

α(s)= γ sγ−1 exp(β0 +β1x1 +β2x2 +β3x3 +β4x4).

If T0 is drawn from this survival distribution, then A(T0) is a unit exponential, where A(t)=tγ exp(xtβ) is the cumulative hazard function. The estimated A(T0), and hence its moreconvenient logarithm, should be close to a pivot, so we consider

Zn = gn(T0, θ )= log A(T0, β, γ )= γ log T0 + xtβ.

The T0 in question may be represented as T0 = {Vexp(−xtβ)}1/γ , with V a unit exponential.Hence

Zn = (γ /γ )(logV − xtβ)+ xtβ.

Its distribution is complicated, but may be simulated at any position in the (β,γ ) parameterspace, and will at least for large n approach that of logV. Using the approach outlined inconnection with (12.5) we reach the following predictive confidence distribution for T0:

Cpred(t0)= P{gn(T0, θ )≤ gn(t0, θobs)}= PML{(γ ∗/γobs)(logV∗ − xtβobs)+ xtβ∗ ≤ γobs log t0 + xtβobs}.

Here the probability is being computed via simulation of (β∗, γ ∗) based on datasets of thesame size n = 193 drawn from the estimated model, and with the same censoring pattern.Also, the V∗ are simply unit exponentials, drawn independently. Note that this predictivedistribution is constructed for one individual at the time, so to speak, that is, an individualwith specified covariates equal to x1,x2,x3,x4.

12.5 Time series and kriging

Prediction is often required in situations with statistical dependencies among the observa-tions, as when predicting the next data point of an ongoing time series. The methodologydeveloped above applies, but it might, for example, be harder to construct relevant pivots.In this section we briefly indicate how methods may be developed for such contexts.

Page 16: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 351 — #371�

12.5 Time series and kriging 351

To illustrate a certain general point we start out considering a very simple setup, with twocorrelated normals, and where one attempts to predict one from the other. Let in fact(

Y0

Y

)∼ N2(

(00

), σ 2

(1 ρ

ρ 1

)),

so far with correlation ρ and standard deviation σ known, and where Y0 is to be predictedbased on Y = yobs. One scheme is to start out with Y − ρY0 and divide by its standarddeviation (estimated, if necessary). This leads to

T = Y0 −ρY

σ(1 −ρ2)1/2.

It is seen to be a pivot (its distribution is standard normal), and the confidence distributionfor Y0 is

Cpred(y0)=�( y0 −ρyobs

σ(1 −ρ2)1/2

).

A different approach is to work with the conditional distribution of Y0 given Y = yobs. Usingstandard properties of the multinormal, this is N(ρyobs,σ 2(1−ρ2)), which on standardisationleads to (Y0 − ρyobs)/{σ(1 − ρ2)1/2} and the observation that its distribution is standardnormal, for each yobs, and hence also unconditionally. We learn that the two ideas agree andend up with the same Cpred(y0).

In more complicated situations these two approaches will not necessarily match eachother, however, depending also on how one builds one’s exact or approximate pivot. Intypical applications one would also need to estimate both correlation and variances from thedata, with complications for both approaches.

We record one general useful scheme, as an extension of the second type of approach,and where the aim is to work out the appropriate conditional distribution of the unknowngiven what has been observed, and then estimate this distribution. Assume(

Y0

Y

)∼ Nn+1(

(ξ0ξ

), σ 2

(1 kt

k A

)),

with A being the correlation matrix for Y (with value 1 along its diagonal) and kj thecorrelation between Y0 and component Yj. Then

Y0 |y ∼ N(ξ0 + kt A−1(y − ξ),σ 2(1 − kt A−1k)).

The ideal pivot would therefore be

T = Y0 − ξ0 − kt A−1(y − ξ)σ (1 − kt A−1k)1/2

,

since its distribution is standard normal regardless of y and hence also unconditionally. Inparticular, U =�(T ) is a uniform, so

C0(y0)=�(y0 − ξ0 − kt A−1(y − ξ)

σ (1 − kt A−1k)1/2

)describes the ideal predictive confidence distribution. In most relevant applications therewould, however, be unknown parameters present for the mean function, the variance leveland the correlations.

Page 17: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 352 — #372�

352 Predictions and confidence

Aslongas thecorrelationstructure isknown, inparticularalso Aandk, thesituationisnot toocomplicated, as onealsomay transform thevariables back to independence. Inparticular, if theEY = ξ is linear in say p parameters, then these may be estimated using maximum likelihood(which here takes the form of matrix weighted least squares), along with an estimate of σof the form σ 2 = Q0/(n − p) ∼ σ 2χ2

n−p/(n − p). In fact Q0 = (Y − ξ )t A−1(Y − ξ ). In suchcases the estimated version of the pivot given earlier has a t-distribution with n − p degreesof freedom, and predictive analysis for Y0 may proceed. Specifically,

Cpred(y0)= Fν(T∗)= Fν

(y0 − ξ0 − kt A−1(y − ξ )σ (1 − kt A−1k)1/2

), (12.12)

where Fν is the appropriate t distribution function with ν = n − p degrees of freedom.The problems grow harder when the correlation structure needs to be estimated too. First

of all one needs a method for estimating the A and the k above, under suitable conditions.Second, the modified version of T or T ∗ above will rarely be a genuine pivot. This wouldcall for further specialised methods, depending on the model and its ingredients, either inthe direction of finessing the T ∗ in the direction of having its distribution independent ofparameters, or via bootstrapping and corrections to account for the sampling variability inthe relevant A and k.

To illustrate some of these issues, consider a situation where Y and Y0 have the samemean ξ , say as part of a stationary time series, and where there is a parametric correlationfunction, say corr(Yi,Yj) = c(di, j,λ), featuring the interdistances di, j between data points.Then the log-likelihood function is

(ξ ,σ ,λ)= −n logσ − 12 log |A|− 1

2 (y − ξ t1)A−1(y − ξ1)/σ 2.

Here 1 is the vector (1, . . . ,1)t, and the parameter λ is involved in A = A(λ). First, the abovemay be maximised for ξ and σ for fixed λ, leading to

ξ (λ)= 1t A−1y/(1t A−11) and σ 2(λ)= (y − ξ1)t A−1(y − ξ1)/n.

This gives a profiled log-likelihood,

prof(λ)= −n log σ (λ)− 12 log |A(λ)|− 1

2n,

defining next the maximum likelihood estimate λ and the consequent ξ (λ) and σ (λ).In various geostastical applications the aforementioned procedure is at least partly

followed, fitting spatial data to a certain covariance structure, say with covariancescov(Yi,Yj) = σ 2 exp(−λdi, j). Inserting the estimate of λ, one has a machinery for spatialinterpolation, via the estimated conditional distribution of Y0 at a new geographical positiongiven the observations Yj at their positions. This mechanism for predicting Y0 at a set ofnew positions is called kriging; see, for example, Ripley (1981), Cressie (1993). It is alsonot uncommon to conveniently gloss over the sampling variability inherent in finding λ,leading then to certain prediction intervals, say displayed pointwise for a region of interest,which might be artificially narrow. The reason is that the implicit or explicit construction

T ∗ = T (λ)= Y0 − ξ − kt A−1(y − ξ1)σ (1 − kt A−1k)1/2

has a more complicated distribution than the t-distribution, and is not any longer a pivot. Inparticular, using (12.12) would not be appropriate.

Page 18: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 353 — #373�

12.6 Spatial regression and prediction 353

How to proceed in such situations depends on details of the model, the sample size, thespread of the data in relation to their correlation and the parameter values, in particularthe λ involved in the correlation function. It is difficult to give a unified and alwayssuccessful strategy for fine-tuning such ratios as T ∗ into pivots and hence accurate predictiveconfidences.

Let us nevertheless end this brief review of contextual prediction by describing a specialcase with a simple structure, that of the autoregressive time series of order one. SupposeYi = ξ + σεi, where the εi follow such an AR(1) model, with cov(εi,εj) = ρ |i−j|, In thatcase there are simplifying formulae for the inverse and determinant of A above, leading inparticular to the simple fact that

Yn+1 |past ∼ N(ξ +ρ(yn − ξ), σ 2(1 −ρ2)).

This incidentally says that the Yi form a Markovian process. Estimates may be found forξ ,σ ,ρ, as per the aforementioned likelihood recipe, and the starting point for formingprediction intervals would be

T ∗ = Yn+1 − ξ − ρ(Yn − ξ )σ (1 − ρ2)1/2

.

Its distribution depends on the ρ, sometimes with a variance changing rapidly with thatvalue (again, depending on the sample size and the size of ρ). A strategy is then to exploitsimulations to estimate this underlying variance function, say vn(ρ), for example, in thespirit of pivot tuning; see Section 7.3. This leads to say T ∗∗ = T ∗/vn(ρ)

1/2, which might stillhave a variance changing with ρ, but often considerably less so. Such efforts, with a secondlayer of iteration, if required, finally gives a bona fide predictive confidence distribution forthe next data point in such a time series, with guaranteed confidence limits.

12.6 Spatial regression and prediction

Consider a spatial regression model with correlated observations, of the type Yi = xtiβ +

σεi for i = 1, . . . ,n, with a suitable correlation function corr(εi,εj) = c(di,j,λ). Here the di,j

indicates distance between observations Yi and Yj, in a clear geographical sense or via someother relevant measure, and λ is the parameter related to this corrlation function. Also, thexi are p-dimensional covariate vectors, with the ensuing n × p matrix X of row vectors xt

i

having full rank. We shall give attention to an unobserved Y0 = xt0β + σε0, for which we

wish to construct a predictive confidence distribution. Assuming a Gaussian structure, wehave (

Y0

Y

)∼ Nn+1(

(xt

), σ 2

(1 kt

k A

)),

with A the matrix of intercorrelations for the n observations and with k the vector withcomponents ki = corr(ε0,εi). The conditional distribution of the unobserved Y0 given theobservations Y = y is

Y0 |y ∼ N(xt0β0 + kt A−1(y − Xβ),σ 2(1 − kt A−1k)).

Thus the general strategy outlined above suggests working with an estimated version ofV0 = Y0 −xt

0β−kt A−1(y− Xβ), and then dividing by a scale estimate to form an approximatepivot.

Page 19: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 354 — #374�

354 Predictions and confidence

Assume first that the correlation function is actually known, hence also the matrix A andvector k. We then use

β = (X t A−1 X)−1 X A−1y = HX A−1y, with H = (X t A−1 X)−1.

We may express β as β+σHX A−1ε, and learn that β ∼ Np(β,σ 2H). The estimated versionof V0 is

V = Y0 − xt0β− kt A−1(Y − X β)

= σ {ε0 − xt0HX t A−1ε− kt A−1(ε− XHX t A−1ε)}.

This is a normal variable with zero mean. Putting in decent algebraic efforts one finds thatits variance is VarV = σ 2τ(x0)

2, with

τ(x0)2 = 1 − kt A−1k + (x0 − X t A−1k)tH(x0 − X t A−1k). (12.13)

With σ 2 = Q0/(n − p), where Q0 = (y − X β)t A−1(y − X β), one finds the predictiveconfidence distribution

Cpred(y0)= Fn−p

(y0 − xt0β− kt A−1(y − X β)

σ τ (x0)

),

with Fν the distribution function of a tν . This is the spatial generalisation of the previousresult (12.11), which corresponds to statistical independence between observations andhence to k = 0 and simplified expressions above. Note in particular how the variance factorassociated with predicting Y0 at x0 changes from the previous 1+xt

0(Xt X)−1x0 to the present

1 − kt A−1k + (x0 − X t A−1k)t H(x0 − X t A−1k).The preceding formulae are generally valid for any spatial set of observations, as long as

there is a correlation function leading to A and k; in particular these formulae may be usedwithout regard to there being simplified versions available or not. It is however enlighteningto consider the special case of a time series model with trend and autocorrelation, as withExample 4.13. That particular model takes yi = a+bi+σεi, with time running equidistantlyfrom 1 to n and with autocorrelation corr(yi,yj) = ρ |i−j|. Then A−1 is a tridiagonal bandmatrix. When predicting the next observation Yn+1, we have k = (ρn, . . . ,ρ), and kt A−1 =(0, . . . ,0,ρ). Matters then simplify to

Cpred,1(yn+1)= Fn−2

(yn+1 − a − b(n + 1)−ρ(yn − a − bn)

σ τ (n + 1)

),

with a formula for τ(n + 1) following from (12.13). Similarly, when predicting two timesteps ahead with this model, the confidence distribution takes the form

Cpred,2(yn+2)= Fn−2

(yn+1 − a − b(n + 2)−ρ2(yn − a − bn)

σ τ (n + 2)

),

and so on. These may be turned into prediction bands for the future, as say

yn+j ∈ a + b(n + j)+ρ j(yn − a − bn)± σ τ (n + j)zα,

where zα is chosen to have P{|tn−2| ≤ zα} = α. We note that after a few steps, depending onthe size of ρ, ρ j would be close to zero, and the band becomes the simpler

yn+j ∈ a + b(n + j)± σ (1 + xt0Hx0)

1/2zα,

as with (12.11).

Page 20: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 355 — #375�

12.6 Spatial regression and prediction 355

Example 12.10 Skiing prediction (Example 4.13 continued)

In Example 4.13 we considered the number of skiing days at Bjørnholt, near Oslo, andworried over what the future holds for the inhabitants of Norway’s capital. The time seriesstretches from 1897 to the present, but with a hole from 1938 to 1954, cf. Figure 4.19. Toform confidence statements about the future we concentrate on the data from 1955 to 2012,and have found that the four-parameter model described and analysed in Example 4.13,with a linear trend, a variance and an autocorrelation parameter, fit the data well. The taskis hence to predict y2012+j for j = 1,2, . . .. Following the general procedure above we need towork with

W(x0)= V

σ τ (x0)= Y0 − xt

0β− kt A−1(y − X β)

σ τ (x0),

where

τ (x0)2 = 1 − kt A−1k + (x0 − X t A−1k)tH(x0 − X t A−1k)

and H = (X t A−1 X)−1. This W(x0) is found to have a distribution very close to being thesame across parameters, as long as ρ is not too extreme (and we saw the confidencedistribution for ρ in Figure 4.20, essentially supported on (0,0.5)). Simulation fromthe W(x0) distribution then leads to Figure 12.8, displaying the predictive confidencedistribution for the number of skiing days at Bjørnholt for the years 2013 to 2022.

1960 1970 1980 1990 2000 2010 2020

50

100

150

year

skiin

g da

ys

Figure 12.8 Predictive confidence for the number of skiing days at Bjørnholt, for years 2013to 2022, based on data from 1955 to 2012. Displayed confidence quantiles are 0.05, 0.25, 0.50,0.75, 0.95.

Page 21: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 356 — #376�

356 Predictions and confidence

12.7 Notes on the literature

Statisticians have been predicting for ever, but more often without full attention to theaccuracy of these predictions. Also Fisher saw his fiducial apparatus as applicable toprediction settings. Bayesian prediction tends to be an easier task, given the assumptionsand framework involved, as it is equivalent to integrating out parameters given all datato find the predictive distributions. The famous memoir of Laplace (1774) is arguably ofthis type, see Stigler (1986b). Smith (1999) gives an overview of the Bayesian ways ofpredictions, also making certain links to the frequentist side. Barndorff-Nielsen and Cox(1996) gave a careful overview of concepts and large-sample techniques for predictions,along with measures of the inherent uncertainty for these. Results from that work maybe turned into confidence intervals and full confidence distributions for unobservedvariables.

People working in time series are of course often occupied with prediction, frommeteorology to finance, but methods are often tinkered with and finessed for the predictionpart but not for assessing their uncertainty. There is a wide literature related to forecastingof econometric parameters; see, for example, Hansen (2008), Cheng and Hansen (2014),also exploiting model averaging ideas for prediction purposes. The confidence distributionview should have some supplementary relevance for such methods. Similarly a subset ofscientific publications concerning climate change provides not merely an estimated curvefor say the future mean temperature of the globe, but also tentative confidence intervalsfor such estimates, based on certain modelling assumptions. In a few cases researcherssupplement such studies with a full predictive distribution; see, for example, Skeie et al.(2014).

Relatively little appears to have been done in the literature regarding accurate confidencein prediction contexts. Fisher (1935) used the prediction pivot in the normal case. Lawlessand Fredette (2005) use methods partly similar to our own, looking for useful pivots thatmay then be inverted to predictive confidence. Wang et al. (2012) take this approach onestep further, via fiducial analysis, and extending the catalogue where exact predictionconfidence distributions are available. Most often such exact solutions would not beavailable, calling for approximation techniques, perhaps from large-sample methodology;cf. again Barndorff-Nielsen and Cox (1996). Methods for automatic fine-tuning includepre-pivoting and perhaps iterations thereof; cf. Beran (1988b, 1990), and also methodssurveyed and developed in our Chapter 7.

Exercises

12.1 Normal prediction: Let Y1, . . . ,Yn and Y0 be i.i.d. N(μ,σ 2), and with standard estimates μ andσ as defined in Example 12.2.

(a) Show that

(Y0 − μ)/σ ∼ (1 + 1/n)1/2tn−1.

(b) The simple plug-in prediction interval method corresponds to using Y0 ∈ μ± σ zα as thelevel α interval, where P{|N(0,1)| ≤ zα} = α. Determine the exact coverage of this interval,and compare it to what is implied by the exact method associated with (a), for small andmoderate n.

Page 22: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 357 — #377�

Exercises 357

12.2 The next data point: For an ordered sample U(1) < · · · < U(n) from the uniform distribution,show that U(i) has density

gi(u)= n!(i − 1)!(n − i)!u

i−1(1 − u)n−1 for u ∈ (0,1),

which is also a Beta distribution with parameters (i,n− i+1). Deduce that its mean is i/(n+1),which was needed in connection with the nonparametric prediction confidence distribution ofSection 12.2.

12.3 Risk function for predicting the next data point: Here we look at the notion of a risk functionfor prediction distributions.

(a) The risk with quadratic loss of a confidence distribution C(y0;Y ) for Y0, based on dataY = Y1, . . . ,Yn, may be defined as

R = E∫(y − Y0)

2 dC(y;Y ).

Let Ct(y0)= Fn−1((y0 −μ)/σ ) be the t-confidence distribution, with Fn−1 being the cumulativet-distribution at df = n − 1 and with μ and σ 2 the sample mean and variance. Show that therisk is Rt = σ 2(2n − 3)(n − 1)/{n(n − 3)}, regardless of the data being normally distributed ornot.

(b) For the ordered sample Y(1), . . . ,Y(n), P{Y(k−1) < Y0 ≤ Y(k)} = i/(n + 1). The nonparametricconfidence distribution might thus be taken as a discrete distribution over the sample withweights 1/(n + 1) and with the remaining 1/(n + 1) being distributed over the tails in someway. To simplify matters let rn be the risk contribution from the tails. Show that this makes therisk equal to

Rnp = E∫ Y(n)

Y(1)

(y − Y0)2 dC(y;Y )+ rn = 2σ 2 n

n + 1+ rn.

Compare Rt to 2σ 2 nn+1 . What is the take-home message?

(c) No confidence distribution for the next observation can be better than the distribution of Y0,that is, F. Show that the quadratic risk is RF = 2σ 2. The empirical distribution F is an attractivecandidate as confidence distribution for Y0. Show that RF = 2σ 2 under quadratic loss. Is this aparadox?

12.4 How tall is Nils? Assume that the heights of Norwegian men above the age of twenty followthe normal distribution N(ξ ,σ 2) with ξ = 180 cm and σ = 9 cm.

(a) If you have not yet seen or bothered to notice this particular aspect of Nils’s appearance,what is your point estimate of his height, and what is your 95% prediction interval? Write thesedown and give your full predictive confidence distribution in a plot.

(b) Assume now that you learn that his four brothers are actually 195 cm, 207 cm, 196 cm,200 cm tall, and furthermore that correlations between brothers’ heights in the population ofNorwegian men is equal to ρ = 0.80. Use this information about his four brothers to reviseyour initial point estimate of his height, and provide the full predictive confidence distributioninside the same plot as above. Is Nils a statistical outlier in his family?

(c) Suppose that Nils has n brothers and that you learn their heights. Give formulae for theupdated confidence distributions in terms of normal parameters ξn and σn. Use this to clarifythe following statistical point: Even if you get to know all facts concerning 99 brothers, thereshould be a limit to your confidence in what you may infer about Nils.

Page 23: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 358 — #378�

358 Predictions and confidence

12.5 A technical lemma: For W ∼ Np(0,�) and B as symmetric invertible matrix with −1 =�−1 + B−1 positive definite, prove identity (12.9). Start from E exp(atW − 1

2 Wt B−1W) beingequal to

(2π)−p/2|�|−1/2

∫exp(atw − 1

2 wt B−1w − 12 wt�−1w)dw

and rearrange in terms of , to reach |I +�B−1|−1/2 exp(at a).

12.6 Predicting exponentials: Let Y1, . . . ,Yn be independent observations from the exponentialdensity θ exp(−θy). The aim is to construct a confidence distribution for the next observation,say Y0.

(a) Show that the maximum likelihood estimator is θ is equal to the inverse sample mean 1/Yand that it may be represented as θ/Vn, where Vn ∼ χ2

2n/(2n).

(b) Show next that θY0 ∼ F2,2n, the Fisher distribution with degrees of freedom (2,2n).Conclude that Cpred(y0)= F2,2n(θobsy0) is the natural confidence distribution for Y0.

(c) Assume that θ is given the prior distribution Gamma(a,b). Find the posterior distribution,and use (12.6) to find the Bayesian predictive density. Show that when a and b tend to zero,then the Bayesian predictive becomes identical to the Cpred(y0) found previously.

(d) Generalise the aforementioned setting to the case of observations coming from theGamma(a,θ) distribution, with a known but θ unknown. Find a predictive confidencedistribution for θ based on

∑ni=1 Yi. What if a is unknown too?

12.7 The next uniform: Suppose a sequence of independent observations Y1,Y2, . . . come from theuniform distribution on [0,θ ], with θ unknown.

(a) Show that W = Yn+1/Y(n) is a pivot, where Y(n) is the largest of the first n observations.

(b) Deduce that Cpred(yn+1)= Kn(yn+1/y(n)) is a predictive confidence distribution for the nextpoint, with Kn the cumulative distribution for W. Show that

Kn(w)={ n

n+1 w if w ≤ 1,

1 − 1n+1 (1/w)

n if w ≥ 1.

Give an interpretation of Kn in terms of a mixture, and finds its density kn(w).

(c) Display the confidence curve for the next point, when the largest of so far n = 19observations is 4.567. Compute and display also f(y) = ∫

f(y,θ)dCn(θ), where Cn is theconfidence distribution for θ based on Y(n), and compare to the similar construction from theBayesian point of view.

12.8 Long lives ahead: Consider Example 12.8, concerned with the Oeppen and Vaupel (2002)data. Construct the predictive density for the expected lifelength of women, in the country withthe highest such number, in any given year ahead of us. Access data via the Human MortalityDatabase at www.mortality.org to construct predictive densities also for the expected lifelengthin the record-holding country for men.

12.9 Predicting unemployment: For Example 12.9 we used a third-order polynomial as a trendfunction for predicting unemployment in the years ahead. In our analysis we modelledthe residuals as independent and normal. Extend this model to include an autocorrelationparameter ρ, and find a confidence distribution for this parameter (it is estimated at 0.139and is not significant). Then try out one or two other parametric trend models, for example, afourth-order polynomial, and compare results, for the predictive density g(y25 |x25).

Page 24: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 359 — #379�

Exercises 359

12.10 Predicting a Weibull: Suppose Y1, . . . ,Yn are independent observations from the Weibulldistribution with cumulative distribution function 1 − exp(−yθ ) for y positive. Develop anapproximate confidence distribution for θ , and simulate data for small and moderate n todisplay this confidence. Then use this to compute and compare the results of (12.7) and (12.8).Generalise your results to the case of two unknown parameters, with Weibull distribution1 − exp{−(y/a)b}.

Page 25: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 360 — #380�

13

Meta-analysis and combination of information

The impact of a scientific report is a function both of the presentation and the originalityand the quality of the results. To achieve maximal impact, new empirical results should bereported in a form that enables readers to combine them effectively with other relevantdata, without burdening readers with having to redo all the analyses behind the newresults. Meta-analysis is a broad term used for methods analysing a set of similar orrelated experiments jointly, for purposes of general comparison, exhibiting granderstructure, spotting outliers and examining relevant factors for such, and so on. As such thevital concept is that of combining different sources of information, and often enough notbased on the full sets of raw data but on suitable summary statistics for each experiment.This chapter examines some natural and effective approaches to such problems, involvingconstruction of confidence distributions for the more relevant parameters. Illustrationsinclude meta-analysis of certain death-after-surgery rates for British hospitals. Yet otherapplications involving meta-analysis methods developed here are offered in Chapter 14.

13.1 Introduction

Meta-analysis, the art and science of combining results from a set of independent studies,is big business in medicine, and is also an important tool in psychology, social sciences,ecology, physics and other natural sciences. The yearly number of medical meta-analyseshas increased exponentially in recent years, and rounded 2000 by 2005 (Sutton and Higgins,2008).

The first issue to resolve in a meta-analysis is to define the parameter, possibly a vector, tostudy. This might not be as easy as it sounds because related studies might vary with respectto definition and they will also usually vary in their experimental or observational setupand statistical methods. Along with the definition, the population of related potential studiesis delineated. In rather rare cases is it clear that the value of the parameter is exactly thesame in related studies. Then a fixed effects model may be appropriate. When the parametervaries across related studies, or when there are unobserved heterogeneities across studies,the statistical model to be used in the meta-analysis should include random effects.

The next problem is to determine the criterion and algorithm for selecting studies forthe meta-analysis. Ideally the collection should be a random sample from the populationof related potential studies. This ideal is usually not achievable, and one must settle fora collection that ought to be argued to be representative for the population. Potential biasin the selection should then be discussed. One potential selection bias is that ‘significant’results are more likely to be published than studies leading to insignificant results.

360

Page 26: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 361 — #381�

13.1 Introduction 361

There are very many meta-analyses of randomised controlled clinical trials. The leadingmedical journals have formed a consortium guiding authors for what to report from theirindividual studies (Moher et al., 2010). When these guidelines are followed it is rather clearwhat population a study belongs to. Researchers trying out a new treatment might, however,obtain inconclusive or even unwanted results, and be tempted to not publish. This wouldlead to publication bias, and a sample of published studies will be subject to selection bias.Clinical trials is a rather well organised field, with a guide to best practice spearheaded by theWorld Health Organization and a registry for clinical trials covering most of the world. Clin-ical trials to be carried out must register ahead of time, and details with respect to content,quality and validity, accessibility, technical capacity and administration/governance must begiven (see www.who.int/ictrp/network/bpg/en/index.html). Meta-analysis of clinical trialsshould thus not be seriously affected by selection bias.

The problem of selection bias in meta-analyses might be more severe in other fieldsof science. In traffic safety studies, for example, there has been reluctance to publishnonsignificant results. Elvik (2011) discusses publication selection bias and also otherpotential biases in meta-analyses of bicycle helmet efficacy. The funnel plot is a diagnosticplot to look for selection bias associated with the ‘file drawer problem’ of less significantresults being less likely to be published and remaining in the researcher’s file drawer. Thelarger the study is, the more likely it is to be published. The funnel plot is simply a plotof estimated effect size against study size (via inverse standard error, or similar) for thecollection of studies. When there is no selection bias related to study size the plot wouldthen look like a funnel turned around, with the wide end down. If, however, the pyramid-likescatter leans towards the y-axis there is cause for concern.

There is a sizable methodological literature on meta-analysis; see Borenstein et al. (2009),Rothstein et al. (2005) and Sutton and Higgins (2008). Nelson and Kennedy (2009) providea guide with ten main items for meta-analyses in economics. One important issue here isthat of correlation in effect size estimates both within and between studies. We have notmuch to add with respect to the basic problems of definition, population delineation, studyselection and potential selection biases. Our aim here is mainly to demonstrate the usefulnessof confidence inference in basic meta-analysis of a selected set of studies.

The archetype meta-analysis is as follows. There are k independent studies of the samephenomenon, say the cure rate θ of a treatment. In each study an effect size θ is estimated,together with information of estimation precision often termed study size. When the trueeffect size varies across studies, it is regarded as a stochastic variable drawn from perhapsthe normal distribution. The aim is now to integrate the k estimates with accompanyingconfidence intervals or other measures of uncertainty to obtain a combined estimate of θ , or,if θ is not the same across studies, then estimates of its distribution. When the random effectis normally distributed its mean over the population, θ0, and its standard deviation, say τ ,are the parameters of interest. In the fixed effects model τ = 0 and the common true effectsize θ0 is the single parameter of interest.

In Example 13.1 below the treatment is a certain type of open heart surgery of childrenyounger than one year of age. We consider the results from k = 12 British hospitals from1991 to 1995 discussed in Marshall and Spiegelhalter (2007). The effect size is here theprobability of dying as a result of the surgery. The estimated effect size is y/m, where y is

Page 27: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 362 — #382�

362 Meta-analysis and combination of information

the number of deaths out of the m Bernoulli trials. In addition to the estimated effect size,the sample size of the study, m, is also available.

For the British hospital study the first question might be whether open heart surgery ofinfants is to be carried out according to a specific protocol, and whether the populationshould be British hospitals using this protocol, or perhaps hospitals in developed countriesoperating this way. The population might also be British hospitals with best practice inopen heart surgery of infants. The study was actually required by an inquiry under theBritish National Health Service to compare the performance in the Bristol hospital to otherspecialist centres for this type of operation (Spiegelhalter et al., 2002). Whether the k studiesform a random sample from the defined population, called ‘specialist centres’, or perhapsa representative or complete sample, should then be argued. Despite the fact that the datawere collected and analysed to compare the Bristol hospital with other specialist centres, weuse the data to illustrate basic meta-analysis. For statistical modelling we thus view the 12hospitals as a random sample from a larger population of specialist centres, and the purposeis to estimate the mean and also the variance in death rate after operation in this population,but with outlying hospitals excluded. To be outlying is here understood as being extremerelative to the statistical model assumed.

Meta-analyses are sometimes carried out for a broader set of purposes than to estimatethe central tendency in potential studies to estimate a particular treatment effect. The Britishhospital study could, for example, have been carried out as a meta-analysis primarily toidentify outliers, as it in a sense was. Outliers or heterogeneity must be judged in thecontext of a model. In the fixed effects model, outliers are studies in which true effectsize is different from the value in the median in the population. In the random effects model,outliers are determined relative to the size of the study and the variation in the randomeffect.

The purpose of a meta-analysis might also be to estimate a multivariate effect of thetreatment, the treatment effect as a function of covariates specific to study or individualsor perhaps the structural form of a graded treatment effect. These studies are often calledmeta-regressions, which are predominant in economics (Nelson and Kennedy, 2009). Onemight also analyse a collection of published results in more complex models involvingrandom effects, possibly in a hierarchy. Statistical modelling is actually required inmeta-analyses as it is in any statistical analysis, and the results from meta-analyses dependon the model and method, as well as on the data.

In addition to these aspects of meta-analysis there are many instances in which statisticalinformation from published sources is combined with new data, or other published results,to yield an integrated analysis. For example, when Brandon and Wade (2006) assessed thestatus of the bowhead whale population off of Alaska, they integrated published resultson mortality, fecundity, historic catches and abundance estimates. This integrative analysiswas done in a Bayesian way, and would not be regarded as a typical case of meta-analysis,but as in meta-analysis the more informative those published estimates are, the better theanalysis.

We shall not look at these broader issues here, but concentrate on estimating centraltendency and variability in univariate treatment effect, and on the investigation ofhomogeneity and identification of outliers in simple random- or fixed effects models.

Page 28: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 363 — #383�

13.2 Aspects of scientific reporting 363

13.2 Aspects of scientific reporting

Scientific papers are a genre of their own, to some extent depending on the field. JerzyNeyman used to say: “First, say what you are going to say, then say it – and finally say whatyou just said.” There are numerous guides for authors in specific journals. Guides are alsoissued by universities and posted on the Web. One of them starts as follows: “Write withprecision, clarity and economy. Every sentence should convey the exact truth as simply aspossible” (Instructions to Authors, Ecology, 1964). In this section we do not add to thisgeneral literature, but only make some remarks regarding papers that might be included infuture meta-analyses or otherwise provide statistical information that might be integratedwith other data.

To allow efficient integration with other data a likelihood function is needed. Somejournals ask to have the full set of data made available to readers. Together with themodel the likelihood function is then available, perhaps by some effort. Should the fulldata always be reported? In some contexts, yes, but in other contexts we think no. Forone thing, the model and method might be influenced by judgement and intuition basedon an understanding of the data and their generation that is hard to convey in the paper. Areader equipped with the data but not with this intimate understanding might be led to lessappropriate analyses. Many authors are reluctant to give away their data, not only for thewrong reasons. Whether the full data are made available or not, sufficient information shouldbe supplied to allow a likelihood, reduced to the dimension of the interest parameter to berecovered. This confidence likelihood should correspond to the statistical inference drawnin the study, say in the format of a confidence distribution. To prepare for this, sufficientinformation must be supplied.

The CONSORT 2010 statement on improving the reporting of randomised controlledclinical trials (Moher et al., 2010) is introduced as follows: “To assess a trial accurately,readers of a published report need complete, clear, and transparent information on itsmethodology and findings. Unfortunately, attempted assessments frequently fail becauseauthors of many trial reports neglect to provide lucid and complete descriptions ofthat critical information.” Table 1 of the statement is a checklist of information toinclude when reporting a randomised trial. The key words are Introduction: Backgroundand objectives; Methods: Trial design, Participants, Interventions, Outcomes, Samplesize, Randomisation, Blinding, Statistical methods; Results: Participant flow (a diagramis strongly recommended), Recruitment, Baseline data, Numbers analysed, Outcomesand estimation, Ancillary analyses, Harms; Discussion: Limitations, Generalisability,Interpretation; Other information: Registration, Protocol, Funding. With minor modificationthis checklist should be useful for most papers reporting statistical information.

The statistical information published will be further associated with informationon background, data and methods. Under ‘Outcomes and estimation’ the statementrecommends: “For each primary and secondary outcome, results for each group, and theestimated effect size and its precision (such as 95% confidence interval) [ought to be given].”For clinical trials with binary outcome this, together with sample sizes, will usually besufficient to re-create the likelihood function for the effect size. Parmar et al. (1998) discusshow to extract the necessary information in case of survival endpoints. But such informationshould perhaps be included in the papers in the first place. This is really the point we wouldlike to make.

Page 29: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 364 — #384�

364 Meta-analysis and combination of information

Estimates of important parameters ought to be reported in the form of confidencedistributions or confidence curves. Often the confidence distribution will be approximatelynormal, and its location and scale are sufficient. If also the confidence likelihood bynormal conversion has a χ 2 null distribution (cf. Section 10.2), this should be said inthe paper. It does not take long. In more complicated situations the transformation fromconfidence distribution to confidence likelihood would be needed, perhaps as a parametricapproximation, and also the null distribution of the confidence likelihood. There might benuisance parameters in the model. When these influence the null distribution, say of theprofile deviance on which the confidence curve is based, their influence should be keptlow, say, by use of the magic formula or deviance correction methods of Sections 7.4–7.7.This should be reported together with the details of the choices made; cf. discussion of therelationship between likelihood and confidence in Chapter 10.

As an example take Schweder et al. (2010), summarised in Section 14.3, who estimatedthe abundance of bowhead whales off of Alaska from photos of surfacing whales obtainedin 10 surveys, in combination with other relevant data. Their model and simulation methodare fairly involved because whales are only weakly marked by colouring or scars andthe identification algorithm misses many true matchings. The key results were confidencecurves for abundance, population growth rate and individual mortality. These are, however,not approximated by parametric distributions, nor are their accompanying confidencelikelihoods with null distributions given. The plotted confidence curves are good forunderstanding the estimates and their associated uncertainty, but they are not directly usablefor full integration with other data, whether in a meta-analysis or other integrative work suchas population assessment (Brandon and Wade, 2006), at least not by likelihood methods.With a little more work and only an extra paragraph in the paper, the potential for full use ofthe results in future studies had been substantially improved.

13.3 Confidence distributions in basic meta-analysis

In the basic meta-analysis model there is a focus parameter θ0 that has been estimated in anumber of studies. The various studies might differ in methodology and there might evenbe a degree of variability in the focus parameter across studies. The meta-analysis is basedon k studies assumed to be a random sample of potential studies. The aim is to estimateθ0 understood as the mean parameter in the population. Study j in the collection yields theestimate yj, which at least approximately may be taken as an observation from

Yj = θj ∼ N(θj,σ2j ) for j = 1, . . . ,k (13.1)

for the value of θ specific to the study, θj, and where the σj is well estimated and may betaken as known. In various applications the σj have emerged as say κj/

√mj, with the mj

being the sample sizes.The fully homogeneous fixed effects case now corresponds to θ1, . . . ,θk being equal, and

this is indeed often a relevant hypothesis to test. By the optimality theory developed inChapter 5, see, for example, Theorem 5.10, the uniformly optimal confidence distributionbased on the collected data can be seen to be

C(θ)=�(A1/2(θ − θ0)), (13.2)

Page 30: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 365 — #385�

13.3 Confidence distributions in basic meta-analysis 365

Table 13.1. Numbers of open-heart operations and deaths for childrenyounger than one year of age, as recorded by Hospital Episode Statistics,for 12 British hospitals between 1991 and 1995; cf. Example 13.1

Number Hospital m z Raw estimate

1 Bristol 143 41 0.2872 Leicester 187 25 0.1343 Leeds 323 24 0.0744 Oxford 122 23 0.1895 Guys 164 25 0.1526 Liverpool 405 42 0.1047 Southampton 239 24 0.1008 Great Ormond St 482 53 0.1109 Newcastle 195 26 0.133

10 Harefield 177 25 0.14111 Birmingham 581 58 0.10012 Brompton 301 31 0.103

where A = ∑kj=1 1/σ 2

j , involving the optimally weighted average

θ0 =( k∑

j=1

yj/σ2j

)/( k∑j=1

1/σ 2j

). (13.3)

In fact, it is easily seen that θ0 ∼ N(θ0,1/A).In many cases, however, there is room for heterogeneity, with some unobserved

differences between the k underlying focus parameters, and in the prototype setup they aretaken as i.i.d. N(θ0,τ 2). The interests and aims of meta-analysis in such a setting mighttypically include but not necessarily be limited to

1. Estimating each θj, typically with a confidence interval or a standard error, and displayingthis graphically, possibly by a forest plot such as Figure 13.1

2. Comparison of the k parameters, for example, by testing their possible equality orassessing their spread

3. Estimation and inference for a common background value θ04. Estimating the distribution of the focus parameter over studies, and identifying possible

outlying studies in the collection relative to the distribution of the parameter5. Setting up relevant predictive distributions for outcomes of forthcoming experiments6. ‘Summarising the summaries’ for the focus parameter and its hyperparameters, so that

the result of the meta-analysis will be directly useful in future meta-analyses

For an illustration, consider Table 13.1, concerning death-after-operation rates for acertain type of complex open-heart operations on children younger than one year of age,from 12 different British hospitals between 1991 and 1995; cf. our earlier commentsin Section 13.1 and Example 13.1. Marshall and Spiegelhalter (2007) considered thesedata, primarily to investigate whether the Bristol hospital showed substandard performancerelative to other British specialist centres. We will thus provisionally assume that the k = 12hospitals may be regarded as a random sample of British specialist centres.

Page 31: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 366 — #386�

366 Meta-analysis and combination of information

death rate after operation

0.00 0.10 0.20 0.30 0.40

Bristol

Leicester

Leeds

Oxford

Guys

Liverpool

Southhampton

Great Ormond ST

Newcastle

Harefield

Birmingham

Brompton

Figure 13.1 Forest plot for death-after-operation rates θj for the 12 British hospitals listed inTable 13.1. A point estimate and a 95% confidence interval are shown for each hospital. The sizeof the central filled squares are proportional to sample size mj.

To fit the preceding framework, the observed rates are in the form of yj = θj = zj/mj,and with zj stemming from mechanisms that we perhaps somewhat simplistically model asBin(mj,θj) situations. This fits the preceding with σj = {θj(1− θj)/mj}1/2, as the sample sizesare large enough to guarantee both a good approximation to normality as well as accuratelyestimated σj.

With the preceding model formulation for the θj, the Yj are independent with distributionsN(θ0,τ 2 +σ 2

j ), giving rise to the log-likelihood function

k(θ0,τ)=k∑

j=1

{− 1

2 log(τ 2 +σ 2j )− 1

2

(yj − θ0)2τ 2 +σ 2

j

}.

For fixed τ , this is maximised for

θ0(τ )=∑k

j=1 yj/(τ2 +σ 2

j )∑kj=1 1/(τ 2 +σ 2

j ), (13.4)

leading to the profile log-likelihood function

k,prof(τ )= − 12

k∑j=1

[log(τ 2 +σ 2

j )+{yj − θ0(τ )}2

τ 2 +σ 2j

]. (13.5)

When the observed yj have a tight spread, the profile log-likelihood is maximised forτ = 0 in (13.5), with the consequent estimate θ (0)= θ of (13.3) for θ0. The precise condition

Page 32: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 367 — #387�

13.3 Confidence distributions in basic meta-analysis 367

for this to happen is that

12

k∑j=1

1

σ 2j

{ (yj − θ0)2σ 2

j

− 1}< 0. (13.6)

In the opposite case, the maximum likelihood estimate τ is positive. The expression on theleft-hand side of (13.6) is the derivative of the profile log-likelihood function, now evaluatedin terms of τ 2 rather than τ , at zero; see Exercise 13.1. The resulting maximum likelihoodestimator of θ0 is θ0 = θ0(τ ). The profile log-likelihood for θ0 is

k,prof(θ0)=k∑

j=1

[− 1

2 log{τ(θ0)2 +σ 2j }− 1

2

(yj − θ0)2τ (θ0)2 +σ 2

j

], (13.7)

with τ (θ0)= argmaxτ k(θ0,τ).Before obtaining confidence distributions for the key parameters θ0 and τ the collection

of studies might need pruning for outliers. We suggest identifying possible outliers relativeto the normal distribution of θ by the following stepwise algorithm.

1. Start with the index set I = {1, . . . ,k}, and choose an appropriate level ε.2. For each j ∈ I, estimate τ by maximum likelihood or by the median estimator (13.10)

discussed in the text that follows, and also θ0 of (13.4) for τ , from the indexed studiesexcluding study j. Calculate the residual

ej = (yj − θ0)/(σ 2j + τ 2)1/2.

3. Let k be the number of points in the current index set. If maxI |ej|> zk for zk being the 1−εquantile in the distribution of the maximum of k absolute values drawn from the normaldistribution, {�(zk)−�(−zk)}k = 1 − ε, remove the study with the maximal absoluteresidual |ej| from the index set.

4. Repeat steps 2 and 3 until no residual exceeds the critical level zk. The obtained index setis regarded as the set of homogeneous studies under the normal-normal model, and theexcluded studies are regarded as outlying.

5. Compute a confidence distribution both for θ0 and for τ from the retained studies.

The residuals are at each step approximately independent and standard normallydistributed under homogeneity. The probability of erroneously excluding one or morestudies from a homogeneous collection is thus approximately 1 − ε.

There are various methods for obtaining approximate confidence distributions for θ0. Forτ , an exact confidence distribution is available. We consider this first.

The profile log-likelihood (13.5) is based on the construct

Q(τ ,y)=k∑

j=1

{yj − θ0(τ )}2

τ 2 +σ 2j

(13.8)

which in the present normal-normal model is χ2k−1 distributed and hence a pivot. In addition,

Q is decreasing in τ for all possible data; see Exercise 13.2. Hence

C(τ )= 1 −�k−1(Q(τ ,y)) (13.9)

Page 33: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 368 — #388�

368 Meta-analysis and combination of information

is an exactly valid confidence distribution for τ . Its median

τ = C−1τ (

12 ) (13.10)

is a median-unbiased estimator. We note that other confidence constructions are available forτ . As long as R = R(y) is a function of the data with a distribution stochastically increasingwith τ , such as the variance R1 = ∑k

j=1(yj − yj)2 or the range R2 = maxj≤k yj − minj≤k yj,

there is a confidence distribution CR(τ ) = Pτ {R(Y ) ≥ R(yobs)}, which may be computedby simulating R(y) for each value of τ . Whereas we do not believe the method basedon Q(τ ,y) here is power optimal, in the strict sense of Section 5.3, we have found it toperform well.

The method of Viechtbauer (2007) consists essentially in inverting the Q(τ ,y) forobtaining confidence intervals for the dispersion parameter. He compared his confidenceintervals with their main competitors, primarily with respect to coverage and interval length,and found them to perform favourably.

A confidence curve and confidence distribution for θ0 are a bit harder to obtain. Thereare a couple of essentially similar alternatives regarding forming a confidence distributionfor θ0. Writing for clarity τ0 for the underlying true parameter value, we have

θ0(τ )=∑k

j=1 Yj/(σ2j + τ 2)∑k

j=1 1/(σ 2j + τ 2)

∼ N(θ0,ω(τ)2)

with

ω(τ)2 =∑k

j=1(σ2j + τ 2

0 )/(σ2j + τ 2)2{∑k

j=1 1/(σ 2j + τ 2)

}2 .

This expression is minimised for τ = τ0, where the value is

ω(τ0)= 1/{ k∑

j=1

1/(σ 2j + τ 2

0 )}1/2

.

Note that if the σj are close to being equal to a common value σ0, then the weights inθ0(τ ) are close to being equal (to 1/k), and with ω(τ) close to (σ 2

0 + τ 20 )/k

1/2. The pointis that these two approximations, of equal weights and the preceding value for ω(τ0), arerobustly acceptable with σ0 = {k−1

∑kj=1σ

2j }1/2, and that the ω(τ) curve is fairly constant in

a reasonable neighbourhood around τ0. Hence the difference between the exact distributionof θ0(τ ) and that of N(θ0,ω(τ0)2) is merely of second order, and we may be allowedto use

Cθ0(θ)=�((θ − θ0)/ω(τ )) (13.11)

as an approximation to the exact confidence distribution of θ0 based on θ0. For some morediscussion of relevance, consult Singh et al. (2005, 2007), Xie et al. (2011) and Xie andSingh (2013).

A more elaborate confidence distribution construction for θ0 is to use t-bootstrappingbased on

t = θ0 − θ (τ )ω(τ )

and t∗ = θ0 − θ∗(τ ∗)ω(τ ∗)

,

Page 34: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 369 — #389�

13.3 Confidence distributions in basic meta-analysis 369

with bootstrapped versions of τ ∗ and hence θ0(τ ∗) and ω(τ ∗) stemming from simulatedversion of Y ∗

j ∼ N(θ0,σ 2j + τ 2), and then following the general recipe of Section 7.6.

A third approach is to probability transform the profile deviance of θ0. When τ is closeto zero the null distribution of the deviance might differ a bit from χ2

1 , and also dependsomewhat on τ . For the spread parameter being larger the χ2

1 distribution fits better.Yet another approach is to use the fiducial distribution for θ0 obtained by Fisher’s

step-by-step method. When τ is the true value of the spread parameter θ (τ ) is normallydistributed with known variance. The conditional fiducial distribution given τ is thusF(θ0 |τ) = �((θ0 − θ (τ ))/ω(τ)). With G(τ ) = 1 − �k−1(Q(τ ,y)), the marginal fiducialdistribution for θ0 is

H(θ0)=∫ ∞

0F(θ0 |τ)dG(τ ). (13.12)

As we know from the fiducial debate (Chapter 6), this marginal fiducial distribution need notbe a valid confidence distribution. This must be checked in each case. We have simulated afew cases and found that the distribution of H(θ0), under true values (θ0,τ0), is very close tothe uniform; cf. Figure 13.3. To be an exact confidence distribution the distribution shouldbe exactly uniform.

Example 13.1 British hospitals

The forest plot of Figure 13.1 indicates that Bristol is an outlying hospital relative to a normaldistribution of the θj. This is indeed confirmed by running the preceding stepwise algorithmto identify outliers. In the first round we find Bristol to have residual e = 4.006 with p-value1 − {�(|e|)−�(−|e|)}12 = 0.0007. In the second round, with Bristol excluded, Oxfordhas the residual 2.117, which is extreme among the 11 remaining hospitals. The p-valueis now only 0.32, and the procedure stops. The remaining 11 hospitals are regarded as ahomogeneous sample relative to the normal-normal model.

In the context one might argue to test only to one side. Hospitals try to reduce mortality,and outliers might be hospitals lagging behind in the race for best practice. Bristol wouldthen be an even clearer outlier, but Oxford would still be regarded as part of a homogeneouscollection of 11 hospitals.

Figure 13.2 shows the confidence distribution Cτ and the confidence curve for the spreadparameter τ , that is, the standard deviation in the normal distribution of mortality rate afteropen heart surgery on infants in British specialist centres. There is confidence 0.063 forτ = 0. The point estimate is τ = C−1

τ (12 )= 0.021.

The confidence distribution C for mean mortality rate over British specialist centres, θ0,is shown in Figure 13.3. The point estimate is θ0 = C−1( 1

2 )= 0.113 and the 95% confidenceinterval is (0.096, 0.136). To validate that the marginal fiducial distribution really is aconfidence distribution, a brief simulation study was carried out. The histogram of C(θ0,Y )for simulated data for the 11 hospitals, assuming the normal-normal model, in the upper leftpanel of the figure, shows nearly a perfect uniform distribution, and C(θ ,y) is taken as theconfidence distribution for θ0.

The third method was also used to obtain a confidence curve for θ0 from its profiledeviance in the normal-normal model. To investigate the null distribution of the profiledeviance calculated at the true value, 104 binomially distributed count vectors were

Page 35: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 370 — #390�

370 Meta-analysis and combination of information

0.00 0.04 0.08

τ

conf

iden

ce

0.00 0.04 0.08

τ

conf

iden

ce c

urve

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

Figure 13.2 With Bristol pushed out of the dataset of Table 13.1, the figure displays confidencedistribution obtained by the Q pivot (left) and its confidence curve (right) for the 11 remaininghospitals, for the standard deviation parameter τ of the normal distribution of death rate afteroperation in British specialist centres. The confidence distribution has point mass 0.063 at zero,which means that intervals for τ of level cc(0) = 0.874 and above will have zero as their leftendpoint.

simulated by first drawing the binomial probabilities from N(θ0,τ 2) for each of τ = 12 τ , τ , 2τ

where the estimates are the maximum likelihood estimates in the normal-normal model.The profile deviance for θ0 was then calculated in the assumed true value θ0 for eachdataset. The three distributions are all close to the χ2

1 distributions expected from Wilks’theorem. The lower left panel of Figure 13.3 shows that the confidence distribution haslonger tails than the normal distribution. See also Figure 13.4 for relevant quantile-quantileplots.

Confidence curves for θ0 obtained by four methods are shown in Figure 13.6. Thefiducial method and probability transforming the profile deviance of the Beta-binomialmodel, discussed in the text that follows, and also the profile deviance of the approximatingnormal-normal model are virtually identical. The confidence distribution found by directlysumming the weighted normal scores as suggested by Singh et al. (2005), see later, isslightly shifted to the right, and is also narrower than the other three confidence distributions.That the latter is narrower than the three former is due to the uncertainty in the parameterrepresenting the spread among British hospitals is not accounted for in the normal scoremethod.

Page 36: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 371 — #391�

13.4 Meta-analysis for an ensemble of parameter estimates 371

C (θ0)

freq

uenc

y

0

100

300

500

conf

iden

ce

0.08 0.10 0.12 0.14 0.16 0.08 0.10 0.12 0.14 0.16

−3

2

−1

0

1

2

3

θ0

norm

al q

uant

iles

θ0

0.08 0.10 0.12 0.14 0.16

θ0

cc

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

Figure 13.3 With Bristol excluded, the confidence distribution and the confidence curve for θ0based on the fiducial distribution (13.12) are shown in the right panels. In the upper left panel ahistogram is shown for H(θ0,Y ) based on 104 simulated data for the 11 hospitals, with τ = 0.2,θ0 = 0.11, and with σj as estimated from the observed data. The lower left panel shows a normalprobability plot for the confidence distribution.

13.4 Meta-analysis for an ensemble of parameter estimates

It is important to note that the aforementioned prototype setup is wide enough to encompassa long list of situations in which the parameters being assessed and compared are notmerely binomial probabilities or mean parameters in Gaussian populations. As long as(13.1) holds, at least approximately, for certain parameters θ1, . . . ,θk that can meaningfullybe compared, the apparatus developed earlier may be applied, both for reaching confidencestatements about the overall mean parameter and for assessing the degree of heterogeneityvia a confidence distribution for the spread parameter τ . For an Olympic illustration of suchan analysis, see Section 14.5.

Example 13.2 Ancient Egyptian skulls

We return to Example 3.10, involving the biometrical parameters for collections of ancientEgyptian skulls, dating from five different time epochs. The researchers measured fourquantities for each of 30 skulls from each of these epochs, respectively MB (maximalbreadth), BH (basibregmatic height), BL (basialveolar length), and NH (nasal height). We

Page 37: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 372 — #392�

372 Meta-analysis and combination of information

15

10

5

0

15

10

5

0

15

10

5

0

deviance

0 5 10 15 0 2 4 6 8 10 12 0 2 4 6 8 101214

χ2 1

Figure 13.4 q-q plots against the χ21 distribution for null profile deviances of the normal-normal

model for 104 sets of data simulated from the normal-binomial model (see Example 13.1). Theparameters of the simulation model were θ0 = θ0 = 0.111 and respectively τ = τ /2 = 0.0057 tothe left, τ = τ = 0.0115 in the middle and τ = 2τ = 0.0223 to the right.

view these measurement vectors as stemming from underlying distributions N4(ξj,�j), fortime epochs j = 1,2,3,4,5; see Claeskens and Hjort (2008) for further discussion, alsoinvolving certain identifiable trends for the mean vectors ξj over time.

Here we use these data to provide a more complex illustration of the use of theprototype setup associated with (13.1), focussing on the variance matrices involvedrather than the mean vectors. Generally speaking, when Y = (Y1,Y2,Y3,Y4)

t stems froma distribution with variance matrix �, then the linear combination Z = wtY = w1Y1

+·· · + w4Y4 has variance wt�w. The largest value of this variance, with w constrained tohave length ‖w‖ = 1, is equal to λmax, and similarly the smallest value of this variance,with the same constraint, is equal to λmin, where λmax and λmin are the largest andsmallest eigenvalues of �; see Exercise 13.5. A statistically meaningful parameter istherefore the ratio θ = (λmax/λmin)

1/2, a measure of the degree of spread in the underlyingfour-dimensional distribution. Estimates and their standard errors are found to be asfollows.

Page 38: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 373 — #393�

13.4 Meta-analysis for an ensemble of parameter estimates 373

Epoch Ratio Standard error

−4000 2.652 0.564−3300 2.117 0.434−1850 1.564 0.326−200 2.914 0.615

150 1.764 0.378

Initial analysis, involving also a forest plot and the use of the stepwise method describedin Section 13.3 for identifying potential outliers, shows that the normal-normal model of(13.1) is fully plausible, namely that the population mean of θ is independent and normallydistributed over the epochs studied and that sample means are normally distributed withstandard errors as estimated. The algorithm for identifying outliers relative to this modeldoes not identify any of the five epochs as outliers; the p-value of the extreme residual inthe first round is actually 0.52.

Figure 13.5 provides the five estimates θj in question (left panel), along with 90%confidence intervals, plotted against time epochs. To compute these intervals we have forthis illustration appealed to approximate normality of each smooth function of the empiricalvariance matrix, and with standard errors computed via parametric bootstrapping; see againExercise 13.5 for details as well as for some more elaborate alternatives. The right-handpanel then provides the confidence distribution for the spread parameter τ , as per the recipegiven previously. It does have a point mass of size 0.228 at zero, but indicates otherwisethat there really is such a positive spread, with the five θj not being equal. This provides a

−4000 −3000 −2000 −1000 0

1.0

1.5

2.0

2.5

3.0

3.5

4.0

tempus

θ es

timat

es

0.0 0.5 1.0 1.5 2.0

τ

conf

iden

ce

0.0

0.2

0.4

0.6

0.8

1.0

Figure 13.5 For the ancient Egyptian skulls data described in Example 13.2 we have estimatedthe parameter θ = (λmax/λmin)

1/2 for each of the five time epochs in question (stretching from4000 B.C. to 150 A.D.), where λ is the vector of eigenvalues for the underlying variance matrix.The left panel shows these estimates, along with 90% confidence intervals, and the right paneldisplays the confidence distribution for the spread parameter τ .

Page 39: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 374 — #394�

374 Meta-analysis and combination of information

plausible indication for claiming that the statistical distributions of the human skulls reallyhave changed over these five epochs, not necessarily regarding their average shapes but interms of the biological variation.

13.5 Binomial count data

The British hospital data are binomial count data for each hospital. Many meta-analyses,particularly in the field of clinical trials, have the same type of binomial data. Insteadof going straight to an approximative normal model, as was done previously, it mightbe advantageous to go forwards with the binomial likelihood for each study. This maybe particularly pertinent in situations where the sample sizes are small, or when theθj probabilities being evaluated and compared are close to zero or one, as the normalapproximation does not work well then.

With θj the binomial (‘success’) probability for study j, heterogeneity might be modelledby a Beta distribution with parameters a = γ θ0 and b = γ (1 − θ0). If Y ∼ Bin(m,θ) and θhas this Beta distribution, then the distribution for the count y takes the form

g(y)=∫ 1

0

(m

y

)θ y(1 − θ)m−y �(γ )

�(γ θ0)�(γ (1 − θ0))θγ θ0−1(1 − θ)γ (1−θ0)−1 dθ

=(

m

y

)�(γ )

�(γ θ0)�(γ (1 − θ0))�(γ θ0 + y)�(γ (1 − θ0)+ m − y)

�(γ + m)

for y = 0,1, . . . ,m; cf. Exercise 4.4.19. With binomial count data (mj,yj), as with the Britishhospital data, this then leads to a log-likelihood function of the type

(θ0,γ )=k∑

j=1

log g(yj,mj,θ0,γ ),

provided the collection is homogeneous relative to the Beta distribution. With thisparameterisation θ0 is the mean success probability across studies and γ measures howtight this distribution is. The variance of θ is actually θ0(1 − θ0)/(γ + 1). Full homogeneitycorresponds to γ being infinity. Methods similar to those developed above for the normalmodel may now be derived, drawing on the general theory of Chapters 3, 4 and 7, to provideconfidence distributions for the overall mean probability θ0 and the spread parameter γ . Thelatter will have a point mass at infinity, and we may prefer to present this in terms of thetransformed parameter κ = 1/(γ +1), with κ = 0 corresponding to the θj probabilities beingidentical.

In the Beta-binomial model there is also a need to check that the collection of studiesrepresents a homogeneous group. This might be done by changing the recursive methodto identify residuals in the normal-normal model of Section 13.3, following (13.7). Withthe index set in step (2) of that recipe being I, calculate the p-value, pk

j , of study j ∈ Ibeing homogeneous with the studies I − {j}. This is done by testing H : pj = θ0, say by thelikelihood ratio method, where γ is estimated from studies I − {j} and regarded as known,and where θ0 is the assumed common parameter for these studies. Remove study j in step(3) if this yields the smallest of the k p-values, and if pk

j < 1 − (1 − ε)k.

Page 40: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 375 — #395�

13.6 Direct combination of confidence distributions 375

A different approach to the analysis of binomial counts is via the arcsine transformation.If Y ∼ Bin(n,p), then famously

√n(p − p)→d N(0,p(1 − p)), for p = Y/n. Letting A(u)=

2arcsin(√

u), the delta method of Section A.3 gives√

n{A(p)− A(p)} →d A′(p)N(0,p(1 − p))= N(0,1).

We may hence transform binomial estimates θj = yj/mj to say γj = A(θj), and utilise theapproximation γj ∼ N(γj,1/mj). Analysis then proceeds in the domain of the γj, which wemodel as γj ∼ N(γ0,κ2), say. One reaches confidence distributions for their overall mean γ0

and κ , and may transform back to the scale of θ via θj = A−1(γj) = sin2( 12γj). It turns out

that the arcsine transformation in addition to stabilising the variance has the added featureof often coming closer to the normal distribution.

When yj is 0 or mj for some studies, or nearly so, transformations like arcsine or logistic,and also the identity (no transformation as earlier), do not work well. Sweeting et al. (2004)discussed “what to add to nothing” in the context of meta-analysis, whereas Rucker et al.(2008) asked, “Why add anything to nothing?” Instead of arbitrarily adding or subtractingsomething, a more sensible approach might be to use the above Beta-binomial model, orsome other model based on the likelihoods for the individual studies. This is what we do fora particular application in Section 14.7.

13.6 Direct combination of confidence distributions

Combining independent p-values by log-transformation dates back to Fisher (1925). Let Pj

be the p-value when testing the jth hypothesis. When the hypothesis in question is true,Pj is uniformly distributed, which implies −2logPj ∼ χ2

2 . If all hypotheses are correct,therefore, and the underlying test statistics are independent, −2

∑kj=1 logPj ∼ χ2

2k. Fisher’smethod for testing all the k hypotheses simultaneously is to look at the combined p-valueP = 1 − �2k(−2

∑kj=1 logPj). With cumulative distribution function F(x) = exp( 1

2x) for

negative x, Fisher’s method is to combine the p-values by their F-scores∑k

j=1 F−1(Pj) and

then probability-transform to P = Fk(∑k

j=1 F−1(Pj)) where Fk is the distribution of −χ22k

which is the distribution of∑k

j=1 F−1(Uj).This method works for any F. Stoufer et al. (1949) used the normal distribution. Their

method is P =�(k−1/2∑k

j=1�−1(Pj)).

Singh et al. (2005) picked up on this idea and suggested to combine independentconfidence distributions for a common parameter θ , say C1(θ), . . . ,Ck(θ), by C(θ) =Fk(

∑kj=1 F−1(Cj(θ))). They found that asymptotic power is optimised locally when F is

chosen as the double exponential distribution, in the sense of Bahadur slope. In Xie et al.(2011) a weighted version of this method is studied,

Cw(θ)= Fw

( k∑j=1

wjF−1(Cj(θ))

), (13.13)

where Fw(x) = P{∑kj=1 wjF−1(Uj) ≤ x}, with U1, . . . ,Uk being independent and uniformly

distributed, and the wj are nonnegative weights. They use the normal distribution for scoring,which yields Fw(x)=�((∑k

j=1 w2j )

−1/2x).

Page 41: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 376 — #396�

376 Meta-analysis and combination of information

Xie et al. (2011) are primarily interested in robustness with respect to heterogeneity intheir meta-analysis method in the fixed effects model. They devise an algorithm for choosingthe weights from the observed data. They show their method to converge in probability in thesense that the confidence median converges to the true value θ0, which is the common trueeffect size for the homogeneous group of studies in the collection sharing this value. Thishappens both when the precision in each study in a fixed set of studies increases withoutbound, and when the number of studies increases. In the latter case they find the asymptoticefficiency to be κ = (3/π)1/2 .= 0.977, that is, Cw(θ)= C0(κ(θ−θ0)). The fraction of outliersmust be limited.

In the homogeneous fixed effects model of basic meta-analysis with normal data, theoptimal combination is (13.2) given earlier. This results from (13.13) by choosing wj = σ−1

j .In the random effects model Xie et al. (2011) suggest to estimate τ 2 by the so-calledDerSimonian and Laird method, and treat this estimate as not being affected by uncertainty.They thus replace σj by (σ 2

j + τ 2)1/2 and proceed as if in the fixed effects model. Theresulting combined confidence distribution (13.11), studied in the preceding, will thus beslightly narrower than correct, that is, as the confidence distribution found by the fiducialmethod. This is not a big problem when the studies have nearly equal study size, or when k islarge, and neither is it problematic to use the normal distribution rather than the appropriatet-distribution when the degrees of freedom is large.

13.7 Combining confidence likelihoods

Instead of combining the confidence distributions directly, they might first be converted toconfidence likelihoods which are combined by multiplication in the usual way. Assume thatnormal conversion is acceptable, that is, the confidence log-likelihood of study j can betaken as j(θ)= − 1

2 {�−1(Cj(θ))}2. The combined confidence deviance is then

D(θ)=k∑

j=1

{�−1(Cj(θ))2 −�−1(Cj(θ))

2},

with θ the minimiser of∑k

j=1�−1(Cj(θ))

2.In the homogeneous fixed effects model with normal data, Cj(θ) =�((θ − θj)/σj). The

combined confidence log-likelihood is thus simply − 12

∑kj=1(θ − θj)2/σ 2

j , with deviance

function D(θ) = (θ − θ0)2∑kj=1 1/σ 2

j . The confidence curve obtained from the confidencedeviance is consequently that of the optimal combination (13.2).

In the corresponding random effects model the confidence likelihood of study j isLj(θ0,τ) = Eτ exp{− 1

2 (θ − θj)2/σ 2j }, where expectation is with respect to θ ∼ N(θ0,τ 2).

The combined confidence log-likelihood in this normal-normal model is thus the normallog-likelihood k considered earlier with profiles (13.5) and (13.7). Confidence distributionsfor τ and θ0 respectively are then found by probability-transforming their deviancefunctions.

In regression meta-analyses and also in more involved random effects models theconfidence likelihoods of the studies in the collection are combined according to themeta-analysis model. If, say, the size of the British hospital is regarded as a covariate in

Page 42: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 377 — #397�

13.7 Combining confidence likelihoods 377

the mixed model logistic meta-regression, with intercept being normally distributed, resultsare found from the respective profile deviances of the combined confidence likelihood.Spiegelhalter (2001) actually found mortality to significantly decrease with the volume ofthis type of cases across specialist hospitals in the period considered.

Results have hitherto rarely been published in the format of confidence distributionsor confidence curves. Often they rather come as confidence intervals together with pointestimates. The methods of Section 10.5 might then be used to obtain approximate confidencelikelihoods.

Example 13.3 British hospitals (continued)

Confidence curves for θ0 obtained by four methods are shown in Figure 13.6. The maximumlikelihood estimate is relatively small in the normal-normal model, and relatively largein the Beta-binomial model. The two confidence curves based on deviance profiles areconsequently located to the left and the right among the four. The Beta-binomial confidencecurve was obtained by transforming the profile deviance to the probability scale by the χ2

1

distribution. It was in fact not straightforward to simulate the null distribution in this model.The confidence distribution found by directly summing the weighted normal scores assuggested by Singh et al. (2005) is slightly shifted to the right of the confidence distributionfound by the fiducial method, and is also narrower due to the uncertainty in the parameterrepresenting the spread among British hospitals not being accounted for in the normal scoremethod.

The confidence distribution for θ0 obtained by applying the t-bootstrap method ofSection 7.6 to the arcsine transformed relative frequencies is nearly identical to that obtainedby t-bootstrapping the untransformed frequencies, and they are quite similar to that obtainedby the fiducial method; see Figure 13.6.

Remark 13.1 Context matters

We have developed a framework and machinery for combining information from differentsources, for example, via confidence likelihoods. The recipes need to be appliedconscientiously, however, with proper regard to the context in which data have beencollected, and so forth. As in other statistical situations one may be led to erroneousconclusions by overlooking aspects of connections between data, by ‘lurking covariates’,etc. This is illustrated in the following example.

Example 13.4 Spurious correlations for babies

Figure 13.7 displays the confidence curve (solid line) for the correlation coefficient ρbetween certain variables x and y, as measured on 23 small children at the Faculty ofMedicine at the University of Hong Kong. The confidence curve in question is the optimalone, properly computed via the methods of Section 7.8; cf. Example 7.8. It appears to showthat the ρ in question is significantly positive, which would amount to a so-called interestingmedical finding. The problem is that there are three groups of children here, with 7, 8 and8 of them at age 3, 12 and 24 months, respectively, and that both x and y change with age,y most clearly so. Looked at separately, the correlations ρ1,ρ2,ρ3 at the three age levelsare innocuous, as seen in the figure. The fact that x and y change with age, without anyreason supplied for assuming that the correlation between them ought to stay constant (if

Page 43: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 378 — #398�

0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15

θ

conf

iden

ce c

urve

fiducialnormal-normal profilenormal scoresbeta-binomial profile

0.0

0.2

0.4

0.6

0.8

1.0

Figure 13.6 Four confidence curves for the mean rate parameter θ0 in the British hospitalsstudy, (1) by the fiducial method; (2) by probability transforming the profile deviance of thenormal-normal mode; (3) by using the normal score method of Singh et al. (2005), and (4) byprobability transforming the profile deviance of the Beta-binomial model.

1.00.50.0−0.5−1.0

ρ at 3, 12, 24 months

conf

iden

ce c

urve

s

0.0

0.2

0.4

0.6

0.8

1.0

Figure 13.7 Confidence curves for the correlation coefficient between x and y, with estimates0.181, 0.202, 0.524, at ages 3, 12 and 24 months, for respectively 7, 8 and 8 children. Alsodisplayed is the narrower confidence curve for the correlation when lumping together the threedatasets. That ‘overall correlation’ lacks good statistical interpretation, however.

Page 44: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 379 — #399�

13.8 Notes on the literature 379

at all present), makes the ‘overall correlation’ a confusing parameter, lacking a meaningfulinterpretation.

13.8 Notes on the literature

The first statistical meta-analysis on record was that of Karl Pearson (Simpson and Pearson,1904), where he combined studies of typhoid vaccine effectiveness. In the following 75years there were sporadic meta-analyses, primarily in medicine. Clinical trials and moregeneral medicine is still the arena for most meta-analyses, but the methodology and practicehave spread much more widely; see Borenstein et al. (2009), who review the history ofmeta-analysis and provide a broad introduction to the field. Publication bias is problematicfor meta-analysis, not the least in the social sciences, as reported by Peplow (2014).

Borenstein et al. (2009) start their book by recalling Dr Benjamin Spock’s recommenda-tion in his best-selling Baby and Child Care to have babies sleep on their stomach (Spock,1946), and noting that some 100,000 children suffered crib death from the 1950s into the1990s. Gilbert et al. (2005, p. 874) reviewed the associations between infant sleepingpositions and sudden infant death syndrome (SIDS) and conclude: “Advice to put infantsto sleep on the front for nearly a half century was contrary to evidence available from 1970that this was likely to be harmful. Systematic review of preventable risk factors for SIDSfrom 1970 would have led to earlier recognition of the risks of sleeping on the front andmight have prevented over 10,000 infant deaths in the UK and at least 50,000 in Europe, theUSA, and Australasia.”

There is an abundance of books on meta-analysis. Borenstein et al. (2009) review many ofthem. There are also a dozen methodology papers on meta-analysis each year in Statistics inMedicine; see Sutton and Higgins (2008) who review recent developments in meta-analysisprimarily in medicine. But meta-analysis has spread to most empirical fields, and there aremethods papers in many subject matter journals.

Nelson and Kennedy (2009), for example, review the state of the art in natural resourceeconomics, and also some 140 meta-analyses conducted in this branch of economics.Another recent example is Kavvoura and Ioannidis (2008), who describe issues that arisein the retrospective or prospective collection of relevant data for genetic association studiesthrough various sources, common traps to consider in the appraisal of evidence and potentialbiases. They suggest creating networks of research teams working in the same field andperforming collaborative meta-analyses on a prospective basis. This recommendation is inline with the practice of prospective registration of clinical trials administrated by the WorldHealth Organization, and is advisable recommendation in many fields.

The broad majority of meta-analyses are concerned with combining results fromindependent studies for a common focus parameter. In Chapter 10 we also combined resultsfor a parameter derived from the parameters of the individual studies, for example, thegrowth coefficient based on abundances. This is also meta-analysis. In Section 14.7we develop optimal methods of combining discrete data when the model is inside theexponential family. This problem is also studied by Liu et al. (2014a).

Traditionally, estimates of a common vector parameter are combined by a weightedaverage. This works under homogeneity and normality. Liu et al. (2014b) developed amethod of multivariate meta-analysis of heterogeneous studies yielding normal results.

Page 45: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 380 — #400�

380 Meta-analysis and combination of information

They effectively combine the individual Gaussian confidence likelihoods of the individualstudies. The combined confidence likelihood might then yield confidence distributions for aparameter of interest, even when that parameter is not estimable in any of the studies. Thismight be the case in network meta-analysis considered by Yang et al. (2014).

The epidemiologist Stolley (1991) reminds us of the dangers of being selective in whatwe choose as our data. He refers to Fisher’s gross error in the controversy over lung cancerand smoking. Fisher (1958) argued, “Is it possible, then, that lung cancer – that is to say, thepre-cancerous condition which must exist and is known to exist for years in those who aregoing to show overt lung cancer – is one of the causes of smoking cigarettes? [. . . ] anyonesuffering from a chronic inflammation in part of the body (something that does not give riseto conscious pain) is not unlikely to be associated with smoking more frequently [. . . ] It isthe kind of comfort that might be a real solace to anyone in the fifteen years of approachinglung cancer.” The statistical evidence that Fisher brought forward to hold the door open forthe possibility that lung cancer could contribute to causing smoking, rather than the causalarrow going the other way, was indeed selective; “he was so delighted at the finding ofcounterevidence that he exaggerated its importance” (Stolley, 1991, p. 419).

Exercises

13.1 Maximum likelihood for random effects: Consider the situation of Section 13.3 where Yj ∼N(θj,σ 2

j ) independently for j = 1, . . . ,k, for given θj, but where the θj are taken i.i.d. fromN(θ0,τ 2).

(a) Show that Yj marginally is distributed as N(θ0,σ 2j + τ 2) and that the log-likelihood of the

observed data is

k = − 12

k∑j=1

(yj − θ0)2σ 2

j + τ 2− 1

2

k∑j=1

log(σ 2j + τ 2).

(b) Show next that the value of θ0 maximising the log-likelihood for fixed τ is

θ0(τ )=k∑

j=1

yj

σ 2j + τ 2

/ k∑j=1

1

σ 2j + τ 2

.

This leads to the profile log-likelihood k,prof(τ ) of (13.5).

(c) Work out the derivative of k,prof(τ ), as a function of τ 2 rather than of τ , at zero. Show that itis equal to the formula of (13.6), and translate this into a condition for the maximum likelihoodestimate τ being positive.

13.2 A chi-squared construction: Following the assumptions and notation of Exercise 13.1, consider

Q = Q(τ ,y)=k∑

j=1

{Yj − θ0(τ )}2

σ 2j + τ 2

where θ0(τ )= A−1∑k

j=1 Yj/(σ2j + τ 2), with A =∑k

j=1 1/(σ 2j + τ 2). The following calculations

take place under the true value of τ , and aim at setting up the chi-squared based confidencedistribution for τ used in Section 13.3.

(a) Show that θ0(τ )∼ N(θ0,1/A).

Page 46: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 381 — #401�

Exercises 381

(b) Show that the distribution of Q does not depend on θ0, which we for the followingcalculations hence may set equal to zero. Introduce Vj = Yj/(σ

2j + τ 2)1/2, which become

independent and standard normal. Show that the vector p with elements A−1/2(σ 2j +τ 2)−1/2 has

length 1, and hence that there is an orthonormal matrix P having p as its first row. Transform toW = PV, so that W1, . . . ,Wk are also independent and standard normal.

(c) Show that W1 = A1/2θ0(τ ) and that Q(τ ) can be expressed as W22 + ·· · + W2

k , proving itsχ2

k−1 distribution, and also that it is independent of θ0(τ ).

(d) Conclude that C(τ )= 1 −�k−1(Q(τ ,y)) is a confidence distribution for τ . Comment on itspoint mass at zero.

(e) Show that Q(τ ,y) is a decreasing function of τ regardless of the data y. Try first the casewith k = 2.

13.3 Exact distribution of quadratic forms: In Section 13.3 we considered statistics of the type Wk =∑kj=1(Yj − Y )2, where the Yj are independent and N(0,ω2

j ), say. The following is relevant forthe construction of the confidence distribution CR(τ ) for τ pointed to after (13.9).

(a) Show that Wk may be presented as a linear combination of independent χ21 variables.

(b) Consider W =∑kj=1 λjVj, where the Vj are independent χ2

1 variables. Find its mean, varianceand skewness, and also an expression for its characteristic function, φ(t)= E exp(itW).

(c) Find an inversion formula in the probability literature on characteristic functions and attemptto implement it for the present purpose of computing and displaying the exact density of W. Itmight, however, be easier to use stochastic simulation, for example, in connection with CR(τ )

pointed to earlier.

13.4 Meta-analysis for Poisson counts: In Exercise 8.14 we consider the simple data set 3, 1, 1,4, 0, 4, 0, 4, 8, 1, 5, 3, 4, 3 and 7, investigating there the degree to which the data might beoverdispersed with respect to the Poisson model. That analysis can be complemented here, asfollows.

(a) Suppose Y |θ ∼ Pois(θ) but that θ ∼ Gamma(a,b). Show that the distribution of Y is then

f0(y,a,b)= ba

�(a)

1

y!�(a + y)

(b + 1)a+yfor y = 0,1,2, . . . .

With the parametrisation (θ0/τ ,1/τ) for (a,b), show that Yi has mean θ0 and variance θ0(1+τ),hence with τ reflecting potential overdispersion; see also Exercise 4.17.

(b) Fit the preceding model, using f(y,θ0,τ) = f0(y,θ0/τ ,1/τ), using maximum likelihood(the estimates become 3.200 and 0.834), finding also the approximate standard errors. Findand display approximate confidence distributions for the overall mean parameter θ0 andoverdispersion parameter τ . Comment on the latter’s point mass at zero.

13.5 Multivariate statistics for skulls: The following calculations are related to the analysis of theEgyptian skulls measurements in Example 13.2, where we in particular need standard errorsfor θ , the ratio of the largest to the smallest root-eigenvalue. Assume Sn is the variancematrix formed from a sample of independent X1, . . . , Xn from a Np(ξ ,�), assuming for easeof presentation that the eigenvalues λ1 > · · ·>λp of � are distinct. Letting λ1 > · · ·> λp be theobserved eigenvalues for Sn, one may prove that

√n(λj −λj)→d Aj ∼ N(0,2λ2

j ), and that theseare independent in the limit; see e.g. Mardia et al. (1979, appendix).

(a) With θ = (λ1/λp)1/2 and θ = (λ1/λp)

1/2, show that

√n(θ2 − θ2)= √

n(λ1/λp −λ1/λp)→d N(0,4θ4).

Page 47: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 382 — #402�

382 Meta-analysis and combination of information

Deduce from this again that√

n(θ − θ)→d N(0,θ2). Hence an easy-to-use approximation tosd(θ) is θ/

√n.

(b) A more careful method for estimating σ = sd(θ) than σquick = θ/√n is that of parametricbootstrapping, where one simulates a large number of θ∗ from the estimated multinormalmodel. For the case of the skulls, where n = 30 for each of the five groups, this gives estimatesthat are around 10% larger than what is found from the easy approximations described above.

(c) Redo the analysis of Example 13.2. Supplement this analysis with similar calculations,including confidence distributions for population mean θ0 and the associated spread τ , for afew other parameter functions of the variance matrices, such as, for example,

ψ1 = (λ1 · · ·λp)1/(2p) and ψ2 = maxλj

λ1 +·· ·+λp.

Give interpretations to these pararameters.

Page 48: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 383 — #403�

14

Applications

The present chapter reports on different application stories, where the concepts andmethodology of previous chapters are put to use. The stories include medical statisticsand range from golf putting to Olympic unfairness, from Norwegian income distributionsto prewar American government spending and analysis of bibliographic citation patterns.We shall see how the apparatus of confidence distributions, profiled log-likelihoods andmeta-analysis helps us reach inference conclusions along with natural ways of presentingthese.

14.1 Introduction

The application stories selected for this chapter are intended to demonstrate the broadusefulness of our methodology, across a range of models and situations. The confidencedistribution machinery allows the practitioner to compute and exhibit a full (epistemic)probability distribution for the focus parameter in question (without having to start witha prior).

We start with a reasonably simple story about golf putting, with data from toptournaments. A model built for the purpose does a better job than with logistic regression,allowing inference for various relevant parameters. We then turn to an analysis of parametersrelated to a population of bowhead whales, from abundance to growth and mortality rates.Next we offer a different analysis of data used by C. Sims in his acceptance speechwhen receiving his 2011 ‘Nobel Prize of Economics’. Sims used a Bayesian setup withnoninformative priors, whereas we reach rather different conclusions for the main parameterof interest via confidence analysis (using the same data and the same model).

Jumping from prewar US economics to sports, we demonstrate next that the Olympic1000-metre speedskating event is unfair and needs a different design. The Olympicunfairness parameter is of the size of 0.15 second, more than enough in that sport formedals to change necks. Turning to Norwegian income distribution we investigate aspectsof the Gini index relevant for understanding how the so-called Nordic model works inScandinavian societies. Then we provide an optimal inference analysis of the odds ratioacross a range of 2×2 tables, for a certain application that has involved not merely losses oflives and billions of dollars and legal mass action, but also statistical controversy, as it hasnot been clear how to give proper meta-analysis of such tables when there are many zeroes.Our final application story pertains to understanding patterns of Google Scholar citations,of interest in the growing industry of bibliometry, citographics and culturomics.

383

Page 49: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 384 — #404�

384 Applications

Table 14.1. Golf data: Number of feet from thehole, the number of attempts, number of successes

Feet Attempts Successes

2 1443 13463 694 5774 455 3375 353 2086 272 1497 256 1368 240 1119 217 69

10 200 6711 237 7512 202 5213 192 4614 174 5415 167 2816 201 2717 195 3118 191 3319 147 2020 152 24

14.2 Golf putting

Table 14.1 relates data from Gelman et al. (2004, table 20.1), concerning the ability ornot of professional golfers to successfully ‘put’ the ball in the hole, collected over severalseasons. The columns contain say xj,mj,yj, where yj is the number of succesful puts outof a total of mj attempts, at distance xj feet from the hole. We wish to model p(x), theprobability of success from distance x, along with the associated variability. We might, forexample, take an interest in the distance x0 at which the success chance is at least 0.80, andso forth.

The traditional off-the-shelf model for such data is logistic regression, with say

p(x)= H(a + bx), where H(u)= exp(u)/{1 + exp(u)}, (14.1)

and where the other assumption perhaps a bit too quickly made (knowingly or not, whenlogistic regression software is used) is that the yj are really independent and binomial(mj,p(xj)). The log-likelihood for both this and other parametric binomial based modelstakes the form

n =n∑

j=1

[yj logp(xj)+ (mj − yj) log{1 − p(xj)}

], (14.2)

with n = 19 denoting the number of such binomial trials. Maximum likelihood estimates arefound to be (a, b)= (2.2312,−0.2557), leading to the dashed curve in Figure 14.1.

Page 50: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 385 — #405�

14.2 Golf putting 385

20151050

feet

Pr

(suc

cess

)

0.0

0.2

0.4

0.6

0.8

1.0

Figure 14.1 The raw estimates pj = yj/mj of golf putting at distance xj feet from the hole,along with two fitted curves, based respectively on the geometric model (solid line) and ordinarylogistic regression (dashed line).

A competing model, inspired by a brief geometrical comment in an exercise of Gelmanet al. (2004, chapter 20), holds that

p(x)= p(x,σ,ν)= P{|σ tν | ≤ arcsin((R − r)/x)}, (14.3)

where the unknown parameters are the scale σ and the degrees of freedom ν for the tdistribution, and where R and r are the radii of the hole and the ball, respectively equalto 4.250/2 inch (0.177 foot) and 1.680/2 inch (0.070 foot). For details, see Exercise 14.1.The same type of log-likelihood (14.2) may be maximised for this competing two-parametermodel, and the result is (σ , ν) = (0.0232,5.6871), yielding in its turn the solid-line curveof Figure 14.1. The value of the Akaike information criterion (AIC), that is, 2n,max − 2p,where p is the number of estimated parameters in the model (see Section 2.6), climbsdramatically from −365.92 to −148.79 when we pass from logistic regression to thisparticular geometrically inspired model. It appears very clear then, that model (14.3) fitsthese particular data much better than the logistic regression.

To illustrate how this model might be used, with our machinery of confidence curves andso forth, consider the distance x0 at which the putting chance is p(x0)= p0 = 0.80. Using themodel (14.3) this focus parameter may be expressed as

x0 = x0(σ ,ν)= R − r

sin(σF−1ν (

12 + 1

2p0)),

with maximum likelihood point estimate x0 = x0(σ , ν) = 3.179. To supplement this with aconfidence curve, study the profiled log-likelihood function

n,prof(x0)= max{n(σ ,ν) : (σ ,ν) to fit p(x0,σ ,ν)= p0}.

Page 51: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 386 — #406�

386 Applications

This maximisation over (σ ,ν) for each given x0 may be carried out in different ways, forexample, by solving for σ first and finding

σ(ν)= arcsin((R − r)/x0)

F−1ν (

12 + 1

2p0).

The task is then reduced to maximising n(σ (ν),ν) over ν alone, under the dictatedconstraint p(x0,σ(ν),ν)= p0, By the general theory of Chapters 2 and 3, the deviance

Dn(x0)= 2{n,prof(x0)− n,prof(x0)}is approximately a χ2

1 , at the true value of x0, and ccn(x0)= �1(Dn(x0)) forms the resultingconfidence curve.

This is illustrated in Figure 14.2 (solid line, associated with x0 = 3.179), along withcorresponding result based on the more traditional logistic regression model (dashed line,associated with point estimate 3.304). For the (14.1) model the setup and formulae aresimpler, involving

n,prof(x0)= max{n(a,b) : H(a + bx0)= p0}= max{n(d0 − bx0,b) : all b},

where d0 = H−1(p0)= 1.386. The two confidence distributions are rather different (thoughthe two point estimates of x0 are in not unreasonable agreement). The curve based on thegeometric model is, however, vastly preferable, not because it happens to provide a tighter

3.0 3.1 3.2 3.3 3.4 3.5 3.6

x0

conf

iden

ce c

urve

s, tw

o m

odel

s

0.0

0.2

0.4

0.6

0.8

1.0

Figure 14.2 Confidence curves for the distance x0 at which success probability is 0.80, basedon the geometric model (solid line, with point estimate 3.179) and logistic regression (dashedline, with point estimate 3.304).

Page 52: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 387 — #407�

14.3 Bowheads 387

confidence distribution and narrower confidence intervals, per se, but because the simplerlogistic regression model does not really fit the data.

That the logistic regression model serves these data rather less well than the geometric isperhaps already clear from Figure 14.1, and also from Figure 14.2, as the confidence curveis much wider for the logistic case. It is solidly supported by the AIC values noted earlier.Another way to appreciate the shortcoming of logistic regression for these data is to computeand display Pearson residuals

zj = yj − mjpj

{mjpj(1 − pj)}1/2

for different fitted models, writing for simplicity pj for p(xj). Carrying out this it isclear that logistic regression leads to far bigger values of {yj − mjp(xj)}2 than mjp(xj){1 −p(xj)}, indicating a clear presence of binomial overdispersion; see Exercise 14.1. Thisinvites yet another view of the golf data, via the model that considers Yj given pj to bebinomial (mj,pj) but that admits uncertainty about the pj by taking these independent andBeta(γ p(xj,a,b),γ (1 − p(xj,a,b))). Here γ has interpretation close to ‘prior sample size’,with

Epj = p(xj,a,b) and Varpj = p(xj,a,b){1 − p(xj,a,b)}/(γ + 1),

and with classical logistic regression corresponding to γ being very large. The likelihoodnow becomes

L =n∏

i=1

(mj

yj

)�(γ p0, j + yj)

�(γ p0, j)

�(γ (1 − p0, j)+ mj − yj)

�(γ (1 − p0, j))

�(γ )

�(γ + mj), (14.4)

writing now p0,j for p(xj,a,b) = H(a + bxj); see also Section 13.5 and Exercise 4.19.Carrying out these calculations and maximisations one finds maximum likelihood estimates(1.729,−0.209,33.753) for (a,b,γ ). Displaying, for example, the 0.05, 0.50, 0.95 quantilecurves of p(x) under this model formulation is useful (cf. Exercise 14.1), and demonstratesin particular that there is a considerable amount of looseness around the logistic regressioncurve.

This model extension, from pure binomials to Beta-binomials to allow for overdispersion,can also be attempted starting instead with the geometric model (14.3). It turns out thatthis model is already good enough and there is no gain in allowing a third overdispersionparameter.

14.3 Bowheads

Airborne surveys of bowhead whales have been carried out during the spring when thebowhead migrates eastwards past Barrow, Alaska. Schweder et al. (2010) analysed datafrom 10 years, and found the confidence curves shown in Figures 14.3 and 14.4 for severalparameters of interest. The main difficulty they tackled is that most whales are only weaklymarked, and false negatives are commonplace when images are matched. That is, some truerecaptures are not recognised. The matching protocol is, however, stringent, and when amatch is found it is assumed true. False-positive matches are thus assumed not to occur.

The model is relatively complex. The captures of an individual are assumed to follow aPoisson process with intensity depending stochastically on the individual whale and on the

Page 53: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 388 — #408�

388 Applications

0 5000 10,000 15,000 20,000

abundance

conf

iden

ce

0.0

0.4

0.8

Figure 14.3 Confidence curves for abundance in 1984 (solid line), 2001 (dashed line), and 2007(broken line).

0.0

0.00

0.0005 0.0015 0.0025 0.0 0.1 0.2 0.3 0.4 0.5 0.6

growth rategrowth rate

growth rategrowth rate

conf

iden

ce

conf

iden

ce

conf

iden

ce

conf

iden

ce

0.02 0.04 0.06 0.08 0.01 0.02 0.03 0.04

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

Figure 14.4 Confidence curves for yearly growth rate of the population (top left); yearlymortality intensity (top right); mean of hourly capture intensity (bottom left); and coefficientof variation of hourly capture intensity (bottom right). The horizontal lines represent confidence0.95. The vertical lines indicate 95% confidence intervals. The confidence curves deviate slightlyfrom those of Schweder et al. (2010) because they are based on another bootstrap run.

year. The probability of successfully matching a capture to a previous capture is estimatedby logistic regression on the degree of marking and image quality. Individuals are recruitedby the so-called Pella–Tomlinson population model, and their mortality rate is assumed tobe constant.

Page 54: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 389 — #409�

14.4 Sims and economic prewar development in the United States 389

The model was fitted by maximising a pseudo-likelihood where the matching is assumedcorrect, and where the abundance is assumed to follow the deterministic version of thePella–Tomlinson model. The maximising estimators for the 18 free parameters of themodel are biased. Data from the model without the limitations of the pseudo-likelihoodare, however, easy to simulate. This was done in 1000 replicates, and each replicatedataset was fed to the pseudo-likelihood yielding a bootstrap estimate. The abc method ofSection 7.6 was then applied for each parameter separately, yielding confidence curves likethose in Figure 14.3 and 14.4. Variance was found to vary little by value of the individualparameters. Acceleration was therefore neglected by setting a = 0 in (7.17). The (biascorrected) estimates were in broad agreement with other results.

This application illustrates how inference by confidence distributions might be carriedout for several parameters of interest in complex models by parametric bootstrapping.

14.4 Sims and economic prewar development in the United States

Sims (2012) argues that statistical analysis of macroeconomic data should be carried outvia Bayesian methods. These models are often systems of equations expressing causalrelationships between the various variables of the multivariate time series that constitutesthe data. Haavelmo (1943) is a seminal paper for such studies. T. Haavelmo was awardedthe Sveriges Riksbank Prize in Economic Sciences in Memory of Alfred Nobel in 1989 (theNobel Prize in Economics) for this and the later paper Haavelmo (1944) which constitutedeconometrics as a modern branch of statistics. C. Sims was awarded the Nobel Price inEconomics for 2011 together with T. J. Sargent, “for their empirical research on cause andeffect in the macroeconomy”. In his Nobel Prize lecture (Sims, 2012) referred to Haavelmoand illustrated the Bayesian approach by analysing a simple model for the economicdevelopment in the United States in the years before World War II.

We will revisit the data Sims used and develop a confidence distribution for a parameterof particular interest. Our confidence distribution differs substantially from the Bayesianposterior distribution Sims found.

For year t = 1929, . . . ,1940 let Ct be consumption, It investment, Yt total income and G t

government spending in the United States. The data are the annual chain-indexed, real GDPcomponent data for the United States. They are given in Table 14.2. The model Sims uses is

Ct = β+αYt +σC Z1,t,

It = θ0 + θ1(Ct − Ct−1)+σI Z2,t,

Yt = Ct + It + G t,

G t = γ0 + γ1G t−1 +σG Z3,t,

where the Z i,t are i.i.d. and standard normal. The data for year 1929 are used to initiate thelikelihood, which then uses data for the remaining years.

The multiplier θ1 is of special interest. It cannot be negative by economic theory,according to Sims. He assumes a flat prior for all the coefficients, but restricted to thepositive for θ1, and a prior that is flat in 1/σ 2 for the three variance terms. The priors forthe individual parameters are taken independent. He obtains a posterior distribution for the

Page 55: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 390 — #410�

390 Applications

Table 14.2. Macroeconomic data for the UnitedStates, courtesy of C. Sims.

Year I G C

1929 101.4 146.5 736.31930 67.6 161.4 696.81931 42.5 168.2 674.91932 12.8 162.6 614.41933 18.9 157.2 600.81934 34.1 177.3 643.71935 63.1 182.2 683.01936 80.9 212.6 752.51937 101.1 203.6 780.41938 66.8 219.3 767.81939 85.9 238.6 810.71940 119.7 245.3 852.7

0.00 0.05 0.10 0.15 0.20 0.25 0.30

θ1

conf

iden

ce

0.0

0.2

0.4

0.6

0.8

1.0

Figure 14.5 The cumulative confidence distribution for θ1 (solid line, with vertical start)together with an approximation to the cumulative posterior distribution obtained by Sims (2012).

interest parameter that is nearly triangular on (0,0.15) with mode at θ1 = 0; see Figure 14.5where an approximation to its cumulative distribution is shown. Our confidence distributionbased on the same data and model is also shown.

The unrestricted maximum likelihood estimate of θ1 is negative. The more negative itis the more reason to think that investment was insensitive to changes in consumption inthe United States in the 1930s. The left plot of Figure 14.6 indicates that the unrestrictedmaximum likelihood estimator θ1 is free of any bias in the median. The q-q plot to theright in the figure is similar to the other q-q plots we looked at for various values of theparameters, and indicates that the profile deviance evaluated at the assumed value θ0

1 ,

Page 56: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 391 — #411�

14.5 Olympic unfairness 391

0.00 0.10 0.20

−0.2

0.0

0.2

0.4

θ1

estim

ate

0 2 4 6 8 10

0

2

4

6

profile deviance

χ2, d

f =1

Figure 14.6 A scatter of simulated θ1 against θ1, with binned medians as black squares nearlyfollowing the diagonal (shown); and a q-q plot of the profile deviance at the true value θ1 = 0against the χ2

1 distribution.

D(θ01 ), is nearly χ 2

1 distributed. Assuming this distribution, the unrestricted confidencecurve cc(θ1) = �1(D(θ1)) is obtained (not shown here). This is assumed equitailed, andis converted to a confidence distribution. Setting C(θ1) = 0 for θ1 < 0, the confidencedistribution shown by a solid line in Figure 14.5 is obtained. We should be 90.03% confidentthat θ1 = 0, that is, investment was insensitive to changes in consumption in the prewarperiod! Sims’s Bayesian analysis gave a different picture.

14.5 Olympic unfairness

The Olympic 500-m and 1000-m sprint competitions are the Formula One events ofspeedskating, watched by millions on television along with the fans in the stands. Theathletes skate in pairs, in inner and outer lanes, switching lanes on the exchange part ofthe track; see Figure 14.7. They reach phenomenal speeds of up to 60 km/h and naturallyexperience difficulties negotiating the curves, which have radii of ca. 25 m and ca. 29 m.

For the 500-m, which takes a top male athlete 34.5 seconds and the best female skatersless than 37.0 seconds, the last inner turn is typically the most difficult part of the race, asthis book excerpt (from Bjørnsen, 1963, p. 115) testifies:

He drew lane with anxious attentiveness – and could not conceal his disappointment: First outerlane! He threw his blue cap on the ice with a resigned movement, but quickly contained himself andpicked it up again. With a start in inner lane he could have set a world record, perhaps be the firstman in the world under 40 seconds. Now the record was hanging by a thin thread – possibly the goldmedal too. At any rate he couldn’t tolerate any further mishaps.

Page 57: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 392 — #412�

392 Applications

start outer

start inner

200 m, 600 m, finish

Figure 14.7 The speedskating rink: The skater starting in inner lane also ends his 1000-m racein that lane, and skates three inners and two outers; the other skater skates two inners and threeouters.

The problem, for Evgenij Grishin and the other sprinters, is that of the centrifugal forces(cf. the acceleration formula a= mv2/r), and meeting such challenges precisely when fatiguestarts to kick in, making skaters unable to follow the designated curve and in effect skatinga few metres extra. The aforementioned excerpt also points our attention to the fact thatluck can have something to do with one’s result; for the 500 m many skaters prefer the lastouter lane (that is, starting in the inner) to the last inner lane (starting in the outer). Thusthe Olympic 500-m event carried a certain random element of potential unfairness to it, as arandom draw determined whether a skater should have last inner or last outer lane. Valiantstatistical efforts of Hjort (1994c) demonstrated that the difference between last inner andlast outer is statistically and Olympically significant, to the tune of ca. 0.06 second (enoughof a difference to make medals change necks). Owing in part to these efforts and detailedreporting, along with careful deliberation and negotiations within the International SkatingUnion and the International Olympic Committee, these Olympic all-on-one-card rules werefinally changed. From 1998 Nagano onwards, the skaters need to race the 500-m twice, withone start in inner and one in outer lane, and with the sum of their times determining medalsand the final ranking.

The Olympic 1000-m event is, however, still arranged in an all-on-one-card fashion,and we shall learn here that this is actually unfair. To examine and assess the impliedlevel of Olympic unfairness we shall use data from the annual World Sprint Speed SkatingChampionships, where the athletes skate 500 m and 1000 m on both Saturday and Sunday,and with one start in inner and one in outer lane. Table 14.3 indicates the type of data wemay collect and analyse for these purposes. We model these as follows, for the pair of resultsfor skater no. i:

Yi,1 = a1 + bui,1 + cvi,1 + 12dzi,1 + δi + εi,1,

Yi,2 = a2 + bui,2 + cvi,2 + 12dzi,2 + δi + εi,2.

(14.5)

Here ui,1 and vi,1 are the 200-m and 600-m passing times for the Saturday race and similarlywith ui,2 and vi,2 for Sunday, and δi is a component reflecting the individual ability, centredon zero; the best skaters have negative and the less splendid skaters positive values ofδi. We model these as stemming from a N(0,κ2) distribution, whereas the εi,1 and εi,2 arei.i.d. random fluctuations terms from a N(0,σ 2). We take a1 different from a2 in (14.5) toreflect the fact that skating conditions are often different from one day to the next, regarding

Page 58: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 393 — #413�

14.5 Olympic unfairness 393

Table 14.3. Saturday and Sunday races, with 200-m and 600-m passing times and with iand o indicating start in inner and outer lanes, respectively, for the six best men of theWorld Sprint Championship in Calgary, January 2012

hero 200 600 1000 200 600 1000

1 S. Groothuis i 16.61 41.48 1:07.50 o 16.50 41.10 1:06.962 Kyou-Hyuk Lee i 16.19 41.12 1:08.01 o 16.31 40.94 1:07.993 T.-B. Mo o 16.57 41.67 1:07.99 i 16.27 41.54 1:07.994 M. Poutala i 16.48 41.50 1:08.20 o 16.47 41.55 1:08.345 S. Davis o 16.80 41.52 1:07.25 i 17.02 41.72 1:07.116 D. Lobkov i 16.31 41.29 1:08.10 o 16.35 41.26 1:08.40

temperature, wind and ice quality. Finally,

zi,1 ={−1 if no. i starts in inner track on day 1,

1 if he starts in outer track on day 1,

with a similar zi,2 for day 2 (and due to the International Skating Union rules, zi,2 = −zi,1).The crucial focus parameter here is d; a skater needs the extra amount 1

2d if he startsin the outer lane compared to the extra amount − 1

2d if he starts in inner lane (we considerthe men here). In a fair world, d would be zero or at least close enough to zero that itwould not matter for the podium and ranking. We shall estimate the d parameter and assessits precision for each of 12 different World Championships and then exploit meta-analysismethods of Chapter 13 to provide a suitable overall analysis.

Model (14.5) has both fixed and random effects, and may be written(Yi,1

Yi,2

)∼ N2(

(xt

i,1β

xti,2β

),�), with � =

(σ 2 + κ2, κ2

κ2, σ 2 + κ2

),

with

xi,1 = (1,0,ui,1,vi,1,12zi,1)

t, xi,2 = (1,0,ui,2,vi,2,12zi,2)

t,

and β = (a1,a2,b,c,d)t. The interskater correlation parameter ρ = κ2/(σ 2 + κ2) plays acrucial role. There are seven parameters, though for the present purposes focus lies solelywith the unfairness parameter d. A simpler estimation method would be based on thedifferences Yi,2 − Yi,1 only, but this delivers a less precise estimate, as discussed in Hjort(1994c); thus we are willing to fit and analyse a collection of seven-parameter models foreach event, even though what we shall keep from these analyses are essentially only theconfidence distributions for the d parameter, one for each World Championship.

The model density for the two results of skater i may be written

fi = (2π)−1|�|−1/2 exp

{− 1

2

(yi,1 − xt

i,1β

yi,2 − xi,2tβ

)t

�−1

(yi,1 − xt

i,1β

yi,2 − xi,2tβ

)}and leads to a log-likelihood of the form

n(β,σ ,ρ)= −2n logσ − 12n log

1 +ρ1 −ρ − 1

2

1

σ 2

Qn(β)

1 +ρ ,

Page 59: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 394 — #414�

394 Applications

Table 14.4. Estimates of the unfairness parameter d along with itsestimated standard deviation, for fourteen consecutive World SprintChampionships for men. The table also provides the Wald ratios t = d/se.

d se t d se t

2014 0.172 0.061 2.802 2007 0.096 0.063 1.5132013 0.288 0.067 4.316 2006 0.062 0.064 0.9742012 0.254 0.077 3.307 2005 0.196 0.071 2.7672011 0.127 0.078 1.626 2004 0.164 0.086 1.9052010 0.136 0.066 2.039 2003 0.168 0.057 2.9292009 0.124 0.059 2.091 2002 0.111 0.063 1.7612008 0.042 0.052 0.801 2001 0.214 0.068 3.150

with

Qn(β)=n∑

i=1

(yi,1 − xt

i,1β

yi,2 − xi,2tβ

)t(1, −ρ

−ρ, 1

)(yi,1 − xt

i,1β

yi,2 − xi,2tβ

).

Some analysis reveals that the maximum likelihood estimate of β must be β(ρ), where

β(ρ)= {M11 + M22 −ρ(M12 + M21)}−1{S11 + S22 −ρ(S12 + S21)}is found from minimisation of Qn(ρ), in terms of Muv = n−1

∑ni=1 xu,ixt

v,i and Suv =n−1

∑ni=1 xu,iYv,i. The maximum likelihood estimates σ , ρ of σ ,ρ may then be found my

maximising n(β(ρ),σ ,ρ). We furthermore record that various details and argumentsparallelling those in Hjort (1994c) yield

β.=d N5(β,σ 2(1 +ρ)M−1

ρ ) where Mρ = M11 + M22 −ρ(M12 + M21).

In particular, confidence intervals can be read off from this, as can the full confidencedistribution, which for these purposes is sufficiently accurately given by Cj(dj) = �((dj −dj)/sej) for the different championships, with sej the estimated standard deviation of dj.See Table 14.4. Our analyses have been carefully handling outliers (parallelling those usedin Hjort [1994c] for the 500-m case), as the aim is to provide accurate assessment for all‘normal’ top races of the best skaters in the world.

Figure 14.8 (left panel) gives a graphical display of these 14 estimates and their precision,via 90% confidence intervals. It appears clear that the 1000-m is indeed unfair; the destimates are consistently positive, some of them significantly so, with as much as 0.146second (the overall weighted average). This corresponds to about a 2 m difference, withadvantage for the inner lane, and is more than enough for medals to change hands (in the1992 Olympics, for example, the three medallists were within 0.07 second); see also Hjortand Rosa (1998).

Obviously the 0.146 second estimated advantage is an overall figure across bothevents and skaters, and different skaters handle the challenges differently. Among thosehaving already commented on the potential difference between inner and outer conditionsare four Olympic gold medallists in their autobiographies (Holum, 1984, Jansen, 1994,Le May Doan, 2002, Friesinger, 2004). Dianne Holum (1984, p. 225), rather clearly blames

Page 60: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 395 — #415�

14.5 Olympic unfairness 395

2000 2005 2010 2015

0.0

0.1

0.2

0.3

0.4

events

Oly

mpi

c un

fair

ness

0.00 0.02 0.04 0.06 0.08 0.10

τ

conf

iden

ce0.0

0.2

0.4

0.6

0.8

1.0

Figure 14.8 Estimates and 90% intervals are shown for the unfairness parameter d, for 14consecutive World Sprint Championship tournaments, 2001 to 2014 (left), along with theoverall unfairness estimate 0.146 second; the confidence distribution for the standard deviationparameter τ of the heterogeneity model of Section 13.3 (right).

starting in the outer on her Sunday 1000-m for losing the World Championship to MonikaPflug. The first notable aspect is that the 1000-m is the only nonsymmetric race in the sensethat the two skaters do not meet the same number of inner and outer turns; the inner laneperson has three inners and two outers while the outer lane person has two inners and threeouters; see Figure 14.7. The second is that the inner lane starter has an easier start, with alonger straight-ahead distance before meeting the curve. Finally, many sprinters are at thevery end of their forces after completing two of their two-and-a-half laps, making the lastouter curve a daunting one.

It remains to be seen whether the already anticipated inner advantage is truly significant,in the statistical and hence Olympic sense. For this it is tempting to form a final confidenceinterval based on the estimated standard error of the overall weighted estimate, via inversevariances; this overall estimate is 0.1417 with standard error 0.0166 computed in thetraditional fashion; see Exercise 14.2. This method does, however, trust a certain crucialunderlying assumption, namely that the overall population parameter d0 is precisely thesame across all World Championships. The more cautious meta-analysis setup here amountsto taking the underlying dj from different events as stemming from a N(d0,τ 2), and tofind the confidence distribution for τ from the variation inherent in Table 14.4. This hasbeen carried out in Figure 14.8 (right panel), using the (13.9) meta-analysis method ofSection 13.3. There is some evidence that τ = 0, that is, full homogeneity and equalityof the dj, but enough doubt about it to warrant a more careful confidence assessment ofd0 via the appropriate t-bootstrap method of Section 13.3. This is done for Figure 14.9,which displays all individual confidence curves ccj(dj) along with the overall confidence

Page 61: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 396 — #416�

396 Applications

−0.05 0.00 0.05 0.10 0.15 0.20 0.25 0.30

Olympic unfairness

conf

iden

ce c

urve

s

0.0

0.2

0.4

0.6

0.8

1.0

Figure 14.9 Confidence curves ccj(d) for the 14 consecutive World Sprint Championshiptournaments 2001 to 2014 (dotted and dashed lines), along with the overall confidence curveccgrand(d) which combines information across events (solid line). Confidence intervals of level99% may be read off from the horizontal 0.99 line, for both the individual and the overallexperiment. In particular, the overall 0.99 interval for the unfairness parameter is [0.102,0.192].

cc∗(d0) for d0. The 99% interval is [0.102,0.192], for example, which is very convincing:The Olympic 1000-m race is unfair. The consequent demand of fairness is to let skaters runthe 1000-m twice, with one start in inner and one in outer. This will be communicated to theInternational Skating Union.

Above we have used the seven-parameter model (14.5) to analyse several aspects of thespeedskating results data, across fourteen different World Sprint Championships, and alsoother quantities than the main focus parameter d are of interest, for example, for predictionpurposes. Exercise 14.2 looks into the interskater correlation parameter, for example. Wemention that simpler models may be entertained when focus is only on d, for example,one-dimensional regression models based on the differences Yi,2 −Yi,1. This leads, however,to less precision in the final estimates of d, as witnessed also by appropriate model selectionanalysis using the focussed information criterion (FIC) of Claeskens and Hjort (2003, 2008).Finally we note that various goodness-of-fit tests have been carried out, both graphical andformal, and all supporting the view that model (14.5) is fully adequate.

14.6 Norwegian income

Of the 34 Organisation for Economic Co-operation and Development (OECD) countriesNorway had the most compressed wage distribution in 2007, as measured by the ratioof the ninth to the first decile in the distribution of gross wage. Together with theother Scandinavian countries, Norway has also exceptionally high employment (Barth and

Page 62: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 397 — #417�

14.6 Norwegian income 397

Moene, 2012), Wilkinson and Pickett (2009) argue that income inequality has a negativeeffects on societies: eroding trust, increasing anxiety and illness and encouraging excessiveconsumption. How income is distributed in society is indeed an important issue. Wewill have a look at Norway, particularly with respect to how the income distribution hasdeveloped over recent years.

Income inequality is measured in different ways. OECD list countries by the Ginicoefficient both before and after tax and transfers. In their list for the late 2000s Sloveniahas the lowest Gini coefficient of 0.24 (after tax and transfers). Then comes Denmark andNorway at 0.25. The Gini coefficient is based on the Lorentz curve L(p), which is defined asL(p) = ∫ p

0 F−1(v)dv/∫ 1

0 F−1(v)dv where F is the cumulative income distribution function,that is, the relative share of the total income earned by the fraction p with the lowest income.A completely even income distribution makes L(p) = p. The more uneven the distributionthe more convex is the Lorenz curve from L(0)= 0 to L(1)= 1. The Gini index is definedas twice the area between the diagonal and the Lorentz curve.

We have access to tax data for the Norwegian population and shall see whether incomebecame less evenly distributed in recent years, which it has in many countries. The presentstudy illustrates the use of product confidence curves on large data. The illustration is partof the project ‘Inequality, Social background and Institutions’ at the Frisch Centre and theCentre of Equality, Social Organization, and Performance (ESOP), University of Oslo. Datawere made available by Statistics Norway.

More recent data than for 2010 were not available. The question we ask is whether theincome distribution has become more uneven over the decade 2000 to 2010. The globalfinancial crisis might have affected income in Norway in 2010, but perhaps not very muchbecause unemployment remained low during the turbulence. Otherwise the two years donot stand out as more special than most other years. We consider income before tax forindividuals at least 18 years of age. Total income is income on capital assets plus otherincome comprising mainly wage, social benefits and pension. Tax rates have been fairlystable over the decade. Sample size, which is the number of adult residents in Norway, isrespectively n2000 = 3,442,475 and n2010 = 3,805,928.

The income distribution evolves rather slowly in Norway, at least the large proportionwith middle income. Changes are expected in the tails of the distribution. The lowertail might be influenced by the increasing number of immigrants. There has also beensome growth in the number of students and retirees. The upper tail of the incomedistribution is influenced by the increasing private wealth and also the performance ofthe financial marked. Top wages have also increased in recent years. The Lorentz curveand the Gini index are not particularly good at picking up minor changes in the tailsof income distribution. We will thus only look at how the quantiles evolve over thedecade.

Let qt(p) be the p-quantile in the observed income distribution in year t. The curve

ψ(p)= q2010(p)/q2000(p) for p = 0.001,0.002, . . . ,0.999

reflects the relative changes in the income distribution over the decade. It is shown for all bybold line in Figure 14.10, and also for males and females. Consumer prices have grown by afactor of 1.22 over the decade. The majority has thus had a substantial increase in purchasingpower. Males have had a higher growth in income than females, and they are more sensitive

Page 63: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 398 — #418�

398 Applications

0.0

0.5

1.0

1.5

p

growth 2000 to 2010, quantiles total income

0.0 0.2 0.4 0.6 0.8 1.0

Figure 14.10 Growth in total income, ψ , from 2000 to 2010. All Norwegians 18 years or older(bold), males (dashed line), females (hatched line). Relative quantiles for p < 0.007 are notshown since below this point total income was negative or zero.

to the tax system, which has led to the kinks in the trajectory between 10% and 20% formales but not for females. At the lower end, p< 0.1 the quantiles have grown less than by50%, and below p = 0.03 the income quantiles have declined over the decade. This is notsurprising in view of the increasing immigration to Norway. What really is surprising is thatquantiles above p = 0.99 income quantiles have grown less less than for the middle 89%of the income quantiles. A similar graph for other income shows the opposite pattern at theupper end, with progressively higher other income (i.e., wages) for the very well paid. It isthe decline in capital income that makes for this difference. Over the decade the income hasgrown the most for lower medium income, and at the higher medium, but the lower tail hasstretched out while the far upper tail has declined somewhat in relative terms. To investigatewhether the dip in growth in total income at the top of the income distribution is accidental,a product confidence curve for the trajectory ψ is useful. For this purpose a statistical modelis needed.

What could the statistical model be for our registry data on income where the wholeNorwegian population is observed? If emphasis is on what really has happened in thecountry, then the data tell the whole story, and there is no model with unknown parametersto estimate. But with a more general economic view of labour market and income formationin economies of the Nordic model, a stochastic model is needed. The rather abstractmodel is thus that Norway with its economic history is a realisation of a random processgenerating economies of the Nordic type. We will assume this process to be such that thejoint distribution of the income quantiles in a year is normal, located at the true quantile

Page 64: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 399 — #419�

14.6 Norwegian income 399

−2 −1 0 1 2

74200

74300

74400

74500

p = 0.099

−2 −1 0 1 2

74600

74650

74700

74750

p = 0.100

−2 −1 0 1 2

74800

74900

75000

p = 0.101

0.80 0.85 0.90 0.95 1.00

F

0.0

0.2

0.4

0.6

0.8

1.0

Figure 14.11 Normal probability plots for q2000(p)∗ for p = 0.099,0.100,0.101. There is akink in the data for p = 0.10 due to a multiple tie in the distribution. The lower rightpanel is the cumulative distribution of the maximal marginal confidence curve over p =0.100,0.101, . . . ,0.999 obtained from bootstrapping, F, with the curve P(F ≤ f)= f 27 added.

trajectory, and with variances and covariances as free parameters. Observed qualities for2000 and 2010 are independent. Our data are extensive, and the asymptotic normality ofquantiles lends support for the model. It is also supported by bootstrapping.

For each of the two years, the registry has been bootstrapped nonparametrically in B =200 replicates. The income quantiles qt(p)∗ are calculated for each year and each p for eachreplicate. For most values of p the distribution of qt(p)∗ are close to normal. Owing to somelocal extreme densities, and even ties in the income distribution, there are some departuresfrom normality, for example, at position p = 0.10; see Figure 14.11. The many normalplots we looked at are, however, similar to the two other plots in the figure. The statisticalmodel is thus that qt(p)= qt(p)+σt(p)Z t(p) where Z t(p) for p = 0.001,0.002, . . . ,0.999 is azero-mean unit-variance Gaussian process with correlation structure as seen in the bootstrapreplicates.

The product confidence curve might now be constructed for the vector of ψ values forp > 0.007. We use the method of Section 9.9.5. The marginal confidence curve for eachvalue of p is found by the ratio-of-means method discussed in Section 4.6, assuming theobserved quantiles being marginally normally distributed. The marginal variances σt(p)2

are estimated from the bootstrapped data, and are assumed known. The profile deviance of

Page 65: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 400 — #420�

400 Applications

the quotient ψ(p) is

D(ψ(p))= {q2010(p)−ψ(p)q2000(p)}2

σ 22010(p)+ψ(p)2σ 2

2000(p),

which is distributed as χ 21 . This deviance accords with that in (4.4) when the two variances

are equal.For each bootstrap sample the confidence curve cc∗

p(ψ)= �1(D(ψ(p))) is calculated forp ≥ 0.01 and evaluated at ψ(p)= Eq2010(p)∗/Eq2000(p)∗, as estimated from the bootstrappeddata. The adjustment curve

Kψ(x)= P{maxp

cc∗p(ψ(p))≤ x}

is finally estimated from the bootstrapped trajectories. The cumulative distribution of Kψ isshown as the lower right panel of Figure 14.11. The curve P(Fψ ≤ f)= Kψ(f)= f 27 is addedto the plot. It fits well, and indicates that there are effectively about 27 independent latentcomponents in the 990-dimensional normal distribution of ψ . The product confidence curveis Kψ(�1(D(ψ(p))))≈ �1(D(ψ(p))27.

Owing to the large size of the data, the product confidence curve is very narrow.Figure 14.12 shows ψ(p), as in Figure 14.10, together with the 99.9% confidence band

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0p

ψ

0.02

0.04

0.06

0.08

0.10

0.12

p

wid

th

0.0

0.5

1.0

1.5

Figure 14.12 Left: Income growth from 2000 to 2010 with a simultaneous 0.999 confidenceband. Right: Width of simultaneous confidence bands for income growth at confidence 0.999(upper line) and 0.9. The irregularities are due to irregularities in the variance of the incomequantiles.

Page 66: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 401 — #421�

14.7 Meta-analysis of two-by-two tables from clinical trials 401

of the product confidence curve in the left panel. The widths of the 99.9% confidence bandand the 90% confidence band are shown in the right panel. The dip at the far right end ofψ(p) is evidently not due to random variability in the data. The top 0.1% quantile in thedistribution of total income had certainly a weaker growth from 2000 to 2010 than the 90%of the distribution below.

14.7 Meta-analysis of two-by-two tables from clinical trials

Rosiglitazone is an antidiabetic drug used to treat type 2 diabetes mellitus, working asan insulin sensitiser by making certain fat cells more responsive to insulin. Annual saleshave reached up to US $2.5 billion. The drug is controversial, however, and as of 2012it is the subject of more than 13,000 lawsuits against the manufacturing pharmaceuticalcompany in question. Some reports have indicated an association with increased risk ofheart attacks. Nissen and Wolski (2007) carried out meta-analysis of a total of 48 trialsexamining these issues, in particular generating 48 tables of paired trinomial experiments,where the outcomes tabled for each are myocardical infarction (MI), cardiovascular disease(CVD)-related death, or survival; see Table 14.5.

Previous analyses of these data have created their own challenges for and evencontroversies among statisticians, due to the particular prominence of zero counts, forwhich standard methods and in particular those based on large-sample approximations donot apply or may not work well; also, remedies that involve replacing zeros with a smallpositive number have their own difficulties and remain ad hoc. For some discussion of theseissues, see Nissen and Wolski (2007) and Tian et al. (2009). Our object here is to givea proper meta-analysis of these 48 tables, focussing on the log-odds differences betweenRoziglitazone and control groups. We shall do so drawing on the confidence tools developedin Chapter 13, and, in particular, our methods are not troubled with the zeros as such.In fact we shall give two different but related analyses; first in the framework of pairedbinomial experiments, as with Nissen and Wolski (2007) and Tian et al. (2009), and thenusing meta-analysis of Poisson tables.

Paired binomials

Pooling together cases of MI and CVD-related deaths we are faced with n = 48 2×2 tables,of the type

Yi,0 ∼ Bin(mi,0,pi,0) and Yi,1 ∼ Bin(mi,1,pi,1). (14.6)

Here pi,0 and pi,1 are the probabilities of MI-or-CVD-death for respectively control groupand roziglitazone group, and interest is focussed on the log-odds

θi = logpi,0

1 − pi,0and θi +ψi = log

pi,1

1 − pi,1. (14.7)

A natural meta-analysis model here is to take the log-odds difference parameter ψ constantacross these similar studies, and reaching confidence inference for this parameter isindeed what we are aiming for in the following. Practitioners might prefer the odds ratio

Page 67: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 402 — #422�

402 Applications

Table 14.5. The 48 tables discussed and analysed in Nissen and Wolski (2007) and Tianet al. (2009), giving sample sizes mi,1 and mi,0 for respectively the Rosiglitazone study groupsand the control groups, along with the number of cases ending in MI or CVD-related deaths

ID m1 MI death m0 MI death

1 357 2 1 176 0 02 391 2 0 207 1 03 774 1 0 185 1 04 213 0 0 109 1 05 232 1 1 116 0 06 43 0 0 47 1 07 121 1 0 124 0 08 110 5 3 114 2 29 382 1 0 384 0 0

10 284 1 0 135 0 011 294 0 2 302 1 112 563 2 0 142 0 013 278 2 0 279 1 114 418 2 0 212 0 015 395 2 2 198 1 016 203 1 1 106 1 117 104 1 0 99 2 018 212 2 1 107 0 019 138 3 1 139 1 020 196 0 1 96 0 021 122 0 0 120 1 022 175 0 0 173 1 023 56 1 0 58 0 024 39 1 0 38 0 025 561 0 1 276 2 026 116 2 2 111 3 127 148 1 2 143 0 028 231 1 1 242 0 029 89 1 0 88 0 030 168 1 1 172 0 031 116 0 0 61 0 032 1172 1 1 377 0 033 706 0 1 325 0 034 204 1 0 185 2 135 288 1 1 280 0 036 254 1 0 272 0 037 314 1 0 154 0 038 162 0 0 160 0 039 442 1 1 112 0 040 394 1 1 124 0 041 2635 15 12 2634 9 1042 1456 27 2 2895 41 543 101 0 0 51 0 044 232 0 0 115 0 045 70 0 0 75 0 046 25 0 0 24 0 047 196 0 0 195 0 048 676 0 0 225 0 0

Page 68: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 403 — #423�

14.7 Meta-analysis of two-by-two tables from clinical trials 403

itself, that is,

OR = ρ = pi,1/(1 − pi,1)

pi,0(1 − pi,0)= exp(θi +ψ)

exp(θi)= exp(ψ)

rather thanψ . We find it practical to work out the mathematics in terms ofψ before revertingto OR when presenting graphs and confidence distributions, and so forth.

We first point out that when the probabilities involved are not close to zero or one, thenthe usual large-sample techniques and results apply, with well-working approximations tonormality, and so forth; see Exercise 8.8.4 for some details. Such methods are implementedin various software packages and are in frequent use. These approximations do not work forprobabilities close to zero, however, with small yi,0 and yi,1, for which we need to develop adifferent type of machinery.

The case of a single two-by-two table has been analysed in Example 8.1. Using formulaedeveloped there we see that the log-likelihood contribution from table i is

yi,1ψ + ziθi − mi,0 log{1 + exp(θi)}− mi,1 log{1 + exp(θi +ψ)},with zi = yi,0 + yi,1. As demonstrated in that example there is also an optimal confidencedistribution for ψ and hence OR = exp(ψ) for that table alone, taking the form

Ci(ρ)= Pρ{Yi,1 > yi,1,obs |zi,obs}+ 12Pρ{Yi,1 = yi,1,obs |zi,obs}.

It involves the conditional distribution of Yi,1 given Z i = zi, found after some algebra to bethe noncentral hypergeometric distribution

g(y1 |zi)=( mi,0

zi−y1

)(mi,1y1

)exp(ψy1)∑zi

y′1=0

( mi,0

zi−y′1

)(mi,1

y′1

)exp(ψy′

1)(14.8)

for y1 = 0,1, . . . ,min(zi,mi,1). Combining these individual log-likelihood contributions, then,one finds the full log-likelihood n(ψ ,θ1, . . . ,θn), which can be expressed as

ψBn +n∑

i=1

θizi −n∑

i=1

[mi,0 log{1 + exp(θi)}+ mi,1 log{1 + exp(θi +ψ)}

],

with Bn = ∑ni=1 yi,1. By the optimality theory of Chapters 5 and 8, therefore, there is an

optimal overall confidence distribution for ψ and hence ρ = exp(ψ), utilising informationfrom all tables, taking the form

C∗n (ρ)= Pρ{Bn > Bn,obs |z1,obs, . . . ,zn,obs}

+ 12Pρ{Bn = Bn,obs |z1,obs, . . . ,zn,obs}.

(14.9)

This curve can be computed via simulation of the distribution of Bn given the observedvalues of z1, . . . ,zn, which can be accomplished via the result (14.8) for each single Yi,1 |zi,obs.The resulting power optimal confidence distribution is displayed as a the thicker curvein Figure 14.13. The associated optimal confidence curve cc∗

n(OR) = |2C∗n (OR) − 1| is

displayed (with solid line) in Figure 14.14.

Page 69: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 404 — #424�

404 Applications

2.52.01.51.00.5odds ratio

conf

iden

ce fo

r 40

tabl

es

0.0

0.2

0.4

0.6

0.8

1.0

Figure 14.13 Confidence distributions for each individual odds ratio ORi, see (14.7), based onTable 14.5. The thicker line is the optimal overall confidence distribution for the common oddsratio computed via meta-analysis and (14.9).

2.52.01.51.00.5

odds ratio

optim

al a

nd a

ppro

xim

ate

conf

iden

ce c

urve

0.0

0.2

0.4

0.6

0.8

1.0

Figure 14.14 Two confidence curves for the common odds ratio ρ = exp(ψ) are shown. Thesolid line is that of the guaranteed most powerful procedure cc∗

n(ρ) associated with C∗n (ρ) of

(14.9), and the dashed line has been reached by the χ21 transform of the overall profile deviance,

involving maximisation over all θ1, . . . ,θn.

Page 70: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 405 — #425�

14.7 Meta-analysis of two-by-two tables from clinical trials 405

Remark 14.1 What to do with null-tables?

“What to add to nothing?” ask Sweeting et al. (2004), when discussing how to handlecount-data with zeroes in the meta-analysis context. There has actually been a widediscussion in the statistical literature regarding how to handle cases where both of thepaired groups have yielded zero outcomes, that is, where y0 = 0 and y1 = 0 in the notationgiven earlier; see again Nissen and Wolski (2007) and Tian et al. (2009). This discussiontouches both operational and conceptual issues; clearly, having observed y = 0 in a binomialexperiment is quite informative, but it is perhaps not entirely clear whether one can learnanything about the difference in log-odds when both y0 = 0 and y1 = 0, or whether sucha case helps the combined information across a collection of tables. A principled answermay be given now, in that we are applying a general theorem about a certain inferencestrategy being optimal with respect to confidence power, and we may simply inspect thisoptimal method and see what it actually does in the present case. This optimal strategy restson the conditional distribution of Bn = ∑n

i=1 Yi,1 given z1,obs, . . . ,zn,obs, and for those tableswhere zi,obs = 0 we simply must have Y1,i = 0. It follows that the relevant distribution in(14.9) is identical to that obtained when the null tables are ignored. Thus in Figure 14.14we have worked with the conditional distribution of

∑ni=1 Yi,1 for those 40 among the 48

two-by-two tables for which zi,obs ≥ 1. Note that the fixed effect model (14.6) is assumed. Inother models, say the random effect model of the Beta-binomical type, the null tables mightcontribute some information.

Another approach to reaching confidence inference for ψ is to profile out each of the θi,that is, to maximise (ψ ,θ1, . . . ,θn) over in the present case all 40 extra parameters θi. Thiscan be accomplished via numerical optimisation, yielding the curve

∗n,prof(ψ)= max{n(ψ ,θ1, . . . ,θn) : all θ1, . . . ,θn}. (14.10)

The large-sample theory worked with and developed in Chapters 2–4 primarily aims atsituations in which the number of unknown parameters is low compared to sample size,and it is not clear a priori whether e.g. the χ2

1 result of Theorem 2.4 is applicable here,with 40 parameters being maximised over. Examining that theorem and its proof onelearns, however, that chances are good that the χ2

1 approximation is valid here, in thatthere is enough information about most of these parameters in the data, leading to a large41-dimensional log-likelihood function that is after all approximately quadratic, and it isthis fact that lies at the heart of the proof of the relevant result. For the present case theconfidence curve obtained from the profile deviance agrees very well with the optimalone; see Figure 14.14. For a situation characterised by having very small mi,1 values andwhere the exact confidence curve is not close to the approximation based on the chi-squaredtransform of the deviance, see Exercise 14.5.

We also point out that in this and similar situations there is a difference between themaximum likelihood method and what we may term the maximum conditional likelihoodstrategy. The latter is based on multiplying together the different g(yi,1 |zi) of (14.8), viewingthe yi,1 as conditional on the zi. This yields the estimator ORmcl = exp(ψmcl), behavingoptimally inside this one-parameter family. One may indeed show that this estimator isconsistent whereas the full maximum likelihood estimator ORml = exp(ψml) is in trouble,having a certain asymptotic bias (Breslow, 1981). The bias is small with reasonably sized

Page 71: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 406 — #426�

406 Applications

mi,0 and mi,1. We note that because the maximum conditional likelihood function mustbe a monotone function of Bn = ∑n

i=1 Yi,1, given z1, . . . ,zn, from (14.8), the optimalconfidence distribution (14.9) is identical to that based on ORmcl. This estimator has anapproximately normal distribution, by general theory, which is here related to the fact thatit is in one-to-one correspondence with Bn, which is a sum of independent componentsfrom different noncentral hypergeometric distributions. Under reasonable Lindeberg typeconditions, see Section A.3, a convenient approximation emerges,

C∗n (ρ)

.=�(ξn(ρ)−∑n

i=1 yi,1,obs

τn(ρ)

),

where ξn(ρ) = ∑ni=1 ξ(ρ,zi) and τn(ρ)2 = ∑n

i=1 τ(ρ,zi)2 are the mean and variance of Bn

computed under ρ, involving the mean and variance of (14.8), computed as finite sums.

Meta-analysis of Poisson counts

The methods we developed earlier for meta-analysis of paired binomial experiments areby construction valid regardless of where in (0,1) the underlying probabilities pi,0,pi,1 aresituated. In the case under examination here it is, however, clear that these probabilitiesfor heart attacks and CVD-related deaths are small, luckily; cf. Table 14.5. This means thatwe are in the ‘Poisson domain’ of the binomial distributions, with large m and small p;cf. Exercise 14.6.

A natural modelling alternative to (14.6) and (14.7) is therefore

Yi,0 ∼ Pois(ei,0λi,0) and Yi,1 ∼ Pois(ei,1λi,1), (14.11)

with known exposure weights ei,0 and ei,1 reflecting the sample sizes, and where the naturalparallel assumption to that of common log-odds difference in (14.7) is λi,1 = λi,0γ acrosstables. This involves a common proportionality parameter determining the overall differencein risk between group zero (the control groups) and group one (those taking the insulin drugin question). The exposure numbers ei,0 and ei,1 are a matter of scale and interpretation ofrisk. We may, for example, write mi,0pi,0 as ei,0λi,0 with ei,0 = mi,0/100, and correspondinglymi,1pi,1 as ei,1λi,1; this lets exposure be measured in terms of ‘100 patients’ and risk in termsof ‘MI or CVD-death rate per 100 patients’. Conceivably also other relevant aspects maybe factored into the exposure numbers ei,0 and ei,1, such as covariates that may reflector explain why some of the studies listed in Table 14.5 appear to have higher risk thanothers.

Consider first one such pair,

Y0 ∼ Pois(e0λ0) and Y1 ∼ Pois(e1λ1), with λ1 = λ0γ .

The likelihood may be expressed as

L = exp(−e0λ0 − e1λ0γ )(e0λ0)y0(e1λ0γ )

y1/(y0!y1!)= λy0+y1

0 γ y1 exp{−(e0 + e1γ )λ0}ey00 ey1

1 /(y0!y1!).For the same reason as for the paired binomial model, optimal confidence inference forthe proportionality parameter γ hence proceeds via the conditional distribution of Y1 given

Page 72: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 407 — #427�

14.7 Meta-analysis of two-by-two tables from clinical trials 407

2.52.01.51.00.5

γ

conf

iden

ce

0.0

0.2

0.4

0.6

0.8

1.0

Figure 14.15 Confidence distributions for the proportionality parameter γ are shown for eachof the 40 subtables of Table 14.5, along with the overall confidence distribution (solid line) foundby the optimal method.

Z = Y0 + Y1 = zobs. Some calculations show that

Y1 |(Z = z)∼ Bin(z,

e1γ

e0 + e1γ

),

see Exercise 14.6. Figure 14.15 displays these optimal confidence distribution functions

Ci(γ )= Pγ {Yi,1 > yi,1,obs |zi,obs}+ 12Pγ {Yi,1 = yi,1,obs |zi,obs},

one for each of the 40 tables which has at least one of yi,0,yi,1 nonzero.The log-likelihood function for the full dataset becomes

n(γ ,λ1,0, . . . ,λn,0)=n∑

i=1

yi,1 logγ +n∑

i=1

{zi logλ0,i − (ei,0 + ei,1γ )λi,0}.

Here we include also the eight tables with null counts (yi,0,yi,1) = (0,0). It follows firstlyfrom Theorem 5.11 that there is a most powerful confidence distribution for γ, namely

C∗n (γ )= P{Bn > Bn,obs |z1,obs, . . . ,zn,obs}+ 1

2P{Bn = Bn,obs |z1,obs, . . . ,zn,obs},where Bn,obs is the observed value of Bn =∑n

i=1 Yi,1. In Figure 14.15 this overall confidenceis presented with a fatter full line than the others. Second, we may attempt profiling the41-dimensional log-likelihood function, maximising over the 40 λi,0 for each fixed γ, just aswith (14.10). This leads to the explicit formula

∗n(γ )=n∑

i=1

{yi,1 logγ − zi log(ei,0 + ei,1γ )}.

Page 73: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 408 — #428�

408 Applications

2.52.01.51.0

0.0

0.5

1.0

1.5

γ

conf

iden

ce d

ensi

ty

Figure 14.16 The figure displays three confidence densities c∗n(γ ) associated with the optimal

confidence distributions C∗n (γ ), for the risk proportionality parameter of the Poisson models

when used for meta-analysis of the data of Table 14.5. These are for MI only (solid line, MLpoint estimate 1.421); for CVD-related death only (solid line, ML point estimate 1.659); andfor the combined group MI + Death (dashed line, ML estimate 1.482). It is apparent that theroziglitazone drug increases the risk for MI and CVD-related death with more than 40%. Weused 3 · 105 simulations for each value of γ to compute C∗

n (γ ) with good enough precision fortheir derivatives to look smooth.

Just as with Figure 14.14 we learn that this second confidence distribution, constructed viathe χ2

1 transform on the resulting deviance function D∗(ψ)= 2{n(γ )−n(γ )}, is amazinglyclose to the optimal confidence curve using C∗

n (γ ). Here γ = 1.482 is the overall maximumlikelihood estimate of γ . We do not show these cc(γ ) curves here, but for variation wedisplay in Figure 14.16 the confidence density c∗

n(ψ) associated with the optimal confidencedistribution C∗

n (ψ). The figure also displays the corresponding confidence densities for thetwo subgroups of MI alone (which has point estimate 1.421) and CVD-related death alone(which has point estimate 1.659). For a fuller discussion, along with further development ofthe Poisson model based methods, see Cunen and Hjort (2015).

Our analysis rather clearly indicates that the roziglitazone has significantly worrisomeside effects, increasing the risk for MI and CVD-related death with around 50%.

Remark 14.2 Multinomial models

The analyses of these and similar data in the literature have essentially reduced the problemto ‘one type of outcome at a time’, which then leads to the 2×2 tables studied above. Thesedata are really 2 × 3 tables, however, because one considers each of (MI, death, neither)for each of the studies. A fuller and joint analysis would hence start with paired trinomial

Page 74: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 409 — #429�

14.8 Publish (and get cited) or perish 409

tables, say

(Ui,0,Vi,0,mi,0 −Ui,0 − Vi,0)∼ trinom(mi,0,pi,0,qi,0,1 − pi,0 − qi,0),

(Ui,1,Vi,1,mi,1 −Ui,1 − Vi,1)∼ trinom(mi,1,pi,1,qi,1,1 − pi,1 − qi,1).

We note that the Poisson analyses above in a sense have taken care of the n × 2 × 3issue already, as generally speaking trinomial variables U ,V of the preceding type tend toindependent Poisson variables with rates λ1,λ2, when sample sizes turn big and probabilitiessmall, such that mp1 and mp2 tend to λ1 and λ2; see Exercise 14.6.

Our model is now

pi,0 = exp(ai)

1 + exp(ai)+ exp(bi),

qi,0 = exp(bi)

1 + exp(ai)+ exp(bi),

pi,1 = exp(ai +ψ1)

1 + exp(ai +ψ1)+ exp(bi +ψ2),

qi,1 = exp(bi +ψ2)

1 + exp(ai +ψ1)+ exp(bi +ψ2),

with parameters ψ1 and ψ2 dictating the essential and systematic difference between regime0 (the control group) and regime 1 (the study group taking the particular medicine inquestion). The methodological and also practical point here is that we again can work outand implement the optimal confidence distributions forψ1 andψ2, utilising the power theoryof Section 5.5. See also the discussion in Schweder and Hjort (2013a).

14.8 Publish (and get cited) or perish

The machines of automated information gathering do not overlook those in the academicprofessions. In particular, the various publications of the world’s scholars are found andcontinuously added to databases, across a wide range of fields, from a long list ofinternational academic journals as well as from technical reports from institutions anddepartments, the big and the small, the famous and the obscure. Among these systemsand websites are the ISI Web of Science, CiteSeerX, Scirus, GetCited, Google Scholar,and ResearchGate. Their use includes that of quickly finding publications on various issues,of discovering that something one thought was a new idea has been dealt with ten yearsago, and so forth, but also that of accessing and assessing the cumulative research outputof individual scholars. This is statistical information that might be utilised to measureinfluence, identifying different clusters of scholars, measuring in which ways the businessof medical research acts differently from that of, for example, pure mathematics, andso forth. In a society influenced by and partly obsessed by counting and recognition, whereboth departments and individuals may be rewarded from certain types of publications andtheir imagined impact, bibliometric, citographic and culturomic analyses will become morevisible and more sophisticated, for good and for worse.

Page 75: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 410 — #430�

410 Applications

In this section we shall briefly illustrate our confidence distribution methodology foranalysing patterns of such profiles, for convenience and practical ease choosing such profilesfrom Google Scholar for our data. We shall use a certain three-parameter model thatappears to have some advantages over models suggested in Breuer and Bowen (2014). Thisapplication has the feature that our confidence distributions emerge via estimation methodsdifferent from the likelihood machinery we have used for the other cases of this chapter.Though our illustration concerns publication profiles for scholars, variations of the methodsused ought to find application for the study of other phenomena of a similar structure, fromcounting ‘Likes’ on social media to insurance claims for firms.

Our abc model for citation profiles can be summarised as follows. We take our basicdata to be log-counts y of positive publications, that is, the log of citation numbers whenthe publication in question has been cited at least once. The model then postulates thatY ∼ a(1−U b)c, with U a standard uniform, where a,b,c are positive parameters. One findsthe following formulae, for the distribution function and its inverse,

F(y)= 1 −{1 − (y/a)1/c}1/b and F−1(p)= a{1 − (1 − p)b}c. (14.12)

When estimating the a,b,c parameters we avoid using plain maximum likelihood, as thatmethod places equal weight on all data points. Here we are rather more interested in theupper quantiles, both because these higher citations arguably represent the most importantwork of the scholars studied and because data associated with lower citation counts cannotbe fully trusted. Finally the role of the model here is specifically not to attempt to fit the fulldata distribution, but to be more accurate for these higher quantiles.

An estimation procedure taking these desiderata into account is to minimise a suitabledistance function between the empirical and theoretical quantiles, say

∫ 10 {F−1

n (p) −F−1(p,a,b,c)}2 dw(p). Here F−1

n (p) is the empirical quantile and w(p) an appropriate weightmeasure, which should give more weight to higher than to lower p. A practical variant is

Qn(a,b,c)=k∑

j=1

{F−1n (pj)− F−1(pj,a,b,c)}2w(pj),

with weights w(pj) at preselected quantile points pj. For our illustration we have used asimple version of this, with the pj set to 0.5,0.6,0.7,0.8,0.9,0.95 and with weights equalto F−1

n (pj), available in R as quantile(y,quantilevalues), see Table 14.6. Estimatesare then found via a nonlinear function minimiser such as nlm. This leads to Figure 14.17,which gives q-q plots

(F−1(i/(n + 1), a, b, c),Y(i)) for i = 1, . . . ,n

for the two scholars in question. As per design the model does not aspire to fit the data wellat lower quantiles (the y = 0 data in the figure are the mildly embarrassing papers of theauthors that so far have generated precisely 1 citation), but succeeds reasonably well for thehigher quantiles.

The a parameter of (14.12) is primarily related to the scholar’s production volume, so tospeak, whereas different scholars, with different fame and hence different a parameters,might exhibit similar types of shapes, that is, might have similar (b,c) parameters. A

Page 76: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 411 — #431�

14.9 Publish (and get cited) or perish 411

Table 14.6. Quantiles of the log-positive-count Google Scholar citationdistributions for Schweder and Hjort, accessed in mid June 2015. The actualempirical values are given along with those fitted via the abc model (14.12)

Quantile (%) Schweder Fitted Hjort Fitted

50 2.013 2.087 2.398 2.45860 2.639 2.474 2.890 2.86270 2.859 2.909 3.401 3.31880 3.296 3.436 3.807 3.87990 4.290 4.175 4.700 4.69395 4.734 4.770 5.389 5.387

7

abc model quantiles

Sch

wed

er lo

g−co

unts

00 1 2 3 4 5 6 1 2 3 4 5 6 7

0

1

2

3

4

5

6

7

0

1

2

3

4

5

6

7

abc model quantiles

Hjo

rt lo

g−co

unts

Figure 14.17 Log-positive-count distributions for Schweder and Hjort, with citation profiles asof June 2015, plotted as (y(i),y(i)) with y(i) = F−1(i/(n + 1), a, b,c), using the abc model (14.12).

production volume independent measure of ‘stretch’ in the upper quantiles of a scholar’sprofile is

γ = F−1(p2)

F−1(p1)= {1 − (1 − p2)

b}c

{1 − (1 − p1)b}c,

for suitable p1 < p2. For our brief and somewhat ad hoc investigations, for the profiles ofvarious colleagues, we have chosen to use this stretch measure, with (p1,p2)= (0.50,0.95).A high γ might indicate the presence of a few top hits in a scholar’s output, for example.Plugging in parameter estimates for the scholars investigated above, we find γ equal to 2.285and 2.191. We supplement these estimates with confidence distributions for Schweder andHjort in Figure 14.18. Here we have used parametric bootstrapping methods of Section 7.6.For some of the details required, along with some variations, check Exercise 14.7.

Page 77: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 412 — #432�

412 Applications

1.6 1.8 2.0 2.2 2.4 2.6 2.8

0.0

0.2

0.4

0.6

0.8

1.0

γ

conf

iden

ce d

istr

ibut

ions

for

Sch

wed

er a

nd H

jort

Figure 14.18 Confidence distributions for Schweder and Hjort, with full and dotted lines, forthe stretch factor parameter γ = F−1(p2)/F−1(p1), with (p1,p2) = (0.50,0.95). The 0.05, 0.50,0.95 confidence points are marked, and are 1.935, 2.311, 2.582 for Schweder and 1.841, 2.218,2.488 for Hjort.

14.9 Notes on the literature

The application stories of this chapter are largely based on new work of the authors, aimingat showing our methodology at work in a variety of contexts, with different types of modelsand focus parameters, and so forth. Most of these applications have not been discussedearlier in the literature. Here we provide a few brief pointers to where the data come fromand how the problems we address arose.

The golf putting story relates to data from Gelman and Nolan (2002) and Gelman et al.(2004). We generalise a certain geometric argument put forward by these authors and finda two-parameter model that vastly improves on the traditional logistic regression model.Our apparatus provides accurate estimation of how distance matches success rate for topgolfers, along with precise assessment of uncertainty. There ought to be other types ofapplications where a similar machinery would work. We also note the generalisation from‘straight regression’ to ‘uncertain regression’ mentioned at the end of that story, having to dowith binomial overdispersion and the usefulness of adding another parameter to account forthat. The analysis of abundance and population dynamics for bowhead whales stems fromearlier work by Schweder et al. (2010), which is here modified and extended to account forinformation gathered over the past few years.

Our reanalysis of the Sims (2012) data, pertaining to prewar US economy, partly cameout of an interest to see how the ‘flat priors’ the Nobel Prize winner used actually matteredfor his Bayesian-based conclusions. As we have accounted for, the analysis pans out ratherdifferently in the prior-free frequentist framework, though we are using exactly the samedata and the same model. The Olympic unfairness story stems from the higher than medium

Page 78: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 413 — #433�

Exercises 413

interest of one of the authors in the sport and sociology of speedskating. The Olympic 500-msprint event changed its rules, from those valid from Chamonix 1924 to Lillehammer 1994,to those in use from Nagano 1998 onwards, after the initiative and statisticial efforts of Hjort(1994c), due to a certain parameter estimate being 0.06 second. The statistical evidence iseven stronger that the 1000-m event is imbalanced, in the statistical overall sense across thepopulation of the world’s fastest women and men, favouring the person starting in the innerlane.

There is active international interest in the so-called Nordic model, with some apparentparadoxes of an economic system working well there even when some of its aspects mightappear to clash with other theories for economic growth, and so forth. This is partly thebackground for being interested in the Norwegian income study, the associated Gini indexes,and how the outer strata of ‘very rich’ and ‘rather poor’ evolve over time, even in thesesupposedly egalitarian societies. Our quantile regression analyses shed light on these issues,partly involving multiconfidence distributions of the product type.

We were aware of the methodological controversies surrounding meta-analysis for pairedbinomial data with many zeroes, via the work and interest in these issues of colleagues. Itappeared to us that our optimality theory for confidence distributions should clear up at leastsome of the ‘what to add to nothing’ debate, and this led to the Rosiglitazone story. Herewe incidentally find the Poisson rate models to be nearly equivalent but somewhat easierto use; see also Cunen and Hjort (2015). Finally we chose to round off our collection ofstories with our take on an ongoing discussion related to the patterns of citations in popularacademic databases, like Google Scholar, thus ending these stories with Schweder and Hjortconfidence curves for Schweder and Hjort.

Exercises

14.1 Golf: For the golf putting data analysed in Section 14.2, fit the logistic regression model H(a+bxj) and H(a + bxj + cx2

j ), with H(u) the logistic transform, as well as the alternative (14.3)model. Compute and compare AIC scores for the three models, and also Pearson residuals zj =(yj −mjpj)/{mjpj(1− pj)}1/2. Also fit the extended three-parameter Beta-binomial model (14.4),similarly computing AIC score and Pearson residuals. Find also an approximate confidencedistribution for the tightness parameter γ of that model.

14.2 Olympic unfairness: Access the 2014 World Sprint Championship results, cf. Table 14.3, viathe CLP website www.mn.uio.no/math/english/services/knowledge/CLP/.

(a) Fit the bivariate regression model (14.5), estimating all seven parameters (a1,a2,b,c,d forthe mean structure and σ ,κ for the variance structure).

(b) In this (14.5) model, show that the parameter estimators of the variance structure, (ρ, σ ),are approximately independent of those of the mean structure, β = (a1, a2, b,c, d). Show also,via methods of Chapter 2, that(

σ

ρ

)≈d N2

((σ

ρ

),1

n

( 12σ

2 − 12σ(1 −ρ2)

− 12σ(1 −ρ2) (1 −ρ2)2

)).

(c) Provide confidence distributions for each of the seven parameters, for this 2014 WorldChampionship event. Do this also for the inter-skater correlation coefficient ρ = κ2/(σ 2 + κ2).

Page 79: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 414 — #434�

414 Applications

Table 14.7. Estimates of interskater correlation parameter ρ forthe World Sprint Championships 2001–2014, along with estimatesof their standard deviations, via the model (14.5)

Year ρ se Year ρ se

2014 0.736 0.098 2007 0.753 0.0772013 0.801 0.076 2006 0.710 0.0722012 0.584 0.134 2005 0.680 0.1012011 0.737 0.093 2004 0.558 0.1242010 0.902 0.039 2003 0.673 0.0972009 0.864 0.054 2002 0.693 0.0882008 0.871 0.039 2001 0.631 0.123

(d) Carry out a proper meta-analysis for the interskater correlation parameter ρ = κ2/(κ2 +σ 2),based on estimates and standard errors from the 14 World Sprint Championships 2014–2001via the model (14.5), given in Table 14.7. In particular, assuming the model ρj ∼ N(ρ0,τ 2), giveconfidence distributions for τ and for ρ0. Give an interpretation of your results.

14.3 Meta-analysis for Lidocain data: The following data table is from Normand (1999), andpertains to prophylactic use of lidocaine after a heart attack. The notation is as with the story ofSection 14.7.

m1 m0 y1 y0

39 43 2 144 44 4 4

107 110 6 4103 100 7 5110 106 7 3154 146 11 4

(a) With the model y1 ∼ Bin(m1,p1) and y0 ∼ Bin(m0,p0), find the optimal confidencedistribution for the odds ratio (OR), defined as in Exercise 8.4, for each of the six studies.

(b) Assuming that there is such a common odds ratio parameter across the six 2 × 2 tables,compute and display its optimal confidence distribution, along with the six individual ones.

14.4 Meta-analysis for gastrointestinal bleeding: In a broad meta-analysis pertaining to corticos-teroids and risk of gastrointestinal bleeding, Narum et al. (2014) analyse data from several typesof study. In particular they analyse the group of ‘ambulants’, patients not treated in hospitals.There are a total of 56 two-by-two-table studies in their list related to ambulants, but of theseonly n = 5 binomial pairs have at least one case of the bleeding ulcer event under study. In thenotation of Section 14.7, these are

m1 m0 y1 y0 z49 50 5 0 5

101 99 0 1 141 40 1 2 363 63 1 0 1

198 202 1 0 1

Page 80: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 415 — #435�

Exercises 415

(a) Find the optimal confidence distribution for the odds ratio for each of these five pairs ofbinomials, as in Example 8.1.

(b) Assuming a common odds ratio (OR)= ρ for these 56 two-by-two tables, and in particularfor these five, compute and display the optimal confidence distribution, which by the theory ofSection 14.7 takes the form

C(ρ)= Pρ{B > 8 |z1 = 5,z2 = 1,z3 = 3,z4 = 1,z5 = 1}+ 1

2 Pρ{B = 8 |z1 = 5,z2 = 1,z3 = 3,z4 = 1,z5 = 1}.Here B = Y1,1 +·· ·+ Y1,5, with observed value 5+0+1+1+1 = 8. This involves simulationfrom the noncentral hypergeometric distribution for each of the five components.

(c) Find also the maximum likelihood and median confidence estimates, along with the 95%optimal confidence interval. The results are 2.76 and 2.66 for estimating ρ, and the interval isthe quite wide [0.74,12.98].(d) This is markedly different from the published point estimate 1.63 and 95% interval[0.42,6.34] given in Narum et al. (2014, table 3), based on the same data and partly the samemodel. The primary reason is that they operate with a random component, where table i isassociated with a log-odds difference parameterψi and these are taken as i.i.d. from a N(ψ0,τ 2).Their estimate and interval are then interpreted as valid for ψ0. The second reason is that theirmethod relies on a normal approximation for the implied ψi estimators, after which methodsof Chapter 13 apply. Now attempt to emulate their calculations, providing also a confidencedistribution for τ . Compute in particular the confidence point mass at τ = 0, corresponding tothe simpler model with a common and fixed ψ0.

14.5 Crying babies: Cox and Snell (1981, p. 4) provide data from a certain setup to check whethernewborn infants are less likely to cry if tended to in a particular way. On each of 18 days babiesnot crying at a specified time served as subjects. On each day one baby, chosen at random, wasgiven the special treatment, the others were controls. Thus in the notation used elsewhere inthis chapter, each m1 = 1.

Day m0 y0 m1 y1 Day m0 y0 m1 y1

1 8 3 1 1 10 9 8 1 02 6 2 1 1 11 5 4 1 13 5 1 1 1 12 9 8 1 14 6 1 1 0 13 8 5 1 15 5 4 1 1 14 5 4 1 16 9 4 1 1 15 6 4 1 17 8 5 1 1 16 8 7 1 18 8 4 1 1 17 6 4 1 09 5 3 1 1 18 8 5 1 1

(a) Let the probabilities of not crying be pi,1 and pi,0 on day i, for the treatment baby and for theothers, and assume the odds ratio (OR) is the same across the 18 days. Compute and display theoptimal confidence distribution C∗(OR). In particular, C∗(1) = 0.026, reflecting qua p-valuelow confidence in OR ≤ 1, that is, the alternative hypothesis OR> 1 that the treatment has aneffect is very likely.

(b) Compute also the approximate confidene curve �1(D(OR)) based on the deviance and theprofiled log-likelihood, maximising over the 18 extra parameters. See Figure 14.19.

Page 81: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 416 — #436�

416 Applications

0 2 4 6 8 10 12

odds ratio

conf

iden

ce, o

ptim

al a

nd a

ppro

xim

ate

0.0

0.2

0.4

0.6

0.8

1.0

Figure 14.19 Confidence curves for the odds ratio of crying babies considered in Exercise 14.5;the optimal curve to the left, with point estimate 3.39, the approximate curve via deviance to theright, with maximum likelihood estimate 4.21.

14.6 Meta-analysis for Poisson counts: Analyses of two-by-two tables with respect to odds andodds ratios are most often carried out using the formalism of binomial distributions, as inExample 8.1 and Exercise 14.3. One might also use Poisson models, however, when theprobabilities involved are small.

(a) Prove Poisson’s famous result: If Ym ∼ Bin(m,p) with m growing large and p small, in amanner that secures mp → λ, then Ym →d Pois(λ). This is ‘the law of small numbers’, fromLadislaus von Bortkiewicz’s ‘Das Gesetz der kleinen Zahlen’, 1898.

(b) Then prove the following more general result. Suppose X1, . . . , Xm are independentBernoulli variables with probabilities p1, . . . ,pm, with Ym = ∑m

i=1 X i counting the number ofevents among the m attempts. Show that if

∑mi=1 pi → λ and maxi≤m pi → 0, then Ym →d

Pois(λ).

(c) Assume (Ym, Zm) has a trinomial distribution with parameters (m,p,q), with m large and p,qsmall, formalised to mean mp → λ1 and mp → λ2. Show that (Ym, Zm)→d (Y , Z), where Y andZ are independent Poisson with parameters λ1 and λ2.

(d) Suppose Y0 and Y1 are independent binomials with parameters (m0,p0) and (m1,p1), withp0 and p1 small. When p0 and p1 are small, we might use Pois(λ0) and Pois(λ1) instead. Showthat the odds ratio becomes close to γ = rλ1/λ0, with r = m0/m1 the ratio of the sample sizes.Construct a confidence distribution for γ .

(e) Suppose m0 = m1 = 100 and that one observes (Y0,Y1) = (1,3). Compute and comparethe confidence distributions for the odds ratio (OR), using the exact binomial method ofSection 14.7, and the method based on the Poisson approximations. Experiment with otheroutcomes and comment on your findings.

14.7 Publish or perish: Find the Google Scholar profile of a person you might know, perhapsyourself.

Page 82: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 417 — #437�

Exercises 417

(a) Fit the abc model f(y,a,b,c) of (14.12), using suitable weights w(pj) in the estimationmethod given there. Produce a confidence distribution for that scholar’s rich-to-poor ratioγ = F−1(p2)/F−1(p1), say with (p1,p2)= (0.50,0.95). Compare with Figure 14.18.

(b) As an alternative model fitting strategy, use the weighted likelihood method associated withthe weighted Kullback–Leibler divergence, briefly described in (2.29) and (2.30). It consists inmaximising

n,w(a,b,c)=n∑

i=1

{w(yi) log f(yi,a,b,c)− ξ(a,b,c)},

where ξ(a,b,c) = ∫w(y) f(y,a,b,c)dy. Implement this here with f(y,a,b,c) stemming from

(14.12), and with w(y) = (y/a)q. Discuss its merits vis-a-vis minimising the quantile fitfunction.

(c) Suppose we model the log-positive-counts as a log-nomal, with �((logy − ξ)/σ ). Showthat γ = exp[σ {�−1(p2) − �−1(p1)}]. How can we estimate σ here, with upper quantilesmore important than lower ones? Give recipes for cc(γ ) under this model, and alsononparametrically.

Page 83: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 418 — #438�

15

Finale: Summary, and a look into the future

In this final chapter we first highlight some main points from previous chapters, withemphasis on interpreting confidence as epistemic probability. Various other theories ofepistemic probability and evidential reasoning are then briefly reviewed, and we discusshow Bayesian methods relate to confidence distribution methods in statistical theoryand practice. The final section identifies some issues in confidence theory that remainunresolved, of which the need for an axiomatic theory of epistemic probability, of theconfidence distribution type, is far the most important and also the most challenging.It is clear that more statistical theory is needed! We echo Efron (1998) that confidencedistributions might be the big hit in the years to come, both in statistical theory andpractice. But the proof of the pudding is in the eating, and by clarifying the concepts andmethods of statistical inference based on confidence distributions, statistical practice inempirical sciences will hopefully improve.

15.1 A brief summary of the book

The table of contents (pages vii–xi) gives a condensed summary of the book. Instead ofcommenting in chronological order we here highlight the most important messages of thebook. Our comments are selective and indeed incomplete.

Epistemic probability: Important but neglected

The 95% confidence interval for the Newtonian gravitational constant G based on theCODATA 2010 is (6.6723,6.6754) in appropriate units (Milyukov and Fan, 2012). Thus,P(6.6723 < G < 6.6754) = 0.95. Statements like the latter have been condemned bynon-Bayesian statisticians since Neyman (1941): the G is not a random variable, but anatural constant. It is thus either true or false, according to the hard-core frequentists, thatG ∈ (6.6723,6.6754), and the probability for such an event must therefore either be zero orone. This condemning message has been hard to convey to students and researchers, as itis contrary to common sense. We should, in our view, stop this practice. By acknowledgingthat there are two distinct types of probability, and that the statement P(6.6723 < G <6.6754) = 0.95 is one of epistemic probability, and not about the outcome of a real orimagined experiment, it makes perfect sense. It does also make sense to say that the nullhypothesis is false with probability 0.98 when the p-value is 0.02. The p-value is actually anepistemic probability for the null hypothesis based on the data as seen in the perspective ofthe model and also the alternative hypothesis.

418

Page 84: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 419 — #439�

15.1 A brief summary of the book 419

The tradition in statistics for nearly a century has been that there is only one kindof probability, and this probability can be manipulated and calculated according toKolomgorov’s axioms. The Bayesians of the Lindley breed understand probability asepistemic, while others have an aleatory understanding of the concept. We think, assuggested by Ian Hacking, Fred Hampel and others, that probability has a Janus face. Oneside of probability is aleatory, concerning random events in nature or society. The otherface is epistemic concerning the uncertainty surrounding our knowledge about society ornature. The near universal neglect in classical statistics of there being two distinct kinds ofprobability that should be handled differently and that have different meaning has causedmuch confusion.

The Bayesian who cares about frequentist properties of her methods, such as bias, is alsowell served by acknowledging the distinction between aleatory and epistemic probability.

This book is about keeping epistemic and aleatory probability apart, and to provideconcepts and methods to infer epistemic probability distributions from data, in view of thestatistical model that is cast in terms of aleatory probabilities. We think ‘confidence’ is agood term for epistemic probability in the statistical context of empirical science.

Confidence distributions are fiducial distributions when the dimension of the parameteris one. Fiducial probability was invented by Fisher. His aim was to establish an alternativeto the then prevailing Bayesian methodology with flat prior distributions representingignorance or lack of information. Fisher squared the circle and obtained a posteriordistribution without invoking a prior. He thought of his fiducial probability as ordinaryprobability subject to ordinary calculus. To what degree Fisher made the philosophicaldistinction between aleatory and epistemic probabilities is a bit hard to say from his writing,but he certainly thought that his probabilities were subject to the rules of probabilitycalculation that were axiomatised by Kolmogorov. Fisher’s lack of distinguishing betweenthe rules for epistemic and for aleatory probabilities is at the root of his “biggest blunder”. Aswas discussed in Chapter 6, it actually turned out that fiducial probability does not in generalobey the laws of ordinary probability calculus. Fisher’s erroneous claims did unfortunatelycause fiducial probability to fall into disrespect and neglect.

Confidence and fiducial probability

In the simple case of a one-dimensional sufficient statistic X of the data having continuouscumulative distribution F(x,ψ0), the fiducial distribution based on an observation x isC(ψ ,x) = 1 − F(x,ψ) when X is stochastically increasing in ψ . Thus C(ψ0, X) =1 − F(X ,ψ0) ∼ U , that is, uniformly distributed on (0,1). Any fiducial distributionfunction for a one-dimensional parameter is actually uniformly distributed, in thecontinuous case, when evaluated at the true value of the parameter. We have takenthis as the defining property of a confidence distribution C , as this is equivalent tothe α-quantile C−1(α, X) being the right endpoint of a one-sided confidence interval(−∞,C−1(α, X)) of confidence degree α. Also, (C−1(α, X),C−1(β, X)) is a confidenceinterval of degree β − α. This equivalence between confidence intervals and fiducialdistributions, that is, confidence distributions, was noted by Neyman (1934), and is in ourview the basis for interpreting the confidence distribution as a distribution of epistemicprobabilities.

Page 85: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 420 — #440�

420 Finale: Summary, and a look into the future

Fisher wanted his fiducial probabilities to be exact, and this is possible only in continuousmodels. Approximate confidence distributions are, however, available also for discretedata, provided the statistic is on an ordinal scale. With X the one-dimensional sufficientstatistic with probability function f(x,ψ0) and cumulative distribution F(x,ψ0), the so-calledhalf-corrected confidence distribution is C(ψ ,x)= 1 − F(xobs,ψ)+ 1

2 f(xobs,ψ).The confidence distribution for the parameter, based on the observed data, is obtained

from the aleatory probability distribution of the data given the parameter. This is a bigjump of thought. But we think this jump is the best we can do in the scientific context toobtain an epistemic probability distribution from the data in view of the model. To thescientist her data are in focus. They are often obtained by considerable effort, and thechallenge is to figure out what evidence they hold relative to the parameter of interest. Itis the data and their evidence, and not statistical properties of the method that interestsher. The physicist is, for example, not much interested in the coverage frequency ofher confidence interval method when used in psychology or biology or whatever. Sheis interested in how the uncertainty surrounds his estimate of the gravitational constantfrom his data. She is, we think, willing to make the jump from coverage probability toconfidence.

The confidence curve cc(ψ)= |2C(ψ)− 1| gives an excellent picture of the confidencedistribution as a carrier of confidence intervals. It is a funnel curve pointing at the confidencemedian, which is a median unbiased point estimate, and with level sets {ψ : cc(ψ) ≤ α}being tail-symmetric confidence intervals. Confidence curves might be obtained fromdeviance functions. Sometimes the likelihood is multimodal, as the profile likelihood is forthe ratio of two normal means considered in Section 4.6. Perhaps the notion of confidencedistributions should have been in terms of confidence curves and the family of confidenceregions they represent as their level sets, rather than in terms of cumulative distributionfunctions?

Some hypothesis tests are uniformly optimal by the Neyman–Pearson lemma. SinceC(ψ1) is the p-value when testing H0 : ψ ≤ ψ1 versus H1 : ψ > ψ1 the Neyman–Pearsonlike results are available for confidence distributions. The provision is that the likelihoodratio of a sufficient statistic is monotone, and the result is that any natural measure of spreadof the confidence distribution based on this statistic is smaller than the spread of any otherconfidence distribution.

Conditional tests might be uniformly optimal in exponential class models. This carriesover to conditional confidence distributions given the ancillary statistics. This is discussedin Chapters 5 and 8.

A problem that has been much debated in recent years is how to handle the tables with nodeaths among treated and controls in a collection of independent 2 × 2 tables summarisingsurvival studies. The optimal confidence distribution for the excess mortality is based onthe conditional distribution of the total number of deaths among the treated given the totalnumber of deaths. Our answer to the question “What to add to nothing” (Sweeting et al.,2004) is thus not to add anything but to ignore the empty tables, at least in fixed effectsmodels. In models like the Strauss model and the Ising model, the maximum likelihoodestimate is hard to come by. The conditional, that is, the optimal confidence distribution forthe focus parameter is, however, in reach by simulating from the conditional distribution;cf. Section 8.7.

Page 86: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 421 — #441�

15.1 A brief summary of the book 421

The likelihood function is a central tool in parametric statistical inference, so also inobtaining confidence distributions. The concepts and results most useful for confidenceinference are collected in Chapter 2 with some material postponed to Chapter 7 and theappendix to make for smoother reading. The fact that the profile deviance function isapproximately χ2 distributed when evaluated at the true value of the parameter (the Wilkstheorem) is highly useful. The approximation is actually quite good, even for small sets ofdata, and there are tools to improve it. The r∗ and p∗ method of Barndorff-Nielsen (1983,1986), along with methods associated with Bartlett correction, are particularly potent tools.This is utilised in Chapter 10 to obtain a confidence likelihood from a confidence curve.The confidence likelihood is what Efron (1993) called the implied likelihood. Approximateconfidence likelihoods are also obtained from a small set of confidence quantiles, say aconfidence interval and a point estimate.

When the model parameter θ is of dimension p> 1, it is suggested in Chapter 9 that con-fidence distributions for one-dimensional interest parameters ψ = ψ(θ) are obtained eitherfrom the profile deviance or by bootstrapping when pivots in the sufficient statistics areunavailable. Sometimes inference is required for a q-dimensional parameter ψ , 1 < q ≤ p.The q-dimensional confidence curve, say cc(ψ)= �q(Dprof(ψ)) where �q is the cumulativeχ2

q distribution and Dprof(ψ) is the profile deviance, will then provide a funnel plot pointingat the maximum likelihood estimate. The level sets or contours of the confidence curveare confidence regions with that level of confidence. When q is large, it might be difficultto visualise confidence curves based on profile deviances. Then the product confidencecurve might be handy. Its level sets are rectangles in q-space, which might be depicted asconfidence bands. Product confidence curves might be obtained by combining univariateconfidence curves for each component of the parameter, and then adjusting the confidencescale to make the level set at α a simultaneous confidence band at that level.

Combining independent confidence distributions for the same parameter might beaccomplished in different ways. In analogy with Fisher’s method of combining independentp-values, Singh et al. (2005) suggested adding the normal scores �−1(Ci(ψ)) and basingthe integrated confidence distribution on this sum, which has a normal distribution at thetrue value of ψ . Variants of this method are also discussed. Another approach is to calculatethe confidence log-likelihood − 1

2�−11 (cci(ψ)) for each independent confidence distribution.

Their sum is the confidence log-likelihood for the collection, with deviance function thatapproximately is χ 2

1 distributed. The combined confidence curve and confidence distributionis thus obtained. Meta-analysis is all about combining results from different independentstudies. When the study results are reported as confidence distributions we suggest inChapter 13 converting to confidence log-likelihood and to add up. There is often interstudyvariability in reported results. This is usually modelled by a latent random component. Inthe simple case of the individual confidence distributions being normal, and the interstudyvariation also being normal, an exact confidence distribution is found for the interstudyvariance. It will often have a point mass at zero.

Focussed research

The spirit of the book is in line with that of R. A. Fisher. Not only did he invent thelikelihood function and the fiducial probability, and develop the modern statistical concepts

Page 87: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 422 — #442�

422 Finale: Summary, and a look into the future

and methods, particularly the likelihood theory. But he also in his textbooks and statisticalpractice promoted the use of parametric models with one or a few parameters of specialinterest. His scientific question to be elucidated by the data is formulated in terms of theparameter of special interest.

Fisher’s models were often small and parametric, not only because calculations weremuch more demanding then than they are now, but also for the sake of clarity andtransparency. He usually put forward the model he wanted to work with, perhaps withsupporting arguments with respect to realism and appropriateness, but seldom as one ofseveral plausible models. He almost never chose his model through any kind of modelselection competition. Fisher was interested in the logic of scientific reasoning, and thoughtthat in the end it had to be conditional on the chosen model. This is actually the case evenwhen inference is carried out with modern and widespread model selection tools, in fact alsowhen when model selection uncertainty is accounted for.

The Fisherian paradigm for empirical science is widely respected, but more so in theorythan practice. It consists, in our view, of

1. Asking the scientific question to address and formulate it in terms of a particularparameter

2. Then gathering data and formulating a model giving home to the interest parameter, whilerespecting the data generating process and thus fitting the data, and allowing inference onthe interest parameter

3. Carrying out the inference, often by likelihood methods, and reporting and interpretingthe results concerning the parameter of interest in the format of an epistemic probabilitydistribution that is not more subjectively loaded than necessary

Fisher favoured his fiducial distribution in step 3. When the parameter is scalar, we agree.But when the parameter of interest is of dimension q > 1 we would rather use a jointconfidence curve or give scalar confidence distribution for each of its components.

Often the scientific process is reversed, with first gathering or accessing a certain bodyof data and then to ask a question and formulating a model. Several models might actuallycome into question, and one of these are usually chosen on theoretical grounds and/or bytheir fit to the data relative to their parsimony say by the Akaike information criterion (AIC)or some other criterion discussed in Section 2.6.

There is a widespread practice in empirical science to rely mostly on the fit/parsimonyproperty when selecting the model. The candidate models are then chosen on more or lessexplicit theoretical grounds. Theory here understood as ecology, physics or whatever thesubstance matter is. Many scientific papers report a table of estimates obtained in severalmodels, often of regression type. Each model is also evaluated with respect to fit/parsimonyby AIC, BIC, focussed information criterion (FIC) or some other criterion. This table willoften be the result of the paper and the discussion is an attempt of placing the result in contextand judge its relevance. Often the discussion is of the results of the model that wins themodel selection competition. This practice is of course appropriate in some circumstances.But good papers are focussed, asking a particular question of importance and gathering andinterpreting relevant evidence.

Another advantage of keeping to one model is that confidence distributions for the focusparameters are internally valid, that is, had the same type of data been gathered in repeated

Page 88: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 423 — #443�

15.2 Theories of epistemic probability and evidential reasoning 423

experiments and analysed by the same method the realised coverage frequency will equalthe nominal ones in the long run.

It is often the case that we are uncertain about what variables to include in the model.With too many the evidence for the focussed parameter will be washed out to somedegree. With too few variables the inferred effect might be spurious. When a certainparameter has been identified as focal, the FIC criterion of Claeskens and Hjort (2008,chapters 6–7) might be employed. The aim there is to find the model among the candidatesthat minimises the squared error loss for the focus parameter, at least asymptotically. Insmooth models, the inferred confidence distributions will be Gaussian. When standard error,obtained from the Hessian of the chosen model, is modified according to the uncertaintyinvolved in the model selection, the confidence distribution is asymptotically internallyvalid.

Our illustrating examples throughout and the applications in Chapter 14 do follow theFisherian paradigm.

15.2 Theories of epistemic probability and evidential reasoning

The following brief overview of theories of epistemic probability and evidential reasoningis far from complete. The purpose here is to give a broad introduction, and to mention pointsof contact of these rival theories with confidence inference.

Epistemic probability and degree of belief

The main theory of epistemic probability for use in statistical work is of course the Bayesiantheory. When the prior represents the belief of the scientist, the posterior represents herbelief after having seen the data, given the statistical model and hence the likelihood. Thisis a mathematical truth within standard probability theory. The prior might, however, notrepresent a consensus view. Then the posterior is loaded with the subjective prior that ischosen. The use of flat priors is widespread, and is an attempt of choosing a noninformativeprior. Flatness is, however, not an invariant property of noninformativity. There are actuallyno such thing as a noninformative continuous distribution.

(Jeffreys, 1931) found another way out of this problem than (Fisher, 1930). He requiredinvariance of the posterior to transformations of the parameter, and found, for example,the flat prior for a location parameter, and 1/σ as the prior for the scale parameter σ .Jeffreys argued that his invariant priors are reasonable representations of noninformativity.One problem with Jeffreys’ solution is that the Bayes formula requires priors to be properprobability distributions. With an improper prior one might argue that the posterior is thelimit of posteriors based on the prior being truncated to increasing finite intervals. Improperpriors might sometimes lead to useless improper posterior distributions.

Another and perhaps graver problem for the Bayesian is the potential for bias (in thefrequentist sense), particularly for nonlinear derived parameters. It must be unsettling tohave epistemic probability distributions that frequently are located say to the right of theparameter in a certain class of problems such as the length problem. We discussed the lengthproblem in Section 9.4, and also in Example 6.3.

Page 89: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 424 — #444�

424 Finale: Summary, and a look into the future

(Berger and Bernardo, 1992) found a way to construct a prior for the basic parameter θthat makes the posterior for a parameter of interest ψ = ψ(θ) (approximately) unbiased inthe frequentist sense. This prior is in a sense focussing on the particular parameter in focus,ψ , and with more than one reduced parameter of interest a new Bayesian analysis must bedone for each of them since the prior for θ must be tailored to each one. This is discusseda bit further in Section 15.3 where the Bayesian methodology is contrasted with confidencedistribution methods.

Belief functions

In court and in other walks of life, the evidence for a proposition A is weighted against theevidence against A. When evidence comes in probabilistic terms, the probabilities p(A) forA and q(A)= p(Ac) need not add to one. The remainder r(A)= 1 − p(A)− q(A) measuresthe lack of evidence. In court and to most people these probabilities are epistemic.

Keynes (1921) emphasised the strict logical relationship between between evidence andprobability. He spoke of legal atoms as logical building blocks of the evidence, and he sawthe evidence as premise for the probability. When the legal atoms are independent, eachexercised a separate effect on the concluding probability, while they could also combine inan organic formations with interactions between them. Keynes’ treatise was highly regardedwhen it was published, but it has not had any sustained impact on the theory or philosophyof probability.

Dempster (1967) initiated a theory for epistemic probability allowing for lack of evidencefor some propositions. He had in mind a universe of propositions with related probabilities.These probability statements need not be precise, but could be in terms of upper andlower probabilities bounding the probability of the statement. The universe need also notbe algebraically complete under the rules of negation and conjunction. When the eligiblestatements are represented as a family R of subsets of the sure event , R is not an algebrawith respect to negation and intersection. To each member A ∈ R there is associated amass m(A) ≥ 0, and summing to 1 over R. Dempster’s belief function is defined for eachA ∈ R as

bel(A)=∑

B∈R:B⊂A

m(B).

There is also an upper probability

pl(A)=∑

B∈R:A∩B �=∅m(B).

This latter probability is interpreted as the plausibility of A.As an example, consider the following universe of eligible statements and their

masses: m(alive) = 0.20, m(dead) = 0.5, m(alive or dead) = 0.2, m(resurrect) = 0.01,m(alive or resurrect) = 0.09. Then bel(alive) = m(alive) = 0.20, pl(alive) = 0.49, and soon.

Dempster was particularly interested in combining evidence from different sources.Assume there are two sources of evidence, 1 and 2. The universe for source i is Ri, a familyof subsets of , equipped with the mass-function mi. The mass function of the combined

Page 90: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 425 — #445�

15.2 Theories of epistemic probability and evidential reasoning 425

evidence is defined for all A ∈ as

m1&2(A)= m1∧2(A)

1 − m1∧2(∅)where m1∧2(B) = ∑

D∩E=B m1(D)m2(E). From the combined mass function the combinedbelief function is obtained.

The two sources of evidence to combine cannot be completely conflicting, that is, theremust at least be one pair of statements B1 ∈ R1, B2 ∈ R2 such that B1 ∩ B2 �= ∅. Dempster’srule of combining evidence is associative and communicative. The normalisation inDempster’s rule redistributes m1∧2 over the nonconflicting statements.

Dempster’s rule has been studied by many authors, see, for example, Jøssang and Pope(2012). It is not free of contraintuitive results. Zadeh (1984) mentioned the followingexample with three suspects, = {a,b,c}. Witness 1 has the masses 0.99,0.01,0 for theserespectively, while witness 2 has masses 0,0.01,0.99. Dempster’s rule combines these twosources of evidence to full certainty that b is the guilty one despite both witness havevery low belief in this statement. Incidentally, the Bayesian would come to the samecontraintuitive conclusion if she combined the two distribution by Bayes’ formula. So wouldthe statisticians using the likelihood in the way Edwards or Royall suggested; see later. ForDempster, Bayes and the likelihoodist, replacing the zero by only a small amount of belieffor each witness would remove the difficulty.

Dempster’s theory was extended by Shafer (1976), and is now called theDempster–Shafer theory. When reviewing Shafer’s book, Fine (1977, p. 671) praised itas being a lucid introduction to the “unfortunately neglected study of epistemic probabilityand evidential reasoning”. Epistemic probability is still rather neglected, and its theory hasnot been much developed since Shafer (1976).

Independent confidence distributions for the same parameter might be combined withthe Dempster–Shafer recombination rule. Hannig and Xie (2012) show how this works.Their investigation of this idea is in its early phase, and the method will hopefully turnout to be widely useful. They find that when the sample size increases in each study,this method is asymptotically equivalent to the method of adding the normal scores of theindividual confidence distributions (Singh et al., 2005). It is then also equivalent to addingthe confidence log-likelihoods, as discussed in Chapter 10. Confidence distributions areactually asymptotically Gaussian in smooth models, and the normal score method and theconfidence likelihood method yield the same result in cases where the samples are largecompared to the parameter dimension.

Likelihood as carrier of evidence

The likelihood function is carrying the information contained in the data, as seen through thelens of the model. It is the central ingredient in all theory of parametric statistical inference.The problem is to convert the likelihood to probability to make it interpretable.

In Bayesian methodology, this is simply done by Bayes’ formula. Many scientists find,however, Bayesian analysis to be unacceptable since it involves prior distributions thatusually are unfounded.

Page 91: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 426 — #446�

426 Finale: Summary, and a look into the future

The likelihood function is essential both for confidence inference and Bayesian inference,both in theory and practice. Both methodologies transform the likelihood into epistemicprobability. Edwards (1992) and also Royall (1997) base their theory of statistical evidenceon the likelihood function directly, however, without any conversion to probability.

Edwards (1992) refers to Hacking (1965) and Barnard, for example, Barnard (1967),when he argues that an epistemic probability for a proposition is too ambitious a goalsince it would represent an absolute degree of belief in the proposition, while only relativedegrees of believes are warranted. As Edwards, both Hacking and Barnard propose to usethe likelihood as the carrier of relative belief. Any quantitative measure of relative belief,called ‘support’ by Edwards should satisfy the following requirements.

1. Transitivity If H1 is better supported than H2 and H2 is better supported than H3, then H1

is better supported than H3.2. Additivity Support for H1 relative to H2 based on two independent sets of data should add

to the support for H1 relative to H2 based on the combined data.3. Invariance The support should be invariant to a one-to-one transformation of the data,

and also to a one-to-one transformation of the parameter.4. Relevance and consistency The measure of support must prove itself intuitively

acceptable in application; in the absence of any information with which to compare twohypotheses, their support should be equal, and in case a hypothesis is true this hypothesiswill asymptotically have better support than any false hypothesis; the relative supportmust be unaffected by including an independent set of totally irrelevant data.

5. Compatibility The measure of support should be an integral ingredient of the Bayesianposterior probability.

The log-likelihood function satisfies the above requirements, and is Edwards’ supportfunction.

Consistency in (4) is equivalent to the maximum likelihood estimator being consistent inthe usual sense. This is, as we have seen, true under rather wide conditions on the model.

Being a relative measure of belief, the log likelihood is unaffected by an additive constant.Thus, the deviance function is serving as a measure of disbelief in Edwards’ sense. Thelog-likelihood function being constant, that is, the deviance being constantly zero, meansthat there is no information in the data to discriminate between different values of theparameter. The zero function is thus the noninformative deviance. The nonexistence of anoninformative distribution is a problem in Bayesian theory.

Edwards (1992, p. 31) states his ‘Likelihood axiom’ as follows: “Within the frameworkof a statistical model, all the information which the data provide concerning the relativemerits of two hypotheses is contained in the ratio of the respective likelihoods of thosehypotheses on the data, and the likelihood ratio is to be interpreted as the degree to whichthe data support the one hypothesis over the other.” Although perhaps not self-evident, mostacceptable feature of well established statistical inferential practice are in agreement withthe likelihood axiom, and it is thus solidly empirically based.

Edwards’ definition of the log likelihood as the support function is thus based on thelikelihood axiom. Its interpretation is the natural one that a value of the parameter θ1 is morebelieved to be the true value than another value θ2 if L(θ1)≥ L(θ2), i.e. (θ1)− (θ2)≥ 0.

Page 92: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 427 — #447�

15.2 Theories of epistemic probability and evidential reasoning 427

Royall (1997) agrees with Edwards that the likelihood function is all there is neededto discriminate between rival hypotheses. He subscribes to the Law of likelihood firstformulated by Hacking (1965), “If hypothesis A implies that the probability that a randomvariable X takes the value x is pA(x), while hypothesis B implies that the probability is pB(x),then the observation X = x is evidence supporting A over B if and only if pA(x) > pB(x),and then the likelihood ratio pA(x)/pB(x) measures the strength of that evidence” (Royall,1997, p. 3). Edwards (1992) reformulated this to his Likelihood axiom, and the two authorsagree that the concept of evidence must be a relative, and should be measured by likelihoodratio or its logarithm.

Royall suggests that we should interpret the likelihood ratio directly, without anyconversion to probability. Bernoulli trials are easy to understand intuitively. He argues thatwe should understand a likelihood ratio L1/L2 in terms of Bernoulli trials. A coin couldbe fair or false, with probability of heads respectively 1/2 and 1. The likelihood ratio offalse to fair in n trials, all resulting in heads, is 2n. Assume that the likelihood ratio of H1

versus H2 comes out L1/L2 = 16 in an experiment. Then the evidence for H1 over H2 is thesame as that of a coin being false relative to fair in n = 4 tosses of the coin, all resultingin heads. The evidence is actually that of false to fair when all the n = log(L1/L2)/ log2tosses give heads, however complex the model or data are behind the likelihoodratio.

Evidence is something different from uncertainty, according to Royall. He illustrates thedistinction by the following example. Let (X ,Y ) have a bivariate distribution with densityf(x,y), for example, the bivariate normal distribution with all parameters known. When X =x is observed, what is then the evidence for y, and what is the uncertainty distribution for y?Treating y as an unknown parameter, the likelihood function is the conditional density L(y)=f(x |y). This is Royall’s evidence function for y. He gives the reverse conditional densityp(y) = f(y |x) as the natural uncertainty density. The Bayesian has in this model a naturalprior in the marginal density for y, yielding the uncertainty density p(y). The two functionsp(y) and L(y) are in principle different, and Royall has made his point. Had, however a flatprior been used, the posterior density is proportional to L(y). In this model F(x |y) is a pivot,and the confidence distribution is C(y)= 1−F(x |y)when X is stochastically increasing in y.This is the case when the distribution is the bivariate normal, which Royall considered. Theconfidence distribution could also be obtained by converting the deviance function relatedto L(y) into a confidence curve.

As many others, Royall is critical of the Neyman–Pearson rejection/acceptance testingin the scientific context where the question is to measure the evidence for an hypothesisagainst an alternative. He also criticises Fisher’s significance test by p-values. The p-valueaddresses the wrong question, Royall argues, as evidence is relative. The p-value is based ona statistic, usually chosen in view of the alternative to the null hypothesis, and is thereforenot as absolute as Royall thinks. In simple models the p-value is, however, just a conversionof the likelihood ratio.

The law of likelihood is of course fundamental, but it does not prevent the likelihood orits deviance function to be converted to a confidence distribution. Our view is in line withLehmann (1993) that when suitably interpreted, there is really not such a big differencebetween the Neyman–Pearson paradigm and the Fisherian paradigm when it comes toassessing the evidence for an hypothesis. By rephrasing the problem as one of finding

Page 93: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 428 — #448�

428 Finale: Summary, and a look into the future

the confidence distribution for the parameter in question, the two paradigms come nicelytogether.

15.3 Why the world need not be Bayesian after all

“In the first place, the best way to convey to the experimenter what the data tell him aboutθ is to show him a picture of the posterior distribution.” This sensible observation, thatusers of statistics wish to see a probability distribution associated with the parameter ofinterest, expressed here by Box and Tiao (1973, chapter 1), has traditionally contributedto pushing statisticians and users in the Bayesian direction. As we have explained andadvocated in our book, however, there is an alternative route to deriving and displaying suchprobability distributions, but outside the Bayesian framework. The confidence distributionsand confidence curves we have developed and investigated are free from the philosophical,mathematical and practical difficulties of putting up probability distributions for unknownparameters. Our own view might not be quite as terse as that formulated by Fraser(2011, p. 329), in his rejoinder to his Statistical Science discussants: “And any seriousmathematician would surely ask how you could use a lemma with one premise missingby making up an ingredient and thinking that the conclusions of the lemma were stillavailable.” The point remains, however, that the Bayesians own no copyright to producingdistributions for parameters of interest given the data, and that our frequentist-Fisherianapproach offers precisely such distributions, also, we would argue, with fewer philosophicaland interpretational obstacles.

Confidence distributions for a focus parameter are the Fisherian cousins of the Bayesianposterior distributions (or the other way around, cf. Fraser (2011) and his discussants). Assuch they start out from different interpretations and aims, but in a variety of cases end upproducing similar or identical inference; see, for example, Schweder and Hjort (2003). Alsothe Bernsteın–von Mises theorems of Section 2.5 support this view, that at least for dataof reasonable volume compared to the complexity of the model, the two approaches willpan out similarly. Although this is comforting it is of course not an excuse for too quicklyadopting one or the other view. Some might be tempted to argue that the path with the fewerphilosophical and practical obstacles should be chosen, in cases where parallel strategieslead to similar results.

There are also situations where the confidence distribution view leads to clearer andless biased results than what standard Bayes leads to, as with the length problem andNeyman–Scott situation. Here the likelihood function peaks in the wrong place, causedby too many parameters with too little information, and this messes up both maximumlikelihood and standard Bayes solutions. Of course such problems might be worked with andsaved also within the Bayesian machinery, but that would involve adding further informationto the start model, like an overall prior seeing to it that the ensemble of parameters does nothave too much intervariability, of building an ‘objective prior’ that specifically takes thefocus parameter into account. Relevant in this connection is the broad field of ‘objectiveBayes’, see, for example, Berger and Bernardo (1992) and Berger and Sun (2008), andattempts at setting up priors for one out of many, as with Tishirani (1989). We are notadverse to such methods, of course, but neutrally point out that in several of these mildlytroubling cases, the confidence view might do a better job with a minimum of further fixes.

Page 94: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 429 — #449�

15.3 Why the world need not be Bayesian after all 429

As stressed generally and in various examples of Chapters 3 and 4, confidencedistributions for derived scalar parameters are found by way of pivots, possibly approximateones, for the parameter of interest, and cannot in general be found by integrating ajoint confidence distribution. Perhaps the difference in practice between modern Bayesiananalysis of the Berger and Bernardo type and confidence inference is less than oftenbelieved. Both methods aim at an epistemic probability distribution for each parameterof interest. In confidence inference a pivot or approximate pivot must be found for eachfocus parameter, while the modern Bayesian must for each such focus parameter find anappropriate prior for the full parameter vector.

One may ask when the c(ψ) curve is identical to a Bayesian’s posterior. This is clearlyanswered in the presence of a pivot; the confidence density agrees exactly with the Bayesianupdating when the Bayesian’s prior is

π0(ψ)=∣∣∣∂piv(T ;ψ)

∂ψ

∣∣∣/∣∣∣∂piv(T ;ψ)

∂T

∣∣∣. (15.1)

In the pure location case the pivot is ψ− T , and π0 is constant. When ψ is a scale parameterand the pivot is ψ/T , the prior becomes proportional to ψ−1. These priors are preciselythose found to be the canonical ‘noninformative’ ones in Bayesian statistics. Method (15.1)may be used also in more complicated situations, for example via abc or t-bootstrappingapproximations in cases where a pivot is not easily found; cf. Section 7.6. See alsoSchweder and Hjort (2002, 2003). Confidence distributions are invariant to monotonictransformations. So are Bayesian posteriors when based on Jeffreys priors. Only Jeffreyspriors can thus yield posterior distributions that are confidence distributions.

It is possible for the frequentist to start at scratch, without any (unfounded) subjectiveprior distribution. In complex models, there might be distributional information availablefor some of the parameters, but not for all. The Bayesian is then stuck, or she has to constructpriors. The frequentist will, however, not have principle problems in such situations.The concept of noninformativity is, in fact, simple for likelihoods. The noninformativelikelihoods are simply flat. Noninformative Bayesian priors are, on the other hand, a thornymatter. In general, the frequentist approach is less dependent on subjective input to theanalysis than the Bayesian approach. But if subjective input is needed, it can readily beincorporated (as a penalising term in the likelihood).

In the bowhead assessment model (Schweder and Ianelli, 2001), there were moreprior distributions than there were free parameters. Without modifications of the Bayesiansynthesis approach like the melding of Poole and Raftery (2000), the Bayesian gets intotrouble. Due to the Borel paradox (Schweder and Hjort, 1996), the original Bayesiansynthesis was, in fact, completely determined by the particular parametrisation. With moreprior distributions than there are free parameters, Poole and Raftery (2000) propose to meldthe priors to a joint prior distribution of the same dimensionality as the free parameter.This melding is essentially a (geometric) averaging operation. If there are independent priordistributional information on a parameter, however, it seems wasteful to average the priors.If, say, all the prior distributions happen to be identical, their Bayesian melding will give thesame distribution. The Bayesian will thus not gain anything from k independent pieces ofinformation, while the frequentist will end up with a less dispersed distribution; the standarddeviation will, in fact, be of the familiar size O(1/k1/2).

Page 95: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 430 — #450�

430 Finale: Summary, and a look into the future

Nonlinearity, nonnormality and nuisance parameters can produce bias in results, evenwhen the model is correct. This is well known, and has been emphasised repeatedly inthe frequentist literature. Such bias should, as far as possible, be corrected in the reportedresults. The confidence distribution aims at being unbiased: when it is exact, the relatedconfidence intervals have exactly the nominal coverage probabilities. Bias correction hastraditionally not been a concern to Bayesians. There has, however, been some recent interestin the matter. To obtain frequentist unbiasedness, the Bayesian will have to choose her priorwith unbiasedness in mind. Is she then a Bayesian? Her prior distribution will then notrepresent prior knowledge of the parameter in case, but an understanding of the model. Inthe Fisher–Neyman paradigm, this problem is in principle solved. It takes as input (unbiased)prior confidence distributions converted to confidence likelihoods and delivers (unbiased)posterior confidence distributions.

There are several other issues that could have been discussed, such as practicality andease of communicating results to the user. Let us only note that exact pivots and thus exactconfidence distributions and confidence likelihoods are essentially available only in regularexponential models for continuous data. In other models one must resort to approximatesolutions in the Fisher–Neyman tradition. Whether based on asymptotic considerations orsimulations, often guided by invariance properties, an ad hoc element remains in frequentistinference. The Bayesian machinery will, however, always in principle deliver an exactposterior when the prior and the likelihood is given. This posterior might, unfortunately,be wrong from a frequentist point of view in the sense that in repeated use in the samesituation it will tend to miss the target. Is it then best to be approximately right, and slightlyarbitrary in the result, as in the Fisher–Neyman case, or arriving at exact and unique butperhaps misleading Bayesian results?

15.4 Unresolved issues

In the following we point to certain difficult and partly unresolved issues. The list is merelya partial one, as the statistical field of confidence inference is in rapid expansion, with plentyof challenges!

Axiomatic theory for epistemic probability

The main unresolved issue is that of an axiomatic theory for epistemic probabilityunderstood as confidence distributions. The issue is the following. A confidence distributionC(θ) exists for a parameter θ . For what derived parameters ψ = h(θ) can a confidencedistribution C(ψ) be obtained for ψ , and by what rules can C(ψ) be calculated from C(θ)?The first question is one of measurability, and the second is one of epistemic probabilitycalculus. The fiducial debate (Chapter 6) showed that Fisher was wrong to assume thatKolmogorov’s axioms apply to his fiducial probability.

In the length problem considered in Example 1.5 and later on, we have seen thatintegrating the multivariate confidence distribution to get a distribution for a nonlinearparameter typically yields something that is not a confidence distribution. There are severalexamples of the marginal distribution of a multivariate fiducial distribution not beingfiducial. Pedersen (1978) found, for example, that the only marginal of the bivariate

Page 96: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 431 — #451�

15.4 Unresolved issues 431

fiducial distribution of (μ,σ), the parameters of a normal distribution, which are fiducialare for parameters that are linear in the pair; see Section 6.2. Standard probabilitycalculus might also go wrong when θ has dimension 1; see Example 6.1. However, inmany cases the marginals of a joint fiducial distribution are quite close to be confidencedistributions. Under what conditions is marginalisation legitimate, and how good are theapproximations?

Multivariate confidence distributions and confidence curves

In the length problem the joint normal is taken as the joint confidence distribution for θ =(μ1, . . . ,μn). In this, and in fiducial distributions obtained by Fisher’s step-by-step method(Section 6.2), it is fairly clear what nested families of joint confidence distributions theyyield. But, is it a good definition of multivariate confidence distribution that they yield suchnested families of confidence regions?

Multivariate confidence distributions were discussed in Chapter 9, but a generaldefinition was not given. Emphasis was on families of confidence regions obtained from thedeviance function, and on product confidence regions. The former confidence distributionis essentially one-dimensional because it is represented by only one confidence curvedetermined by the deviance. The product confidence distribution is, on the other hand,restricted in shape to be multivariate integrals (boxes). These nested families of confidenceregions are useful in practice, but is it entirely appropriate to say that they represent jointconfidence distributions?

Perhaps multivariate confidence distributions should be defined by their nested familiesof confidence regions. If so, should the dimensionality of the confidence distribution be thenumber of independent confidence curves, that is, families of confidence regions, that itholds? What should here be meant by independence?

These issues are central to any axiomatic theory of epistemic probability basedon confidence distributions, and rules for how to manipulate multivariate confidencedistributions to obtain distributions for derived parameters is what is sought.

Deviance functions are not always unimodal, and the confidence curve obtained by exactor normal conversion will then not be a simple funnel. Also, in some cases the confidencecurve does not reach 1 on the confidence axis, not even asymptotically.

For example, in the Fieller problem considered in Section 4.6 of estimating the ratioof two normal means, the confidence curve based on the profile deviance has a mode notreaching confidence 1 (see Figure 4.10). Thus, at confidence levels above the maximumof the confidence curve the confidence region will be the whole parameter space. At lowerlevels the confidence regions are the unions of two half-open intervals, while at even lowerlevels they are finite intervals. This family of confidence regions cannot be represented bya proper cumulative confidence distribution.

We think that the concept of confidence distributions should be extended to cases wherethe confidence curve is not a simple funnel. To each confidence curve there is a family ofnested confidence regions. By regarding the confidence curve, or its family of confidenceregions as the basic concept the theory would be more in the spirit of Neyman, and furtherfrom Fisher’s fiducial distributions. The confidence is, however, interpreted as epistemicprobability.

Page 97: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 432 — #452�

432 Finale: Summary, and a look into the future

There might be more than one confidence curve in a given model. The distribution ofconfidence is then defined for more than one nested family of regions in the parameterspace.

If the theory is established with confidence curve as the basic notion, the question is whatgeneral condition must be satisfied for a function of the parameter and the data, cc(θ ,Y ), tobe a confidence curve. Clearly, the confidence curve must have values in the unit interval.In addition we think that the a confidence curve should satisfy:

1. When θ0 is the value of the parameter behind Y , the confidence curve is a uniform pivot,cc(θ0,Y )∼ U ;

2. infθ cc(θ ,y)= 0 for all possible observed data y.

Ball et al. (2002) Ball, F. K. found that some outcomes of a model for an epidemic leadto empty confidence intervals at intermediate levels of confidence. They constructed theirconfidence region by inverting tests of H0 : θ = θ0. There are also other examples of thisphenomenon. Does this mean that infθ cc(θ ,y) = 0 cannot be a necessary condition? Oris there something with this method of constructing confidence regions which makes theexample not a counterexample?

What about continuity? We have not listed continuity, in some form or other, as anecessary condition. But clearly, the confidence curve must be measurable.

Approximations, discreteness

A confidence distribution is approximate if its cumulative distribution function C(θ0,Y ),or its confidence curve, is approximately uniformly distributed. This is not a particularlyprecise notion. Bounds on the difference between the distribution of C(θ0,Y ), or cc(θ0,Y ),and the uniform distribution would be helpful.

A distribution of θ computed from the data is an asymptotic confidence distributions ifC(θ0,Yn) tends in distribution to the uniform on (0,1), or similarly for the confidence curve.In some cases the rate of convergence is of order two or even of order three; see Chapter 7and the Appendix. More results of this type would however be helpful.

In a location model with density f(y −μ), the same f(y −μ), seen as a function of μ,is the Bayesian posterior based on a flat prior, and also the confidence density. Bayesianposteriors are actually approximate confidence distributions. Fraser (2011) suggests thatBayes posterior distributions are just quick and dirty confidence distributions. He looks intononlinearity and model curvature to investigate the goodness of the approximation. Furtherresults in this direction are useful. Also, further results regarding the choice of prior to obtaingood approximate confidence distributions would be welcome.

For discrete data where the sufficient statistic is on an ordinal scale, half-correctionis suggested to yield approximate confidence distributions. Hannig (2009) found thathalf-correction is yielding a superior approximate confidence distribution in the binomialcontext. Is this always the case? An optimality result supplementing Theorem 5.11 fordiscrete exponential families should be within reach, but is still not available.

Is confidence distribution a useful concept when the parameter is on a nominalscale?

Page 98: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 433 — #453�

15.4 Unresolved issues 433

Bootstrap, Bayes and confidence

Bootstrap distributions are closely related to confidence distributions (Xie and Singh, 2013).In Section 7.6 we discussed the bias and acceleration correction method to obtain anapproximate confidence distribution, accurate to the second order, for scalar parameters.Efron (2013) relates Bayesian inference to the parametric bootstrap, and shows how theposterior distribution for a scalar parameter can be corrected by a reweighting obtained fromthe bias and acceleration method to obtain a second-order accurate asymptotic confidencedistribution. Can these methods be extended to parameters of higher dimensions? Does themethod then solve the basic problem in epistemic probability in that confidence distributionsfor derived parameters can be found from that for the full parameter, at least to second-orderaccuracy?

The adjustments of the likelihood to make the deviance at the true value more closelyχ2 distributed and at the same time less dependent on any nuisance parameter, discussedin Sections 7.4 and 7.5, extend to p-dimensional likelihoods. Do they also apply to profilelikelihoods? If epistemic probability in higher dimensions is defined in terms of confidencecurves, this methodology of obtaining confidence curves also for derived parameters fromadjusted likelihoods, provides a second-order asymptotic solution to the basic problem ofepistemic probability.

Combining confidence distributions

Independent confidence distributions might be combined by the method of Singh et al.(2005) by their normal scores, or by another scoring. This works fine when the confidencedistributions are one-dimensional and proper, that is, are represented by cumulativedistribution functions. But what if they are represented by confidence curves that are notfunnel plots? If the parameter is of dimension larger than one, independent confidencecurves might simply be combined by normal scores.

The combination method based on confidence likelihoods works also in this case, as theconfidence likelihood simply is a transform of the confidence curve. How should then thesum of the confidence deviances be converted to a combined confidence curve?

If, say, the normal conversion is correct for all the k confidence distributions in question,the sum of the deviances is Dc(θ) = ∑k

i=1�−1p (cci(θ)). Then Dc(θ0,Y ) ∼ χ 2

pk, where

dim(θ) = p. So, �−1pk (Dc(θ0,Y )) ∼ U and the first requirement for this to be a confidence

curve is met. But, with high probability infθ Dc(θ0, X)) > 0, and the second requirement isviolated. In this case Dc,prof(θ)= Dc(θ)− mint Dc(t) is a profile deviance, which, however,is not χ2

pk distributed. We have invoked the Wilks theorem to suggest the χ2p distribution for

the profile confidence deviance, at least for large samples. How good is this approximation?The distribution of Dc,prof(θ0)might of course be obtained by bootstrapping. The distributionmight also be found to second order accuracy if the adjustment methods discussed earlierapply to the combined confidence likelihood.

Rao–Blackwell

Fisher required his fiducial distributions to be based on sufficient statistics S. Because theconditional distribution of the data given S is free of the parameter, it follows logically that

Page 99: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 434 — #454�

434 Finale: Summary, and a look into the future

any epistemic probability distribution should be based on S. The Rao–Blackwell theoremfor estimation states that for any estimator, its conditional mean given the sufficient statisticS is at least as good as the original estimator. We proved an analogue for confidencedistribution in Theorem 5.6. Our result is, however, probably too restricted in that it isassumed that the pivot has the same functional form in both the original statistic and S. It isintuitively obvious that confidence distributions based on sufficient statistics dominate otherconfidence distributions in performance. We have been unable to prove this so far, and leavethe challenge to the reader.

Model selection and postselection inference

Our methods for constructing confidence distributions for parameters of interest have atthe outset been developed for the conceptually and practically clearest case of workingwithin a given parametric model. In such a framework, there are clear results concerningdistribution of maximum likelihood estimators and deviances, and so forth, even witha repertoire of further modifications for making the implied normality and chi-squaredtype approximations work better (Chapter 7). There is also clear methodology for bothunderstanding and repairing behaviour of these constructions when the presumed modelis not fully accurate (Section 2.6).

Problems become harder when the selection of the model being used is part of thesame analysis, however. If say a confidence distribution for a focus parameter ψ has beenconstructed at the second stage of an explicit or implicit two-stage procedure, where the firststage is that of letting the same data dictate which of several models to use, then hiding orglossing over the first stage may hamper the interpretation of this second stage output. Thisrelates to what is sometimes referred to as “the quiet scandal of statistics” (Breiman, 1992,Hjort and Claeskens, 2003a,b, Hjort, 2014); one’s presumed 95% confidence interval mightbe optimistically narrow if one does not take into account that the model used to producethe interval is part of the sampling variability too. It is certainly possible to address theseissues within a sufficiently flexible context, but solving the consequent problems in a fullysatisfactory manner is hard, cf. the references mentioned, along with further ideas discussedin Berk et al. (2013), Efron (2014) and Hjort (2014).

To see what these issues involve, and what problems they lead to, suppose we have dataand a focus parameter ψ , defined and with the same interpretation across candidate modelsM1, . . . ,Mk. We then know that the deviance statistic Dj(ψ) is approximately a χ2

1 , under thehome turf conditions of model Mj, leading to a confidence distribution, and so forth. Supposenow that the models considered are of the type indexed by densities f(y,θ ,γ0 + δ/√n), withθ a protected parameter of dimension p and γ an open parameter of dimension q, and wherethe candidate models correspond to inclusion or exclusion of γ1, . . . ,γq. Thus there are atthe outset a total of 2q different submodels S to choose from, where S is the appropriatesubset of indexes {1, . . . ,q}, each of these leading to an estimate ψS, and so forth. Results ofClaeskens and Hjort (2008, chapters 6–7) might then be used to show that

DS(ψ)→d TS ={ �0 +ωt(δ− GSD)

(τ 20 +ωtGSQG t

Sω)1/2

}2,

Page 100: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 435 — #455�

15.5 Finale 435

involving an orthogonal decomposition into independent limit variables �0 ∼ N(0,τ 20 ) and

D ∼ Nq(δ,Q), further involving a vector ω stemming from the focus parameter and certainq × q matrices GS. The first point is that this demonstrates that the deviance statisticsbecome noncentral rather than ordinary chi-squared; only when δ = 0, that is, under modelconditions, does the ordinary result hold. The second point is that when the DS(ψ0) actuallyused is the result from a first-stage model selection, say using the AIC or the FIC, thenits distribution is in reality a complicated mixture of limiting noncentral χ2

1 (λS) variables.Finally the third general point is that the above type of analyses tend to be mathematicallyenlightening, but not easy to exploit when it comes to setting up correction procedures forthe confidence distributions. Bootstrapping from the biggest of the candidate models is anoption, but even such a conservative strategy does not always work; cf. the aforementionedreferences.

Software

The confidence distributions based on real or simulated data discussed or displayed in thevarious chapters are all calculated by rather small R-programs. An R-library, or softwaredeveloped in another environment, for confidence inference would clearly boost the useof these methods. Such software remains to be developed, particularly if it is flexible withrespect to model choice, and provides exact or good or approximate confidence distributionsfor any parameter of interest.

Functions for analysing data with generalised linear models, mixed effects and analysis ofvariance models, and various other frequently used methodology are available in R and otherstatistical software systems. With options for improved calculation of the profile devianceyielding higher order accurate confidence curves for selected parameters the power of thesefunctions is increased. Brazzale et al. (2007) provide such functionality in R in the form ofseparate functions. We would, however, like to see such functionality built into the standardfunctions, but rely on the R-community to start this work.

15.5 Finale

As noted already in the preface, fiducial probability has been said to be Fisher’s biggestblunder. But Efron (1998), among other authors, expresses hopes for a revival of the method,and speculates that Fisher’s biggest blunder might be a big hit in our new century. We sharethis hope, but take note of the various traps that Fisher fell into when assuming that fiducialprobability is governed by the ordinary laws of probability as axiomatised by Kolmogorov.Instead of ‘fiducial probability’ we use the term ‘confidence’.

This term was first used by Cox (1958). He sees confidence distributions as “simple andinterpretable summaries of what can reasonably be learned from the data (and an assumedmodel)” (Cox, 2013, p. 41). With reference to the problem of making inference for the ratioof two normal means (see Section 4.6), where the confidence curve will be bounded aboveby some p < 1 when the estimate of the numerator is close to zero, and thus confidenceintervals of degree α > p are the whole real line, Cox notes that the confidence distributionmight not be a proper distribution.

Page 101: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 436 — #456�

436 Finale: Summary, and a look into the future

Efron (2013) is also still enthusiastic about confidence distribution. He says (p. 41), “[a]nimportant, perhaps the most important, unresolved problem in statistical inference is theuse of Bayes theorem in the absence of prior information. Analyses based on uninformativepriors, of the type advocated by Laplace and Jeffreys, are now ubiquitous in our journals.Though Bayesian in form, these do not enjoy the scientific justification of genuine Bayesianapplications. Confidence distributions can be thought of as a way to ground ‘objectiveBayes’ practice in frequentist theory.”

To his death Fisher regarded fiducial inference as the jewel in his crown of the “ideasand nomenclature” for which he was responsible (Fisher, 1956, p. 77). This despite theserious criticism he faced. The fiducial distribution was as Neyman (1934) remarked arevolutionary idea. Hald (1998, p. 1) holds that “[t]here are three revolutions in parametricstatistical inference due to Laplace (1774), Gauss and Laplace in 1809–1812, and Fisher(1922). It took about 20 years and many papers for each of these authors to work out thebasic idea in detail, and it took about half a century for the rest of the statistical communityto understand and develop the new methods and their applications.” We think that Fisher(1930) and Neyman (1934) staged the fourth revolution in parametric statistical inference.More than 80 years have now passed. The statistical community is still far from fullyunderstanding fiducial probability or confidence distributions. Neyman insisted on keepingto aleatory probabilities only, and resisted any epistemic probability interpretation of hisdegree of confidence (despite its very epistemic term). And Fisher himself was apparentlynever quite on top of fiducial probability. Fisher died in 1962. Late in his life he is onrecord to have remarked to L. J. Savage; cf. Edwards (1992, p. 237), “I don’t understandyet what fiducial probability does. We shall have to live with it a long time before we knowwhat it’s doing for us. But it should not be ignored just because we don’t yet have a clearinterpretation.” We agree. There is far more to explore concerning fiducial distributions andconfidence distributions. And if the confidence distribution becomes the sturdy backbone ina synthesis of Bayesian and frequentist thinking, we owe even more to Sir Ronald AylmerFisher.

Page 102: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 437 — #457�

Overview of examples and data

A generous number of real datasets are used in this book to illustrate aspects of themethodology being developed. Here we provide brief descriptions of each of these realdata examples, along with key points to indicate which substantive questions they relateto. Some of these datasets are small, partly meant for simpler demonstrations of certainmethods, whereas other are bigger, allowing also more ambitious modelling for reachinginference conclusions. Key words are included to indicate the data sources, the types ofmodel we apply and for what inferential goal, along with pointers to where in our bookthe datasets are analysed.

Lifelength in Roman era Egypt

In Spiegelberg (1901) the age at death has been recorded for 141 Egyptian mummies, 82 male and 59female, dating from the Roman period of ancient Egypt from around year 100 B.C. These lifelengthsvary from 1 to 96 years, and Pearson (1902) argued that these can be considered a random samplefrom one of the better-living classes in that society, at a time when a fairly stable and civil governmentwas in existence. These data are analysed by Claeskens and Hjort (2008, pp. 33–35), in which ninedifferent parametric models for hazard rates are compared and where the Gompertz type models arefound to be best.

In Example 1.4 we fit a simple exponential model to the lifelengths of the male to motivatethe concepts of deviance functions and confidence curves. In Example 3.7 we find the confidencedistribution for the ratio of hazard rates for female to that for the men (in spite of Karl Pearson’scomment, “in dealing with [these data] I have not ventured to separate the men and women mortality,the numbers are far too insignificant”). A certain gamma process threshold crossing model is used inExercise 4.13, providing according to the Akaike information criterion (AIC) model selection methoda better fit than the Gompertz. For Example 9.7 we compute and display confidence bands for thesurvival curves, for the age interval 15 to 40 years. Then in Example 10.1 the data are used to illustratethe transformation from confidence distribution to confidence likelihood, for the simple exponentialmodel, following Example 1.4, whereas Example 11.1 gives the confidence density for the cumulativeprobability F(y0), that is, the probability of having died before y0, again comparing male with female.Finally in Example 12.4 the data are used for constructing a full prediction confidence distribution forthe lifelength of another person from the same population, using a Weibull model.

Small babies

A dataset stemming from Hosmer and Lemeshow (1999) provides information about 189 babies andtheir mothers, pertaining to birthweight, with certain informative covariates for the mothers, including

437

Page 103: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 438 — #458�

438 Overview of examples and data

smoking habits, age and weight and race. The data are further discussed and analysed in Claeskensand Hjort (2008). In Example 2.4 we carry out logistic regressions pertaining to how the covariatesinfluence the probability of a newborn child having birthweight lower than 2500 g. This is followed upby comparing confidence curves for such a probability for a smoking and nonsmoking white mother inExercise 3.12. In Example 4.3 we find the confidence distribution for the degrees of freedom in a fittedthree-parameter t-type model for the age of the mothers when giving birth, where it is noteworthy thatthis confidence distribution has a point mass at infinity (corresponding to the case where the t-modelreduces to normality). In Exercise 8.8 we fit gamma regression models to birthweights of children ofrespectively smoking and nonsmoking mothers.

Olympic body mass index

From correspondence with eager people collecting data on Olympic athletes, in particular J. Heijmans,we have compiled information on the weight and height of Olympic speedskaters, from Oslo 1952 andup to Vancouver 2010. It is of interest to see how, for example, the body mass index (BMI; weightdivided by squared height, in kg/m2) is distributed, as a function of time, and compared to otherpopulations. In Example 3.8 we study the 95 male and 85 female skaters taking part in Vancouver2010 and find confidence distributions for the mean, spread and quantile parameters. In Example 9.5we construct a simultaneous confidence band for the evolution of BMI over time. The BMIs for thefull group of 741 female skaters are then studied in Example 11.5 to give a Wilcoxon type confidencedistribution for the Hodges–Lehmann estimate.

Assortative mating according to temper

In his classic Natural Inheritance, Galton (1889) gives a fascinating and entertaining analysis of‘temper’, coupled with the general theme of inheritance, and courageously classifies husbands andwives as ‘bad-tempered’ or ‘good-tempered’. We use these data in Example 3.9 to not merely testfor independence of these character traits, but also to provide confidence distributions and curves forrelevant parameters.

Egyptian skulls across five epochs

Thomson and Randall-Maciver (1905) examined 30 skulls from each of five different Egyptiantime epochs, corresponding to −4000,−3300,−1850,−200,150 on our A.D. scale. Measurementsx1,x2,x3,x4 have been taken on each of these 30 times five skulls, cf. Figures 1.1 and 9.1 in Claeskensand Hjort (2008), where these data are analysed. For Example 3.10 we investigate the extent towhich the five mean vectors can be argued to be different, a question formulated in terms of acertain discrepancy parameter, for which we then find a confidence distribution; this is a moreinformative analysis than giving simply the p-value of a test. We use the data in Example 13.2 toprovide confidence analysis of a certain parameter associated with how the covariance matrices haveevolved over time.

Sons of Odin

Odin had six sons (though sources are not entirely clear on the matter): Thor, Balder, Vitharr, Vali(cf. the Eddic poems and the Snorri Edda), Heimdallr, Bragi (cf. Snorri’s kennings). In Example 3.11construct a confidence distribution for the number of all of Odin’s children. In some of the Snorrikennings there are also references to Tyr and Hod as sons of Odin (and yet other names are mentioned

Page 104: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 439 — #459�

Overview of examples and data 439

in the somewhat apocryphical Skaldskaparmal). Exercise 3.17 is about updating the aforementionedconfidence distribution taking these other sons into account.

Abelian envelopes

How many Abel commemorative envelopes were issued, in 1902? Before 2011, five such envelopeswere known in the international community of stamp collectors, with so-called R numbers 280, 304,308, 310, 328. We derive in Example 3.13 a confidence distribution for the number N of envelopesoriginally issued. Exercise 3.15 is about updating this confidence distribution in the light of three more1902 envelopes discovered in 2012, carrying R numbers 314, 334, 389. The data (with associatedquestions) are from colleagues N.V. Johansen and Y. Reichelt of the authors.

False discoveries

Schweder and Spjøtvoll (1982) considered the set of pairwise comparisons of the means in a certainone-way analysis of variance layout. Their methods led to an estimate of the number of true nullhypotheses, and this analysis is extended in Section 3.8 to provide a full confidence distribution.

New Zealand rugby

Certain after the new rules were introduced for the game of rugby football in 1992, with potentialconsequences for the so-called passage times. Data on such and other parameters were collected byHollings and Triggs (1993), for five matches before and five matches other the new rules took effect,for the All Black (the New Zealand national team). In Exercise 3.13 we fit Gamma distributionsand construct confidence distributions for parameters related to the changes in terms of means andstandard deviations of these distributions.

Mixed effects for doughnuts

Data concerning the amount of fat absorbed for different types of fat over consecutive days whenproducing doughnuts are given in Scheffe (1959, p. 137). We use these data in Example 4.5 to illustratemixed and random effects models, and find confidence distributions for several central parameters.

Central England temperatures

A unique database of meteorological data is being maintained at the Central England Temperatures(CET), partly thanks to the efforts of Manley (1974). It contains monthly mean temperatures from1659 to the present. In Example 4.8 we study the April mean temperatures from 1946 to the present,where the fitted regression line a + b(year− 1946) climbs from 8.04 ◦C to 8.66 ◦C and has b = 0.01,corresponding to 1 degree increase over 100 years. We use the data to find the confidence distributionfor the year x0 where the expected mean temperature crosses 9.5 ◦C. The 90% interval stretches from2028 to infinity. Exercise 4.11 asks for further analyses.

The lords of the fly

Drosophila fruit flies and their mating frequency and behaviour may be studied in so-calledpornoscopes, and data of this type are reported on and discussed in Andersen et al. (1993, p. 38)

Page 105: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 440 — #460�

440 Overview of examples and data

and Aalen et al. (2008, p. 82). We use these data in Example 4.9 to fit certain parametric modelsfor counting process models, and give confidence distributions for the peak mating time; see alsoExercise 4.5.

Carcinoma of the oropharynx

A dataset concerning survival time after operation for carcinoma of the oropharynx is providedand discussed in Kalbfleisch and Prentice (2002, p. 378), along with covariates for the patients inquestion. These authors also give a Cox regression type analysis. An alternative analysis is given inClaeskens and Hjort (2008, section 3.4); see also Aalen et al. (2008, section 4.2). In Example 4.10we give a confidence distribution for the median survival time for given types of patients who havealready survived one year after operation, and in Section 12.4 a prediction confidence distribution isconstructed for the time of a future event, for a patient with given covariates.

The asynchronous distance between DNA sequences

Certain DNA evolutionary data, involving the number of transitions for homologous sequenceshypothesised to stem from a common ancestor, are discussed in Felsenstein (2004) and Hobolthand Jensen (2005). We employ a so-called Kimura Markov model in Example 4.11, and computethe confidence distribution for a certain natural asynchronous distance parameter �; see alsoExercise 4.14.

Markov and Pushkin

How Markov invented Markov chains in Markov (1906) is entertainingly discussed in Basharin et al.(2004). Markov’s convincing and empirical analyis of Markov chains in Markov (1913) concernscounting transition frequencies between vowels and consonants in Pushkin’s Yevgeniı Onegin, whichwe match in Example 4.12 by reading the 1977 English translation by C. H. Johnston. We use thesedata to provide confidence distributions for two relevant parameters.

Skiing days at Bjørnholt

We have access to meteorological time series data pertaining to the amount of snow on the ground atvarious locations in Norway, over extended periods, via the work of Dyrrdal and Vikhamar-Scholer(2009) and others. One of these series may be converted to the number of skiings days per year,defined as there being at least 25 cm of snow on the ground, at Bjørnholt, one hour of cross-countryskiing north of Oslo. In Example 4.13 we provide confidence distributions for the autocorrelation andnegative trend regression parameter, and in Example 12.10 we form prediction confidence intervalsfor the number of skiing days over the coming years.

Liver quality of fish and Kola temperatures

Hjort (1914) addresses the key issues underlying the fluctuations and behaviour of the great fisheriesof northern Europe. One of the many data sources considered is a time series pertaining to the liverquality of the northeast Atlantic cod (skrei, Gadus morhua), for the years 1880–1912, making it oneof the first published time series concerning marine fish. In Kjesbu et al. (2014), the authors havemanaged to extend this Hjort series both backwards and forwards in time, producing measurements

Page 106: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 441 — #461�

Overview of examples and data 441

of the so-called hepatosomatic index (HSI) from 1859 to 2013. In Example 4.14 we analyse the HSIseries and investigate to what extent it may be influenced by the annual winter temperatures at Kola,with data stemming from Boitsov et al. (2012)Boitsov, V. D.. More details and a fuller analysis areprovided in Hermansen et al. (2015).

Healing of rats

Pairwise differences of tensile strength of tape-closed and sutured wounds are recorded in Lehmann(1975, p. 126). This gives rise to natural rank-based tests for whether the two populations have thesame distribution. We follow up such tests with a more informative confidence distribution for theirdifference in Example 5.5.

Birds on paramos

Table 4.5 gives the number of different birds species y living on the paramos on certain islands outsideEcuador, along with distance x1 from Ecuador (in km) and area x2 (in thousands of km2). The data stemfrom Hand et al. (1994, Case #52). The data are analysed using Poisson regression with overdispersionin Exercise 4.18, with an overdispersion confidence point mass at zero. A different and flexible modelhandling both over- and underdispersion is used for the same dataset in Exercise 8.18.

Twins and triplets

We have extracted information from Statistics Norway to find the number of twins and triplets bornin Norway, 2004 and 2005. In Example 5.8 we give an optimal confidence curve for the Poissonproportionality parameter in question.

Bolt from heaven

Bolt (2013) is an interesting book, but the author restrains most of his attention to himself and hisown achievements, so to find relevant data for our measure of surprise analyses for track and fieldevents in Section 7.4 we have needed to track down lists of results and organise our own files. Ouranalyses in particular utilise all sub-10.00 second 100-metre races from 2000 to 2008, also discardingresults associated with athletes later taken for drug use as well as those reached with tailwind morethan the prescribed limit 2.0 metres per second. On the technical side, our methods involve Bartlettingthe deviance function derived from the extreme value distribution.

Balancing act

Data from a certain balance experiment, carried out for young and elderly people and reported onin Teasdale et al. (1993), are used in Example 7.7 to find the optimal confidence distribution for themean-to-variance parameter in a situation with low sample size, along with approximations.

Spurious correlations

Pairs of certain measurements (x,y) for small children of age 3, 12 and 24 months were recordedat the Faculty of Medicine at the University of Hong Kong. We use this to illustrate both the exactconfidence distribution and some of its approximants for small sample size in Example 7.8, and also

Page 107: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 442 — #462�

442 Overview of examples and data

to exemplify a potentially incorrect or irrelevant way of assessing correlations from different groups,in Example 13.4.

Modified health assessment questionnaires

Certain health assessment questionnaires (HAQ, with and without modification, MHAQ), leading toso-called HAQ and MHAQ data, are discussed in Claeskens and Hjort (2008, example 3.7) (withdata from P. Mowinckel, personal communication). We fit such a dataset to the Beta distribution inExample 8.2, computing power optimal confidence distributions for the two parameters via certainsimulation strategies. A three-parameter extension of the Beta, using exponential tilting, is used inExample 8.7 and is seen to give a better fit.

Bowhead whales off Alaska

Data on the population of bowhead whales off of Alaska were collected in 1985 and 1986, includingthe use of aerial photography; see da Silva et al. (2000) and Schweder (2003). In Example 8.3 wedevelop a confidence distribution for the number of whales, via a multinomial recapture model. Thisis seen in Example 10.3 to agree well with constructions stemming from other log-likelihoods. InSection 14.3 it is explained how parametric bootstrapping from quite complex models may be utilisedfor constructing confidence distributions for abundance and other relevant parameters.

Bivariate handball and football

We have followed women’s handball tournaments, and specifically those at the Athens Olympics2004 and the European championships in Hungary 2004. In Section 8.3 we demonstrate that thoughthe marginal distribution of the number of goals scored is close to a Poisson, the results (X ,Y ) froma match exhibit a kind of internal dynamics that makes X and Y closer to each other than if they hadbeen independent. We propose a bivariate Poisson distribution and compute the optimal confidencedistribution for its dependency parameter. Exercise 8.10 concerns analogous computations for theworld of football, based on a total of 64 + 31 + 64 + 31 + 64 = 254 matches from five grandEuropean and World tournaments, with data collected and then organised from various websitesby one the authors. An alternative bivariate Poisson model is used for Exercise 8.11, associated inparticular to the real time real excitement plots of Figure 8.18. The match analysed there involvesthe precise time points at which Norway and Spain scored their 28 + 25 goals, which we found athandball.sportresult.com/hbem14w/PDF/21122014/W47/W47.pdf.

Blubber

Konishi et al. (2008) and Konishi and Walløe (2015) analyse data on baleen whales, in particularrecording a certain blubber thickness parameter, of interest regarding how whales cope with changingfactors related to krill abundance, climate, and so forth. We have had access to a detailed versionof these data via L. Walløe (head of the Norwegian delegation to the Scientific Committee of theInternational Whaling Commission, personal communication). In Example 8.5 we model how blubberthickness varies with both year and other covariates, and find the confidence distribution for theimportant regression parameter β associated with year. The methodological challenge here is partlyhow to cope with the active use of model selection which was involved with setting up a goodregression model in the first place.

Page 108: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 443 — #463�

Overview of examples and data 443

Leukaemia and white blood cells

A certain dataset for 17 leukaemia patients, with survival time and white blood cell counts at the timeof diagnosis, has been discussed and analysed in Feigl and Zelen (1965), Cox and Snell (1989) andBrazzale et al. (2007) for illustrating different models and analyses. In Example 8.6 we fit a gammaregression model, which via model selection criteria is seen to work better than previous attempts. Anoptimal confidence distribution is constructed for the basic influence parameter, with further detailsin Exercise 8.12.

Cross-country ski-sprint

Table 8.2 gives the results for the six male and female skiers having reached the finals of the Olympicsin Sochi 2014. For our unfairness analysis in Example 8.8 we needed such information from asufficiently long list of similar events. Such are easily found on, for example, www-fis-ski.com whenit comes to Olympics and World Championships and also for World Cup events for seasons 2014 and2013. We needed help from the Federation Internationale de Ski (FIS) secretariat in Lausanne to gethold of such results from seasons 2012, 2011, 2010. These files have then been edited, organised andkept by one of the authors.

Adelskalenderen of speedskating

The Adelskalenderen of speedskating is the list of the best ever skaters, based on the list of theirpersonal bests over the four classic distances 500 m, 1500 m, 5000 m and 10,000 m, via their combinedpointsum (x1 + x2/3+ x3/10+ x4/20, where x1, . . . ,x4 are the personal bests, in seconds). When a topskater sets a new personal best, the list changes. For the illustration of Example 8.11, involvingheteroscedastic regression, we have used the Adelskalenderen for the 250 best ever skaters, as perOctober 2014, to learn how the 10k time may be predicted from the 5k time. The data stem from filescollected and organised by one of the authors.

Survival with primary biliary cirrhosis

Data are from a randomised study where patients suffering from the autoimmune disease of the livercalled primary biliary cirrhosis receive either drug or placebo, and where covariates are recorded tojudge their influence on survival. The data stem from Fleming and Harrington (1991), with follow-upinformation in Murtaugh et al. (1994), also discussed in Claeskens and Hjort (2008, chapters 1 and9). For our illustration in Example 8.12 we concentrate on three of the available covariates for a totalof 312 patients, namely drug (1 if the D-penicillamine drug is in use, 2 if placebo), age at registration(in days, and ranging from 26 to 79 years), and serum bilirubin in mg/dl (ranging from 0.3 to 28.0).We use linear-linear Gamma distribution modelling, with log-linear parametrisation of both gammaparameters, to analyse the time from registration to either death of transplantation. Only 40% of thepatients reach this event inside the time of study, that is, there is 60% censoring.

Carlsen versus Anand

Magnus Carlsen and Viswanathan Anand have played a total of 39 games, from 2005 up to andincluding the 10 games of the world championships match in Chennai, 2013. Of these, six gameswere won by each, with 27 games ending in remis. In Example 9.6 we give confidence regionsfor the associated trinomial probabilities (p1, p2) of winning for respectively Carlsen and Anand. In

Page 109: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 444 — #464�

444 Overview of examples and data

Exercise 9.4 the task is to update the confidence regions in view of the results from Sochi, 2014. Withtheir draw in Zurich February 2014 followed by the 10 games of Sochi, the score is 9 wins Carlsen, 7wins Anand and 34 draws.

Growing older

The oldest people on our planet are growing older. We use data from Oeppen and Vaupel (2002)to illustrate one of the consequences of this development, from 1840 to 2012. In Example 10.2we compute the confidence distribution for the implied trend regression parameter (the exact andan approximation), with a follow-up in Exercise 12.8. For Example 12.8 we form the predictionconfidence distribution for the average age for women in the country of the most long-lived women,up to the year 2050. A good source for modern as well as much historic data is the Human MortalityDatabase, www.mortality.org.

Humpback whales

Paxton et al. (2006) estimated the abundance of a certain North Atlantic population of humpback, in1995 and 2001. In Exercise 10.10 the task is to build a confidence distribution for the yearly growthrate for this population, from published confidence intervals and estimates, via a joint likelihood forabundance for the two years.

Oslo babies

In the larger context of finding and exploring factors that influence the chances of babies being borntoo large, Voldner et al. (2008) have examined a certain cohort of 1028 mothers and children, all ofthose born at Rikshospitalet, Oslo, in the 2001–2008 period. We have had access to the birthweightsin question (via N. Voldner and K.F. Frøslie, personal communication) and use these to illustrateconfidence distributions for quantiles in Examples 11.3 (direct methods) and Example 11.6 (viaempirical likelihood).

US unemployment

Gujarati (1968) analyses US data related to quarterly unemployment and certain Help Wanted indexes,from the time period 1962 to 1966. In Example 12.9 we fit trend models with autoregressive structurefor the residuals, and use this to compute prediction confidence distributions for the followingquarters.

British hospitals

Our main dataset used for illustrating various meta-analysis methods in Chapter 13 stems from a studyof 12 British hospitals reported on and discussed in Spiegelhalter (2001), Spiegelhalter et al. (2002)and Marshall and Spiegelhalter (2007), and which for our statistical analysis purposes yield binomialcounts for a certain event for a list of studies. One of the questions addressed is whether the Bristolhospital is a statistical outlier. See Section 13.1 and Example 13.1 for assessing the outlier question,and Example 13.3 for meta-analysis for both the overall mean parameter and the associated spreadparameter.

Page 110: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 445 — #465�

Overview of examples and data 445

Golf putting

A table of golf putting successes, seen as binomial experiments at different distances from the hole,is given in Gelman et al. (2004, Table 20.1). In Section 14.2 we use these to build a two-parametermodel for successfully hitting the hole, as a function of the distance. It fits the data much better than,for example, logistic regression. We use this to find confidence curves for the distance x0 at which thehitting probability is some given p0.

Prewar American economy

Sims (2012) analyses a dataset involving consumption, investment, total income and governmentspending for the years 1929 to 1940, first studied by Haavelmoe (1943), and using Bayesian methodsand flat priors he arrives at a posterior distribution for a certain parameter θ1, related to how changesin investment govern consumption. In Section 14.4 we analyse the same data, using the very samemodel, but reach a very different confidence distribution for θ1.

1000 m sprint speedskating

Data for our Olympic speedskating story in Section 14.5 stem from files collected over the years byone of the authors, and relate specifically to the results from a long enough list of annual World SprintChampionships, along with information pertaining to the inner-or-outer lane start positions for theskaters. We use the data to pinpoint the precise degree of drastic unfairness inherent in the currentOlympic setup for the 1000 m sprint event.

Norwegian income

In Section 14.6 we analyse Norwegian income data, stemming from research carried out with the‘Inequality, Social background and Institutions’ project at the Frisch Centre and the Centre of Equality,Social Organization, and Performance (ESOP), University of Oslo. Data for the whole Norwegianpopulation are made available by Statistics Norway. We pay attention, for example, to the growthfrom 2000 to 2010, in terms of measures related to income quantiles.

Google Scholar profiles

The Google Scholar profiles for T. S. and N. L. H. have been accessed and recorded for mid-June2015, and then fitted to a certain three-parameter abc model for such types of data, in Section 14.8.This leads to confidence distributions for relevant parameters, such as the rich-to-poor quantile ratiosF−1(p2)/F−1(p1) for each scholar, for the ratio of such ratios when comparing the profiles of twoscholars, and so forth.

Page 111: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 446 — #466�

Page 112: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 447 — #467�

Appendix

Large-sample theory with applications

This appendix presents general background material on first-order asymptotics for thedistribution of estimators, test statistics, and so on, which in its turn gives the coretheory for necessary parts of the methodology developed in the book. Thus by backwardsinduction it might prove fruitful to read through the following mini-course on the basicsof large-sample theory, even for readers who have been through courses in these areasof theoretical and applied probability theory. We also include some material on moreadvanced topics, on robust parametric inference and model selection strategies, andwhere the basics of large-sample theory is being actively used.

The aim of this appendix is to provide a brief overview of definitions, tools and results of thelarge-sample theory that are used throughout the book. The material covered includes some bitsof matrices and linear algebra, the multivariate normal distribution, convergence in probability,convergence in distribution, convergence with probability one, the uni- and multidimensional centrallimit theorem, the Lindeberg theorem, the delta method and asymptotics for minimisers of randomconvex functions. Finally exercises are included.

A.1 Convergence in probability

Let T1, T2, . . . be a sequence of random variables. We say that Tn converges in probability to a if

limn→∞P{|Tn − a| ≥ ε} = 0 for each ε > 0.

We write Tn →pr a to indicate such convergence. In various contexts one needs to discuss alsoconvergence in probability to a (nonconstant) random variable, say Tn →pr T , but in situationsencountered in this book the probability limits will be constant (nonrandom) quantities. One alsouses ‘Tn is consistent for a’ as an equivalent statement to Tn →pr a, particularly in contexts whereTn is an estimator for the quantity a. Often, Tn would then be a function of a dataset with nobservations.

The convergence in probability concept is used also for vectors. If Tn and a are p-dimensional, thenthe requirement for Tn →pr a is convergence to zero of P{‖Tn − a‖ ≥ ε} for each positive ε, where‖x−y‖ is ordinary Euclidean distance in R

p. One may prove that this is equivalent to coordinate-wiseconvergence, that is, Tn →pr a in R

p if and only if Tn, j →pr aj for each j = 1, . . . ,p.Slutsky’s continuity theorem is important and often used: if Tn →pr a, and g is a function defined

in at least a neighbourhood around a, and continuous at that point, then g(Tn)→pr g(a). Here Tn

could be multidimensional, as could the image g(Tn) of g. Proving the Slutsky theorem is in essenceusing the continuity definition of g at the point a. The theorem holds also when the probability limitis a random variable, that is, Tn →pr T implies g(Tn)→pr g(T ), if g is continuous on a set in which

447

Page 113: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 448 — #468�

448 Large-sample theory with applications

the limit variable T falls with probability one. Proving this version is somewhat more difficult thanproving the T = a version, though.

Since functions g1(x,y)= x+ y, g2(x,y)= xy and so on are continuous, it follows from the Slutskytheorem that Xn →pr a and Yn →pr b implies Xn + Yn →pr a + b, XnYn →pr ab, and so on.

Suppose X1, X2, . . . are independent and identically distributed (i.i.d.), with mean ξ . Then theaverage Xn = n−1

∑ni=1 X i has mean ξ and variance σ 2/n, provided also the variance is finite. By

the Chebyshev inequality, P{|Xn − ξ | ≥ ε} ≤ σ 2/(nε2), proving that Xn = n−1∑n

i=1 X i →pr ξ = EX .This is the law of large numbers, in weak form, under finite variance condition. The statement holdswith finiteness of E|X | as the only requirement on the X distribution, that is, Xn →pr E X even ifE|X |1.001 = ∞; this is even true in the stronger sense of almost sure convergence.

If θn = g(Zn,1, . . . , Zn,p) is a smooth function of averages, say with Zn, j = n−1∑n

i=1 Z i, j for j =1, . . . ,p, then θn is consistent for θ = g(ξ1, . . . ,ξp), where ξj = EZ i,j, using a combination of the law oflarge numbers and the continuity theorem. Thus the empirical standard deviation σn is consistent forthe true standard deviation σ , since

σ 2n = n

n − 1

(n−1

n∑i=1

X2i − X 2

n

)→pr EX2 − (EX)2.

Similarly, the empirical correlation coefficient ρn is consistent for the true correlation ρ, in thatρn may be written as a smooth function of the five averages Xn, Yn, n−1

∑ni=1 X2

i , n−1∑n

i=1 Y 2i ,

n−1∑n

i=1 X iYi.

A.2 Convergence in distribution

Let again T1, T2, . . . be a sequence of random variables on the real line. We say that Tn converges indistribution to a variable T if

Fn(t)= P{Tn ≤ t} → F(t)= P{T ≤ t} for each t ∈ C(F), (A.1)

where C(F) is the set of points at which F is continuous. We write Tn →d T to indicate suchconvergence, or sometimes Fn →d F, and allow variations like say Tn →d N(0,1) in the case of thelimit variable having a N(0,1) distribution. Note that the definition requires probabilities to converge,not the variables themselves; this makes Tn →d T and Tn →pr T fundamentally different statements.Note that when F is continuous, (A.1) simply says that Fn(t)→ F(t) for all t. One may prove that suchpointwise convergence actually implies uniform convergence, that is,

maxt∈R

|Fn(t)− F(t)| → 0 as n → ∞.

Usually pointwise convergence is not enough to ensure uniform convergence, of course, but theboundedness and monotonicity of the functions involved cause uniformity here.

If the limit variable T is degenerate, that is, equal to a constant a with probability one, then onemay prove that Tn →d a if and only if Tn →pr a.

There are important variations on the (A.1) statement. It is equivalent to the requirement that

P{Tn ∈ A} → P{T ∈ A} for all T -continuous sets A, (A.2)

where a set A is said to be T -continuous if P{T ∈ ∂A} = 0, where ∂A is the boundary of A, that is,A − A0, featuring respectively A, the smallest closed set contained A, and A0, the largest open setcontained in A. With A = (−∞, t], this generalises (A.1). Also, P{Tn ∈ (a,b)} → P{T ∈ (a,b)} for allintervals (a,b) with endpoints at which the distribution function of T is continuous.

Page 114: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 449 — #469�

Central limit theorems and the delta method 449

Statements (A.1)–(A.2) are also equivalent to the somewhat more abstract statement that

Eh(Tn)=∫

h(t)dFn(t)→ Eh(T )=∫

h(t)dF(t) (A.3)

for each bounded and continuous function h : R→R. This makes it easy to demonstrate the importantcontinuity theorem for convergence in distribution, which is that Tn →d T implies g(Tn)→d g(T )when g is continuous on a set inside of which T falls with probability one. Thus, if Tn →d N(0,1),then T 2

n →d χ21 , for example.

Statements (A.2) and (A.3 ) lend themselves a bit more easily to generalisation to themultidimensional case than does (A.1), so we define

Tn =⎛⎜⎝Tn,1

...Tn,p

⎞⎟⎠ →d T =⎛⎝T1

...Tp

⎞⎠for vectors to mean that (A.2) or (A.3) hold (then they both hold). The continuity theorem generaliseswithout difficulties, via (A.3). Thus, if Tn →d T , then ‖Tn‖ →d ‖T ‖, and Tn,j →d Tj for eachcomponent j = 1, . . . ,p, since the projection functions are continuous, and if (Xn,Yn)

t →d (X ,Y ),where X and Y are independent standard normals, then Xn/Yn has a limiting Cauchy distribution, andX 2

n + Y 2n →d χ

22 .

There is a further statement equivalent to (A.2) and (A.3) of importance in theoretical and appliedprobability theory, namely

ψn(s)= Eexp(istTn)→ψ(s)= Eexp(istT ) for all s ∈Rp, (A.4)

in terms of the characteristic functionsψn for Tn andψ for T , involving the complex number i=√−1.In fact, Tn →d T if only (A.4) holds for s in a neighbourhood around zero. The (A.4) way of seeingconvergence in distribution leads to the Cramer–Wold theorem, that Tn →d T if and only if there iscorresponding convergence in distribution for all linear combinations, that is, atTn →d atT for eacha ∈R

p. This makes it sometimes possible to prove multidimensional convergence via one-dimensionalwork.

A case in point is when the limit variable T has a multinormal distribution. We say that T =(T 1, . . . , Tp)

t has a multinormal distribution when each linear combinations atT = ∑kj=1 ajTj has a

normal distribution. It is characterised by its mean vector ξ and variance matrix �, and we write T ∼Np(ξ ,�). This is equivalent to the statement that for each a,atT should have the normal distributionN(atξ ,at�a) (where a normal variable with zero variance is defined as being equal to its meanwith probability one). A formula for the multivariate normal density, in the case of � being of fullrank, is

f(y)= (2π)−p/2|�|−1/2 exp{− 12 (y − ξ)t�−1(y − ξ)}

but we are typically more interested in other properties of the multinormal than its density formula.In particular, from the Cramer–Wold theorem, Tn →d Np(0,�) is equivalent to atTn →d N(0,at�a)holding for each a.

A.3 Central limit theorems and the delta method

The central limit theorem says that statistical averages, suitably normalised, will tend to be normallydistributed, when the number of random variables involved increases. The one-dimensional preciseversion of this is that if X1, X2, . . . are independent with the same distribution, and finite mean ξ and

Page 115: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 450 — #470�

450 Large-sample theory with applications

standard deviation σ , then the normalised average tends to the standard normal:

Tn = Xn − ξσ/

√n

= √n(Xn − ξ)/σ →d N(0,1). (A.5)

Even though most complete proofs involve mathematical details at a somewhat laborious level, onemay write down an essentially correct heuristic proof consisting in showing, via Taylor expansion,that the characteristic function ψn(s) of Tn must converge to exp(− 1

2 s2), the characteristic function ofthe standard normal.

The multidimensional version of the central limiting theorem starts with independent andidentically distributed vectors X1, X2, . . . in dimension p, with finite mean vector ξ and variance matrix�. Then

√n(Xn − ξ)→d Np(0,�). (A.6)

This actually follows without much trouble from the one-dimensional version via the Cramer–Woldmethod.

The central limit theorems, with suitable variations, are powerful tools in theoretical and appliedstatistics, at several levels. On one side they may be invoked to infer the structure of certain statisticalmodels; for example, if observations Yi may be seen as products across many independent positivefactors, a log-normal model for the data is a logical consequence under suitable conditions. On theother hand, the limit theorems may give good and effective approximations to quite complicateddistributions, of estimators, test statistics and other random constructions.

The toolbox is significantly widened with a list of further lemmas, sometimes referred to as theSlutsky–Cramer rules. These are as follows. Assume Tn →d T and Un →pr u, a constant. Then Tn +Un →d T +u and TnUn →d T u; also, Tn/Un →d T/u if u is not zero. More generally, if Tn →d T in R

p

and Un →pr u in Rq, then (Tn,Un)→d (T ,u) in R

p+q, which by the continuity theorem in Section A.2applies g(Tn,Un)→d g(T ,u) for each g that is continuous in t where T lands and continuous at u.

One typical application of this is that if√

n(θn − θ) →d N(0,τ 2), for a certain limit standarddeviation τ , and τn is a consistent estimator of this quantity, then

Zn = √n(θn − θ)/τn =

√n(θn − θ)τ

τ

τn→d

N(0,τ 2)

τ

τ

τ=d N(0,1).

This implies, for example, that

P{θn − 1.96 τn/√

n ≤ θ ≤ θn + 1.96 τn/√

n} = P{−1.96 ≤ Zn ≤ 1.96} → 0.95,

i.e. θn ± 1.96 τn/√

n is an asymptotic 95% confidence interval for θ . For another application, suppose�n is a consistent estimator of the limit variance matrix � in (A.6). Then

Wn(ξ)= n(Xn − ξ)t�−1n (Xn − ξ)→d U t�−1U ∼ χ2

p ,

where U ∼ Np(0,�) and we have used the classic χ2 property of the exponent of the multinormal.Thus the random ellipsoid Rn = {ξ : Wn(ξ) ≤ γp,0.90}, using the upper 0.10 point of the χ2

p , hasconfidence level converging to 0.90. The above is used, for example, in Chapters 2, 3 and 9.

A further application of the Slutsky–Cramer rules is an effective technique for showingconvergence in distribution that we may term the limit distribution by comparison method: If Xn →d Xand Xn − Yn →pr 0, then also Yn →d X . We sometimes write Xn

.=d Yn to indicate this condition forlarge-sample equivalence between the variables Xn and Yn and their distributions.

For regression situations and in other contexts where variables are independent, but not identicallydistributed, one needs appropriate generalisation of (A.5) and (A.6). These take many forms, suitable

Page 116: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 451 — #471�

Central limit theorems and the delta method 451

for different occasions. The Lindeberg–Levy theorem is among the more general one-dimensionalgeneralisations of the central limit theorem. Let X1, X2, . . . be independent observations fromdistributions with means zero and standard deviations σ1,σ2, . . .. Then

Zn =n∑

i=1

X i/Bn =∑n

i=1 X i

(∑n

i=1 σ2i )

1/2→d N(0,1)

provided the so-called Lindeberg condition holds,

Ln(ε)=n∑

i=1

E∣∣∣ X i

Bn

∣∣∣2I{∣∣∣ X i

Bn

∣∣∣ ≥ ε} → 0 for each ε > 0. (A.7)

This condition ensures that grave imbalance does not occur, in the sense of a few terms being allowedto dominate over the rest. Requirements of this sort are sometimes referred to as asymptotic stabilityconditions. A sometimes simpler statement to check, which implies the Lindeberg condition (A.7), isthe so-called Lyapunov condition

∑ni=1 E |X i/Bn|3 → 0. Proving that this implies (A.7) is taken care

of in Exercise A.6.There are also multidimensional Lindeberg–Levy theorems, in different variants, essentially

obtained by combining the unidimensional version with the Cramer–Wold method. One variant,found useful in Chapters 2, 3 and 11, is as follows. Let X1, X2, . . . be independent vectors in R

p

coming from distributions with means ξ1,ξ2, . . . and variance matrices �1,�2, . . ., and assume thatn−1(�1 +·· ·+�n)→�. If

∑ni=1 E‖X i − ξi‖2I{‖X i − ξi‖ ≥ √

nε} → 0 for each positive ε, then

n−1/2n∑

i=1

(X i − ξi)→d Np(0,�). (A.8)

We note that results like (A.8) may be formulated in a ‘triangular’ fashion, with variable Xn,1, . . . , Xn,n

with means ξn,1, . . . ,ξn,n etc. at stage n.Our final entry in this section on convergence in distribution is that of the delta method. In its

simpler and typical form, it says that√

n(Tn − a)→d Z implies√

n{g(Tn)− g(a)} →d g′(a)Z ,

provided the function g, defined on an interval in which Tn falls with probability tending to one,has a derivative in an interval around a that is continuous at that point. The proof is instructive:g(Tn)− g(a)= g′(Un)(Tn − a), for some Un sandwiched between a and Tn, and Tn →pr a. Hence

√n{g(Tn)− g(a)} = g′(Un)

√n(Tn − a)→d g′(a)Z ,

by the Slutsky–Cramer rules in Section A.3. In various applications Z is a normal, say N(ξ ,τ 2), inwhich case

√n{g(Tn)− g(a)} converges in distribution to N(g′(a)ξ ,g′(a)2τ 2).

The linearisation argument may be lifted to higher dimensions, where it gives the following. If√n(Tn − a)→d Z in R

p, and g is a real-valued function defined for Tn, and smooth around a, then

√n{g(Tn)− g(a)} →d g(a)t Z =

p∑j=1

∂g(a)

∂ajZ j. (A.9)

The limit distribution is normal if Z is multivariate normal.Sometimes certain variations on the delta method are called for, having to do with the contiguous

circumstances where the underlying structure is not quite standing still, but changing with n. Aone-dimensional version is as follows, achieved by slight variations of the preceding arguments:

√n(Tn − a)→d Z implies

√n{g(Tn)− g(a + δ/√n)} →d g′(a)(Z − δ).

Page 117: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 452 — #472�

452 Large-sample theory with applications

Similarly, (A.9) may be extended as follows: If√

n(Tn − a)→d Z in Rp, then

√n{g(Tn)− g(a + δ/√n)} →d g(a)t(Z − δ),

where again g(a) is the gradient of g evaluated at the point a.

A.4 Minimisers of random convex functions

Various estimators and test statistics can be seen as minimisers of suitable criteria functions.The maximum likelihood estimator minimises the negative log-likelihood function, say −n(θ) =−∑n

i=1 log f(Yi,θ) in a situation with independent observations. When the criterion function inquestion is convex, the following couple of lemmas often suffice for proving consistency andasymptotic normality of the minimiser. This can sometimes mean less efforts and fewer regularityconditions than other methods of proofs that typically rely on Taylor expansions and elaborate controlof remainder terms. Methods and results reviewed in this section are essentially from Hjort and Pollard(1993), where a generous list of applications may be found.

Lemma A.1 From pointwise to uniform: Suppose An(s) is a sequence of convex random functionsdefined on an open convex set S in R

p, which converges in probability to some A(s), for each s. Thensups∈K |An(s)− A(s)| goes to zero in probability, for each compact subset K of S.

This is proved in Andersen and Gill (1982, appendix), crediting T. Brown, via diagonalsubsequencing and an appeal to a corresponding nonstochastic result (see Rockafellar (1970, Theorem10.8)). For a direct proof, see Pollard (1991, section 6).

A convex function is continuous and attains it minimum on compact sets, but it can be flat at itsbottom and have several minima. For simplicity let us speak about ‘the argmin’ when referring to anyof the possible minimisers. The argmin can always be selected in a measurable way, as explained inNiemiro (1992, p. 1531), for example.

Lemma A.2 Let the situation be as in Lemma A.1 and assume in addition that the limit function Ahas a unique argmin s0. Then the argmin of An converges in probability to s0.

This may be proved from the first lemma and is quite useful in establishing consistency in varioussituations. If the log-likelihood function n(θ) for a certain model is concave, and n−1n(θ) convergespointwise in probability to some limit function (θ) with a unique argmin θ0, then this effectivelydefines ‘the least false parameter’, and the maximum likelihood estimator is consistent for thisquantity. Note that this result did not require any laborious assumptions about differentiability ofdensities, and so forth.

Lemma A.3 (nearness of argmins) Suppose An(s) is convex as in Lemma A.1 and is approximatedby Bn(s). Let αn be the argmin of An and assume that Bn has a unique argmin βn. Then there is aprobabilistic bound on how far αn can be from βn: for each δ > 0, P{|αn − βn| ≥ δ} ≤ P{�n(δ) ≥12 hn(δ)}, where

�n(δ)= sup|s−βn|≤δ

|An(s)− Bn(s)| and hn(δ)= inf|s−βn|=δ

Bn(s)− Bn(βn).

The lemma as stated has nothing to do with convergence or indeed with the ‘n’ subscript at all,but is stated in a form useful for typical purposes. It is proven in Hjort and Pollard (1993); seealso Exercise A.7. The two lemmas sometimes deliver more than mere consistency when applied tosuitably rescaled and recentred versions of convex processes.

Page 118: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 453 — #473�

Minimisers of random convex functions 453

We record a couple of useful implications of Lemma A.3. If An − Bn goes to zero uniformlyon bounded sets in probability and βn is stochastically bounded, then �n(δ) →pr 0 by a simpleargument. It follows that αn − βn →pr 0 provided only that 1/hn(δ) is stochastically bounded foreach fixed δ. This last requirement says that Bn should not flatten out around its minimum as nincreases.

Lemma A.4 Suppose An(s) is convex and can be represented as 12 stVs + stUn + Cn + rn(s), where

V is symmetric and positive definite, Un is stochastically bounded, Cn is arbitrary, and rn(s) goes tozero in probability for each s. Then αn, the argmin of An, is only op(1) away from βn = −V−1Un, theargmin of 1

2 stVs + stUn + Cn. If also Un →d U then αn →d −V−1U .

One may use this lemma in various applications, often to demonstrate limiting normality of certainvariables, and with few and clean regularity conditions. One such application is to prove the ‘masterlemma’ 2.1 for the case of parametric models with densities log-concave in the parameter. This alsoleads to easier proofs for some of the other chief results in Chapter 2, which again are behind someof the key applications in Chapters 3 and 4. Hjort and Pollard (1993) use these techniques to establishlimiting normality of Cox’s maximum partial likelihood estimator in proportional hazards models,under minimal conditions.

For concreteness and completeness let us illustrate the general method and its details for the caseof the Poisson regression model, where Y1, . . . ,Yn are independent Poisson variates with means μi =exp(xt

iβ), in terms of covariate vectors xi of dimension say p. These details, along with Lemma A.4here, give a full proof of the master Lemma 2.1 and hence of various consequences derived anddiscussed in Chapters 2 and 3. Other regression models may be handled similarly, see Exercise A.8.

With n(β)=∑ni=1(− logμi +yi logμi)=∑n

i=1{yixtiβ−exp(xt

iβ)} the log-likelihood function, onefinds after a little algebra that

An(s)= n(β0 + s/√

n)− n(β0)

=n∑

i=1

[stxiyi −{exp(xti(β0 + s/

√n))− exp(xt

iβ0)}].

The An(·) is concave since n(·) is. Here β0 is the true parameter, and we write μi,0 = exp(xtiβ0) for

the true means. Writing exp(u)= 1 + u + 12 u2 + δ(u), with |δ(u)| ≤ |u|3 for |u| ≤ 1, we have

An(s)= stUn − 12 stJns + rn(s),

where Un = n−1/2∑n

i=1 xi(yi − μi,0), Jn = n−1∑n

i=1μi,0xixti, and rn(s) = −∑n

i=1μi,0δ(xtis/

√n).

Suppose

Jn → J and �n/√

n = (1/√n)maxi≤n

‖xi‖ → 0,

conditions which are met, for example, when the xi are drawn from a covariate distribution with finitevariance matrix, see Exercise A.8. To apply Lemma A.4 we need to prove (1) that Un →d U ∼ Np(0,J)and (2) that rn(s)→pr 0. For the latter,

|rn(s)| ≤ 1

6

n∑i=1

exp(xtiβ0)|xt

is/√

n|3 ≤ 16

1√n

stJns�n‖s‖exp(�n‖β0‖),

valid for all large n, and the upper bound here tends to zero. A bit of work shows finally that preciselythe same conditions secure the multidimensional version of the Lindeberg property (A.7) for Un andhence (1).

Page 119: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 454 — #474�

454 Large-sample theory with applications

A.5 Likelihood inference outside model conditions

Various statistical estimation and inference methods are constructed under suitable ‘ideal conditions’,for example, involving model specifications. It is then important to understand how methods behaveoutside these home turf conditions. An initial look into these issues is provided in Section 2.6. Herewe offer a fuller discussion, with relevant details and illustrations.

We start with the i.i.d. framework with observations Y1,Y2, . . . stemming from some underlying butunknown density g(y), which we attempt to approximate using the parametric family fθ (y) = f(y,θ)with a p-dimensional θ belonging to a parameter space . By the law of large numbers,

λn(θ)= n−1n(θ)= n−1n∑

i=1

log f(Yi,θ)

→a.s. λ(θ)=∫

g(y) log f(y,θ)dy =∫

g log fθ dy

for each θ , provided only that this integral is finite. In most situations there will be a unique θ0maximising this limit function, also solving the equation

∂λ/∂θ =∫

g(y)u(y,θ)dy = Egu(Y ,θ0)= 0.

This parameter value is also the minimiser of the Kullback–Leibler distance

KL(g, fθ )=∫

g log(g/fθ )dy. (A.10)

With mild further regularity conditions the argmax of λn, which is the maximum likelihood estimatorθn, will converge in probability to the argmax of λ, that is, θ0. Thus the maximum likelihoodestimator aims at this least false parameter value, identified by making the distance from the true datagenerating mechanism g to the parametric family fθ as small as possible. For some further remarksand illustrations, see Exercise A.10.

There is also an appropriate extension of Theorem 2.2 to the present agnostic state of affairs.Following the line of argument used in the proof of that theorem, we now need two matrices ratherthan merely one to describe what takes place:

J = −Eg∂2 log f(Y ,θ0)

∂θ∂θ= −

∫g(y)

∂2 log f(Y ,θ0)

∂θ∂θdy,

K = Varg u(Y ,θ0)=∫

g(y)u(y,θ0)u(y,θ0)t dy,

the subscript ‘g’ indicating that the mean and variance operations in question are taken under thetrue mechanism g. We see that Un = n−1/2

∑ni=1 u(Yi,θ0) now tends in distribution to a Np(0,K),

and find

An(s)= n(θ0 + s/√

n)− An(θ0)

= stUn − 12 stJns + rn(s)→d A(s)= stU − 1

2 stJs

under mild conditions; see, for example, Hjort and Pollard (1993), Claeskens and Hjort (2008). Weinfer by the argmax argument that argmax(An)→d argmax(A), yielding

√n(θn − θ0)→d J−1U ∼ Np(0,J−1KJ−1) (A.11)

Page 120: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 455 — #475�

Likelihood inference outside model conditions 455

with the so-called sandwich matrix for its variance matrix. When the model is actually correct theKullback–Leibler distance from g to f(·,θ0) is zero and the two matrices J and K are identical, pointingback to Theorems 2.2 and 2.4.

We take time here to highlight a useful implicit consequence of these arguments concerning themaximum likelihood estimator, namely that

θ = θ0 + n−1/2J−1Un + opr(n−1/2) (A.12)

with Un = n−1/2′n(θ0)→d U ∼ Np(0,K). This is at the heart of (A.11) and shall be found useful later,for example, in Section A.7.

It is important to realise that even though g is and will remain unknown, lying typically alsooutside the reach of the parametric model, we may still estimate J and K consistently from data,making inference about the least false parameter θ0 feasible. We may use

J = −n−1n∑

i=1

∂2 log f(Yi, θ )

∂θ∂θ tand K = n−1

n∑i=1

u(Yi, θ )u(Yi, θ )t, (A.13)

and consequently J−1 K J−1 for the sandwich matrix of the limit distribution.It is worthwhile recording that J and K take simpler forms when working with models of the

exponential family type, as for Chapter 8. Suppose the density is of the form exp{θ tT (y)− k(θ)+m(y)}, for appropriate T (y)= (T1(y), . . . , Tp(y)) and normalising factor k(θ). Then the general recipesabove lead to J = J(θ)= k′′(θ) and K = VargT (Y ), with estimates J = J(θ) and K = n−1

∑ni=1{T (Yi)−

T }{T (Yi)− T }t, the variance matrix of the n vectors T (Yi).

Example A.1 Agnostic use of the normal model

Consider the normal N(μ,σ 2) model for i.i.d. data, now viewed as stemming from some underlyingand unknown density g rather than necessarily from a normal itself. Assume that g has finite meanand standard deviation μ0 and σ0. The model’s log-density is − logσ − 1

2 (1/σ2)(y−μ)2 − 1

2 log(2π),leading to ∫

g log fθ dy = − logσ − 12 (1/σ

2){σ 20 + (μ−μ0)

2}− 12 log(2π).

This is maximised for (μ,σ) equal to (μ0,σ0), so these are the least false parameter values in thissituation. It is noteworthy that we may infer that the maximum likelihood estimators μ and σ convergein probability to true mean μ0 and true standard deviation σ0, from the general results above, withouteven knowing any formulae for these estimators (cf. (2.15) of Example 2.3).

To employ result (A.11) we assume that g also has finite fourth moment, so that skewness andkurtosis parameters

γ3 = Eg{(Y −μ0)/σ0}3 and γ4 = Eg{(Y −μ0)/σ0}4 − 3

exist; these are both zero under normality. Taking derivatives of the log-density leads to

∂ log f/∂μ= (y −μ)/σ 2 = ε/σ ,

∂ log f/∂σ = −1/σ + (1/σ 3)(y −μ)2 = (ε2 − 1)/σ ,

in terms of ε = (y −μ)/σ . With a bit of further efforts one finds

J = 1

σ 20

(1 00 2

)and K = 1

σ 20

(1 γ3

γ3 2 + γ4

).

Page 121: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 456 — #476�

456 Large-sample theory with applications

Hence (√n(μ−μ0)√n(σ −σ0)

)→d N2(0,�),

say, where the sandwich matrix on this occasion is

� = J−1KJ−1 = σ 20

(1 1

2γ312γ3

12 + 1

4γ4

).

This essentially tells us that inference for the mean parameter μ is not really affected by the normalityassumption being wrong (confidence intervals, and so forth are nevertheless first-order correct), butthat normality-based inference for the spread parameter σ is rather more vulnerable. Specifically,confidence intervals are not affected by skewness but may need a kurtosis correction if approximatenormality of observations cannot be trusted.

The preceding results generalise with moderate efforts to regression models, with appropriatemodifications of definitions and arguments. Suppose the model used employs a density of the formf(yi |xi,θ) for Yi |xi but that the real state of affairs is some g(y |xi), for observations Y1, . . . ,Yn that areindependent given the covariate vectors x1, . . . ,xn. There is then a Kullback–Leibler distance

KL(θ ,x)= KL(g(· |x), f(· |x,θ))=∫

g(y |x) logg(y |x)

f(y |x,θ)dy

from the x-conditional truth to the x-conditional model density. The maximum likelihood estimator θnmaximising the log-likelihood n(θ)= ∑n

i=1 log f(Yi |xi,θ) converges under natural conditions to theleast false parameter value θ0 that minimises the x-weighted Kullback–Leibler distance

KL(g(· | ·), f(· | ·,θ))=∫

KL(θ ,x)dQ(x)

=∫ ∫

g(y |x) logg(y |x)

f(y |x,θ)dydQ(x),

(A.14)

in terms of an hypothesised covariance distribution Q, seen as the limit of the empirical covariancedistribution Qn that places probability mass 1/n at each xi.

There is also an appropriate generalisation of (A.11) to regression models, namely that

√n(θn − θ0,n)→d Np(0,J−1KJ−1),

in terms of the least false parameter θ0,n that minimises (A.14), but with the empirical Qn rather thanits limit Q, and where J and K are defined as the limits of respectively Jn and Kn, where

Jn = −n−1n∑

i=1

Eg∂2 log f(Yi |xi,θn)

∂θ∂θ t,

Kn = n−1n∑

i=1

Varg u(Yi |xi,θn).

Estimates of these are produced by inserting the maximum likelihood estimator for θn. For furtherdiscussion and illustration regarding these issues, see Hjort and Claeskens (2003a) and Claeskens andHjort (2008).

Page 122: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 457 — #477�

Likelihood inference outside model conditions 457

Example A.2 Poisson regression with overdispersion

The most frequently used regression model for count data holds that Yi |xi is Poisson with meanparameter μi = exp(xt

iβ). The log-likelihood function then becomes

n(β)=n∑

i=1

{−μi + Yi logμi − log(Yi!)} =n∑

i=1

{−exp(xtiβ)+ Yix

tiβ− log(Yi!)},

leading also to Jn = n−1∑n

i=1μixixti in the aforementioned notation. Ordinary analysis, as

implemented in most statistical software packages, proceeds via the approximation β ∼ Np(β ,J−1n /n),

where Jn is as Jn but inserting μi = exp(xtiβ) for μi.

Assume now that there actually is overdispersion, with Var(Yi |xi) = (1 + d)μi for some positived. Ordinary Poisson regression calculus as summarised earlier will then be in error, leading e.g. totoo narrow confidence intervals (by a factor of (1 + d)1/2, actually) and too small p-values (for anillustration, see Exercise 4.18). The model robust viewpoint leads however to correct analysis, as wefind Kn = (1 + dn)Jn with consequent sandwich matrix J−1

n KnJ−1n = (1 + dn)J−1

n . The general matrixestimation recipe above corresponds here to using

Jn = n−1n∑

i=1

μixixti and Kn = n−1

n∑i=1

(Yi − μi)2xix

ti,

and indeed J−1n KnJ−1

n will give the correct amount of uncertainty for β.

Finally in this section we investigate the consequences for the profile log-likelihood statisticDn(ψ0) of (2.17) of the outside-the-model view, where ψ = a(θ) is a focus parameter. Followingthe chain of arguments of the proof of Theorem 2.4, also involving the vector w = ∂a(θ0)/∂θ , onefinds that

Dn(ψ0)= 2{n,prof(ψ)− n,prof(ψ0)}

= n(ψ −ψ0)2

wtJ−1n w

+ opr(1)→d(wtJ−1U )2

wtJ−1w.

This is now a scaled squared standard normal rather than merely a squared standard normal. In fact,

Dn(ψ0)→d kχ21 with k = wtJ−1KJ−1w

wtJ−1w. (A.15)

The deviance must therefore be used with caution if there is more than a small doubt about theadequacy of the model used. One may construct an appropriate model robust version, which isguaranteed to always have the right χ2

1 limit, namely

D∗n(ψ0)= (1/k)Dn(ψ0)= wtJ−1w

wtJ−1KJ−1w2{n,prof(ψ)− n,prof(ψ0)}.

In the Poisson regression case with overdispersion, for example, one would have k = 1 + d. Thismodel robust perspective invites the modified confidence curve

cc∗(ψ)= �1

( wtJ−1w

wtJ−1KJ−1wDn(ψ)

). (A.16)

Page 123: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 458 — #478�

458 Large-sample theory with applications

A.6 Robust parametric inference

In Section 2.6 we took up the general theme of parametic robustness. One of the aims there is toavoid abandoning a parametric model even when it is considered a reasonable approximation, asopposed to a precise description of how data are generated. For various applications, involving themost frequently used models, for example, involving Gaussian assumptions, it becomes important tosafeguard against outliers and overly influential data points. Rather than discarding such influentialobservations, the intention when using minimum divergence methodology, as with (2.25) and (2.29),is to modify the maximum likelihood scheme itself.

Assume the parametric family {fθ : θ ∈ } is used to approximate a density g potentially positionedoutside the family, and consider

da(g, fθ )=∫

{ f1+aθ − (1 + 1/a)gfθ

a + (1/a)g1+a}dy. (A.17)

Here a is a positive parameter. One may show that the integrand here is always nonnegative, and equalto zero only in points or intervals where fθ = g (see Exercise A.13). Thus (A.17) may be seen as a(directed) distance function, from a fixed g to the approximation fθ , with the distance being equal tozero only if g = fθ for almost all y.

The case of a = 1 corresponds to L2 distance∫(g − fθ )2 dy, as is readily seen. Also, importantly,

when a → 0 the da(g, fθ ) distance tends to the Kullback–Leibler distance KL(g, fθ ) of (A.10); seeExercise A.13. The da distance was introduced in Basu et al. (1998), along with the estimationprocedure to be described next, with further analysis and comparisons with other methods offeredin Jones et al. (2001). Essentially, using da with a small a affords significant improvements regardingrobustness and resistance to the otherwise too strong influence of outliers while losing very little inefficiency.

To develop an estimation procedure associated with da(g, fθ ) of (A.17), note that the last term(1/a)

∫g1+a dy does not depend on θ , so if we can estimate the two first terms and then minimise over

θ we have a sensible method of the minimum distance kind. The canonical criterion function is

Hn(θ)=∫

f1+aθ dy − (1 + 1/a)n−1

n∑i=1

f(yi,θ)a + 1/a, (A.18)

and the da estimator is the parameter value θ = θ (a) that minimises this function. The last term 1/adoes not really matter and may be omitted; its presence is merely cosmetic in that Hn relates moresmoothly to its log-likelihood limit as a grows small. Specifically, Hn(θ) now tends to −n−1n(θ),and the da estimate θ (a) tends to the maximum likelihood estimate θML, as a → 0.

Example A.3 Robust analysis of the normal model

Consider the normal model for i.i.d. data, as in Example 2.3, but now with the intention to providemore robust estimators for its parameters than the maximum likelihood ones (given in (2.15)). Wecalculate

Aa(σ )=∫φ{(y −μ

σ

) 1

σ

}1+ady = σ−a(2π)−a/2(1 + a)−1/2

(it does not depend on μ), and the criterion function to minimise to find estimates of (μ,σ) is

Hn(μ,σ)= Aa(σ )− (1 + 1/a)n−1n∑

i=1

{φ(yi −μσ

) 1

σ

}a + 1/a.

Page 124: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 459 — #479�

Robust parametric inference 459

We point out that the estimates (μ, σ ) solve the ‘robustified score equations’ that correspond to∂Hn(θ)/∂θ = 0, which in this situation simplify to

n−1n∑

i=1

εiφ(εi)a = 0,

n−1n∑

i=1

(ε2i − 1)φ(εi)

a = −(2π)−a/2 a

(a + 1)3/2,

(A.19)

in terms of εi = (yi − μ)/σ . The usual maximum likelihood estimates are arrived at when a → 0.Importantly, we do not really need to solve equations like (A.19) in any hard sense of the word, assolutions are easily arrived at in practice using a suitable general-purpose minimisation programmelike nlm in R. This is what we used for Example A.4 below. See Example A.5 for an extension torobust analysis of linear regression.

Example A.4 Robust assessment of Newcomb’s speed of light measurements

In an impressive and important series of experiments carried out by Simon Newcomb at around 1880the speed of light was measured in a manner more precise than in any earlier experiments. One ofthese experiments resulted in 66 measurements that on the transformed scale he used ranged from−44 to 40, where today’s agreed-on true value of the speed of light corresponds to the value 33.02;see Newcomb (1891) and Stigler (1977). A histogram of these is displayed in Figure A.1, alongwith two fitted normal densities. The point is that the usual fit, with maximum likelihood estimates(26.212,10.664) for (μ,σ), is not good, being overly influenced by outliers; the da method, on the

y

hist

ogra

m a

nd tw

o fit

ted

norm

als

−40 −20 0 20 40

0.00

0.02

0.04

0.06

0.08

Figure A.1 Histogram of Newcomb’s 66 measurements pertaining to the speed of light, alongwith two fitted normals – one based on ordinary maximum likelihood estimates (dashed line),the other using the robust da method, with a = 0.05 (solid line). The value corresponding to thetrue speed of light is 33.02.

Page 125: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 460 — #480�

460 Large-sample theory with applications

other hand, used in this illustration with the small value a = 0.05, gives a much better fit, withrobustified estimates (27.438,5.991). For some further details, see Exercise A.12.

The behaviour of the da estimator θ may be investigated using analysis and arguments fairlysimilar to those used to understand the maximum likelihood estimator, that is, the machinery ofSections 2.2–2.4. In the following we keep a fixed at some value, typically a small one, and writeθ and so forth instead of the more elaborate θ (a).

First, under mild conditions, θ converges in probability to a well-defined least false parameter θ0that minimises da(g, fθ ) over all possible θ ; thus f(·,θ0) provides the best parametric approximationto the real g inside the given family. If the model happens to be correct then this minimal distance isequal to zero and θ is consistent for this true value θ0, just like, for example, the maximum likelihoodestimator.

Second, a random process much like the An(·) process used in Lemma 2.1 and both ofTheorems 2.2 and 2.4 may be put up and investigated:

An,a(s)= n{Hn(θ0 + s/√

n)− Hn(θ0)} = stUn + 12 stJns + rn(s),

say, where it will be seen that Un = √nH∗

n(θ0) is asymptotically zero-mean normal with a certainvariance matrix Ka; that Jn = H∗∗

n (θ0) converges in probability to a certain matrix Ja; and that rn(s)tends to zero in probability. Here we let H∗

n (of size p × 1) and H∗∗n (of size p × p) denote first and

second-order derivatives of Hn; as earlier, p is the length of the parameter vector.We note that

H∗n(θ)= (1 + a)

{∫f(y,θ)1+au(y,θ)dy − n−1

n∑i=1

f(Yi,θ)au(Yi,θ)

},

and the da estimator may also be characterised as the solution to H∗n(θ) = 0; when a → 0 this

corresponds to the familiar likelihood estimation equations∑n

i=1 u(Yi,θ)= 0. From

An,a(s)→d A(s)= stU + 12 stJas,

where U ∼ Np(0,Ka) and Ja is the limit in probability of Jn, one finds√

n(θ − θ0)→d argmin(A)= J−1a U ∼ Np(0,J−1

a KaJ−1a ), (A.20)

much as with result (A.11) for the maximum likelihood estimator outside model conditions. Fullexpressions for J and K may be derived from the sketch above, and are given in Basu et al. (1998) andJones et al. (2001), along with consistent estimators of these. This enables full model-robust inferencefor the least false parameter θ0 = θ0,a. There are actually several options regarding such consistentestimators, but the perhaps simplest recipe is as follows. For Ja, use Ja = H∗∗

n (θ), the Hessian matrix ofthe criterion function at the minimum, which is typically a side product of the minimisation algorithmused. For Ka, use

Ka = (1 + a)2n−1n∑

i=1

(fai ui − ξ )(fai ui − ξ )t with ξ = n−1n∑

i=1

fai ui,

also readily computed, in terms of fi = f(yi, θ ) and ui = u(yi, θ ).We also record that formulae for the population matrices Ja and Ka simplify under conditions of

the parametric model:

Ja = (1 + a)∫

f(y,θ0)1+au(y,θ0)u(y,θ0)

t dy,

Ka = (1 + a)2{∫

f(y,θ0)1+2au(y,θ0)u(y,θ0)

t dy − ξ0ξ t0

},

Page 126: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 461 — #481�

Robust parametric inference 461

where ξ0 = ∫f(y,θ0)1+au(y,θ0)dy. The case of a → 0 corresponds to J and K both being equal to the

Fisher information matrix and hence (A.20) to the classic result (2.6) of Theorem 2.2. These formulaemay be used to assess the degree of loss of efficiency under model conditions, compared to the optimalmaximum likelihood methods. This involves computing J−1

a KaJ−1a from the preceding and comparing

to J−10 . Carrying out such exercises in various models shows that one typically loses very little in

efficiency with a small value of a, say a = 0.05, but earns remarkably much in terms of robustnessand outlier resistance.

Just as the maximum likelihood apparatus extends nicely from i.i.d. situations to regression setups,including precise notions of least false parameters and a limit result (A.14) outside model conditions,also the Basu–Harris–Hjort–Jones (BHHJ) methods can be generalised from i.i.d. to regressionmodels. With notation f(y |xi,θ) for the model density of Yi given xi, the criterion function to minimiseis now

Hn(θ)= n−1n∑

i=1

{∫f(y |xi,θ)

1+a dy − (1 + 1/a)f(yi |xi,θ)a + 1/a

}.

Example A.5 Robust linear regression

Consider the traditional linear regression model worked with in, for example, Example 2.1, involvingyi = xt

iβ + εi and so forth, but now with the intention of devising a robust estimation scheme. Suchrobust estimates of β and σ are defined as the minimisers of

Hn(β,σ)= σ−a (2π)−a/2

(1 + a)1/2− (1 + 1/a)n−1

n∑i=1

{φ(yi − xt

σ

) 1

σ

}a + 1

a.

These estimates β, σ also solve the robustified score equations

n−1n∑

i=1

εixiφ(εi)a = 0,

n−1n∑

i=1

(ε2i − 1)φ(εi)

a = −(2π)−a/2 a

(a + 1)3/2,

in terms of estimated and normalised residuals εi = (yi − xtiβ)/σ . Sending a to zero leads to ordinary

least squares regression.

Example A.6 Robust Poisson regression

Ordinary Poisson regression for count data yi in terms of covariate vectors xi takes yi to have resultedfrom a Poisson mechanism with mean parameterμi = exp(xt

iβ); see Example A.2. A robust estimationscheme is to minimise the criterion function

Hn(β)= n−1n∑

i=1

[Aa(exp(xt

iβ))− (1 + 1/a){exp(−μi)

μyii

yi!}a]

,

where Aa(λ) = ∑∞y=0{exp(−λ)λy/y!}a. Again, a is a user specified algorithmic parameter, typically

chosen to be a small value like 0.05, which affords robustness at a very low loss in efficiency.

There are other routes to robustifying maximum likelihood. One such is the weighted likelihoodmethod briefly described in Section 2.6; see (2.29) and (2.30). The starting point is that

dw(g, fθ )=∫

w{g log(g/fθ )− (g − fθ )}dy

Page 127: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 462 — #482�

462 Large-sample theory with applications

defines a divergence, thought of here as the distance from the true g to the parametrically modelledfθ , which is always nonnegative and zero only if fθ = g; see Exercise A.11. The natural empiricalfunction here is the weighted log-likelihood

∗n(θ)=n∑

i=1

{w(yi) log f(yi,θ)− ξ(θ)},

where ξ(θ)= ∫w(y)f(y,θ)dy. There are again a well-defined least false parameter θ0 = θ0,w to which

the maximum weighted likelihood estimator θ∗ = argmax(∗n) converges, and a sandwich matrixJ−1

w KwJ−1w involved in the normal limit distribution for

√n(θ∗ − θ0); see Exercise A.11 for details.

The ordinary maximum likelihood apparatus corresponds to weight function w(y)= 1. The weightedversion allows the possibility of fitting a parametric density with more attention to some areas ofthe sample space than others, as when, for example, fitting a Weibull density to data where whathappens in the upper tail is considered more important than for the lower and middle range. The weightfunction might also be chosen to avoid too much influence of extreme data values, by down-tuningsuch data points.

A.7 Model selection

Here we provide a fuller discussion and description of the model selection criteria we briefly reviewedin Section 2.6.

For concreteness we start with the i.i.d. situation and consider independent observations Y1, . . . ,Yn

from some common but unknown density g. Suppose there are various parametric candidate models,say from f1(y,θ1) to fk(y,θk). Using any of these results in a Kullback–Leibler distance

KL(g(·), fj(·, θj))=∫

g loggdy − Rn, j, with Rn, j =∫

g(y) log fj(y, θj)dy.

Below we shall provide natural estimators of the terms Rn, j, say Rn, j; this invites selecting thecandidate model with the highest attained value of Rn, j. Note that Rn, j may be interpreted asEg{log fj(Yn+1, θj) |data}, an expectation over a new data point Yn+1 but conditional on Y1, . . . ,Yn,used in θj. Also, whether we see the Rn, j developed below as estimating the random quantity Rn, j orits mean value Eg Rn, j = Eg log fj(Yn+1, θj) is not important for the development of the model selectorper se.

Focus therefore on any of these candidate models, say simply f(y,θ) to avoid burdening the notationwith its index, with maximum likelihood estimator θ . In Section A.5 we examined

λn(θ)= n−1n(θ)→ λ(θ)= Eg log f(Y ,θ)=∫

g(y) log f(y,θ)dy,

and defined the associated θ0 maximising the limit function as the least false parameter. The presenttask, in view of the preceding, is to estimate Rn = λ(θ). We shall see that the perhaps natural startestimator λn(θ) = n−1n,max overshoots it target with a certain amount that we therefore shall assessand then subtract. To this end one may now derive two results, respectively

λ(θ).= λ(θ0)− 1

2 (θ − θ0)tJ(θ − θ0) .= λ(θ0)− 12 n−1U t

nJ−1Un,

λn(θ ).= λn(θ0)+ 1

2 n−1U tnJ

−1Un,(A.21)

and with ‘.=’ signifying equality to the relevant leading order of magnitude. As in Section A.5, Un =

n−1/2′n(θ0)→d U ∼ Np(0,K), see (A.11) and (A.12). It follows that

n−1n,max − Rn = λn(θ0)−λ(θ0)+ n−1U tnJ

−1Un + opr(n−1),

Page 128: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 463 — #483�

Model selection 463

where the difference λn(θ0) − λ(θ0) is an average of zero mean variables. Consequently, we arelearning that n−1n,max is overshooting the target Rn with a random amount approximately the sizeof n−1U tJ−1U . In yet other words,

n−1(n,max − p∗) estimates Rn =∫

g(y) log f(y, θ )dy,

in which p∗ is any natural estimator of p∗ = EU tJ−1U = Tr(J−1K).This gives rise to various natural model selection strategies, where the score n,max − p∗ is computed

for the relevant candidate models and then compared; the higher the score, the better the model, in thesense related to the Kullback–Leibler distance as spelled out above. Three special cases are

AIC = 2n,max − 2p,

AIC∗ = 2n,max − 2p∗,

AICboot = 2n,max − 2p∗boot,

(A.22)

corresponding to the classic AIC (Akaike’s information criterion, where the trace of J−1K somewhatcrudely is set equal to the dimension p, an equality valid under model conditions, where J = K);the model robust version of the AIC (where p∗ = Tr(J−1K), inserting estimates of the two matrices);and the bootstrapped version of the AIC (where a bootstrap regime is used to arrive at a particularestimate pboot, in various situations a more precise penalty factor than the other two). The factor 2is not important but stems from tradition and is related to certain log-likelihood differences beingapproximately chi-squared when multiplied with this factor.

It turns out that the development of these criteria sketched in the preceding for the i.i.d. settingextends naturally without too many complications to regression models and even to situations withmoderate dependence. In particular, the formulae of (A.22) are still valid. For details, discussion andapplications, see Claeskens and Hjort (2008, chapter 3).

The so-called Bayesian information criterion (BIC) has the form

BIC = 2n,max − p logn,

again with n the sample size and p the number of unknown parameters being estimated via maximumlikelihood. It stems from a certain Laplace approximation to the exact posterior model probability ina Bayesian formulation of the model selection problem; see Claeskens and Hjort (2008, chapter 4)for details and applications. We note that the AIC favours more complex models than the BIC, sincep logn penalises harder than 2p as long as n ≥ 8.

The AIC, BIC and various other model selectors work in ‘overall modus’ and provide a rankinglist of candidate models without taking on board any particular intended use of the final model. Thisis different for the Focussed information criterion (FIC) of Claeskens and Hjort (2003), Hjort andClaeskens (2003a,b) and Claeskens and Hjort (2008), where the crucial point is to reach a rankinglist for each given focus parameter. Consider such a focal estimand, say ψ , that needs to have a clearinterpretation across models, say ψ = aj(θj) in candidate model j. Estimation via model j leads to theestimator ψj = aj(θj). It has a certain mean squared error, say

rj = Etrue (ψj −ψtrue)2 = vj + b2

j ,

where vj = Vartrue ψj and bj = Etrue ψj −ψtrue. The subscript indicates that the expectation operationin question is with respect to the real data-generating mechanism, which is not necessarily capturedby any of the models. The FIC works by estimating these variances and squared biases, which thenlead to

FICj = vj + (b2j ).

Page 129: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 464 — #484�

464 Large-sample theory with applications

The latter is meant not to be equal to b2j , where bj estimates bias; in fact the estimate of the squared

bias typically starts with b2j and then subtracts a modification term for overshooting.

Clearly the general FIC method as outlined here takes different forms in different setups anddepends also on what is assumed regarding the real model, which may not be captured by any of thecandidate models. Claeskens and Hjort (2003) and Hjort and Claeskens (2003a,b) provide general FICformulae for the case of sets of parametric models where the largest of these is assumed to contain thetrue model. Various extensions to other setups are surveyed in Claeskens and Hjort (2008, chapters 6and 7). The ideas are general enough to work also in less standard situations and for comparingparametric with nonparametric models; see Jullum and Hjort (2015).

A.8 Notes on the literature

The law of large numbers goes back to Jacob Bernoulli’s famous Ars Conjectandi from 1713,proved there for (indeed) Bernoulli variables, that is, independent variables taking values zeroand one. The term itself apparently stems from Poisson in 1837 (‘la loi des grands nombres’). Asuccession of sharper and more general versions have then been developed by different scholars,including Chebyshov, Markov and Borel, leading up to the strong law of large numbers with minimalassumptions in (Kolmogorov, 1933); see also the reprint Kolmogorov (1998) with more material thanin the German original, explaining also that Kolmogorov had his master result a few years before1933. Interestingly, irritation with a colleague’s supposition that laws of large numbers requiredindependence apparently caused A. A. Markov to invent Markov chains, in 1906; see Basharin et al.(2004) for a historical account. Regarding the somewhat natural question “how large is the n in thestrong law of large numbers?” Hjort and Fenstad (1992) found the precise limit distribution of ε2 Nε,where Nε is the last n for which the distance to the target is at least ε.

The central limit theorem has its origins in work by de Moivre from 1733, concerning normalapproximations to the binomial distribution, later extended and made mathematically clearer inLaplace’s 1812 Theorie Analytique des Probabilites. As with the laws of large numbers the resultsof the central limit theorem type have different forms and scope, regarding both the sets of conditionsused and the precise mathematical notion of convergence in distribution. Scholars associated withthese lines of probability work include Galton, Liapunov, Chebyshov, Markov, von Mises, Polya,Lindeberg and Cramer; see Fischer (2011) for an account of the early history. The important‘Lindeberg condition’ is from his 1922 paper. An interesting footnote here is that Alan Turing in his1934 fellowship dissertation for King’s College at Cambridge essentially rediscovered a version of theLindeberg theorem; see Zabell (1995) for an engaging account. Modern and clear treatises on centrallimit type theorems, in various settings, include Serfling (1980), Ferguson (1996) and Lehmann (1983,1999), with more advanced material in Billingsley (1968), Barndorff-Nielsen and Cox (1989, 1994)and van der Vaart (1998). For special techniques and sharper results under convexity conditions, asfor models leading to log-concave likelihoods, see Andersen and Gill (1982, appendix) and Hjortand Pollard (1993). For generalisations of central limit theorems to martingales and so forth, withapplications to counting process models (of nonparametric, semiparametric and parametric forms),see Rebolledo (1980), Helland (1982) and Andersen et al. (1993).

The very useful techniques associated with the delta method, in various guises, were first clarifiedin Cramer (1946). The Cramer–Wold device stems from their paper Cramer and Wold (1936). Themethods for robustifying likelihood methods discussed in Section A.6 are partly from Basu et al.(1998) and Jones et al. (2001), and also discussed in Basu et al. (2011).

Exercises

A.1 Limit distribution of mean and standard deviation: Suppose Y1, . . . ,Yn are i.i.d. from somedistribution with mean and standard deviation ξ and σ , and assume the fourth moment is finite.

Page 130: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 465 — #485�

Exercises 465

Define skewness and kurtosis as

γ3 = E{(Yi − ξ)/σ }3 and γ4 = E{(Yi − ξ)/σ }4 − 3.

These are zero for the normal distribution. Consider ξ = n−1∑n

i=1 Yi and σ 2 = n−1∑n

i=1(Yi −ξ )2.

(a) Show that ξ →pr ξ and σ →pr σ . (For this part only a finite second moment is required.)

(b) Show that σ 2 is close enough to σ 20 = n−1

∑ni=1(Yi − ξ)2 to make

√n(σ 2 − σ 2

0 )→pr 0. Usethis to show that

√n(σ 2 −σ 2)→d N(0,σ 4(2 + γ4)).

(c) Show more generally joint distribution of mean and standard deviation,( √n(ξ − ξ)√n(σ −σ)

)→d

(AB

)∼ N2

((00

),σ 2

(1 1

2γ3

12γ3

12 + 1

4γ4

)).

Use this to construct an approximate confidence distribution for σ .

(d) Use the delta method to work out the limit distributions of√

n(σ /ξ −σ/ξ) and√

n(σ 2/ξ −σ 2/ξ), assuming that ξ �= 0. Show in particular that if the Yi are from a Poisson distribution, then√

n(Wn − 1)→d N(0,2), where Wn = σ 2/ξ . This may be used to test the Poisson distributionagainst overdispersion.

A.2 Correlation coefficient: Assume (X i,Yi) are independent observations from a bivariate distribu-tion with finite fourth moments, with population correlation coefficient ρ = cov(X ,Y )/(σ1σ2),where σ1 and σ2 are the standard deviation parameters for X i and Yi. Consider the correlationcoefficient

rn =∑n

i=1(X i − Xn)(Yi − Yn)2

{∑ni=1(X i − Xn)2}1/2{∑n

i=1(Yi − Yn)2}1/2.

(a) Show that rn →pr ρ, for example, by representing Rn as a smooth function of the samplemeans of X i,Yi, X2

i ,Y 2i , X iYi.

(b) With Cn = n−1∑n

i=1(X i − ξ1)(Yi − ξ2), show that

√n

⎛⎜⎝Cn −ρσ1σ2

σ 21 −σ 2

1

σ 22 −σ 2

2

⎞⎟⎠ →d

⎛⎝ CB1

B2

⎞⎠ ∼ N3(0,�),

for an appropriate covariance matrix �. Then use the delta method to find the limit distributionof

√n(rn −ρ). Show in particular that when X i and Yi are independent, then

√nrn →d N(0,1).

This is valid without further assumptions on the underlying distributions of X i and Yi, and leadsto a simple test for independence.

(c) Assume now that the distribution of (X i,Yi) is binormal. Show that√

n(rn −ρ)→d N(0,(1−ρ2)2). Use the delta method to show that

√n(Zn −ζ )→d N(0,1), where ζ = h(ρ)= 1

2 log{(1+ρ)/(1 −ρ)} is Fisher’s zeta transformation and Zn = h(rn).

A.3 Spectral decomposition:

(a) Generate say n = 100 vectors of dimension p = 6, where each vector has independentstandard normal entries. Compute the empirical variance matrix S from these data. Useeigen(S,symmetric = T)$vectors in R, or something equivalent in other packages, tocompute the unitary matrix P t of the spectral decomposition, and check that PSP t equals thediagonal matrix with the eigenvalues eigen(S,symmetric= T)$values down this diagonal.(The symmetric= T part here is not strictly necessary, but helps R to achieve better numericalprecision.)

Page 131: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 466 — #486�

466 Large-sample theory with applications

(b) Writing Sn for the aforementioned variance matrix, based on n independent data vectorsof the type described, show that Sn converges in probability to the p × p identity matrix Ip asn → ∞; you may also show that the convergence actually takes place with probability one.

(c) Show in the setup of the previous point that all eigenvalues tend to 1, almost surely, as nincreases.

(d) Show that all eigenvalues of Sn will continue to go to 1, even when the dimension p of thevectors are increasing with n, as long as p/n → 0. See Bai (1999).

A.4 Estimation of variance matrices: One often needs to estimate quantities of the form κ =Eh(X ,θ), based on independent realisations X1, . . . , Xn of the variable X , with h a givenfunction that depends on some unknown parameter θ .

(a) Suppose θn is a consistent estimator of θ . Show that if h(x, t) is such that |h(x, t′)−h(x, t)| ≤g(x) |t′ − t| for all (x, t) and (x, t′), where g(X) has finite mean, then n−1

∑ni=1 h(X i, θ ) is

consistent for κ .

(b) Use this result to give precise conditions under which the estimators J and K given inSection A.5 are secured consistency for the maximum likelihood theory related matrices J andK. Verify your conditions for a couple of standard parametric models.

(c) Show from this that the sandwich estimator J−1K J−1 flowing from (A.13), and indirectlyused in Section 2.6, is consistent for the variance matrix of the limit distribution in (A.11).

A.5 The excentric part of the noncentral chi-squared: Let Z ∼ χ2p (λ), defined as the distribution of

X21 +·· ·+ X2

p where the X i are independent with X i ∼ N(ai,1) and λ= ∑pi=1 a2

i .

(a) Show that the distribution of∑p

i=1 X2i indeed depends on the mean vector a = (a1, . . . ,ap)

t

only through λ= ‖a‖2.

(b) Show that Z has Laplace transform

E exp(−tZ)= (1 + 2t)−p/2 exp(− λt

1 + 2t

).

Use this (or another method) to show that Z has mean p +λ and variance 2p + 4λ.

(c) Show that a sum of independent noncentral chi-squared variables is also a noncentralchi-squared.

(d) Show that the noncentral Z may be represented as a central χ2p+2J, with a random degrees

of freedom number p + 2J, where J ∼ Pois( 12λ).

(e) Show that Z in fact can be represented as Z0 + Eλ, where Z0 ∼ χ2p independent of Eλ,

which may be seen as the purely noncentral part of the noncentral chi squared. Find the explicitdistribution of Eλ (which has a point mass at zero). See Hjort (1988a).

(f) Suppose Y1, . . . ,Yn are independent with Yi ∼ N(μi,1). Show that Z = ∑ni=1(Yi − Y )2 ∼

χ2n−1(λ), with λ= ∑n

i=1(μi − μ)2, where μ= n−1∑n

i=1μi. Construct a confidence distributionfor λ.

Page 132: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 467 — #487�

Exercises 467

A.6 Lyapunov condition: The Lindeberg condition (A.7) might not be easy to work with directly.Here we develop sufficient conditions that sometimes are easier to use.

(a) For a nonnegative random variable Y , show that EY ≥ EY I{Y ≥ ε} ≥ εP{Y ≥ ε}, whichgives Markov’s inequality P{Y ≥ ε} ≤ ε−1 EY .

(b) With a similar trick, show that EY 3 ≥ εEY 2I{Y ≥ ε}. Conclude from this that

Ln(ε)=n∑

i=1

E∣∣∣ X i

Bn

∣∣∣2I{∣∣∣ X i

Bn

∣∣∣ ≥ ε} ≤ 1

ε

n∑i=1

E∣∣∣ X i

Bn

∣∣∣3,with assumptions and notation as for the Lindeberg condition (A.7). Hence if

∑ni=1 E |X i/

Bn|3 → 0, or more generally∑n

i=1 E |X i/Bn|2+δ → 0 for some positive δ, then the Lindebergcondition is in force, and

∑ni=1(X i/Bn)→d N(0,1).

A.7 Convexity lemma: The following helps to prove the second convexity lemma of A.4. For thesituation laid out in Lemma A.3, let s be an arbitrary point outside the ball around βn withradius δ, say s = βn + lu for a unit vector u, where l> δ. Show that convexity of An implies

(1 − δ/l) An(βn)+ (δ/l) An(s)≥ An(βn + δu).Writing then An(s)= Bn(s)+ rn(s), show that

(δ/l) {An(s)− An(βn)} ≥ An(βn + δu)− An(βn)

≥ hn(δ)− 2�n(δ).

Use this to prove the lemma. It is worth pointing out that any norm on Rp can be used here,

and that no assumptions need to be placed on the Bn function besides the existence of theminimiser βn.

A.8 Lindeberg stability: This exercise examines a couple of situations where conditions are workedout to secure asymptotic stability and hence limiting normality of natural statistics.

(a) Let Y1,Y2, . . . be independent Bernoulli variables with probabilities p1,p2, . . .. The varianceof Sn = ∑n

i=1 Yi is B2n = ∑n

i=1 pi(1 − pi). When will Sn be approximately normal? Show that(Sn −∑n

i=1 pi)/Bn →d N(0,1) if and only if Bn → ∞. So the cases pi = 1/i and pi = 1/i1.01 aredrastically different, from the normal approximation point of view.

(b) Consider a linear regression setup with Yi = xtiβ + εi with p-dimensional covariate vectors

xi for i = 1, . . . ,n, and where the noise terms ε1, . . . ,εn are i.i.d. with finite standard deviationσ , but not necessarily normal. The least squares estimator is β = �−1

n

∑ni=1 xiYi, with �n =

n−1∑n

i=1 xixti = n−1 X t X . Assume that the covariate vectors are well-behaved in the precise

sense that

�n →� and �n/√

n = (1/√n)maxi≤n

‖xi‖ → 0 (EA.1)

as n → ∞, where � is an appropriate p×p matrix of full rank. Show that this implies√

n(β−β)→d Np(0,σ 2�−1).

(c) Consider next the logistic regression model where Y1, . . . ,Yn are independent 0–1 variableswith pi = P{Yi = 1} = H(xt

iβ), with H(u) = exp(u)/{1 + exp(u)} the logistic transform. Let�n = n−1

∑ni=1 pi(1 − pi)xixt

i. Show that the same condition (EA.1) as in the previous pointsecures

√n(β−β)→d Np(0,�−1).

Page 133: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 468 — #488�

468 Large-sample theory with applications

(d) Assume that the covariate vectors x1, . . . ,xn may be considered as i.i.d., sampled froma certain covariate distribution C with finite second moments. Show that this implies theasymptotic stability condition (EA.1).

A.9 Medians and spatial medians:

(a) Consider a random variable X from a distribution that we for simplicity assume iscontinuous and increasing. Show that the median m can be characterised as the minimiser ofthe function g(a)= E|X − a|.(b) Show similarly that the empirical median of a dataset X1, . . . , Xn can be characterised as theminimiser of the function

∑ni=1 |X i − a|. Use this to prove that the empirical median Mn is a

consistent estimator of the true median.

(c) Find the limit distribution of√

n(Mn − m) by working with the random function An(s) =∑ni=1{|X i − m − s/

√n|− |X i − m|}.

(d) We may define a ‘spatial median’ of a two-dimensional distribution as the point m =(m1,m2) minimising the function

E‖(X ,Y )− (a,b)‖ =∫ ∫

{(x − a)2 + (y − b)2}1/2 dF(x,y),

where (X ,Y ) denotes a random pair from this distribution. Try to find the spatial median of thedistribution with density f(x,y) = 1 + θ(2x − 1)(2y − 1) on [0,1]2, for example, where θ is aparameter in (−1,1).

(e) For random pairs (X1,Y1), . . . ,(Xn,Yn), define the empirical spatial median as the minimiserof

∑ni=1 ‖(X i,Yi)− (a,b)‖. Show that the spatial median is consistent for the true median.

A.10 Least false parameters and the Kullback–Leibler distance: Let g be a fixed density, to beapproximated with fθ = f(·,θ) belonging to a suitable parametric family.

(a) For any fixed y with g(y) positive, show that

h(u)= g(y) log{g(y)/u}− {g(y)− u}is always nonnegative, and equal to zero only when u = g(y). Conclude that with anynonnegative weight function w(y),

dw(g, fθ )=∫

w{g log(g/fθ )− (g − fθ )}dy ≥ 0,

with the divergence being zero only if fθ (y) = g(y) almost everywhere. Show that the specialcase of w(y)= 1 corresponds to the Kullback–Leibler divergence KL(g, fθ ) of (A.10).

(b) Take fθ to be the normal N(μ,σ 2). Show that the least false parameters μ0,σ0 minimisingthe Kullback–Leibler distance KL(g, fθ ) are the mean and standard deviation of the g. Note thatthis could be deduced without looking at or knowing the formulae for the maximum likelihoodestimators of these two parameters under the normal model.

(c) Consider the exponential model with density θ exp(−θy). Find the least false parameterθ0 minimising KL(g, fθ ), in terms of the true density g. Verify that the maximum likelihoodestimator converges to this θ0.

(d) Then consider the Beta density with parameters (a,b). Characterise the least falseparameters (a0,b0) in terms of the true means of logY and log(1 − Y ).

(e) As for the normal case in Example A.1, find the two matrices J and K involved in (A.11),along with the sandwich matrix J−1KJ−1, for the one-dimensional exponential model, and thenfor the two-dimensional Gamma density model.

Page 134: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 469 — #489�

Exercises 469

A.11 Weighted likelihood: With a given nonnegative weight function w(y) we saw that dw(g, fθ ) ofExercise A.10 forms a proper divergence measure, meant in these contexts to be the distancefrom a true g to an approximation fθ . We now fill in a few details from what was touched in(2.29) and (2.30).

(a) Consider the weighted log-likelihood function

∗n(θ)=n∑

i=1

{w(yi) log f(yi,θ)− ξ(θ)},

with ξ(θ)= ∫w(y)f(y,θ)dy. We may call its maximiser θ∗ the maximum weighted likelihood

estimator. Show that n−1∗n(θ)→pr A(θ), a function that is maximised precisely for the leastfalse parameter θ0 = θ0,w involved in dw. Give conditions under which θ∗ →pr θ0. Verify thatwith weight function w(y) constant, then we are back to maximum likelihood territory.

(b) Using variations of arguments used to prove asymptotic normality of the maximumlikelihood estimator outside model conditions, involving the first- and second-order deriva-tives of ∗n(θ), show that

√n(θ∗ − θ0) →d N(0, J−1

w KwJ−1w ), for suitable matrices Jw

and Kw.

(c) Generalise the methods and results of (a) and (b) to the setup of regression models.

A.12 With the speed of light: Consider the speed of light data discussed in Example A.4.

(a) Compute and display curves of estimates μ(a) and σ (a), computed via the BHHJrobustification method, as a function of the tuning parameter a, and with a = 0 giving themaximum likelihood estimates. For which values of a would you say the method has managedto rid itself of the outlying observations?

(b) Compute also robustified deviance curves D∗n,a(μ) and D∗

n,a(σ ), from (2.28), for a =0.01,0.05,0.10.

A.13 Robust estimation in parametric models: Consider the BHHJ divergence measure da(f,gθ ) of(A.17), from Basu et al. (1998).

(a) Show that indeed da(g, fθ ) is always nonnegative, and equal to zero only if g = fθ almosteverywhere. Show also that da(g, fθ ) converges to the Kullback–Leibler divergence (A.10) asa → 0.

(b) Spell out the details of the BHHJ method of (A.18) for the case of the Gamma distribution(a,b), where it is helpful to first find a formula for

∫ ∞0 f(y,a,b)1+a dy (though one might

also revert to numerical integration). Try it out on a simulated dataset of size say n =50 from a Gamma distribution, where you intentionally push say two of the data pointstowards extreme values (too small or two big), to monitor how the method copes with suchoutliers.

(c) Apply the BHHJ machinery for having robust estimation of mean and covariance matrix forthe multinormal distribution. Show that with f(y,μ,�) the density of the Np(μ,�), one has∫

f(y,μ,�)1+a dy = (2π)−pa/2|�|−a/2/(1 + a)p/2.

Implement the BHHJ method for the bivariate case. Use it to specifically estimating thecorrelation coefficient robustly, and to provide a robust confidence distribution for thisparameter. Simulate a dataset with a suitably small fraction of outliers to monitor how themethod copes with this.

Page 135: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 470 — #490�

Page 136: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 471 — #491�

References

Aalen, O. O. (1978). Nonparametric inference for a family of counting processes. Annals of Statistics,6:701–726.

Aalen, O. O., Borgan, Ø., and Gjessing, H. K. (2008). Survival and Event History Analysis: A ProcessPoint of View. Springer-Verlag, Berlin.

Aldrich, J. (1997). R. A. Fisher and the making of maximum likelihood 1912–1922. Statistical Science,12:162–176.

Aldrich, J. (2000). Fisher’s “inverse probability” of 1930. International Statistical Review, 68:155–172.Andersen, P. K., Borgan, Ø., Gill, R. D. and Keiding, N. (1993). Statistical Models Based on Counting

Processes. Springer-Verlag, Berlin.Andersen, P. K. and Gill, R. D. (1982). Cox’s regression model for counting processes: A large sample

study. Annals of Statistics, 10:1100–1120.Anderson, T. W. (1958). An Introduction to Multivariate Statistical Analysis. John Wiley & Sons, New

York.Anderson, T. W. and Goodman, L. A. (1957). Statistical inference about Markov chains. Annals of

Mathematical Statistics, 28:89–110.Baddeley, A. J. Rubak, E. and Turner, R. (2015). Analyzing Spatial Point Patterns with R. Chapman &

Hall/CRC, London.Baddeley, A. J. and Turner, R. (2005). Spatstat: An R package for analyzing spatial point patterns. Journal

of Statistical Software, 12:1–42.Bai, Z. D. (1999). Methodologies in spectral analysis or large dimensional random matrices, a review [with

discussion and a rejoinder]. Statitica Sinica, 9:611–677.Ball, F. K., Britton, T. and O’Neill, P. C. (2002). Empty confidence sets for epidemics, branching processes

and Brownian motion. Biometrika, 89:211–224.Banerjee, M. and McKeague, I. W. (2007). Confidence sets for split points in decision trees. Annals of

Statistics, 35:543–574.Barlow, R. E., Bartholomew, D. J., Bremner, J. M. and Brunk, H. D. (1972). Statistical Inference Under

Order Restrictions: The Theory and Application of Isotonic Regression. John Wiley & Sons, New York.Barnard, G. A. (1967). The use of the likelihood function in statistical practice. In Proceedings of the

Fifth Berkeley Symposium on Mathematical Statistics and Probability, Vol. I, pp. 27–40. University ofCalifornia Press, Berkeley.

Barndorff-Nielsen, O. E. (1983). On a formula for the distribution of the maximum likelihood estimator.Biometrika, 70:343–365.

Barndorff-Nielsen, O. E. (1986). Inference on full or partial parameters based on the standarized signedlog-likelihood ratio. Biometrika, 73:307–322.

Barndorff-Nielsen, O. E. (2014). Information and Exponential Families in Statistical Theory. John Wiley& Sons, New York. A re-issue of the 1978 edition, with a new preface.

Barndorff-Nielsen, O. E. and Cox, D. R. (1979). Edgeworth and saddle-point approximations with statisticalapplications [with discussion and a rejoinder]. Journal of the Royal Statistical Society, Series B,41:279–312.

Barndorff-Nielsen, O. E. and Cox, D. R. (1989). Asymptotic Techniques for Use in Statistics. Chapman &Hall, London.

471

Page 137: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 472 — #492�

472 References

Barndorff-Nielsen, O. E. and Cox, D. R. (1994). Inference and Asymptotics. Chapman & Hall, London.Barndorff-Nielsen, O. E. and Cox, D. R. (1996). Prediction and asymptotics. Bernoulli, 2:319–340.Barndorff-Nielsen, O. E. and Wood, T. A. (1998). On large deviations and choice of ancillary for p∗ and r∗.

Bernoulli, 4:35–63.Barry, D. and Hartigan, J. A. (1987). Asynchronous distance between homologous DNA sequences.

Biometrics, 43:261–276.Barth, E. and Moene, K. O. (2012). Employment as a price or a prize of equality. Nordic Journal of Working

Life Studies, 2:5–33.Bartlett, M. S. (1936). The information available in small samples. Proceedings of the Cambridge

Philosphical Society, 32:560–566.Bartlett, M. S. (1937). Properties of sufficiency and statistical tests. Proceedings of the Royal Society of

London, Series A, 160:268–282.Bartlett, M. S. (1939). Complete simultaneous fiducial distributions. Annals of Mathematical Statistics,

10:129–138.Bartlett, M. S. (1965). R.A. Fisher and the last fifty years of statistical methodology. Journal of the

American Statistical Association, 60:395–409.Bartlett, M. S. (1966). Review of Hacking’s ‘Logic of Statistical Inference’. Biometrika, 53:631–633.Basharin, G. P., Langville, A. N. and Naumov, V. A. (2004). The life and work of A. A. Markov. Linear

Algebra and Its Applications, 386:3–26.Basu, A., Harris, I. R., Hjort, N. L. and Jones, M. C. (1998). Robust and efficient estimation by minimising

a densithy power divergence. Biometrika, 85:549–559.Basu, A., Shioya, H. and Park, C. (2011). Statistical Inference: The Minimum Distance Approach. Chapman

& Hall/CRC, London.Bayarri, M. J. and Berger, J. O. (2004). The interplay of Bayesian and frequentist analysis. Statistical

Science, 19:58–80.Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful

approach to multiple testing. Journal of the Royal Statistical Society, Series B, 57:290–300.Beran, R. J. (1977). Minimum Hellinger distance estimates for parametric models. Annals of Statistics,

5:445–463.Beran, R. J. (1987). Prepivoting to reduce level error of confidence sets. Biometrika, 83:687–697.Beran, R. J. (1988a). Balanced simultaneous confidence sets. Journal of the American Statistical

Association, 83:679–686.Beran, R. J. (1988b). Prepivoting test statistics: A bootstrap view of asymptotic refinements. Journal of the

American Statistical Association, 74:457–468.Beran, R. J. (1990). Calibrating prediction regions. Journal of the American Statistical Association,

85:715–723.Berger, J. O. (1985). Statistical Decision Theory and Bayesian Analysis. Springer-Verlag, Berlin.Berger, J. O. and Bernardo, J. M. (1992). On the development of reference priors [with discussion and

a rejoinder]. In Bernardo, J. M., Berger, J. O., Dawid, A. P. and Smith, A. F. M., editors, BayesianStatistics 4, pp. 35–60. Oxford University Press, Oxford.

Berger, J. O., Liseo, B. and Wolpert, R. L. (1999). Integrated likelihood methods for eliminating nuisanceparameters [with discussion and a rejoinder]. Statistical Science, 14:1–28.

Berger, J. O. and Sun, D. (2008). Objective priors for the bivariate normal model. Annals of Statistics,36:963–982.

Berger, J. O. and Wolpert, R. (1984). The Likelihood Principle. Institute of Mathematical Statistics,Hayward, CA.

Berk, R., Brown, L., Buja, A., Zhang, K. and Zhao, L. (2013). Valid post-selection inference. Annals ofStatistics, 41:802–837.

Bernstein, P. L. (1996). Against the Gods. John Wiley & Sons, New York.Berry, G. and Armitage, P. (1995). Mid-p confidence intervals: A brief review. The Statistician,

44:417–423.Bickel, P. J. and Doksum, K. A. (2001). Mathematical Statistics: Basic Ideas and Selected Topics, Vol. I

[2nd ed.]. Prentice-Hall, London.

Page 138: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 473 — #493�

References 473

Bie, O., Borgan, Ø. and Liestøl, K. (1987). Confidence intervals and confidence bands for thecumulative hazard rate function and their small sample properties. Scandinavian Journal of Statistics,14:221–233.

Billingsley, P. (1961). Statistical Inference for Markov Processes. University of Chicago Press, Chicago.Billingsley, P. (1968). Convergence of Probability Measures. John Wiley & Sons, New York.Birnbaum, A. (1961). Confidence curves: An omnibus technique for estimation and testing statistical

hypotheses. Journal of the American Statistical Association, 56:246–249.Birnbaum, A. (1962). On the foundations of statistical inference. Journal of the American Statistical

Association, 57:269–306.Bjørnsen, K. (1963). 13 ar med Kuppern & Co. Nasjonalforlaget, Oslo.Blackwell, D. (1947). Conditional expectation and unbiased sequential estimation. Annals of Mathematical

Statistics, 18:105–110.Blaisdell, B. E. (1985). A method for estimating from two aligned present day dna sequences their ancestral

composition and subsequent rates of composition and subsequent rates of substitution, possibly differentin the two lineages, corrected for multiple and parallel substitutions at the same site. Journal of MolecualEvolution, 22:69–81.

Bogstad, B., Dingsør, G. E., Ingvaldsen, R. B. and Gjøsæter, H. (2013). Changes in the relationshipbetween sea temperature and recruitment of cod, haddock and herring in the Barents Sea. MarineBiology Research, 9:895–907.

Boitsov, V. D., Karsakov, A. L. and Trofimov, A. G. (2012). Atlantic water temperature and climate in theBarents Sea, 2000–2009. ICES Journal of Marine Science, 69:833–840.

Bolt, U. (2013). Faster Than Lightning: My Autobiography. HarperSport, London.Boole, G. (1854). The Laws of Thought [reprinted by Dover, New York, 1958]. Macmillan, London.Borenstein, M., Hedges, L. V., Higgins, J. and Rothstein, H. (2009). Introduction to Meta-Analysis. John

Wiley & Sons, New York.Borgan, Ø. (1984). Maximum likelihood estimation in parametric counting process models, with

applications to censored failure time data. Scandinavian Journal of Statistics, 11:1–16.Box, G. E. P. and Cox, D. R. (1964). An analysis of transformations [with discussion and a rejoinder].

Journal of the Royal Statistical Society, Series B, 26:211–252.Box, G. E. P. and Draper, N. R. (1987). Empirical Model-Building and Response Surfaces. John Wiley &

Sons, New York.Box, G. E. P. and Tiao, G. C. (1973). Bayesian Inference in Statistical Models. John Wiley & Sons, New

York.Brandon, J. R. and Wade, P. R. (2006). Assessment of the Bering-Chukchi-Beaufort Seas stock of bowhead

whales using Bayesian model averaging. Journal of Cetacean Resources Management, 8:225–239.Brazzale, A. R. and Davison, A. C. (2008). Accurate parametric inference for small samples. Statistical

Science, 23:465–484.Brazzale, A. R., Davison, A. C. and Reid, N. (2007). Applied Asymptotics: Case Studies in Small-Sample

Statistics. Cambridge University Press, Cambridge.Breiman, L. (1992). The little bootstrap and other methods for dimensionality reduction in regression:

X-fixed prediction error. Journal of the American Statistical Association, 87:738–754.Breiman, L. (2001). Statistical modeling: The two cultures [with discussion and a rejoinder]. Statistical

Science, 16:199–231.Breslow, N. E. (1981). Odds ratio estimators when the data are sparse. Biometrika, 68:73–84.Breslow, N. E. and Clayton, D. G. (1993). Approximate inference in generalized linear mixed models.

Journal of the American Statistical Association, 88:9–25.Breuer, P. T. and Bowen, J. P. (2014). Empirical patterns in Google Scholar citation counts. arxiv.org.Brillinger, D. R. (1962). Examples bearing on the definition of fiducial probability with a bibliography.

Annals of Mathematical Statististics, 33:1349–1355.Brillinger, D. R. (2001). Time Series: Data Analysis and Theory. SIAM, London.Browman, H. I. (2014). Commemorating 100 years since Hjort’s 1914 treatise on fluctuations in the great

fisheries of northern Europe: Where we have been, where we are, where we are going. ICES Journal ofMarine Science, 71:1989–1992.

Page 139: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 474 — #494�

474 References

Brown, L. D. (1986). Fundamentals of Statistical Exponential Families with Applications in StatisticalDecision Theory. Institute of Mathematical Statistics, Hayward, CA.

Carlin, B. P. and Louis, T. A. (2009). Bayesian Methods for Data Analysis. Chapman & Hall/CRC, BocaRaton, FL.

Cheng, X. and Hansen, B. E. (2015). Forecasting with factor-augmented regression: A frequentist modelaveraging approach. Journal of Econometrics, 186:280–293.

Claeskens, G. and Hjort, N. L. (2003). The focused information criterion [with discussion and a rejoinder].Journal of the American Statistical Association, 98:900–916.

Claeskens, G. and Hjort, N. L. (2008). Model Selection and Model Averaging. Cambridge University Press,Cambridge.

Claeskens, G. and Van Keilegom, I. (2003). Bootstrap confidence bands for regression curves and theirderivatives. Annals of Statistics, 31:1852–1884.

Collett, D. (2003). Modelling Survival Data in Medical Research (2nd ed.). Chapman & Hall/CRC, BocaRaton, FL.

Cook, T. D. and Campbell, D. T. (1979). Quasi-experimentation. Houghton Mifflin, Boston.Cornish, E. A. and Fisher, R. A. (1938). Moments and cumulants in the specification of distributions.

Review of the International Statistical Institute, 5:307–320.Cox, D. R. (1958). Some problems with statistical inference. Annals of Mathematical Statistics,

29:357–372.Cox, D. R. (1977). The role of significance tests [with discussion and a rejoinder]. Scandinavian Journal

of Statistics, 4:49–70.Cox, D. R. (2006). Principles of Statistical Inference. Cambridge University Press, Cambridge.Cox, D. R. (2013). Discussion of M. Xie and K. Singh’s paper. International Statistical Review, 81:40–41.Cox, D. R. and Reid, N. (1987). Parameter orthogonality and approximate conditional inference [with

discussion and a rejoinder]. Journal of the Royal Statistical Society, Series B, 49:1–39.Cox, D. R. and Snell, E. J. (1981). Analysis of Binary Data. Chapman & Hall, London.Cox, D. R. and Snell, E. J. (1989). Applied Statistics: Principles and Examples. Chapman & Hall, London.Cramer, H. (1946). Mathematical Methods of Statistics. Princeton University Press, Princeton, NJ.Cramer, H. and Wold, H. (1936). Some theorems on distribution functions. Journal of the London

Mathematical Society, 1:290–294.Creasy, M. A. (1954). Limits for the ratio of normal means. Journal of the Royal Statistical Society, Series

B, 16:186–194.Cressie, N. (1993). Statistics for Spatial Data [revised ed.]. John Wiley & Sons, New York.Cunen, C. M. L. and Hjort, N. L. (2015). Optimal inference via confidence distributions for two-by-two

tables modelled as poisson pairs: Fixed and random effects. In Proceedings 60th World StatisticsCongress, 26–31 July 2015, Rio de Janeiro, volume I, pp. xx–xx. International Statistical Institute,Amsterdam.

Darmois, G. (1935). Sur les lois de probabilite a estimation exhaustive. Comptes Rendus de l’Academie desSciences Paris 2, 200:1265–1266.

Darroch, J. N. (1958). The multiple-recapture census. I: Estimation of a closed population. Biometrika,45:343–359.

da Silva, C. Q., Zeh, J. E., Madigan, D., Lake, J., Rugh, D., Baraff, L., Koski, W. and Miller, G. (2000).Capture-recapture estimation of bowhead whale population size using photo-identification data. Journalof Cetacean Reserve Management, 2:45–61.

David, H. A. and Nagaraja, H. N. (2003). Order Statistics [3rd ed.]. John Wiley & Sons, New York.Davies, P. L. (2008). Approximating data [with discussion and a rejoinder]. Journal of the Korean Statistical

Society, 37:191–211.Davison, A. C. (2001). Biometrika centenary: Theory and general methodology. Biometrika, 13–52.Davison, A. C. (2003). Statistical Models. Cambridge University Press, Cambridge.Davison, A. C. and Hinkley, D. V. (1997). Bootstrap Methods and Their Application. Cambridge University

Press, Cambridge.De Blasi, P. and Hjort, N. L. (2007). Bayesian survival analysis in proportional hazard models with logistic

relative risk. Scandinavian Journal of Statistics, 34:229–257.

Page 140: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 475 — #495�

References 475

De Blasi, P. and Schweder, T. (2015). Tail symmetry of confidence curves based on the log-likelihood ratio.Submitted, xx:xx–xx.

De Leeuw, J., Hornik, K. and Mair, P. (2009). Isotone optimization in R: Pool-adjacent-violators algorithm(PAVA) and active set methods. Journal of Statistical Software, 21:1–23.

Dempster, A. P. (1963). Further examples of inconsistencies in the fiducial argument. Annals ofMathematical Statistics, 34:884–891.

Dempster, A. P. (1967). Upper and lower probabilities induced by a multivalued mapping. Annals ofMathematical Statistics, 38:325–339.

Dennis, J. E. and Schnabel, R. B. (1983). Numerical Methods for Unconstrained Optimization andNonlinear Equations. Prentice-Hall, Englewood Cliffs, NJ.

DiCiccio, T. J. and Efron, B. (1996). Bootstrap confidence intervals [with discussion and a rejoinder].Statistical Science, 11:189–228.

Diggle, P. (2013). Statistical Analysis of Spatial and Spatio-Temporal Point Patterns [3rd ed.]. Chapman &Hall/CRC, London.

Dufour, J. M. (1997). Some impossibility theorems in econometrics with applications to structural anddynamic models. Econometrica, 65:1365–1387.

Dyrrdal, A. V. and Vikhamar-Scholer, D. V. (2009). Analysis of long-term snow series at selected stationsin Norway. Technical report, Norwegian Meteorological Institute, Oslo.

Eddington, A. S. (1914). Stellar Movements and the Structure of the Universe. Macmillan, New York.Edgeworth, F. Y. (1909). Addendum on ‘Probable errors of frequency constants’. Journal of the Royal

Statistical Society, 72:81–90.Edwards, A. W. F. (1992). Likelihood [expanded edition]. Johns Hopkins University Press, Baltimore.Efron, B. (1982). Maximum likelihood theory and decision theory. Annals of Statistics, 10:340–356.Efron, B. (1987). Better bootstrap confidence intervals [with discussion and a rejoinder]. Journal of the

American Statistical Association, 82:171–200.Efron, B. (1993). Bayes and likelihood calculations from confidence intervals. Biometrika, 80:3–26.Efron, B. (1996). Empirical Bayes methods for combining likelihoods. Journal of the American Statistical

Association, 91:538–550.Efron, B. (1998). R.A. Fisher in the 21st century [with discussion and a rejoinder]. Statistical Science,

13:95–122.Efron, B. (2013). Discussion of M. Xie and K. Singh’s paper. International Statistical Review, 81:41–42.Efron, B. (2014). Estimation and accuracy after model selection [with discussion and a rejoinder]. Journal

of the American Statistical Association, 109:991–1007.Efron, B. and Hinkley, D. V. (1978). Assessing the accuracy of the maximum likelihood estima-

tor: Observed versus expected Fisher information [with discussion and a rejoinder]. Biometrika,65:457–487.

Efron, B. and Morris, C. (1973). Stein’s estimation rule and its competitors – an empirical Bayes approach.Journal of the American Statistical Association, 68:117–130.

Efron, B. and Tibshirani, R. J. (1993). An Introduction to the Bootstrap. Chapman & Hall, London.Einmahl, J. H. J. and Magnus, J. R. (2008). Records in athletics through extreme-value theory. Journal of

the American Statistical Association, 103:1382–1391.Einmahl, J. H. J. and Smeets, S. G. W. R. (2011). Ultimate 100 m world records through extreme-value

theory. Statistica Neerlandica, 65:32–42.Einstein, A. (1934). On the method of theoretical physics. The Philosophy of Science, 1:163–169.Elstad, M., Whitelaw, A. and Thoresen, M. (2011). Cerebral Resistance Index is less predictive in

hypothermic encephalopathic newborns. Acta Paediatrica, 100:1344–1349.Elvik, R. (2011). Publication bias and time-trend bias in meta-analysis of bicycle helmet efficacy: A

re-analysis of Attewell, Glase and McFadden. Accident Analysis and Prevention, 43:1245–1251.Embrechts, P., Kluppelberg, C. and Mikosch, T. (1997). Modelling Extremal Events for Insurance and

Finance. Springer-Verlag, London.Ericsson, N. R., Jansen, E. S., Kerbesian, N. A. and Nymoen, R. (1998). Interpreting a monetary condition

index in economic policy. Technical report, Department of Economics, University of Oslo.Ezekiel, M. (1930). Methods of Correlation Analysis. John Wiley & Sons, New York.

Page 141: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 476 — #496�

476 References

Fahrmeier, L. and Tutz, G. (1994). Multivariate Statistical Modelling Based on Generalized Linear Models.Springer-Verlag, Berlin.

Feigl, P. and Zelen, M. (1965). Estimation of exponential survival probabilities with concomitantinformation. Biometrics, 21:826–838.

Feller, W. (1950). An Introduction to Probability Theory and Its Applications. John Wiley & Sons, NewYork.

Felsenstein, J. (2004). Inferring Phylogenies. Sinauer Associates, Sunderland, MA.Ferguson, T. S. (1967). Mathematical Statistics: A Decision Theoretic Approach. Academic Press, New

York.Ferguson, T. S. (1996). A Course in Large Sample Theory. Chapman & Hall, London.Fieller, E. C. (1940). The biologial standardization of insuline. Journal of the Royal Statistical Society

Supplement, 7:1–64.Fieller, E. C. (1954). Some problems in interval estimation. Journal of the Royal Statistical Society, Series

B, 16:175–185.Fine, T. L. (1977). Book review of Shafer: A mathematical theory of evidence. Bulletin of the American

Statistical Society, 83:667–672.Fischer, H. (2011). A History of the Central Limit Theorem: From Classical to Modern Probability Theory.

Springer, Science & Business Media, New York.Fisher, R. A. (1912). On an absolute criterion for fitting frequency curves. Messenger of Mathematics,

41:155–160.Fisher, R. A. (1915). Frequency distribution of the values of the correlation coefficient in small samples.

Biometrika, 10:507–521.Fisher, R. A. (1918). The correlation between relatives on the supposition of mendelian inheritance.

Philosophical Transactions of the Royal Society of London, 52:399–433.Fisher, R. A. (1920). A mathematical examination of the methods of determining the accuracy of an

observation by the mean error, and by the mean square error. Monthly Notices of the Royal AstronomicalSociety, 80:758–770.

Fisher, R. A. (1922). On the mathematical foundation of theoretical statistics. Philosophical Transactionsof the Royal Society of Edinburgh, Series A, 222:309–368.

Fisher, R. A. (1925). Statistical Methods for Research Workers. Oliver & Boyd, Edinburgh.Fisher, R. A. (1930). Inverse probability. Proceedings of the Cambridge Philosophical Society,

26:528–535.Fisher, R. A. (1933). The concepts of inverse probability and fiducial probability referring to unknown

parameters. Proceedings of the Royal Society, Series A, 139:343–348.Fisher, R. A. (1934). Two new properties of mathematical likelihood. Proceedings of the Royal Society of

London, Series A, 144:285–307.Fisher, R. A. (1935). The fiducial argument in statistical inference. Annals of Eugenics, 6:391–398.Fisher, R. A. (1941). The asymptotic approach to Behrens’s integral, with further tables of for the d test of

significance. Annals of Eugenics, 11:141–172.Fisher, R. A. (1954). Contribution to a discussion of a paper on interval estimation by M. A. Creasy. Journal

of the Royal Statistical Society, Series B, 16:212–213.Fisher, R. A. (1956). Statistical Methods and Scientific Inference. Hafner Press, New York.Fisher, R. A. (1958). Cigarettes, cancer and statistics. Centennial Review, 2:151–166.Fisher, R. A. (1973). Statistical Methods and Scientific Inference (3rd ed.). Hafner Press, New York

Extended version of the 1956 edition.Fleming, T. R. and Harrington, D. P. (1991). Counting Processes and Survival Analysis. John Wiley &

Sons, New York.Fraser, D. A. S. (1961a). The fiducial method and invariance. Biometrika, 48:261–280.Fraser, D. A. S. (1961b). On fiducial inference. Annals of Mathematical Statistics, 32:661–676.Fraser, D. A. S. (1966). Some remarks on pivotal models and the fiducial argument in relation to structural

models. International Statistical Review, 64:231–236.Fraser, D. A. S. (1968). The Structure of Inference. John Wiley & Sons, New York.Fraser, D. A. S. (1998). Contribution to the discussion of Efron’s paper. Statistial Science, 13:120–122.

Page 142: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 477 — #497�

References 477

Fraser, D. A. S. (2011). Is Bayes posterior just quick and dirty confidence? [with discussion and a rejoinder].Statistial Science, 26:249–316.

Friesinger, A. (2004). Mein Leben, mein Sport, meine besten Fitness-Tipps. Goldmann, Berlin.Frigessi, A. and Hjort, N. L. (2002). Statistical methods for discontinuous phenomena. Journal of

Nonparametric Statistics, 14:1–5.Galton, F. (1889). Natural Inheritance. Macmillan, London.Gauss, C. F. (1816). Bestimmung der Genauigkeit der Beobachtungen. Zeitschrift Astronomischen

Verwandte Wissenschaften, 1:185–196.Gelman, A., Carlin, J. B., Stern, H. S. and Rubin, D. B. (2004). Bayesian Data Analysis [2nd ed.]. Chapman

& Hall/CRC, London.Gelman, A. and Nolan, D. (2002). Teaching Statistics: A Bag of Tricks. Oxford University Press, Oxford.Gilbert, R., Salanti, G., Harden, M. and See, S. (2005). Infant sleeping position and the sudden infant death

syndrome: Systematic review of observational studies and historical review of recommendations from1940 to 2002. International Journal of Epidemiology, 34:874–887.

Gilks, W. R., Richardson, S. and Spiegelhalter, D. J. (1996). Markov Chain Monte Carlo in Practice.Chapman & Hall, London.

Giron, J., Ginebra, J. and Riba, A. (2005). Bayesian analysis of a multinomial sequence and homogeneityof literary style. The American Statistician, 59:19–30.

Givens, H., Huebinger, R. M., Patton, J. C., Postma, L. D., Lindsay, M., Suydam, R. S., C., G. J., Matson,C. W. and Bickham, J. W. (2010). Population genetics of Bowhead whales (Balaena mysticetus) in theWestern Arctic. Arctic, 63:1–12.

Glad, I. K., Hjort, N. L. and Ushakov, N. G. (2003). Correction of density estimators that are not densities.Scandinavian Journal of Statistics, 30:415–427.

Goldstein, H. (2011). Multilevel Statistical Models [4th ed.]. John Wiley & Sons, London.Good, I. J. (1983). Good Thinking: The Foundations of Probability and Its Applications. University of

Minnesota Press, Minneapolis.Goodman, L. A. (1954). Some practical techniques in serial number analysis. Journal of the American

Statistical Association, 49:97–112.Gould, S. J. (1995). The median isn’t the message. In Adam’s Navel and Other Essays, pp. 15–21. Penguin

Classics, London. First published in Discover Magazine, June 1985.Gould, S. J. (2003). The Hedgehog, the Fox, and the Magister’s Pox. Harmony Books, New York.Green, P. J. and Hjort, N. L. and Richardson, S. (2003), Highly Structured Stochastic Systems. Oxford

University Press.Gujarati, X. (1968). The relation between help-wanted index and the unemployment rate: A statistical

analysis, 1962–1967. The Quarterly Review of Economics and Business, 8:67–73.Guttman, L. (1985). The illogic of statistical inference for cumulative science. Applied Stochastic Models

and Data Analysis, 1:3–9.Haavelmo, T. (1943). The statistical implications of a system of simultaneous equations. Econometrica,

11:1–12.Haavelmo, T. (1944). The probability approach in econometrics. Econometrica, 12:iii–vi+1–115.Hacking, I. (1975). The Emergence of Probability: A Philosophical Study of Early Ideas About Probability,

Induction and Statistical Inference. Cambridge University Press, Cambridge.Hacking, I. (2006). The Emergence of Probability: A Philosophical Study of Early Ideas About Probability,

Induction and Statistical Inference. Cambridge University Press, Cambridge. This is the third edition ofthe book, with an extended preface.

Hacking, I. M. (1965). Logic of Statistical Inference. Cambridge University Press, Cambridge.Hald, A. (1990). A History of Probability and Statistics and Their Applications Before 1750. John Wiley &

Sons, New York.Hald, A. (1998). A History of Mathematical Statistics from 1750 to 1930. John Wiley & Sons, New York.Hald, A. (1999). On the history of maximum likelihood in relation to inverse probability and least squares.

Statistical Science, 14:214–222.Hall, P. (1988). Theoretical comparison of bootstrap confidence intervals. Annals of Statistics, 16:927–953.Hall, P. (1992). The Bootstrap and Edgeworth Expansions. Springer-Verlag, Budapest.

Page 143: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 478 — #498�

478 References

Hampel, F. (2001). An outline of a unifying statistical theory. Technical Report 95, Seminar fur Statistik,ETH Zurich.

Hampel, F. (2006). The proper fiducial argument. In Ahlswede, R. (ed.), General Theory ofInformation Transfer and Combinatorics, Lecture Notes in Computer Science, No. 4123, pp. 512–526.Springer-Verlag, Heidelberg.

Hampel, F. R., Ronchetti, E., Rousseuw, P. J. and Stahel, W. A. (1986). Robust Statistics: The ApproachBased on Influence Functions. John Wiley, New York.

Hand, D. J., Daly, F., Lunn, A., McConway, K. J. and Ostrowski, E. (1994). A Handbook of Small DataSets. Chapman & Hall, London.

Hannig, J. (2009). On generalized fiducial inference. Statistica Sinica, 19:491–544.Hannig, J., Iyer, H. and Patterson, P. (2006). Fiducial generalized confidence intervals. Journal of the

American Statistical Association, 101:254–269.Hannig, J. and Lee, T. C. M. (2009). Generalized fiducial inference for wavelet regression. Biometrika,

96:847–860.Hannig, J. and Xie, M. (2012). A note on Dempster–Shafer recombination of confidence distributions.

Electronic Journal of Statistics, 6:1943–1966.Hansen, B. E. (2008). Least squares forecast averaging. Journal of Econometrics, 146:342–350.Hardle, W. K. and Simar, L. (2012). Applied Multivariate Statistical Analysis (3rd ed.). Springer-Verlag,

Berlin.Harris, R. R. and Harding, E. F. (1984). The fiducial argument and Hacking’s principle of irrelevance.

Journal of Applied Statistics, 11:170–181.Hary, A. (1960). 10,0. Copress, Munchen.Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications.

Biometrika, 9:97–109.Heger, A. (2011). Jeg og jordkloden. Dagsavisen, December 16.Helland, I. S. (1982). Central limit theorems for martingales with discrete or continuous time. Scandinavian

Journal of Statistics, 9:79–94.Helland, I. S. (2015). Epistemic Processes: A Basis for Statistics and for Quantum Mechanics. Springer,

Science & Business Media, New York.Hermansen, G. H. and Hjort, N. L. (2015). Focused information criteria for time series. Submitted for

publication, xx:xx–xx.Hermansen, G. H., Hjort, N. L. and Kjesbu, O. S. (2015). Modern statistical methods applied on extensive

historic data: Hjort liver quality time series 1859–2012 and associated influential factors. CanadianJournal of Fisheries and Aquatic Sciences, 72.

Hjort, J. (1914). Fluctuations in the Great Fisheries of Northern Europe, Viewed in the Light of BiologicalResearch. Conseil Permanent International Pour l’Exploration de la Mer, Copenhagen.

Hjort, J. (1933). Whales and whaling. Hvalradets skrifter: Scientific Results of Marine Biological Research.Hjort, J. (1937). The story of whaling: A parable of sociology. The Scientific Monthly, 45:

19–34.Hjort, N. L. (1985). Discussion contribution to P. K. Andersen and Ø. Borgan’s article ‘counting process

models for life history data: A review’. Scandinavian Journal of Statistics, 12:97–158.Hjort, N. L. (1986a). Bayes estimators and asymptotic efficiency in parametric counting process models.

Scandinavian Journal of Statistics, 13:63–85.Hjort, N. L. (1986b). Statistical Symbol Recognition [Research Monograph]. The Norwegian Computing

Centre, Oslo.Hjort, N. L. (1988a). The eccentric part of the noncentral chi square. The American Statistician,

42:130–132.Hjort, N. L. (1988b). On large-sample multiple comparison methods. Scandinavian Journal of Statistics,

15:259–271.Hjort, N. L. (1990a). Goodness of fit tests in models for life history data based on cumulative hazard rates.

Annals of Statistics, 18:1221–1258.Hjort, N. L. (1990b). Nonparametric Bayes estimators based on Beta processes in models for life history

data. Annals of Statistics, 18:1259-1294.

Page 144: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 479 — #499�

References 479

Hjort, N. L. (1992). On inference in parametric survival data models. International Statistical Review,60:355–387.

Hjort, N. L. (1994a). The exact amount of t-ness that the normal model can tolerate. Journal of the AmericanStatistical Association, 89:665–675.

Hjort, N. L. (1994b). Minimum L2 and robust Kullback–Leibler estimation. In Lachout, P. and Vısek,J. A. (eds.), Proceedings of the 12th Prague Conference on Information Theory, Statistical DecisionFunctions and Random Processes, pp. 102–106. Academy of Sciences of the Czech Republic, Prague.

Hjort, N. L. (1994c). Should the Olympic sprint skaters run the 500 m twice? Technical report, Departmentof Mathematics, University of Oslo.

Hjort, N. L. (2003). Topics in nonparametric Bayesian statistics. In Green, P. J., Hjort, N. L., andRichardson, S. (eds.), Highly Structured Stochastic Systems, pp. 455–478. Oxford University Press,Oxford.

Hjort, N. L. (2007). And quiet does not flow the Don: Statistical analysis of a quarrel between Nobellaureates. In Østreng, W. (ed.) Concilience, pp. 134–140. Centre for Advanced Research, Oslo.

Hjort, N. L. (2008). Discussion of P.L. Davies’ article ‘Approximating data’. Journal of the KoreanStatistical Society, 37:221–225.

Hjort, N. L. (2014). Discussion of Efron’s ‘Estimation and accuracy after model selection’. Journal of theAmerican Statistical Association, 109:1017–1020.

Hjort, N. L. and Claeskens, G. (2003a). Frequentist model average estimators [with discussion and arejoinder]. Journal of the American Statistical Association, 98:879–899.

Hjort, N. L. and Claeskens, G. (2003b). Rejoinder to the discussion of ‘frequentist model averageestimators’ and ‘the focused information criterion’. Journal of the American Statistical Association,98:938–945.

Hjort, N. L. and Claeskens, G. (2006). Focussed information criteria and model averaging for cox’s hazardregression model. Journal of the American Statistical Association, 101:1449–1464.

Hjort, N. L. and Fenstad, G. (1992). On the last time and the number of times an estimator is more than εfrom its target vaule. Annals of Statistics, 20:469–489.

Hjort, N. L. and Glad, I. K. (1995). Nonparametric density estimation with a parametric start. Annals ofStatistics, 23:882–904.

Hjort, N. L., Holmes, C., Muller, P. and Walker, S. (2010). Bayesian Nonparametrics. CambridgeUniversity Press, Cambridge.

Hjort, N. L. and Jones, M. C. (1996). Locally parametric nonparametric density estimation. Annals ofStatistics, 24:1619–1647.

Hjort, N. L. and Koning, A. J. (2002). Tests for constancy of model parameters over time. Journal ofNonparametric Statistics, 14:113–132.

Hjort, N. L., McKeague, I. W. and Van Keilegom, I. (2009). Extending the scope of empirical likelihood.Annals of Statistics, 37:1079–1111.

Hjort, N. L. and Omre, H. (1994). Topics in spatial statistics [with discussion and a rejoinder]. ScandinavianJournal of Statistics, 21:289–357.

Hjort, N. L. and Petrone, S. (2007). Nonparametric quantile inference using Dirichlet processes. In Nair, V.(ed.), Advances in Statistical Modeling and Inference: Essays in Honor of Kjell Doksum, pp. 463–492.World Scientific, Hackensack, NJ.

Hjort, N. L. and Pollard, D. B. (1993). Asymptotics for minimisers of convex processes. Technical report,Department of Mathematics, University of Oslo.

Hjort, N. L. and Rosa, D. (1998). Who won? Speedskating World, 4:15–18.Hjort, N. L. and Varin, C. (2008). ML, PL, QL in Markov chain models. Scandinavian Journal of Statistics,

35:64–82.Hjort, N. L. and Walker, S. (2009). Quantile pyramids for Bayesian nonparametrics. Annals of Statistics,

37:105–131.Hobolth, A. and Jensen, J. L. (2005). Statistical inference in evolutionary models of DNA sequences via

the EM algorithm. Technical report, Department of Theoretical Statistics, University of Aarhus.Hoeting, J. A., Madigan, D., Raftery, A. E. and Chris T. Volinsky, C. T. (1999). Bayesian model averaging:

A tutorial [with discussion and a rejoinder]. Statistical Science, 14:382–417.

Page 145: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 480 — #500�

480 References

Hollings, X. and Triggs, X. (1993). Influence of the new rules in international rugby football: Implicationsfor conditioning. Technical report, xx.

Holum, D. (1984). The Complete Handbook of Speed Skating. High Peaks Cyclery, Lake Placid.Hosmer, D. W. and Lemeshow, S. (1999). Applied Logistic Regression. John Wiley & Sons, New York.Hotelling, H. (1931). The generalization of Student’s ratio. Annals of Mathematical Statistics, 2:360–378.Houde, E. D. (2008). Emerging from Hjort’s shadow. Journal of Northwest Atlantic Fishery Science,

41:53–70.Huber, P. J. (1967). The behavior of maximum likelihood estimators under nonstandard conditions. In

Le Cam, L. and Neyman, J. (eds.), Proceedings of the Fifth Berkeley Symposium on MathematicalStatistics and Probability, Vol. I, pp. 221–233. University of California Press, Berkeley.

Huber, P. J. (1981). Robust Statistics. John Wiley & Sons, New York.IPCC (2007). Climate Change 2007: Fourth Assessment Report of the Intergovernmental Panel on Climate

Change. United Nations, New York. Published by Cambridge University Press, New York.IPCC (2013). Climate Change 2013: Fifth Assessment Report of the Intergovernmental Panel on Climate

Change. United Nations, New York. Published by Cambridge University Press, New York.Jansen, D. (1994). Full Circle. Villard Books, New York.Jeffreys, H. (1931). Theory of Probability. Cambridge University Press, Cambridge.Jeffreys, H. (1961). Scientific Inference. Oxford University Press, Oxford.Jensen, J. L. (1993). A historical sketch and some new results on the improved likelihood ratio statistic.

Scandinavian Journal of Statistics, 20:1–15.Johansen, S. (1979). Introduction to the Theory of Regular Exponential Families. Institute of Mathematical

Statistics, University of København, København.Jones, M. C. (1992). Estimating densities, quantiles, quantile densities and density quantiles. Annals of the

Institute of Statistical Mathematics, 44:721–727.Jones, M. C., Hjort, N. L., Harris, I. R. and Basu, A. (2001). A comparison of related density-based

minimum divergence estimators. Biometrika, 88:865–873.Jordan, S. M. and Krishnamoorthy, K. (1996). Exact confidence intervals for the common mean of several

normal populations. Biometrics, 52:77–86.Jorde, P. E., Schweder, T., Bickham, J. W., Givens, G. H., Suydam, R., Hunter, D. and Stenseth, N. C.

(2007). Detecting genetic structure in migrating bowhead whales off the coast of Barrow, Alaska.Molecular Ecology, 16:1993–2004.

Joshi, V. M. (1967). Inadmissibility of the usual confidence sets for the mean of a multivariate normalpopulation. Annals of Mathematical Statistics, 38:1868–1875.

Jøssang, A. and Pope, S. (2012). Dempster’s rule as seen by little colored balls. Computational Intelligence,4:453–474.

Joyce, P. and Marjoram, P. (2008). Approximately sufficient statistics and Bayesian computation. StatisticalApplications in Genetics and Molecular Biology, 7:1–18.

Jullum, M. and Hjort, N. L. (2015). Parametric or nonparametric? A focussed information criterionapproach. Submitted, xx:xx–xx.

Kagan, J. (2009). The Three Cultures: Natural Sciences, Social Sciences, and the Humanities in the 21stCentury. Cambridge University Press, Cambridge.

Kahneman, D. (2011). Thinking, Fast and Slow. Farrar, Straus and Giroux, New York.Kahneman, D. and Tversky, A. (1979). Prospect theory: An analysis of decisions under risk. Econometrica,

47:263–291.Kahneman, D. and Tversky, A. (1984). Choices, values and frames. American Psychologist, 39:341–350.Kalbfleisch, J. D. and Prentice, R. L. (2002). The Statistical Analysis of Failure Time Data (2nd ed.). John

Wiley & Sons, New York.Kalbfleisch, J. G. and Sprott, D. A. (2006). Fiducial probability. In Xx, X., (ed.), General Theory

of Information Transfer and Combinatorics, Lecture Notes in Computer Science, pp. 99–109.Springer-Verlag, Heidelberg.

Kardaun, O. J. W. F., Salome, D., Schaafsma, E., Steerneman, A. G. M., Willems, J. C. and Cox,D. R. (2003). Reflections on fourteen cryptic issues concerning the nature of statistical inference [withdiscussion and a rejoinder]. International Statistical Review, 71:277–318.

Page 146: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 481 — #501�

References 481

Karlin, S. and Matessi, C. (1983). The eleventh R. A. Fisher memorial lecture: Kin selection and altruism.Proceedings of the Royal Society of London, 219:327–353.

Karlin, S. and Taylor, H. M. (1975). A First Course in Stochastic Processes. Academic Press, New York.Kass, R. (2011). Statistical inference: The big picture. Statistical Science, 26:1–9.Kavvoura, F. K. and Ioannidis, J. P. A. (2008). Methods for meta-analysis in genetic association studies: A

review of their potential and pitfalls. Human Genetics, 123:1–14.Kelly, F. P. and Ripley, B. D. (1976). On Strauss’s model for clustering. Biometrika, 63:357–360.Keynes, J. M. (1921). Treatise on Probability. Macmillan & Co., London.Kim, J. and Pollard, D. (1990). Cube root asymptotics. Annals of Statistics, 18:191–219.Kimura, M. (1981). Estimation of evolutionary distances between homologous nucleotide sequences.

Proceedings of the National Academy of Sciences USA, 78:454–458.Kjesbu, O. S., Opdal, A. F., Korsbrekke, K., Devine, J. A. and Skjæraasen, J. E. (2014). Making use of

Johan Hjort’s ‘unknown’ legacy: Reconstruction of a 150-year coastal time-series on northeast Arcticcod (Gadus morhua) liver data reveals long-term trends in energy allocation patterns. ICES Journal ofMarine Science, 71:2053–2063.

Knutsen, H., Olsen, E. M., Jorde, P. E., Espeland, S. H., Andre, C. and Stenseth, N. C. (2011). Are lowbut statistically significant levels of genetic differentiation in marine fishes ‘biologically meaningful’?A case study of coastal Atlantic cod. Molecular Ecology, 20:768–783.

Kohler, R. E. (1994). Lords of the Fly: ‘Drosophila’ Genetics and the Experimental Life. University ofChicago Press, Chicago.

Kolmogorov, A. N. (1933). Grundbegriffe der Wahrscheinlichkeitsrechnung. Springer, Berlin. Translationof Osnovnye pon�ti� teorii vero�tnostei, Nauka, Moskva.

Kolmogorov, A. N. (1998). Osnovnye pon�ti� teorii vero�tnostei. Fazis, Moskva. 3rdedition of the Russian 1936 original, containing more material than the 1933 German edition.

Konishi, K., Tamura, T., Zenitani, R., Bano, T., Kato, H. and Walløe, L. (2008). Decline in energystorage in the Antarctic minke whale (Balaenoptera bonaerensis) in the Southern Ocean. Polar Biology,31:1509–1520.

Konishi, K. and Walløe, L. (2015). Substantial decline in energy storage and stomach fullness in Antarcticminke whales during the 1990s. Submitted for publication, xx:xx–xx.

Koopman, B. (1936). On distribution admitting a sufficient statistic. Transactions of the AmericanMathematical Society, 39:399–409.

Koschat, M. A. (1987). A characterisation of the Fieller solution. Annals of Statistics, 15:462–468.Lancaster, T. (2000). The incidental parameter problem since 1948. Journal of Econometrics, 95:391–413.Langaas, M., Lindqvist, B. H. and Ferkingstad, E. (2005). Estimating the proportion of true null hypotheses,

with application to DNA microarray data. Journal of the Royal Statistical Society, Series B, 67:555–572.Laplace, P. S. (1774). Memoire sur la probabilite des causes par les ev‘evemens. Memoires de Mathmatique

et de Physique, Tome Sixieme. xx, Paris.Lawless, J. F. and Fredette, M. (2005). Frequentist prediction intervals and predictive distributions.

Biometrika, 92:529–542.Laws, R. M. (1977). Seals and whales of the Southern Ocean. Philosophical Transactions of the Royal

Society, Series B, 279:81–96.Le Cam, L. (1964). Sufficiency and approximate sufficiency. Annals of Mathematical Statistics,

35:1419–1455.Le Cam, L. and Yang, G. L. (2000). Asymptotics in Statistics: Some Basic Concepts. Springer-Verlag,

Berlin.Lee, Y., Nelder, J. and Pawitan, Y. (2006). Generalized Linear Models with Random Effects: Unified

Analysis via H-likelihood. Chapman & Hall/CRC, Boca Raton, FL.Lee, Y. and Nelder, J. A. (1996). Hierarchical generalized linear models [with discussion and a rejoinder].

Journal of the Royal Statistical Society, Series B, 58:619–678.Lehmann, E. L. (1959). Testing Statistical Hypotheses. John Wiley & Sons, New York.Lehmann, E. L. (1975). Nonparametrics: Statistical Methods Based on Ranks. Holden-Day, San Francisco.Lehmann, E. L. (1983). Theory of Point Estimation. John Wiley & Sons, New York.

Page 147: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 482 — #502�

482 References

Lehmann, E. L. (1993). The Fisher, Neyman–Pearson theories of testing hypotheses: One theory or two?Journal of the American Statistical Association, 88:1242–1249.

Lehmann, E. L. (1999). Elements of Large-Sample Theory. Springer-Verlag, Berlin.Lehmann, E. L. (2011). Fisher, Neyman, and the Creation of Classical Statistics, Springer-Verlag, New

York.Lehmann, E. L. (1999). Fisher, Neyman, and the Creation of Classical Statistics. Springer-Verlag, Berlin.Lehmann, E. L. and Casella, G. (1998). Theory of Point Estimation [2nd ed.]. Springer, Berlin.Lehmann, E. L. and Romano, J. P. (2005). Testing Statistical Hypotheses [3rd ed.]. John Wiley & Sons,

New York.Le May Doan, C. (2002). Going for Gold. McClelland & Stewart Publisher, Toronto.Lerudjordet, M. (2012). Statistical analys of track and field data [master’s thesis]. Technical report,

Department of Mathematics, University of Oslo.Liang, H., Zou, G., Wan, A. T. K. and Zhang, X. (2011). Optimal weight choice for frequentist model

average estimators. Journal of the American Statistical Association, 106:1053–1066.Lindley, D. V. (1958). Fiducial distributions and Bayes’ theorem. Journal of the Royal Statistical Society,

Series B, 20:102–107.Lindqvist, B. H. and Taraldsen, G. (2006). Monte Carlo conditioning on a sufficient statistic. Biometrika,

92:451–464.Lindqvist, B. H. and Taraldsen, G. (2007). Conditional Monte Carlo based on sufficient statistics with

applications. In Nair, V. (ed.), Advances in Statistical Modeling and Inference: Essays in Honor of KjellDoksum, pp. 545–562. World Scientific, Hackensack, NJ.

Linnik, Y. V. (1963). On the Behrens–Fisher problem. Bulletin of the Institute of International Statistics,40:833–841.

Liu, C.-A. (2015). Distribution theory of the least squares averaging estimator. Journal of Econometrics,186:142–159.

Liu, D., Liu, R. Y. and Xie, M. (2014a). Exact meta-analysis approach for discrete data and its applicationto 2 × 2 tables with rare events. Journal of the American Statistical Association, 109:1450–1465.

Liu, D., Liu, R. Y. and Xie, M. (2014b). Multivariate meta-analysis of heterogeneous studies usingonly summary statistics: Efficiency and robustness. Journal of the American Statistical Association,xx:xx–xx.

Mandelkern, M. (2002). Setting confidence intervals for bounded parameters [with discussion and arejoinder]. Statistical Science, 17:149–159.

Manley, G. (1974). Central england temperatures: Monthly means 1659 to 1973. Quarterly Journal of theRoyal Meteorological Society, 100:389–405.

Mardia, K. V., Kent, J. T. and Bibby, J. M. (1979). Multivariate Analysis. Academic Press, New York.Marin, J.-M., Pudlo, P., Robert, C. P. and Ryder, R. J. (2012). Approximate Bayesian computational

methods. Statistics and Computing, 22:1167–1180.Markov, A. A. (1906). Rasprostranenie zakona bol�xih qisel na veliqiny,

zavis�wie drug ot druga [Extending the law of large numbers for variables that are dependentof each other]. Izvesti� Fiziko-matematiqeskogo obwestva pri Kazanskomuniversitete (2-� seri�), 15:124–156.

Markov, A. A. (1913). Primer statistiqeskogo issledovani� nad tekstom “Evgeni�Onegina”, ill�striru�wii sv�z� ispytanii v cep� [Example of a statisticalinvestigation illustrating the transitions in the chain for the ‘Evgenii Onegin’ text]. Izvesti�Akademii Nauk, Sankt-Peterburg (6-� seri�), 7:153–162.

Marshall, E. C. and Spiegelhalter, D. J. (2007). Identifying outliers in Bayesian hierarchical models: Asimulation-based approach. Bayesian Analysis, 2:409–444.

Mayo, D. G. (2010). An error in the argument from conditionality and sufficiency to the likelihoodprinciple. In Mayo, D. G. and Spanos, A. (eds.), Error and Inference: Recent Exchanges onExperimental Reasoning, Reliability and the Objectivity and Rationality of Science, pp. 305–314.Cambridge University Press, Cambridge.

Mayo, D. G. (2014). On the Birnbaum argument for the strong likelihood principle [with discussion and arejoinder]. Statistical Science, 29:227–239.

Page 148: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 483 — #503�

References 483

McCloskey, R. (1943). Homer Price. Scholastic, New York.McCullagh, P. (2002). What is a statistial model? [with discussion and a rejoinder]. Annals of Statistics,

30:1225–1308.McCullagh, P. and Nelder, J. (1989). Generalized Linear Models [2nd ed.]. Chapman & Hall/CRC, Boca

Raton, FL.Melville, H. (1857). The Confidence-Man. Dix, Edwards & Co., New York.Milyukov, V. and Fan, S.-H. (2012). The Newtonian gravitational constant: Modern status of measurement

and the new CODATA value. Gravitation and Cosmology, 18:216–224.Mogstad, E. K. (2013). Mode hunting and density estimation with the focused information criterion

[master’s thesis]. Technical report, Department of Mathematics, University of Oslo.Moher, D., Schulz, F. and Altman, D. G. (2010). CONSORT 2010 Statement: Updated guidelines for

reporting parallel group randomised trials. BMC Medicine, 8:8:18 doi:10.1186/1741–7015–8–18.Møller, J. and Waagepetersen, R. (2003). Statistical Inference and Simulation for Spatial Point Processes.

Chapman & Hall/CRC, London.Moyeed, R. A. and Baddeley, A. J. (1991). Stochastic approximation of the mle for a spatial point pattern.

Scandinavian Journal of Statistics, 18:39–50.Murtaugh, P. A., Dickson, E. R., Van Dam, G. M., Malinchoc, M., Grambsch, P. M., Langworthy, A. L.

and Gips, C. H. (1994). Primary biliary cirrhosis: Prediction of short-term survival based on repeatedpatient visits. Hepatlogy, 20:126–134.

Nadarajah, S., Bityukov, S. and Krasnikov, N. (2015). Confidence distributions: A review. StatisticalMethodology, 22:23–46.

Nair, V. N. (1984). Confidence bands for survival functions with censored data: A comparative study.Technometrics, 26:265–275.

Narum, S., Westergren, T. and Klemp, M. (2014). Corticosteroids and risk of gastrointestinal bleeding: Asystematic review and meta-analysis. BMJ Open, 4:1–10.

Nelder, J. E. and Wedderburn, W. M. (1972). Generalized linear models. Journal of the Royal StatisticalSociety, Series A, 135:370–384.

Nelson, J. P. and Kennedy, P. E. (2009). The use (and abuse) of meta-analysis in environmental and naturalresource economics: An assessment. Environmental Resources and Economics, 42:345–377.

Newcomb, S. (1891). Measures of the velocity of light made under the direction of the Secretary of theNavy during the years 1880–1882. Astronomical Papers, 2:107–230.

Neyman, J. (1934). On the two different aspects of the representative method: The method of stratifiedsampling and the method of purposive selection. Journal of the Royal Statistical Society, Series A,97:558–625.

Neyman, J. (1941). Fiducial argument and the theory of confidence intervals. Biometrika, 32:128–150.Neyman, J. and Pearson, E. (1933). On the problem of the most efficient tests of statistical hypotheses [with

discussion and a rejoinder]. Journal of the Royal Statistical Society, Series A, 231:289–337.Neyman, J. and Scott, E. L. (1948). Consistent estimates based on partially consistent observations.

Econometrica, 16:1–32.Niemiro, W. (1992). Asymptotics for M-estimators defined by convex minimization. Annals of Statistics,

20:1514–1533.Nissen, S. E. and Wolski, K. (2007). Effect of rosiglitazone on the risk of myocardial infarction and death

from cardiovascular causes. The New England Journal of Medicine, 356:2457–2471.Norberg, R. (1988). Discussion of Schweder’s paper ‘A significance version of the basic Neyman–Pearson

test theory for cumulative science’. Scandinavian Journal of Statistics, 15:235–241.Normand, S.-L. T. (1999). Tutorial in biostatistics: Meta-analysis: Formulating, evaluating, combining, and

reporting. Statistics in Medicine, 18:321–359.Oeppen, J. and Vaupel, J. W. (2002). Broken limits to life expectancy. Science, 296:1029–1031.Oja, H. (2010). Multivariate Nonparametric Methods with R: An Approach Based on Spatial Signs and

Ranks. Springer-Verlag, Berlin.Ottersen, G., Hjermann, D. Ø. and Stenseth, N. C. (2006). Changes in spawning stock structure strengthen

the link between climate and recruitment in a heavily fished cod (Gadus morhua) stock. FisheriesOceanography, 15:230–243.

Page 149: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 484 — #504�

484 References

Owen, A. (1990). Empirical likelihood ratio confidence regions. Annals of Statistics, 18:90–120.Owen, A. (1991). Empirical likelihood for linear models. Annals of Statistics, 19:1725–1747.Owen, A. (1995). Nonparametric likelihood confidence bands for a distribution function. Journal of the

American Statistical Association, 90:516–521.Owen, A. (2001). Empirical Likelihood. Chapman & Hall/CRC, London.Paccioli, L. (1494). Summa de arithemetica, geometria et proportionalita. xx, Venezia.Parmar, M. K. B., Torri, V. and L., S. (1998). Extracting summary statistics to perform meta-analysis of the

published literature for survival endpoints. Statistics in Medicine, 17:2815–2834.Pawitan, Y. (2000). Computing empirical likelihood from the bootstrap. Statistics and Probability Letters,

47:337–345.Pawitan, Y. (2001). In All Likelihood: Statistical Modelling and Inference Using Likelihood. Oxford

University Press, Oxford.Paxton, C. G. M., Burt, M. L., Hedley, S. L., Vikingsson, G., Gunnlaugsson, T. and Deportes, G. (2006).

Density surface fitting to estimate the abundance of humpback whales based on the NASS-95 andNASS-2001 aerial and shipboard surveys. NAMMCO Scientific Publishing, 7:143–159.

Pearson, E. S. (1966). The Neyman-Pearson story: 1926–34. In Research Papers in Statistics: Festschriftfor J. Neyman. John Wiley & Sons, New York.

Pearson, K. (1902). On the change in expectation of life in man during a period of circa 2000 years.Biometrika, 1:261–264.

Pedersen, J. G. (1978). Fiducial inference. International Statistical Review, 146:147–170.Peplow, M. (2014). Social sciences suffer from severe publication bias. Nature, xx:xx.Pitman, E. J. G. (1936). Sufficient statistics and intrinsic accuracy. Mathematical Proceedings of the

Cambridge Philosophical Society, 32:567–579.Pitman, E. J. G. (1939). The estimation of location and scale parameters of a continuous population of any

given form. Biometrika, 30:391–421.Pitman, E. J. G. (1957). Statistics and science. Journal of the American Statistical Association, 52:

322–330.Pollard, D. B. (1991). Asymptotics for least absolute deviation regression estimators. Econometric Theory,

7:295–314.Poole, D. and Raftery, A. E. (2000). Inference in deterministic simulation models: The Bayesian melding

approach. Journal of the American Statistical Association, 95:1244–1255.Qin, J. and Lawless, J. (1994). Empirical likelihood and general estimating equations. Annals of Statistics,

22:300–325.Quenoille, M. H. (1958). The Fundamentals of Statistical Reasoning. Charles Griffin, London.Raftery, A. E., Givens, G. H. and Zeh, J. E. (1995). Inference from a deterministic population dynamics

model for bowhead whales [with discussion and a rejoinder]. Journal of the American StatisticalAssociation, 90:402–430.

Raftery, A. E. and Schweder, T. (1993). Inference about the ratio of two parameters, with application towhale censusing. The American Statistician, 47:259–264.

Rao, C. R. (1945). Information and accuracy attainable in the estimation of statistical parameters. Bulletinof the Calcutta Mathematical Society, 37:81–91.

Rausand, M. and Høyland, A. (2004). System Reliability Theory: Models, Statistical Methods, andApplications. John Wiley & Sons, Hoboken, NJ.

Rebolledo, R. (1980). Central limit theorems for local martingales. Zeitschrift fur Wahrscheinlichkeitsthe-orie und verwandte Gebiete, 51:269–286.

Reeds, J. A. (1985). Asymptotic number of roots of Cauchy likelihood equations. Annals of Statistics,13:775–784.

Reid, C. (1982). Neyman: From Life. Springer-Verlag, New York.Reiss, R.-D. (1989). Approximate Distributions of Order Statistics. Springer-Verlag, Heidelberg.Ripley, B. D. (1977). Modelling spatial patterns [with discussion and a rejoinder]. Journal of the Royal

Statistical Society, Series B, 39:172–212.Ripley, B. D. (1981). Spatial Statistics. John Wiley & Sons, New York.Ripley, B. D. (1988). Statistical Inference for Spatial Processes. Cambridge University Press, Cambridge.

Page 150: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 485 — #505�

References 485

Robinson, M. E. and Tawn, J. A. (1995). Statistics for exceptional athletics records. Journal of the RoyalStatistical Society, Series C, 44:499–511.

Rockafellar, R. T. (1970). Convex Analysis. Princeton University Press, Princeton.Rodgers, J. L. and Doughty, D. (2001). Does having boys or girls run in the family? Chance, 8–13.Romano, J. P. and Wolf, W. (2007). Control of generalized error rates in multiple testing. Annals of

Statistics, 35:1378–1408.Rothstein, H., Sutton, A. J. and Borenstein, M. (2005). Publication Bias in Meta-Analysis: Prevention,

Assessment and Adjustments. John Wiley & Sons, Chichester.Royall, R. M. (1997). Statistical Evidence: A Likelihood Paradigm. Chapman & Hall, London.Rucker, G., Schwarzer, G., Carpenter, J. and Olkin, I. (2008). Why add anything to nothing? The arcsine

difference as a measure of treatment effect in meta-analysis with zero cells. Statistics in Medicine,28:721–738.

Salome, D. (1998). Statistical Inference via Fiducial Methods [PhD dissertation]. Technical report,University of Groeningen.

Savage, L. J. (1976). On rereading R. A. Fisher. Annals of Statistics, 4:441–500.Scheffe, H. (1959). The Analysis of Variance. John Wiley & Sons, New York.Scheffe, H. (1970). Practical solutions to the Behrens–Fisher problem. Journal of the American Statistical

Association, 65:1501–1508.Schweder, T. (1975). Window estimation of the asymptotic variance of rank estimators of location.

Scandinavian Journal of Statistics, 2:113–126.Schweder, T. (1988). A significance version of the basic Neyman–Pearson theory for scientific hypothesis

testing [with discussion and a rejoinder]. Scandinavian Journal of Statistics, 15:225–242.Schweder, T. (1995). Discussion contribution to ‘Inference from a deterministic population dynamics

model for bowhead whales’ by Raftery, Givens, Zeh. Journal of the American Statistical Association,90:420–423.

Schweder, T. (2003). Abundance estimation from multiple photo surveys: Confidence distributions andreduced likelihood for Bowhead whales off alaska. Biometrics, 59:974–983.

Schweder, T. (2007). Confidence nets for curves. In Nair, V. (ed.), Advances in Statistical Modeling andInference: Essays in Honor of Kjell Doksum, pp. 593–609. World Scientific, Hackensack, NJ.

Schweder, T. and Hjort, N. L. (1996). Bayesian synthesis or likelihood synthesis – what does Borel’sparadox say? Reports of the International Whaling Commission, 46:475–479.

Schweder, T. and Hjort, N. L. (2002). Likelihood and confidence. Scandinavian Journal of Statistics,29:309–322.

Schweder, T. and Hjort, N. L. (2003). Frequentist analogues of priors and posteriors. In Stigum, B.(ed.), Econometrics and the Philosophy of Economics: Theory Data Confrontation in Economics,pp. 285–217. Princeton University Press, Princeton, NJ.

Schweder, T. and Hjort, N. L. (2013a). Discussion of M. Xie and K. Singh’s ‘Confidence distributions, thefrequentist estimator of a parameter: A review’. International Statistical Review, 81:56–68.

Schweder, T. and Hjort, N. L. (2013b). Integrating confidence intervals, likelihoods and confidencedistributions. In Proceedings 59th World Statistics Congress, 25–30 August 2013, Hong Kong, volumeI, pp. 277–282. International Statistical Institute, Amsterdam.

Schweder, T. and Ianelli, J. N. (1998). Bowhead assessment by likelihood synthesis: methodsand difficulties. Technical Report 50/AS2 the Scientific Committee of the International WhalingCommission, 16pp.

Schweder, T. and Ianelli, J. N. (2001). Assessing the Bering-Chukchi-Beaufort Seas stock of bowheadwhales from survey data, age-readings and photo-identifications using frequentist methods. TechnicalReport 52/AS13, the Scientific Committee of the International Whaling Commission, 16pp.

Schweder, T., Sadykova, D., Rugh, D. and Koski, W. (2010). Population estimates from aerial photographicsurveys of naturally and variably marked bowhead whales. Journal of Agricultural Biological andEnvironmental Statistics, 15:1–19.

Schweder, T. and Spjøtvoll, E. (1982). Plots of P-values to evaluate many tests simultaneously. Biometrika,69:492–502.

Serfling, R. J. (1980). Approximation Theorems of Mathematical Statistics. John Wiley & Sons, New York.

Page 151: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 486 — #506�

486 References

Shafer, G. (1976). A Mathematical Theory of Evidence. Princeton University Press, Princeton, NJ.Sharma, S. (1980). On Hacking’s fiducial theory of inference. The Canadian Journal of Statistics,

8:227–233.Sheather, S. J. and Marron, J. S. (1990). Kernel quantile estimation. Journal of the American Statistical

Association, 85:410–416.Shmueli, G. (2010). To explain or to predict? Statistical Science, 25:289–310.Shumway, R. H. (1988). Applied Statistical Time Series Analysis. Prentice-Hall, Englewood Cliffs, NJ.Simpson, R. J. S. and Pearson, K. (1904). Report on certain enteric fever inoculation statistics. The British

Medical Journal, 2:1243–1246.Sims, C. A. (2012). Statistical modeling of monetary policy and its effects [Nobel Prize Lecture in

Economics]. American Economic Review, 102:1187–1205.Singh, K., Xie, M. and Strawderman, W. E. (2005). Combining information from independent sources

through confidence distributions. Annals of Statistics, 33:159–183.Singh, K., Xie, M. and Strawderman, W. E. (2007). Confidence distribution (CD) – distribution estimator

of a parameter. In Complex Datasets and Inverse Problems: Tomography, Networks and Beyond, Vol. 33of IMS Lecture Notes – Monograph Series, pp. 132–150.

Skeie, R. B., Berntsen, T., Aldrin, M., Holden, M. and Myhre, G. (2014). A lower and more constrainedestimate of climate sensitivity using updated observations and detailed radiative forcing time series.Earth System Dynamics, 5:139–175.

Skrondal, A. and Rabe-Hesketh, S. (2004). Generalized Latent Variable Modeling: Multilevel, Longitudi-nal, and Structural Equation Models. Chapman & Hall/CRC, London.

Smith, R. L. (1999). Bayesian and frequentist approaches to parametric predictive inference. In Bernardo,J. M., Berger, J. O., Dawid, A. P. and Smith, A. F. M. (eds.), Bayesian Statistics 6, pp. 589–612. OxfordUniversity Press, Oxford.

Smith, T. (1994). Scaling Fisheries: The Science of Measuring the Effects of Fishing, 1855–1955.Cambridge University Press, Cambridge.

Snow, C. P. (1959). The Two Cultures and the Scientific Revolution. Cambridge University Press,Cambridge.

Snow, C. P. (1963). The Two Cultures: A Second Look. Cambridge University Press, Cambridge.Spiegelberg, W. (1901). Aegyptische und Griechische Eigennamen aus Mumientiketten der Romischen

Kaiserzeit. Greek Inscriptions, Cairo.Spiegelhalter, D. J. (2001). Mortality and volume of cases in paediatric cardiac surgery: Retrospective study

based on routinely collected data. British Medical Journal, 326:261.Spiegelhalter, D. J. (2008). Understanding uncertainty. Annals of Family Medicine, 3:196–197.Spiegelhalter, D. J., Aylin, P., Best, N. G., Evans, S. J. W. and Murray, G. D. (2002). Commissioned

analysis of surgical performance using routine data: Lessons from the bristol inquiry. Journal of theRoyal Statistical Society, Series A, 165:191–221.

Spiegelhalter, D. J., Pearson, M. and Short, I. (2011). Visualizing uncertainty about the future. Science,333:1393–1400.

Spock, B. (1946). The Common Sense Book of Baby and Child Care. Duell, Sloane and Pearce, New YorkCity.

Stein, C. (1959). An example of wild discepancy between fiducial and confidence intervals. Annals ofMathematical Statistics, 30:877–880.

Stigler, S. M. (1973). Studies in the history of probability and statistics, xxxii: Laplace, Fisher and thediscovery of the concept of sufficiency. Biometrika, 60:439–445.

Stigler, S. M. (1974). Linear functions of order statistics with smooth weight functions. Annals of Statistics,2:676–693.

Stigler, S. M. (1977). Do robust estimators work with real data? [with discussion and a rejoinder]. Annalsof Statistics, 5:1055–1098.

Stigler, S. M. (1986a). The History of Statistics: The Measurement of Uncertainty Before 1900. HarvardUniversity Press, Cambridge, MA.

Stigler, S. M. (1986b). Laplace’s 1774 memoir on inverse probability. Statistical Science, 1:359–363.

Page 152: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 487 — #507�

References 487

Stigler, S. M. (1986c). Memoir on the probability of the causes of events [translation of laplace’s 1774memoir]. Statistical Science, 1:364–378.

Stock, J. and Watson, M. (2012). Introduction to Economics: Global Edition. Pearson Education, UpperSaddle River, NJ.

Stolley, P. D. (1991). When genius errs: R. A. Fisher and the lung cancer controversy. Journal ofEpidemiology, 133:416–425.

Stone, M. (1969). The role of significance testing: Some data with a message. Biometrika, 56:485–493.Storey, J. D. (2002). A direct approach to false discovery rates. Journal of the Royal Statistical Society,

Series B, 64:479–498.Stoufer, S. A., Suchman, E. A., DeVinney, L. C., Star, S. A. and Williams, R. M. J. (1949). Adjustment

During Army Life. Princeton University Press, Princeton, NJ.Strauss, D. J. (1975). A model for clustering. Biometrika, 63:467–475.Student (1908). The probable error of a mean. Biometrika, 6:1–25.Sundberg, R. (2010). Flat and multimodal likelihoods and model lack of fit in curved exponential families.

Scandinavian Journal of Statistics, 37:632–643.Sutton, A. J. and Higgins, J. P. T. (2008). Recent developments in meta-analysis. Statistics in Medicine,

27:625–650.Sweeting, M. J., Sutton, A. J. and Lambert, P. C. (2004). What to add to nothing? Use and avoidance of

continuity corrections in meta-analysis of sparse data. Statistics in Medicine, 23:1351–1375.Taraldsen, G. and Lindqvist, B. H. (2013). Fiducial theory and optimal inference. Annals of Statistics,

41:323–341.Teasdale, N., Bard, C., La Rue, J. and Fleury, M. (1993). On the cognitive penetrability of posture control.

Experimental Aging Research, 19:1–13.Thomson, A. and Randall-Maciver, R. (1905). Ancient Races of the Thebaid. Oxford University Press,

Oxford.Tian, L., Cai, T., Pfeffer, M. A., Piankov, N., Cremieux, P.-Y. and Wei, L. J. (2009). Exact and efficient

inference procedure for meta-analysis and its application to the analysis of independent 2×2 tables withall available data but without artificial correction. Biostatistics, 10:275–281.

Tishirani, R. (1989). Noninformative priors for one parameter of many. Biometrika, 76:604–608.Tocquet, A. S. (2001). Likelihood based inference in non-linear regression models using the p∗ and r∗

approach. Scandinavian Journal of Statistics, 28:429–443.Tukey, J. W. (1986). Sunset salvo. The American Statistician, 40:72–76.van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge University Press, Cambridge.Veronese, P. and Melilli, E. (2015). Fiducial and confidence distributions for real exponential families.

Scandinavian Journal of Statistics, 42:471–484.Viechtbauer, W. (2007). Confidence intervals for the amount of heterogeneity in meta-analysis. Statistics in

Medicine, 26:37–52.Voldner, N., Frøslie, K. F., Haakstad, L., Hoff, C. and Godang, K. (2008). Modifiable determinants of

fetal macrosomia: Role of lifestyle-related factors. Acta Obstetricia et Gynecologica Scandinavica,87:423–429.

Volz, A. G. (2008). A Soviet estimate of German tank production. The Journal of Slavic Military Studies,21:588–590.

Wandler, D. V. and Hannig, J. (2012). A fiducial approach to multiple comparison. Journal of StatisticalPlanning and Inference, 142:878–895.

Wang, C. M., Hannig, J. and Iyer, H. K. (2012). Fiducial prediction intervals. Journal of Statistical Planningand Inference, 142:1980–1990.

Wellner, J. A. and van der Vaart, A. W. (1996). Weak Convergence of Empirical Processes. Springer-Verlag,Berlin.

White, H. (1994). Estimation, Inference and Specification Analysis. Cambridge University Press,Cambridge.

Wilkinson, R. G. and Pickett, K. (2009). The Spirit Level: Why More Equal Societies Almost Always DoBetter. Allen Lane, London.

Page 153: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 488 — #508�

488 References

Wilks, S. S. (1938). The large-sample distribution of the likelihood ratio for testing composite hypotheses.Annals of Mathematical Statistics, 9:60–62.

Windham, M. P. (1995). Robustifying model fitting. Journal of the Royal Statistical Society, Series B,57:599–609.

Working, H. and Hotelling, H. (1929). Application of the theory of error to the interpretation of trends.Journal of the American Statistical Association, 24:73–85.

Xie, M. and Singh, K. (2013). Confidence distribution, the frequentist distribution estimator of a parameter:A review [with discussion and a rejoinder]. International Statistical Review, 81:3–39.

Xie, M., Singh, K. and Strawderman, W. E. (2011). Confidence distributions and a unifying framework formeta-analysis. Journal of the American Statistical Association, 106:320–333.

Yang, G., Liu, D., Liu, R. Y., Xie, M. and Hoaglin, D. C. (2014). Efficient network meta-analysis: Aconfidence distribution approach. Statistical Methodology, 20:105–125.

Young, G. A. and Smith, R. L. (2005). Essentials of Statistical Inference. Cambridge University Press,Cambridge.

Yule, G. U. (1900). On the association of attributes in statistics: With illustrations from the material of thechildhood society, & c. Philosophical Transactions of the Royal Society, Series A, 194:357–319.

Zabell, S. L. (1992). R. A. Fisher and the fiducial argument. Statistical Science, 7:369–387.Zabell, S. L. (1995). Alan Turing and the central limit theorem. The American Mathematical Monthly,

102:483–494.Zech, G. (1989). Upper limits in experiments with background or measurement errors. Nuclear Instruments

and Methods in Physics Research, Series A, 277:608–610.

Page 154: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 489 — #509�

Name Index

Aalen, O. O., 123, 124, 126, 128, 143, 289, 294, 440Abel, N. H., 84–86, 96, 97, 439Aldrich, J., 20, 48, 156, 179, 186, 201Aldrin, M., 356Altman, D. G., 361Anand, W., 286, 287, 293, 443, 444Andersen, P. K., 123, 124, 128, 143, 292, 333, 439,

452, 464Anderson, T. W., 129, 143, 225Andre, C., 307Armitage, P., 64Aylin, P., 362

Baddeley, A. J., 143, 257–259, 265, 271Bai, Z. D., 466Ball, F. K., 192, 203Banerjee, M., 27, 28Bano, T., 249, 442Baraff, L., 240, 442Bard, C., 224, 441Barlow, R. E., 237Barnard, G. A., 426Barndorff-Nielsen, O. E., 31, 35, 48, 49, 62, 154, 173,

193, 211, 220–222, 228, 229, 235, 264, 298, 356,421, 464

Barry, D., 130, 149Barth, E., 397Bartholomew, D. J., 237Bartlett, M. S., 188, 189, 196, 202, 229, 277Basharin, G. P., 131, 143, 440, 464Basu, A., 44–46, 49, 458, 460, 464, 469Bayarri, M. J., 101, 142, 145Bayes, T., 185Behrens, W. U., 188, 202Benjamini, Y., 87, 92Beran, R. J., 49, 229, 280, 284, 292, 356Berger, J. O., 21, 48, 49, 101, 142, 145, 167, 428Berk, R., 434Bernardo, J. M., 428Bernstein, P. L., 4, 19Berntsen, T., 356Berry, G., 64Best, N. G., 362Bibby, J. M., 381

Bickel, P. J., 27, 30, 35, 235Bickham, J. W., 145Bie, O., 290Billingsley, P., 27, 131, 143, 288, 289, 292, 333, 464Birnbaum, A., xiv, 7, 31, 66Bityukov, S., 92Bjørnsen, K., 391Blackwell, D., 154, 165–170, 180, 183, 434Blaisdell, B. E., 129Boitsov, V. D., 136Bolt, U., 211–214, 231, 441Boole, G., 201Borel, F. E. J. E., 200, 203, 312, 429Borenstein, M., 361, 379Borgan, Ø., xx, 123, 124, 126, 128, 143, 290, 292, 333,

439, 440, 464Bowen, J. P., 410Box, G. E. P., 43, 50, 312, 428Brandon, J. R., 50, 303, 362, 364Brazzale, A. R., 31, 48, 211, 220, 222, 229, 231, 250,

251, 271, 435, 443Breiman, L., 21, 434Bremner, J. M., 237Breslow, N. E., 264, 405Breuer, P. T., 410Brillinger, D. R., xx, 143, 190, 191, 194, 197, 199, 203Britton, T., 192, 203, 432Browman, H. I., 136Brown, L. D., 264, 434Brown, T., 452Brunk, H. D., 237Buja, A., 434Burt, M. L., 315, 316, 444

Cai, T., 401, 402, 405Campbell, D. T., 60Carlin, J. B., 49, 384, 385, 412, 445Carlsen, M., 286, 287, 293, 443, 444Carpenter, J., 375Casella, G., 166, 179Chebyshov, P. L., 464Cheng, X., 356Claeskens, G., 12, 24, 28, 38, 47–50, 71, 81, 103, 104,

126, 128, 142, 149, 237, 248, 265, 284, 372, 396,

489

Page 155: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 490 — #510�

490 Name Index

423, 434, 437, 438, 440, 442, 443, 454, 456, 463,464

Clayton, D. G., 264Collett, D., 292Cook, T. D., 60Cornish, E. A., 228Cox, D. R., xiii, xx, 20, 21, 31, 35, 47–49, 56, 59, 62, 91,

92, 154, 173, 193, 211, 220–222, 228, 229, 235, 250,251, 298, 312, 356, 415, 435, 443, 464

Cramer, H., 25, 28, 50, 51, 464Creasy, M. A., 117, 143Cremieux, P.-Y., 401, 402, 405Cressie, N., 143, 265, 352Cunen, C. M. L., xx, 408

da Silva, C. Q., 240, 442Daly, F., 441Darmois, G., 180, 264Darroch, J. N., 240Darwin, C. R., 32David, H. A., 320, 333Davies, L., 49Davis, S., 262, 393, 394Davison, A. C., 31, 47, 48, 202, 211, 218, 220, 222, 228,

229, 231, 250, 251, 271, 435, 443De Blasi, P., 215, 216, 229De Leeuw, J., 237DeBlasi, P., xx, 128Dempster, A. P., 190, 197, 199, 424Dennis, J. E., 265Deportes, G., 315, 316, 444Devine, J. A., 136, 440DeVinney, L. C., 375DiCiccio, T. J., 228Dickson, E. R., 443Diggle, P., 256Doksum, K. A., 27, 30, 35, 235Doog, K. J., 3Doughty, D., 153Draper, N. R., 43, 50Dufour, J. M., 117Dyrrdal, A. V., 135, 440

Eddington, A. S., 156, 181Edgeworth, F. Y., 48, 228Edwards, A. W. F., 62, 426, 427, 436Efron, B., xiii, xiv, xx, 4, 20–22, 29, 48, 56, 61, 92, 145,

180, 201, 202, 217, 218, 228, 248, 283, 292, 298,312, 418, 421, 433–436

Einmahl, J. H. J., 229Einstein, A., 50Elstad, M., 313, 314Elvik, R., 361Embrechts, P., 212Ericsson, N. R., 117Espeland, S. H., 307Evans, S. J. W., 362

Ezekiel, M., 201

Fahrmeier, L., 264Fan, S.-H., 418Faust, J., xixFeigl, P., 250, 443Feller, W., 5, 21Felsenstein, J., 129, 440Fenstad, G. U., 464Ferguson, T. S., 167, 464Ferkingstad, E., 90, 92, 99Fermat, P. de, 4, 19Fieller, E. C., 117, 143Fine, T. L., 425Fischer, H., 464Fisher, R. A., xiii–xv, xviii, xix, 1–8, 14–17, 20–22, 25,

29, 31, 33, 48, 50, 51, 55, 56, 61, 64, 92, 93, 108,109, 143, 145, 146, 150, 151, 154, 156, 164–167,170, 174, 179, 181, 182, 185–191, 193–202, 228,274, 277, 282, 292, 295, 297, 301, 307, 312, 315,356, 375, 380, 420, 421, 430, 436

Fisher, R. A. F., 423Fleming, T. R., 292, 443Fleury, M., 224, 441Fraser, D. A. S., xiii, xiv, 4, 18, 20, 21, 56, 92, 157, 159,

179, 195, 202, 428, 432Fredette, M., 356Friesinger, A., 394Frigessi, A., xx, 87, 92Frøslie, K. F., xx, 321, 444

Galton, F., 20, 78–80, 96, 438, 464Gauss, C. F., 2, 7, 17, 20, 48, 436Gay, T., 214Gelfand, A., xxGelman, A., 49, 384, 385, 412, 445George J. C., 145Gibbs, J. W., 256Gilbert, R., 379Gilks, W. R., 49, 245, 258Gill, R. D., xx, 123, 124, 128, 143, 292, 333, 439, 452,

464Ginebra, J., 99Gips, C. H., 443Giron, J., 99Givens, G. H., xiii, 145, 312Gjessing, H. K., 123, 124, 126, 128, 292, 333, 440Glad, I. K., xx, 265, 344Godang, K., 321, 444Goethe, J. W. von, xixGoldstein, H., 142Goldstein, H. E., xxGood, I. J., 3, 21Goodman, L. A., 92, 129, 143Gould, S. J., 49, 127Grambsch, P. M., 443Graunt, J., 19

Page 156: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 491 — #511�

Name Index 491

Green, P. J., 49Grishin, E., 392Groothuis, S., 392–395Grønneberg, S., xxGujarati, X., 347, 444Gunnlaugsson, T., 315, 316, 444Guttman, L., 168

Haakstad, L., 321, 444Haavelmo, T., 389Hacking, I., 4, 19, 61, 196, 197, 426, 427Hald, A., xviii, 7, 20, 48, 295, 436Hall, P., 228Hampel, F. R., 3, 16, 19, 22, 49, 61, 194Hand, D. J., 441Hannig, J., 3, 20, 62, 186, 197–199, 202, 356, 425, 432Hansen, B. E., 356Harden, M., 379Harding, E. F., 197Hardle, W. K., 292Harrington, D. P., 292, 443Harris, I. R., 44–46, 49, 458, 460, 464, 469Harris, R. R., 197Hartigan, J. A., 130, 149Hary, A., 211, 212Hastings, W. K., 258Hattestad, O. V., 254–256, 272, 273Hedges, L. V., 361, 379Hedley, S. L., 315, 316, 444Heger, A., 135Heijmans, J., xxHelland, I. S., 48, 464Hellton, K. H., xxHermansen, G. H., xx, 137, 143, 441Higgins, J. P. T., 360, 361, 379Hinkley, D. V., 29, 145, 218, 228Hjermann, D. Ø., 136Hjort, J., 99, 136–138, 143, 440Hjort, N. L., xiii, 12, 21, 24, 27, 28, 38, 42, 44–50, 71, 81,

87, 92, 99, 103–106, 117, 124–126, 128–131, 137,138, 142, 143, 149, 150, 180, 200, 202, 228, 237,248, 256, 265, 280, 290, 292, 293, 308, 312, 323,326, 328, 330, 333, 344, 372, 392–394, 396, 408,409, 411–413, 423, 428, 429, 434, 437, 438,440–443, 452–454, 456, 458, 460, 463, 464, 466,467, 469

Hoaglin, D. C., 380Hobolth, A., 129, 440Hochberg, Y., 87, 92Hodges, J., 157, 161, 325, 326, 335Hoeting, J. A., 50Hoff, C., 321, 444Holden, M., 356Hollings, X., 439Holmes, C., 42, 47, 49Holum, D., 394Hornik, K., 237

Hosmer, D. W., 38, 437Hotelling, H., 277Houde, E. D., 136Høyland, A., 292Huber, P. J., 49Huebinger, R. M., 145Hunter, D., 145Huygens, C., 5, 19

Ianelli, J. N., 312, 429Ionnidis, J. P. A., 379Ising, E., 256Iyer, H. K., 202, 356

Jansen, D., 394Jansen, E. S., 117Jeffreys, H., 17, 423Jensen, J. L., 129, 229, 440Johansen, N. V., xx, 439Johansen, S., 264Johnston, C. H., 132, 440Jones, M. C., 44–46, 49, 265, 458, 460, 464, 469Jordan, S. M., 192Jorde, P. E., 145, 307Joseph II, B. A. J., 50Joshi, V. M., 180Joyce, P., 180Jullum, M., xx, 48, 49, 464Jøssang, A., 425

Kagan, J., 21Kahneman, D., 19Kalbfleisch, J. D., 126, 192, 200, 440Kaplan, E. L., 289Kardaun, O. J. W. F., 21Karlin, S., 49, 128Karsakov, A. L., 136, 441Kass, R., 18Kato, H., 249, 442Kavvoura, F. K., 379Keiding, N., 123, 124, 128, 143, 292, 333, 439, 464Kelly, F. P., 257Kennedy, P. E., 361, 362, 379Kent, J. T., 381Kerbesian, N. A., 117Keynes, J. M., 424Kim, J., 333Kimura, M., 129, 440Kjesbu, O. S., 136, 137, 440, 441Kluppelberg, C., 212Klemp, M., 414, 415Knutsen, H., 307Kohler, R. E., 124Kolmogorov, A. N., 5, 21, 430, 464Koning, A. J., 99, 138Konishi, K., 249, 442

Page 157: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 492 — #512�

492 Name Index

Koopman, B., 180, 264Korsbrekke, K., 136, 440Koschat, M. A., 117, 143Koski, W., 83, 240, 312, 364, 387, 388, 412, 442Kramer, S., 262Krasnikov, N., 92Krishnamoorthy, K., 192

La Rue, J., 224, 441Lake, J., 240, 442Lambert, P. C., 375, 405, 420Lancaster, T., 142Langaas, M., 90, 92, 99Langville, A. N., 131, 143, 440, 464Langworthy, A. L., 443Laplace, P. S., 2, 7, 14, 17, 20, 295, 346, 356, 436Lauritzen, S., xxLawless, J. F., 330, 356Laws, R. M., 248LeCam, L., 29, 180Lee, T. C. M., 186, 198Lee, Y., 265Lehmann, E. L., xvii, 24, 25, 27, 29, 61, 154, 157, 158,

160, 161, 166, 173, 174, 179, 201, 325, 326, 335,427, 441, 464

LeMay Doan, C., 394Lemeshow, S., 38, 437Lerudjordet, M., 229Liang, H., 50Liapunov, A. M., 464Liestøl, K., 290Lindeberg, J. W., 464Lindley, D. V., 18, 193, 196Lindqvist, B. H., xx, 90, 92, 99, 202, 239, 250Lindsay, M., 145Linnik, Y. V., 188Liu, D., 379, 380Liu, R. Y., 379, 380Louis, T., 49Lunn, A. D., 441

Madigan, D., 50, 240, 442Magnus, J. R., 229Mair, P., 237Malinchoc, M., 443Mandelkern, M., 102, 103, 142Manley, G., 121, 439Mardia, K. V., 381Marin, J.-M., 180Marjoram, P., 180Markov, A. A., 131, 132, 143, 440, 464Marron, J. S., 323Marshall, E. C., 361, 365, 444Maskell, E. J., 186Matessi, C., 49Matson, C. W., 145Mayo, D. G., 48, 92

McCloskey, R., 110, 439McConway, K. J., 441McCullagh, P., 24, 246, 264McKeague, I. W., xx, 27, 28, 326, 328, 330, 333Meier, P., 289Melilli, E., 202Melville, H., 20Mikosch, T., 212Miller, G., 240, 442Milyukov, V., 418Moene, K. O., 397Møller, J., 256, 265Mogstad, E., 265, 271Moher, D., 361Morris, C. M., 180Mowinckel, P., xx, 442Moyeed, R. A., 257–259Mozart, W. A., 50Murray, G. D., 362Murtaugh, P. A., 443Myhre, G., 356Muller, P., 42, 47, 49

Nadarajah, S., 92Nagaraja, H. N., 320, 333Nair, V., 280, 290Narum, S., 414, 415Naumov, V. A., 131, 143, 440, 464Nelder, J. A., 246, 264, 265Nelson, J. P., 361, 362, 379Nelson, W., 289, 294Newcomb, S., 459Neyman, J., xiii, xiv, 3, 5, 7, 10, 16, 20, 22, 61, 64, 92,

112, 142, 154, 165–168, 170–172, 174, 180,185–187, 192, 201, 202, 274, 276, 363, 418, 419,436

Niemiro, W., 452Nissen, S. E., 401, 402, 405Nolan, D., 412Norberg, R., 171Nordmand, S.-L. T., 414Northug, P. Jr., 256, 273Nymoen, R., 117

O’Neill, P. C., 192, 203, 432Oeppen, J., 300, 347, 358, 444Oja, H., 325Olkin, I., 375Olsen, E. M., 307Omre, H., 143, 256Onegin, E., 131–133, 143, 440Opdal, A. F., 136, 440Ostrowski, E., 441Ottersen, G., 136Otterspeer, H., 393Owen, A., 292, 325, 326, 328, 333

Page 158: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 493 — #513�

Name Index 493

Paccioli, L., 4Park, C., 49, 464Parmar, M. K. B., 363Parzen, E., xiv, 92Pascal, B., 4, 5, 19Patterson, P., 202Patton, J. C., 145Pawitan, Y., 31, 48, 166, 218, 265Paxton, C. G. M., 315, 316, 444Pearson, E., 154, 165, 168, 170–172, 174, 180Pearson, K., 12, 71, 294, 379, 437Pearson, M., 19Pedersen, J. G., 189, 202, 430Pedersen, S. L., 74, 284Peplow, M., 379Petrone, S., 323, 333Pfeffer, M. A., 401, 402, 405Piankov, N., 401, 402, 405Pickett, K., 397Pitman, E. J. G., 21, 180, 193–195, 201, 202, 264, 274Poisson, S. D., 416, 464Pollard, D. B., 27, 28, 333, 452–454, 464, 467Poole, D., 302, 312, 429Pope, S., 425Postma, L. D., 145Polya, G., 464Potts, R., 256Powell, A., 211Prentice, R. L., 126, 440Pudlo, P., 180Pushkin, A. S., 131–133, 143, 440

Qin, J., 330Quenoille, M. H., 190Quetelet, A., 73

Rucker, G., 375Rabe-Hesketh, S., 142, 264Raftery, A. E., xiii, 50, 143, 302, 312, 429Randall-Maciver, R., 81, 438Rao, C. R., 25, 28, 50, 51, 154, 165–170, 180, 183, 434Rausand, M., 292Rebolledo, R., 464Reeds, J. A., 145Reichelt, Y., xx, 439Reid, C., 180Reid, N., 31, 48, 211, 220, 222, 229, 231, 250, 251, 271,

435, 443Reiss, R.-D., 180Riba, A., 99Richardson, S., 49, 245, 258Ripley, B. D., 143, 256, 257, 265, 352Robert, C. P., xiv, 92, 180Robinson, M. E., 229Rockafellar, R. T., 452Rodgers, J. L., 153Romano, J. P., 27, 157, 158, 166, 179, 292, 294

Ronchetti, E., 49Rosa, D., 394Rothstein, H., 361, 379Rousseuw, P. J., 49Royall, R. M., 62, 425, 427Rubak, E., 143, 265, 271Rubin, D. B., 49, 384, 385, 412, 445Rugh, D., 83, 240, 312, 364, 387, 388, 412, 442Ryder, R. J., 180

Sadykova, D., 83, 312, 364, 387, 388, 412Salanti, G., 379Salome, D., 21Sandbakk, Ø., xxSavage, L. J., 188, 190, 201, 436Schaafsma, W., 21Scheffe, H., 109, 110, 142, 202, 277, 278, 439Schnabel, R. B., 265Schulz, F., 361Schwarzer, G., 375Schweder, T., xiii, 21, 83, 87, 89, 92, 117, 143, 145, 171,

180, 200, 202, 215, 216, 228, 229, 240, 241, 280,292, 308, 312, 325, 364, 387, 388, 409, 411–413,428, 429, 439, 442

Scott, E. L., 112, 142See, S., 379Serfling, R. J., 464Shafer, G., 64, 425Shakespeare, W., 304Sharma, S., 197Sheather, S. J., 323Shioya, H., 49, 464Shmueli, G., 21Sholokhov, M. A., 143Short, I., 19Shumway, R. H., 143Simar, L., 292Simpson, R. J. S., 379Sims, C. A., xx, 16, 102, 142, 383, 389, 412, 445Singh, K., xiii, 3, 21, 92, 202, 276, 292, 368, 370,

375–378, 421, 425, 433Skeie, R. B., 356Skjæraasen, J. E., xx, 136, 440Skrondal, A., 142, 264Smith, R. L., 27, 35, 48, 173, 235, 356Smith, T., 136Snell, E. J., 250, 251, 415, 443Snow, C. P., 21Solberg, S., 270Solzhenitsyn, A. I., 143Spiegelberg, W., 12, 437Spiegelhalter, D. J., xx, 19, 49, 245, 258, 361, 362, 365,

377, 444Spjøtvoll, E., 87, 89, 92, 439Spock, B., 379Sprott, D. A., 192, 200Star, S. A., 375

Page 159: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 494 — #514�

494 Name Index

Steerneman, A. G. M., 21Stein, C., 18, 191, 196Stenseth, N. C., 136, 145, 307Stern, H. S., 49, 384, 385, 412, 445Stewart L., 363Stigler, S. M., xx, 20, 156, 165, 179, 333, 346, 356, 459Stock, J., 247Stolley, P. D., 380Stone, M., 62, 64, 92Storey, J. D., 92Storvik, G., xxStoufer, S. A., 375Strauss, D. J., 256–259, 265, 271Strawderman, W. E., 21, 276, 292, 368, 370, 375–378,

421, 433Student (W. Gosset), 15, 32, 92, 186Suchman, E. A., 375Sun, D., 428Sundberg, R., 142Sutton, A. J., 360, 361, 375, 379, 405, 420Suydam, R. S., 145Sweeting, M. J., 375, 405, 420Sablıkova, M., 74, 284

Tamura, T., 249, 442Taraldsen, G., xx, 202, 239, 250Tawn, J. A., 229Taylor, H. M., 128Teasdale, N., 224, 441Thomson, A., 81, 438Thoresen, M., 313, 314Tian, L., 401, 402, 405Tiao, G. C., 428Tibshirani, R. J., 218, 228, 428Tocquet, A. S., 228Torri, V., 363Triggs, X., 439Trofimov, A. G., 136, 441Tukey, J. W., xviiiTuring, A., 464Turner, R., 143, 257, 265, 271Tutz, G., 264Tversky, A., 19

Ulvang, V., 256, 443Ushakov, N. G., 344

Van Dam, G. M., 443van der Vaart, A., 27, 464Van Keilegom, I., xx, 284, 326, 328, 330, 333Varin, C., 129–131, 143, 150Vaupel, J. W., 300, 347, 358, 444Veronese, P., 202Viechtbauer, W., 368Vikhamar-Scholer, D. V., 135, 440

Vikingsson, G., 315, 316, 444Voldner, N., xx, 321, 324, 444Volinsky, C. T., 50Volz, A. G., 92von Bortkiewicz, L., 416von Mises, R., 464

Waagepetersen, R., 256, 265Wade, P. R., 50, 303, 362, 364Wald, A., xiv, 5, 154, 167, 183, 394Walker, S. G., 42, 47, 49Walker, S.-E., xxWalløe, L., xx, 249, 442Wan, A. T. K., 50Wandler, D. V., 198Wang, C. M., 356Watson, M., 247Wedderburn, W. M., 246, 264Wei, L. J., 401, 402, 405Wellner, J., 27Westergren, T., 414, 415White, H., 49Whitelaw, A., 313, 314Wilkinson, R. G., 397Wilks, S. S., 48, 292Willems, J. C., 21Williams, R. M. Jr., 375Windham, M. P., 49Wold, H., 464Wolf, W., 292, 294Wolpert, R., xiii, 48, 49Wolski, K., 401, 402, 405Wood, T. A., 228Working, W., 277Wotherspoon, J., 394

Xie, M.-g., xiii, xx, 3, 20, 21, 92, 202, 276, 292, 368, 370,375–380, 421, 425, 433

Yang, G., 380Yang, G. L., 29Young, G. A., 27, 35, 48, 173, 235Yule, G. U., 79, 96

Zabell, S. L., 16, 20, 21, 201, 464Zadeh, L., 425Zech, G., 144Zeh, J. E., xiii, 240, 312, 442Zelen, M., 250, 443Zenitani, R., 249, 442Zhang, K., 434Zhang, X., 50Zhao, L., 434Zou, G., 50

Page 160: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 495 — #515�

Subject Index

Aalen model, 124Abel envelopes, 84–86, 96, 97, 439Adelskalenderen, 262, 443AIC, Akaike’s information criterion, 47, 132, 137, 149,

248, 250, 251, 253, 254, 261, 263, 270–272, 349,385, 387, 413, 422, 435, 437, 463

All Black, 95, 96, 439analysis of variance, 89, 107, 435, 439Anand, Wiswanathan, 286, 287, 293, 443, 444ancillary statistic, 8, 31, 191, 301, 420average confidence density, 138, 139

babiesbig, 321–324, 329, 333, 444correlation, 227, 231, 377, 378, 441, 442crying, 405, 415, 416overdispersed, 151–153oxygen supply for, 313, 314sleeping, 379small, 38–40, 52, 94, 95, 105, 267, 268, 321, 437, 438

balancing act, 48, 224, 304, 441Bartlett

correction, 42, 211, 213, 224, 225, 227, 229, 231, 232,421

identity, 25, 27, 51, 247Behrens–Fisher problem, 188, 202Bernoulli model, 31, 198Bernshteın–von Mises theorem, 41, 42, 128, 343, 428beta model, 238, 239, 254, 468Beta-binomial model, 152, 374, 375, 377, 378,

387, 413BHHJ method (Basu, Harris, Hjort, Jones), 44–46, 49,

458–461, 464, 469BIC, Bayesian information criterion, 47, 248, 261, 422,

463binomial model, 26, 31, 51, 54, 63–65, 83, 89, 107, 144,

148, 174, 234, 235, 247, 248, 265, 266, 307, 313,314, 318, 345, 346, 375, 384, 401, 405, 406,413–416, 444, 445

birds, 150–152, 272, 441Bjørnholt, skiing days at, 135, 136, 150, 355, 440blunder, Fisher’s, xiii, xv, xix, 1, 3, 4, 16, 20–22, 92, 185,

190, 201, 380, 419

body mass index (BMI), 73–76, 93, 284, 285, 324–326,333, 335, 438

Bolt, 211–214, 231, 441Bonferroni adjustment, 281, 291, 293bootstrapping

abc bootstrapping, 204, 217, 218, 227, 228, 429AIC, 463confidence, 283, 305, 433likelihood, 305nonparametric, 94, 249, 283, 335, 399parametric, 60, 150, 219, 275, 341, 373, 382, 389, 411prepivoting, 229t-bootstrapping, 77, 204, 217, 224, 225, 227, 228, 230,

248, 334, 368, 377, 395, 429Borel paradox, xiii, 200, 203, 312, 429British hospitals, 360–362, 365, 366, 369–372, 374,

376–378, 444brothers, 83, 98, 151, 152, 357, 438

carcinoma, 126, 350, 440Carlsen, Magnus, 286, 287, 293, 443, 444Cauchy model, 116, 117, 145, 339, 449cod, 136, 143, 307, 310, 440combining information, see meta-analysiscompleteness, 174, 240, 267confidence bands, 284–290, 292–294, 354, 400, 421, 437confidence conversion, xv, 17, 55, 218, 292, 295–299,

301, 305–308, 310–313, 364, 376, 430confidence curve, see also confidence distribution

concept, 12, 14, 115–117construction, 13, 14, 67, 78, 103, 116, 156, 210, 267,

269, 279, 297, 298, 386definition, 66, 70, 115, 276for meta-analysis, 364, 368, 371, 377, 378, 396for prediction, 337, 339, 340, 342for quantiles, 321, 322, 324graphical summary, xiv, 121, 125, 175, 222, 364, 420illustration, 13, 67, 68, 72, 96, 112, 114, 116, 126, 127,

132, 136, 138, 151, 155, 156, 161, 213, 214, 237,244, 252, 256, 259, 261, 284, 285, 298, 309, 378,386, 388, 438

multivariate, 275–277, 279, 280, 292, 431of a prior, 304, 305

495

Page 161: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 496 — #516�

496 Subject Index

product confidence curve, 275, 278, 280–282, 286,292, 293, 399, 400

confidence density, 55, 59, 60, 62–67, 69, 73, 75–77, 92,93, 97, 98, 146, 147, 169, 205, 210, 217–219, 227,241, 267, 274–277, 292, 298, 301, 302, 318, 319,337, 340, 344, 408, 429, 432, 437

confidence distributionapproximate, xiv, 18, 19, 62, 69, 70, 89, 93, 94, 139,

146, 150, 185, 190, 198, 202, 204, 205, 209, 217,267, 269, 272, 318, 343, 420

compelling summaries, xix, 17, 55, 59, 65, 78, 143,168, 263, 387, 389, 390, 395, 404, 407, 418, 420,428

concept, xiv, 4, 20, 21, 23, 32, 55, 56, 61, 62, 91, 100,115, 138, 142, 165, 170

construction, xv, 4, 21, 59–61, 63, 69, 72, 73, 75, 76,80, 93, 94, 96, 97, 100–103, 105, 107, 108, 111,118, 124, 128, 133, 140, 142–146, 150, 151, 155,158, 160, 163, 165, 204, 212, 233, 260, 262, 272,273, 276, 317

discrete, 83–85, 87–89, 91, 92, 97, 98epistemic, xiv, xv, 17, 91, 419, 430fiducial, xiv, 3, 16, 17, 55, 61, 160, 185, 187, 197,

199–201, 419for meta-analysis, xix, 20, 295, 296, 367, 368, 370,

374–377, 381, 382, 421, 433from empirical likelihood, 325, 332, 334, 335from intervals, xiii, 9, 14, 22, 57, 70from likelihoods, xv, 14, 17, 46, 55, 64, 109, 126, 128,

131, 206, 228, 261, 421, 427from pivots, 55, 59, 72, 94, 117, 157, 267, 301, 429gold standard, xvi, 4, 223, 225illustration, 9, 18, 19, 66, 67, 74–76, 79–83, 87–89, 91,

98, 101, 102, 105, 106, 137, 139, 140, 145, 147,153, 160, 161, 203, 239, 240, 252, 261, 263, 267,269, 320, 323, 370

in multiple dimension, xv, 69, 200, 274–277, 280, 292,431

loss and risk, xvii, 56, 69, 138, 140, 155, 157, 161–166,168, 170–173, 180, 181, 183, 357

more precise, xv, 43, 204, 205, 209, 213, 217, 219,227–230

optimality, 4, 21, 56, 60–62, 69, 103, 112, 115, 139,154, 155, 157–159, 162, 165, 167–179, 204, 215,219, 231, 232, 235–239, 243, 244, 250, 251, 253,254, 259, 260, 265, 267–269, 271, 364, 403, 407,408, 420

posteriors without priors, xiv, xvi, 3, 4, 17, 41, 418, 428predictive, 117, 336–341, 343, 345, 347, 349–351, 353,

355, 357, 358property, 10, 18, 58, 60, 61, 63, 68, 92, 93, 124, 157,

170, 419robust, 49statistical toolbox, xvi, xvii, xix, 4, 23, 383, 411, 435

CONSORT, CONsolidated Standards of Reporting Trials,361, 363

correlation coefficient, 209, 224, 227, 230, 231, 282

correlation, intraclass, 107, 108, 111, 112, 146, 393, 396,413, 414

Cramer–Rao bound, 25, 28, 50, 51cross-country skiing, 135, 136, 150, 254–256, 272, 273,

440

definitionsancillary, 31BHHJ estimator, 45, 458BHHJ estimator, regression, 461characteristic function, 449confidence curve, 66, 70, 115confidence density, 65confidence distribution, 58confidence loss, 162confidence risk, 162convergence in distribution, 448, 449convergence in probability, 447delta method, 33, 451deviance, 35, 278empirical likelihood, 326, 331exponential family, 173, 264Fisher’s information matrix, 25generalised linear model (GLM), 246hazard rate function, 123i.i.d. (independent and identically distributed), 25invariant confidence, 158likelihood function, 24log-likelihood function, 24maximum likelihood estimator, 24median confidence estimator, 66normal conversion, 298odds ratio, 236, 401optimal confidence, 172partial ordering of confidence distributions, 168pivot, 33posterior distribution, 40predictive confidence distribution, 337product confidence curve, 280profile log-likelihood, 34sandwich matrix, 43score function, 25skiing day, 135sufficiency, 30, 166

delta method, 33, 42, 120–122, 147–150, 181, 204, 205,208, 209, 217, 227, 228, 375, 447, 449, 451, 464,465

DNA (deoxyribonucleic acid), 129, 131, 440doughnuts, 109–112, 146, 439Drosophilia flies, 124, 143, 145, 439

effective population size, 307–309empirical likelihood, 317, 325–335estimand minus estimator, 70, 217exponential family, xv, xvii, 56, 173, 174, 204, 208, 215,

216, 233, 235, 236, 246, 247, 252, 253, 264–266,271, 298, 379, 420, 430, 455

Page 162: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 497 — #517�

Subject Index 497

exponential model, 12, 52, 53, 57, 208, 210, 211, 229,299, 307, 314, 345, 468

exponential tilting, 238, 254, 265, 271extreme value model, 211–214, 231

family-wise error rate, 292, 294FIC, focussed information criterion, 47–49, 137, 248,

261, 263, 271, 396, 422, 423, 435, 463, 464fiducial, see also confidence distribution, fiducial

argument, xiii, xv, 1, 3, 15, 20, 56, 185, 186blunder, xiii, xvii, 16, 20, 92, 160, 186–188, 190, 195,

419epistemic probability, 2, 6, 187generalised, 62, 63, 159, 179, 186, 195,

197–199inference, 4jewel in crown, 200, 436posteriors without priors, 56, 185predictive, 188, 356

fiducialis, fiduci, fidere, 201Fieller problem, 100, 117–119, 121, 143, 147, 192, 399,

420, 431Fisher information, 25, 50, 51, 129, 134, 145, 146, 150,

151, 206, 220, 231, 253, 266, 455, 461Fisher’s step-by-step, 189–191, 196, 197, 199, 202, 203,

369, 431Fisherian, xiv, xvi, xvii, 3, 17–19, 40, 154, 296, 302, 304,

305, 422, 423, 427, 428football, 269, 270, 442Federation Internationale de Ski (FIS), 256, 443

gamma model, 9, 176, 290, 468gamma process crossings regression, 128, 149gamma regression, 247, 249–252, 260, 263, 264, 268,

271, 438, 443gastrointestinal bleeding, 414, 415general regression, 28, 41, 44, 45, 49, 186, 198, 209, 217,

247, 248, 330, 350, 376, 450, 456generalised linear models (GLM), xv, xvii, 35, 56, 154,

173, 174, 233–235, 246–248, 252, 264, 265, 435,455

generalised linear-linear models (GLLM), 233, 252, 260,261, 263, 265, 272

German tanks, 84, 85, 92, 97Gini index, 383, 397, 413golf, 383–387, 412, 413, 445Gompertz model, 12, 14, 71, 437Google Scholar, 383, 409–413, 416, 445growing older, 300, 347, 358, 444

half-correction, 32, 62, 63, 69, 83, 84, 89, 92, 93, 102,144, 174, 199, 210, 240, 257, 301, 313, 345, 420,432

handball, bivariate, 241–246, 268–270, 442Hardy–Weinberg equilibrium, 144, 145

hazards regression, 128, 149, 251, 333, 440health assessment questionnaire (HAQ), see modified

health assessment questionnaire (MHAQ)heteroscedastic regression, 261–263, 265, 272, 443Hjort index, 99, 136–138, 143, 440Hodges–Lehmann estimator, 157, 161, 325, 326, 335hypergeometric model, 84, 240

income, Norwegian, 275, 291, 292, 383, 396–401, 413,445

Intergovernmental Panel on Climate Change (IPCC), 1, 2International Olympic Committee (IOC), 392International Skating Union (ISU), 392, 445International Whaling Commission (IWC), xiii, 442invariance, 24, 50, 68, 92, 154, 157–161, 169, 179, 182,

275, 283inverse regression, 120, 121, 147, 148, 439Ising model, 233, 256, 259–261, 420isotonic regression, 237, 239, 260

JHHB, see BHHJ

Kaplan–Meier estimator, 289Kimura model, 129–131, 149, 150, 440Kola, 136–138, 441kriging, 336, 337, 350, 352Kullback–Leibler distance, 23, 44, 46, 49, 330, 417,

454–456, 458, 462, 463, 468, 469Kullback–Leibler distance, weighted, 44, 46, 456, 458,

461, 462, 469

least false parameter, 23, 43, 45, 46, 49, 454–456,460–462, 468, 469

length problem, 18, 19, 191, 196, 202, 279, 357, 428,430, 431

leukaemia, 250–252, 270, 271, 443lidocaine, 414life lengths

becoming longer, 300, 347, 358, 444in Roman era Egypt, 12–14, 71, 72, 94, 149, 290, 294,

298, 299, 318, 319, 334, 341, 342, 437likelihood principle, 7, 8, 17, 30–32, 48, 62, 92, 170likelihood synthesis, 312linear regression, 20, 29, 30, 62, 72, 75–77, 94, 121, 148,

179, 203, 209, 262, 267, 291, 300, 325, 346, 347,349, 442, 444, 467

location model, 169log-likelihood, weighted, 44, 46, 456, 458, 461, 462, 469logistic regression, xviii, 38, 39, 42, 52, 94, 148, 234,

246–248, 377, 383–388, 412, 413, 438, 445, 467loss and risk functions, see confidence distribution, loss

and risk

magic formula, Barndorff-Nielsen’s, 219–221, 229, 364Markov chain, 130–133, 143, 149

Page 163: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 498 — #518�

498 Subject Index

Markov chain Monte Carlo (MCMC), 16–18, 128,244–246, 257, 258, 260, 266

matingfruit flies, 122, 124–126, 143, 145, 439humans, 79, 80, 96, 151, 152, 438

mean-to-variance ratio, 223–225, 271, 441median

confidence for, 320, 322, 324population, 147, 160, 161, 182, 320–322, 362, 468sample, 52, 139, 140, 147, 157, 468

median confidence estimator, 18, 57, 58, 66–69, 81, 86,93, 105, 106, 114, 115, 145, 224, 260, 261, 337, 376,415, 420

median lethal dose parameter, LD-50, 122median survival time, 94, 126–128, 149, 440median-bias correction, 214–217, 229, 231, 278, 279,

313, 314meta-analysis, xvii, xix, 21, 81, 142, 190, 295, 296, 305,

311, 360–364, 374–376, 379, 380, 383, 393, 401,405, 406, 414, 421, 444

mixed effects, 107–109, 111, 142, 145model

Aalen, see Aalen modelBernoulli, see Bernoulli modelbeta, see beta modelBeta-binomial, see Beta-binomial modelbinomial, see binomial modelCauchy, see Cauchy modelexponential, see exponential modelextreme value, see extreme value modelgamma, see gamma modelGompertz, see Gompertz modelhypergeometric, see hypergeometric modelIsing, see Ising modelKimura, see Kimura modelmultinomial, see multinomial modelmultinormal, see multinormal modelnegative binomial, see negative binomial modelnormal, see normal modelPoisson, see Poisson modelPotts, see Potts modelStrauss, see Strauss modelthree-parameter t, see three-parameter t modeluniform, see uniform modelWeibull, see Weibull model

model selection, see AIC, BIC, FICmodified health assessment questionnaire (MHAQ),

237–239, 250, 254, 267, 271, 442multinomial model, 79, 240, 286, 293multinormal model, 81, 209, 224, 225

negative binomial model, 31, 32, 63, 65, 297negative binomial regression, 151, 272Nelson–Aalen estimator, 124, 143, 289, 290, 294, 440Neyman–Pearson theory, xvii, 4, 56, 61, 154, 165, 168,

170, 171, 174, 180, 420, 427Neyman–Scott problem, 100, 112–115, 142, 216, 428

Nils, 357Nobel Prize, 16, 102, 143, 383, 389, 412noncentral chi-squared, 18, 328, 435, 466noncentral hypergeometric distribution, 236noncentral t, 190, 197normal model, 18, 29, 34, 74, 155, 156, 175, 198, 216,

247, 298, 335

objective Bayes, 4, 20, 180, 428, 436odds ratio, 14, 236, 266, 267, 314, 315, 383, 401,

403–406, 414–416Odin, sons of, 83, 98, 438Olympics, 73–76, 212, 214, 241, 254–256, 273, 284, 285,

383, 391, 392, 394–396, 412, 413, 438, 442, 443,445

optimalcombination of information, xv, 296, 365, 376, 379confidence, xvii, 4, 61, 62, 69, 102, 103, 109, 112, 114,

174–180, 184, 215, 223–225, 227, 230–232,267–269, 271, 345, 364, 403–409, 413–416, 420

estimator, 295, 461performance, xvtests, 61, 180, 420

Organisation for Economic Co-operation andDevelopment (OECD), 396, 397

orthogonal transformation, 158overconfidence, 434overdispersion, 142, 150–153, 248, 272, 332, 381, 387,

412, 441, 457

pairwise comparisons, 89, 160, 175, 439pivot

approximate, 16, 33, 34, 73, 205, 217, 306, 352construction, 32, 38, 52, 145, 155, 159, 183, 345, 346,

350fiducial argument, 14, 15, 20, 195fine-tuning, 209, 229, 353from likelihoods, 55giving a confidence distribution, 55, 59, 72, 94, 117,

157, 209, 301, 370, 429multivariate, 274, 276, 292, 312property, 15, 33, 55

Poisson model, 9, 10, 40–42, 46, 51, 62, 63, 93, 102, 103,144, 145, 150, 174, 192, 203, 210, 212, 231, 241,242, 247, 266, 270, 297, 301, 313, 332, 387, 401,406–409, 413, 416, 465

Poisson model, stretched, 272Poisson model, tilted, 271Poisson regression, 42, 46, 122, 150, 152, 247, 248, 272,

332, 441, 453, 457, 461Poisson, bivariate, 241–244, 246, 268–270pornoscope, 122, 124–126, 143, 145, 439posterior distribution

‘objective’ priors, 19, 187, 193, 196, 429approximation to confidence, xiv, 4, 19, 40, 41, 193,

343, 432flat priors, 7, 192, 193

Page 164: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 499 — #519�

Subject Index 499

trouble with, xiii, xiv, 3, 18, 65, 187, 196, 279, 389,390, 445

without priors, xiii, xv, 3, 105, 419, 430Potts model, 233, 256primary biliary cirrhosis (PBC), 263, 264, 272, 443private attributes, 144probability

epistemic, xiv–xvii, 2–7, 11, 17, 19, 22, 61, 91, 92,185, 187, 189, 194, 201, 418–420, 422–426,429–431, 433, 434, 436

frequentist, or aleatory, xiv, xvi, xvii, 2, 5, 6, 17, 19,22, 61, 194, 419, 420, 436

probit regression, 94, 95, 247, 307profile log-likelihood, xv, 23, 25, 34–37, 39, 40, 45, 49,

52, 53, 67, 78, 93–95, 113, 115–117, 122, 130, 134,137, 143, 144, 149, 150, 204, 213, 216, 221,224–226, 230, 231, 243, 267, 268, 272, 275,296–299, 352, 364, 366, 367, 380, 385, 420, 433,457

profile log-likelihood, modified, 219, 221, 228, 298proportional hazards, xviii, 125, 127, 128, 251, 333, 453Pushkin, Onegin, 131–133, 143, 440

quantile confidence, 107, 320–324, 333quantile regression, 291, 413

random effects, 107, 109, 142, 145, 360–362, 376, 380,393, 435

Rao–Blackwell theorem, 56, 154, 165–170, 180, 183, 434ratio of scale parameters, 61, 219, 220, 278, 281rats, 160, 161, 441regime shift, 87, 88, 98, 99regression

gamma, see gamma regressiongamma process, see gamma process crossings

regressiongeneral, see general regressionGLLM, see generalised linear-linear models (GLLM)GLM, see generalised linear models (GLM)hazards, see hazards regressionheteroscedastic, see heteroscedastic regressioninverse, see inverse regressionisotonic, see isotonic regressionlinear, see linear regressionlogistic, see logistic regressionnegative binomial, see negative binomial regressionPoisson, see Poisson regressionprobit, see probit regressionquantile, see quantile regressionrobust, see robust regressionspatial, see spatial regression

revolution, xiii, xviii, xix, 2, 3, 7, 20, 136, 295, 436robust

AIC, 47confidence, 457, 469deviance, 45, 457, 469estimation, 44, 45, 139, 325, 458–460

inference, 23, 42, 44, 49, 330, 376, 457, 460likelihood, 464, 469regression, 459, 461sandwich, 44, 253

robust regression, 459, 461rugby, 95, 96, 439

scale parameter, 59, 181scandal, the quiet, 434shadow, Hjort’s, 136skiing, see cross-country skiingskulls, Egyptian, 81, 82, 371–374, 381, 382, 438smoking, 38–40, 267, 268, 380, 438spatial regression, 134, 135, 353, 440speedskating, 73–76, 262, 272, 284, 285, 383, 391–393,

395, 396, 413, 414, 438, 443, 445Strauss model, xvii, 233, 256–259, 265, 271, 420Strawderman, W. E., 425Student pivot, 160sudden infant death syndrome, 379sufficiency

approximate, 180, 253concept, 156, 165, 170, 179, 181conditioning on, 167, 434dimension reduction, 30, 154, 157, 159, 189, 195, 234,

235, 253minimal, 31, 157, 158, 166, 167, 182, 198, 199, 267principle, 7, 30, 31, 154, 157, 166, 168, 433property, 7, 168, 170, 172–174, 180, 183, 187, 189,

191, 193, 199, 215, 239, 264, 302, 434statistic, 7, 8, 14, 30, 31, 84, 107, 145, 156, 159,

165–167, 169, 171, 172, 181, 189, 193, 198, 202,203, 238, 240, 250, 259, 303, 432

sun, rising tomorrow, 346surprise score, 211–214, 231, 441

temperatures, 121, 136–138, 147, 148, 439, 441three-parameter t model, 105time series, 133–138, 143, 147, 148, 150, 336, 337, 350,

352–356, 439–441, 444Tirant lo Blanc (novel), 99twins, 174, 175, 441two-by-two table, 79, 236, 266, 403, 405, 414–416

underdispersion, 151, 242, 441unemployment, US, 347–350, 358, 444unfairness, Olympic, 254–256, 383, 391–396, 412, 413uniform model, 8, 14, 84, 154, 155, 314

variance components, 101, 107

Weibull model, 51, 52, 127, 128, 149, 271, 290, 341, 342,350, 359, 462

whalesbaleen, 248, 442

Page 165: 12...−3.842,−3.284,−0.278,2.240,3.632, and pointed to certain complexities having to do with multimodal likelihood functions. Here we use the two-parameter Cauchy model, with

“Schweder-Book” — 2015/10/21 — 17:46 — page 500 — #520�

500 Subject Index

whales (Contd.)

blubber, 248, 249, 267, 442

bowhead, 83, 144, 240, 241, 267, 301, 303, 312, 364,387, 412, 442

humpback, 248, 315, 444

minke, 248

Wilcoxon

confidence, 160, 161, 325, 326, 335statistic, 160, 161, 204, 255, 325

Wilks theorem, 13, 35, 48, 54, 78, 228, 279, 292, 298,328, 330, 332, 370, 421, 433

World Health Organization (WHO), 324, 361, 379

Yule’s Q, 79, 96