some comments on misspecification of priors in bayesian modelling of measurement error problems

11
STATISTICS IN MEDICINE, VOL. 16, 203213 (1997) SOME COMMENTS ON MISSPECIFICATION OF PRIORS IN BAYESIAN MODELLING OF MEASUREMENT ERROR PROBLEMS SYLVIA RICHARDSON AND LAURENT LEBLOND Institut National de la Sante ´ et de la Recherche Me ´ dicale-U.170, 16 Avenue Paul Vaillant-Couturier, 94807 VILLEJUIF Cedex 07, France SUMMARY In this paper we discuss some aspects of misspecification of prior distributions in the context of Bayesian modelling of measurement error problems. A Bayesian approach to the treatment of common measurement error situations encountered in epidemiology has been recently proposed. Its implementation involves, first, the structural specification, through conditional independence relationships, of three submodels a measurement model, an exposure model and a disease model and secondly, the choice of functional forms for the distributions involved in the submodels. We present some results indicating how the estimation of the regression parameters of interest, which is carried out using Gibbs sampling, can be influenced by a misspecification of the parametric shape of the prior distribution of exposure. 1. INTRODUCTION Imperfectly observed covariate data is a common problem in many epidemiological studies. The fields of nutritional, environmental or occupational epidemiology abound with examples where surrogates, for example, a food frequency questionnaire or a job title, are used to measure in a coarse way the risk factor of particular interest.1 Consequently, the development of methods for correcting the effects of measurement errors has been an active area of research,2~7 particularly in the epidemiological context.8~14 The statistical treatment of measurement error problems can be seen as part of the wider framework of incomplete data problems.15 In that area, Bayesian modelling has made substantial contributions as it proposes a unified treatment of all unobservables, which encompasses the case of missing or unknown covariates through sampling protocols, or because only surrogates are recorded, as well as more broadly grouped or censored data, latent variables or unobserved stages in a longitudinal process.16~18 We use the term ‘missing covariate’ to denote both missing or imperfectly measured covariates. In a Bayesian framework, all unobservables are considered as random quantities which are assigned a prior distribution expressing probabilistically uncertain- ty or degrees of belief about their values. Statistical parameters are also treated as unobservables. Inference about the (few) parameters of interest will then be based on their marginal posterior distribution, resulting from an integration with respect to the missing data. By explicitly introdu- cing the probabilistic structure of the missing data into the model, one ensures that the CCC 02776715/97/02020311 ( 1997 by John Wiley & Sons, Ltd.

Upload: sylvia-richardson

Post on 06-Jun-2016

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: SOME COMMENTS ON MISSPECIFICATION OF PRIORS IN BAYESIAN MODELLING OF MEASUREMENT ERROR PROBLEMS

STATISTICS IN MEDICINE, VOL. 16, 203—213 (1997)

SOME COMMENTS ON MISSPECIFICATION OF PRIORS INBAYESIAN MODELLING OF MEASUREMENT ERROR

PROBLEMS

SYLVIA RICHARDSON AND LAURENT LEBLOND

Institut National de la Sante et de la Recherche Medicale-U.170, 16 Avenue Paul Vaillant-Couturier,94807 VILLEJUIF Cedex 07, France

SUMMARY

In this paper we discuss some aspects of misspecification of prior distributions in the context of Bayesianmodelling of measurement error problems. A Bayesian approach to the treatment of common measurementerror situations encountered in epidemiology has been recently proposed. Its implementation involves, first,the structural specification, through conditional independence relationships, of three submodels —a measurement model, an exposure model and a disease model — and secondly, the choice of functional formsfor the distributions involved in the submodels. We present some results indicating how the estimation of theregression parameters of interest, which is carried out using Gibbs sampling, can be influenced bya misspecification of the parametric shape of the prior distribution of exposure.

1. INTRODUCTION

Imperfectly observed covariate data is a common problem in many epidemiological studies. Thefields of nutritional, environmental or occupational epidemiology abound with examples wheresurrogates, for example, a food frequency questionnaire or a job title, are used to measure ina coarse way the risk factor of particular interest.1 Consequently, the development of methods forcorrecting the effects of measurement errors has been an active area of research,2~7 particularly inthe epidemiological context.8~14

The statistical treatment of measurement error problems can be seen as part of the widerframework of incomplete data problems.15 In that area, Bayesian modelling has made substantialcontributions as it proposes a unified treatment of all unobservables, which encompasses the caseof missing or unknown covariates through sampling protocols, or because only surrogates arerecorded, as well as more broadly grouped or censored data, latent variables or unobserved stagesin a longitudinal process.16~18 We use the term ‘missing covariate’ to denote both missing orimperfectly measured covariates. In a Bayesian framework, all unobservables are considered asrandom quantities which are assigned a prior distribution expressing probabilistically uncertain-ty or degrees of belief about their values. Statistical parameters are also treated as unobservables.Inference about the (few) parameters of interest will then be based on their marginal posteriordistribution, resulting from an integration with respect to the missing data. By explicitly introdu-cing the probabilistic structure of the missing data into the model, one ensures that the

CCC 0277—6715/97/020203—11( 1997 by John Wiley & Sons, Ltd.

Page 2: SOME COMMENTS ON MISSPECIFICATION OF PRIORS IN BAYESIAN MODELLING OF MEASUREMENT ERROR PROBLEMS

uncertainty derived from the missing data is correctly propagated onto the estimation of theparameters of interest. Thus, even though missing covariates and parameters are formally treatedsymmetrically in the model building step, when it comes to inference the situation is no longersymmetric as one is concerned with integrating over the structure of the missing data.

Particularly suited to this integration is the treatment of these problems by stochastic algo-rithms of the family of Markov chain Monte Carlo methods (MCMC).19 Indeed the missing datastructure is fully exploited by this approach which provides dependent, approximate samplesfrom the joint posterior distribution of all unobservables given the data. From this joint posterior,any marginal posterior distribution of interest is immediately retrievable.

In the measurement error problem, the key steps are:

(i) the specification of a measurement model linking surrogate measures Z and true covari-ates X;

(ii) the specification of an exposure model (prior distribution for X), also sometimes referred toas structural modelling in the literature.

The aim of this note is to present some results indicating how the estimation of the regressionparameters of interest can be influenced by misspecification of the parametric form for the priordistribution of X. Clearly regression parameters may be influenced by other types of misspecifica-tion, in particular concerning the measurement model. In this paper, we restrict ourselves tomisspecification of the prior distribution of X as it is a key element of the Bayesian formulationfor which there is often only weak prior information. Let us note that other approaches tomeasurement error problems, in particular along semi-parametric lines,3,5, 6,25 are specificallytrying to avoid making assumptions about the distribution of X.

We start by summarizing a Bayesian approach to measurement error problems in epidemi-ology using conditional independence relationships which has been outlined in a series ofpapers.17,20,21 Using simulated data sets this approach has been shown to deal successfully witha wide range measurement error problems. Related approaches have been discussed by severalauthors.22~25

2. CONDITIONAL INDEPENDENCE MODELLING OF MEASUREMENT ERRORPROBLEMS

The structure of the measurement problem in epidemiology can be formulated as follows. Riskfactors (covariates) for each individual are to be related to the disease status (response variable)½ of that individual. However, for many or all individuals, while some risk factors C are trulyknown, other risk factors X are unknown, although one or several surrogate measures Z of X arerecorded. To model this situation we shall distinguish three submodels (following the terminologyof Clayton26):

(i) a disease model, which expresses the relationship between the risk factors C and X and thedisease status ½ ;

(ii) a measurement model, which expresses the relationship between the surrogate measuresZ and the true unknown risk factor X ;

(iii) an exposure model which specifies the distribution of the unknown risk factor X in thegeneral population.

The underlying structure of these three submodels can be fundamentally characterized byexpressing the following conditional independence assumptions :

disease model [½iDX

i, C

i, b] (1)

204 S. RICHARDSON AND L. LEBLOND

Page 3: SOME COMMENTS ON MISSPECIFICATION OF PRIORS IN BAYESIAN MODELLING OF MEASUREMENT ERROR PROBLEMS

measurement model [ZiDX

i, j] (2)

exposure model [XiDn] (3)

where the index i denotes individual i, b, j and n denote model parameters and [º D»] genericallydenotes the conditional distribution of º given ». The variables in (1), (2) and (3) can be scalar orvector. Since we are in a Bayesian framework, prior distributions for b, j and n are also required(denoted, respectively, by [b], [j] and [n]). Equations (1), (2) and (3) are called model conditionaldistributions (model conditionals for short).

We also imply additional conditional independence assumptions (the directed Markov assump-tion) which specify that the joint distribution of all the variables can be written as the product ofall the model conditionals:

[b][j] <i

[XiD n] <

i

[ZiDX

i, j] <

i

[½iDX

i,C

i, b] . (4)

Equation (1) states that we place ourselves in the classical case, where, conditionally on the trueexposure being known, the surrogate measures Z

ido not add any information on the disease

status, a hypothesis also referred to in the epidemiological literature as ‘non-differentiable errors’.Equation (2) states that by conditioning on appropriately defined parameters j and the trueexposure X

i, the surrogate measures Z

iare independent amongst individuals. Equation (3)

models the population distribution of unknown risk factors amongst individuals in terms ofparameters n.

By specifying the conditional distribution of the surrogate Z given the true exposure X as inequation (2), we are placing ourselves in the Bayesian analogue of what is traditionally referred toas the ‘classical error model’, where measurement error is independent of X. Another type of errormodel which has been considered in the literature8, 14 is the Berkson error model, where equa-tion (2) is replaced by [X

iDZ

i, j]. With the Berkson error model, usually no model is specified for

the marginal distribution of Z, and so, in the formulation of the measurement model, there isimplicit conditioning on the data.

Equations (1) to (3) have only specified generically the structure of the measurement errorproblem. To use our conditional modelling approach for a given epidemiological study, one mustwrite down model equations corresponding to the type of study design (assessment of a goldstandard and existence of a validation group, use of repeated measures, of several instruments, ofancillary risk factor information) and the specific measurement instruments used in the study.

Models appropriate for studies in nutritional epidemiology have been discussed by severalauthors9,10,17 and a measurement error structure common to many occupational studies hasbeen detailed in Gilks and Richardson.20 After the structural part, the functional part of themodel equations has to be specified. This entails choosing particular parametric forms for thedistributions involved in equations (1) to (3), as well as specifying the prior distributions of theparameters.

As indicated in the introduction, Bayesian estimation in the general framework outlined abovecan be carried out straightforwardly by Gibbs sampling. Gibbs sampling is a Markov chainMonte Carlo method for generating samples from the joint posterior distribution of the modelparameters. It was originally proposed by Hastings.27 As the joint posterior distribution ofinterest, the target distribution, is highly multidimensional, direct simulation is not possible,hence an irreducible Markov chain is constructed with stationary distribution, the targetdistribution of interest. The wide applicability of the algorithm to general statistical modellingwas emphasized by Gelfand and Smith,28 and Gelfand et al.,29 and has since been demonstrated

BAYESIAN MODELLING OF MEASUREMENT ERROR PROBLEMS 205

Page 4: SOME COMMENTS ON MISSPECIFICATION OF PRIORS IN BAYESIAN MODELLING OF MEASUREMENT ERROR PROBLEMS

by many authors. A general introduction and many examples of application including measure-ment error problems can be found in Gilks et al.30 For computational details of the implementa-tion of some measurement error models the reader is referred to references 17 and 20.

3. INFLUENCE OF THE MISSPECIFICATION OF THE EXPOSURE MODEL

At each step of the approach we have outlined, conditional distributions need to be explicitlyspecified in a parametric way. While some of these parametric distributions arise naturally, suchas the choice of the logistic model for the disease risk, other assumed distributional forms aremore arbitrary. In particular, there are some cases where little is known about the distribution ofthe exposure X and an appropriate model for it. In a context of radiation exposure throughfall-out of nuclear tests, Thomas et al.22 have preferred to use a discrete distribution witha variable number of atoms for modelling the exposure.

In this section we present a series of examples with the aim to investigate how misspecificationof the exposure model influences the performance of our method of analysis. We have usedsimulated data sets throughout.

3.1. Design set-up

Two risk factors are involved in the disease model. The first one, X, is measured with error andthe second one, C, is known accurately. We consider the case of a logistic link between risk factorsand disease status. Specifically, we suppose that ½

ifollows a Bernoulli distribution with para-

meter ai, where logit a

i"b

0#b

1X

i#b

2C

i.

Concerning the measurement process, we consider the simple case of an unbiased instrumentZ, possibly recorded twice in a subgroup of individuals. We thus suppose that the measurementmodel conditional, for the rth repeat of instrument Z

ir, is a normal distribution with mean X

iand

variance h~1 :

[ZirDX

i, h]&N(X

i, h~1), r"1, 2.

Two cases of study design are considered:

Design 1: It comprises a main study consisting of 1000 individuals where only the surrogate Z hasbeen recorded. In complement to the main study, there is a validation group, planned inadvance, of n"50 individuals, that is, a group where besides Z, it is assumed that it hasbeen possible to measure accurately the true exposure X by means of a gold standard.

Design 2: The main study is as above but instead of a validation group, the design now includesa subgroup of n"50 individuals where Z has been recorded independently twice.

3.2. Simulation set-up

For each design, a baseline case (a) where X is normally distributed and four cases (b) to (e) ofnon-Gaussian distributions for X are considered:

(a) X&N(2; 4)(b) X&0·5 N(0·26; 1·0)#0·5 (3·73; 1·0)(c) X&0·5 N(0·26; 2·22)#0·5 (3·73; 0·74)(d) X&log-normal (0·346; 0·832)(e) X&s2

2.

206 S. RICHARDSON AND L. LEBLOND

Page 5: SOME COMMENTS ON MISSPECIFICATION OF PRIORS IN BAYESIAN MODELLING OF MEASUREMENT ERROR PROBLEMS

For all the data sets, the disease status ½ was generated assuming logistic regression para-meters fixed at b

0"!0·8, b

1"0·9, b

2"1·2. Throughout we suppose that X and C are

associated by a linear relation; C"cX#e, where e is a standard Gaussian variable, independentof X and c"0·375. In our baseline case (a) this corresponds to a correlation between X andC equal to 0·6. Note that the expected value of X is 2·0 for all examples. For each design and eachcase (a) to (e), 10 data sets were simulated and analysed by Gibbs sampling.

3.3. Prior distribution used in the Bayesian analysis

Each of the simulated data sets was analysed by Gibbs sampling, under the assumption that theexposure model for (X

C) was specified as a bivariate normal distribution with mean k and

variance-covariance matrix &, with a vague normal prior distribution for k centred around (2>02>0

)with precision matrix (0>015,

0>0,0>00>015

) and a Wishart prior distribution for & with 5 degrees offreedom and identity scale matrix.

Hence, apart from data set (a), the parametric shape of the prior distribution of the exposureused to carry out the Bayesian analysis is misspecified. Concerning the prior distribution for theregression parameters, we assumed b

0&N(!1·0; 4·0), b

i&N(0·0; 4·0) i"1, 2 and h&gamma

(0·1; 0·1).

3.4. Results

The results are presented in Table I for the design with a validation group (design 1) and in Table IIfor the design with repeated measures in a subgroup (design 2). For each case and each of the 10data sets, we ran the Gibbs sampler for 5000 iterations discarding the first 500 iterations as a burnin; good convergence behaviour of the Gibbs sampler in measurement error problems have beenpreviously reported17 and visual inspection of the sequences of parameters values confirmed this.We have summarized marginal posterior distributions of the parameters of interest by reportingposterior means and posterior standard deviations averaged over the 10 simulations, as well asthe empirical standard deviation between the mean estimates in each of the 10 simulations.

3.4.1. Standard case (data set (a))

As expected from previous simulations the results show that our estimation method has per-formed satisfactorily for the two designs, with all the estimated posterior means of the parametersvery close to the values set in the simulation.

Note that the prior mean kX

and the precision h have also been well estimated. As expected thedesign with a validation group leads to lower posterior standard deviations for all the parametersdirectly influenced by the measurement design, that is, b

0, b

1and h. Note also that b

2which

corresponds to the covariate C measured without error but correlated to X is well estimated withsimilar precision for the two designs.

3.4.2. Mixture case (data sets (b) and (c))

In data sets (b), the underlying true exposure distribution is symmetric, bimodal, with wellseparated peaks, whereas in data set (c) the mixture distribution is asymmetric (Figure 1). Thecorresponding shapes for the distribution of the surrogate Z are outlined in Figure 1. Note thatthe bimodality is clearly attenuated so that choosing a parametric distribution for X from thehistogram of the Z’s is not straightforward.

Overall we see some deterioration in the estimation of b0

and b1. The estimated posterior

means are not so close to the set values and the posterior standard deviations are noticeably

BAYESIAN MODELLING OF MEASUREMENT ERROR PROBLEMS 207

Page 6: SOME COMMENTS ON MISSPECIFICATION OF PRIORS IN BAYESIAN MODELLING OF MEASUREMENT ERROR PROBLEMS

Tab

leI.

Gib

bssa

mplin

gan

alys

isofa

design

with

ava

lidat

ion

subg

roup

(n"

50)u

nder

diff

eren

tge

nera

ting

distr

ibutionsof

theex

pos

ure

X.D

ata

sets

(b)

to(e

)ar

ean

alys

edw

ith

am

issp

ecifi

edprior

distr

ibution

for

the

exposu

reX

Par

amet

ertr

ue

valu

eb 0

b 1b 2

k xh

!0·

80·

91·

22·

00·

9

Dat

ase

ts(a

)(n

orm

al)

post

erio

rm

ean

!0·

77(0

·25)

0·89

(0·1

1)1·

24(0

·08)

2·01

(0·0

7)0·

94(0

·05)

post

erio

rSD

0·16

(0·0

3)0·

11(0

·02)

0·12

(0·0

08)

0·07

(0·0

02)

0·09

(0·0

1)

Dat

ase

ts(b

)(m

ixtu

re)

post

erio

rm

ean

!1·

13(0

·15)

1·05

(0·1

2)1·

18(0

·13)

2·04

(0·1

1)0·

89(0

·26)

post

erio

rSD

0·23

(0·0

5)0·

17(0

·04)

0·13

(0·0

09)

0·09

(0·0

04)

0·16

(0·0

7)

Dat

ase

ts(c

)(m

ixtu

re)

post

erio

rm

ean

!0·

85(0

·25)

0·94

(0·1

6)1·

31(0

·08)

2·02

(0·0

5)1·

02(0

·24)

post

erio

rSD

0·19

(0·0

6)0·

14(0

·05)

0·14

(0·0

2)0·

09(0

·002

)0·

19(0

·06)

Dat

ase

ts(d

)(lo

g-norm

al)

post

erio

rm

ean

!0·

57(0

·13)

0·67

(0·0

9)1·

14(0

·15)

1·95

(0·0

4)0·

72(0

·13)

post

erio

rSD

0·22

(0·0

5)0·

16(0

·04)

0·12

(0·0

09)

0·08

(0·0

05)

0·14

(0·0

3)

Dat

ase

ts(e

)(c

hi-sq

uar

e)po

ster

ior

mea

n!

0·65

(0·1

5)0·

69(0

·09)

1·15

(0·1

0)1·

97(0

·11)

0·88

(0·1

2)po

ster

ior

SD

0·19

(0·0

3)0·

14(0

·04)

0·11

(0·0

09)

0·08

(0·0

04)

0·17

(0·0

3)

Eac

hlin

esu

mm

ariz

esre

sults

over

10in

dep

enden

tre

plic

atio

ns

:m

ean

and

bet

wee

nre

plic

atio

nst

andar

dde

viat

ions

(giv

enin

bra

cket

s).

(a)AX CB&

NCA2·

0

1·5B; A4·

0

1·5

1·5

1·56BD

(b)

X&

0·5

N(0

·26;

1·0)#

0·5(3

·73;

1·0)

(c)

X&

0·5

N(0

·26;

2·22

)#0·

5(3

·73;

0·74

)(d

)X

&lo

g-no

rmal

(0·3

46;0·

832)

(e)

X&

s2 2

208 S. RICHARDSON AND L. LEBLOND

Page 7: SOME COMMENTS ON MISSPECIFICATION OF PRIORS IN BAYESIAN MODELLING OF MEASUREMENT ERROR PROBLEMS

Tab

leII

.G

ibbssa

mplin

gan

alys

isofa

des

ign

with

two

repea

ted

mea

sure

sin

asu

bgr

oup

(n"

50)u

nde

rdiff

eren

tge

ner

atin

gdistr

ibution

soft

heex

posu

reX

.D

ata

sets

(b)to

(e)ar

ean

alys

edw

ith

am

issp

ecifi

edprior

distr

ibution

for

the

exposu

reX

Par

amet

ertr

ue

valu

eb 0

b 1b 2

k xh

!0·

80·

91·

22·

00·

9

Dat

ase

ts(a

)(n

orm

al)

post

erio

rm

ean

!0·

79(0

·12)

0·90

(0·1

2)1·

24(0

·11)

2·02

(0·0

4)0·

95(0

·17)

post

erio

rSD

0·22

(0·0

4)0·

17(0

·04)

0·13

(0·0

1)0·

09(0

·003

)0·

18(0

·04)

Dat

ase

ts(b

)(m

ixtu

re)

post

erio

rm

ean

!1·

34(0

·37)

1·25

(0·2

9)1·

15(0

·06)

2·00

(0·0

7)0·

75(0

·11)

post

erio

rSD

0·45

(0·2

4)0·

36(0

·20)

0·16

(0·0

3)0·

09(0

·003

)0·

15(0

·04)

Dat

ase

ts(c

)(m

ixtu

re)

post

erio

rm

ean

!1·

31(0

·56)

1·35

(0·4

6)1·

07(0

·16)

2·00

(0·0

7)0·

78(0

·20)

post

erio

rSD

0·37

(0·2

2)0·

32(0

·19)

0·17

(0·0

4)0·

10(0

·005

)0·

14(0

·05)

Dat

ase

ts(d

)(lo

g-norm

al)

post

erio

rm

ean

!0·

38(0

·28)

0·62

(0·2

5)1·

16(0

·24)

2·02

(0·0

5)0·

90(0

·20)

post

erio

rSD

0·20

(0·0

9)0·

15(0

·09)

0·13

(0·0

5)0·

09(0

·005

)0·

19(0

·06)

Dat

ase

ts(e

)(c

hi-sq

uar

e)po

ster

ior

mea

n!

0·69

(0·2

1)0·

75(0

·19)

1·22

(0·1

4)1·

99(0

·09)

0·92

(0·2

5)po

ster

ior

SD

0·25

(0·1

2)0·

19(0

·11)

0·13

(0·0

2)0·

09(0

·006

)0·

19(0

·04)

Eac

hlin

esu

mm

ariz

esre

sults

over

10in

dep

enden

tre

plic

atio

ns:

mea

nan

dbet

wee

nre

plic

atio

nst

anda

rddev

iation

s(g

iven

inbra

cket

s)

(a)AX CB&

NCA2·

0

1·5B; A4·

0

1·5

1·5

1·56BD

(b)

X&

0·5

N(0

·26;

1·0)#

0·5(3

·73;

1·0)

(c)

X&

0·5

N(0

·26;

2·22

)#0·

5(3

·73;

0·74

)(d

)X

&lo

g-no

rmal

(0·3

46;0·

832)

(e)

X&

s2 2

BAYESIAN MODELLING OF MEASUREMENT ERROR PROBLEMS 209

Page 8: SOME COMMENTS ON MISSPECIFICATION OF PRIORS IN BAYESIAN MODELLING OF MEASUREMENT ERROR PROBLEMS

Figure 1. Histograms of the distribution of the surrogate Z for cases (b) to (e) of non-Gaussian distribution for theexposure X.

Z&N(X, (0.9)~1). The density of X is plotted as a full line and a smooth density estimate of Z as a broken line:(b) X&0·5 N(0·26; 1·0)#0·5 (3·73; 1·0)(c) X&0·5 N(0·26; 2·22)#0·5 (3·73; 0·74)(d) X&log-normal (0·346; 0·832)(e) X&s2

2

increased, particularly for data sets (b). The regression parameter b1

is overestimated particularlyfor design 2 and there are also wider fluctuations between the results of the 10 replicated data sets.

As expected, the estimation of b2

is not much affected by the misspecification, the correlationbetween X and C leading only to slightly larger posterior standard deviation for b

2. Results for k

Xand h are still satisfactory. Note that the misspecification of X has increased the posteriorstandard deviation for h in the validation design but not in the repeated measure design. This isbecause in design 1, the validation group provides information both in h and n, whereas in design2, there is only weak information in n in the repeated measures. Thus h is more precisely assessedin design 1, provided that there is no conflict between the values of X in the validation group andthose generated in the main study with the help of the prior distribution of X.

3.4.3 Log-normal and chi-square cases (datasets (d) and (e))

There is again a deterioration of the estimation of b0

and b1, with underestimation of b

1rather

than overestimation as in cases (b) and (c). In contrast to the mixture cases (b) and (c), there is noclear pattern of increase of the posterior standard deviations for these parameters. Consequently,

210 S. RICHARDSON AND L. LEBLOND

Page 9: SOME COMMENTS ON MISSPECIFICATION OF PRIORS IN BAYESIAN MODELLING OF MEASUREMENT ERROR PROBLEMS

in the log-normal case, the posterior mean of b0

is biased and more than two posterior standarddeviations away from the set value of !0·8. As previously, the parameters b

2, k

Xand h are well

estimated, the misspecification being only reflected by increased fluctuations between the 10 datasets, in particular for h.

4. DISCUSSION

In this short note we have reviewed some of the aspects of the Bayesian approach to measurementerror problems via the specification of conditional independence models and the implementationof stochastic simulation algorithms. There are several advantages of this approach over methodspreviously proposed which have been extensively discussed in Richardson and Gilks.17 Ofparamount importance is its flexibility, which enables the modelling of a wide range of measure-ment error situations without resorting to artificial simplifying assumptions. This has importantdesign implications for future studies and an important area for research is to develop guidelinesfor complex designs.

A first step in the construction of such models is the stipulation of suitable conditionalindependence assumptions. Careful thought has to be given to the implications of each of theseassumptions in any particular context. As an example, the conditional independence between therepeated measures of a surrogate given the true risk factor, assumed in design 2, would not hold ifthere is a systematic bias in the measurement instrument.

At a second step, parametric distributions are specified. Misspecification can thus occur ina variety of ways. The influence of misspecification of the unknown exposure distribution onregression parameter estimates gives cause for concern and we have centred our discussion onthis problem. With respect to the regression coefficient for X we have shown some sensitivity tomisspecification, the overall picture being that of a moderate bias in the estimates and increasedposterior standard deviations. On the other hand, misspecification of the prior distribution ofX has little influence on the estimation of the regression coefficient of a covariate measuredwithout error (even when correlated with X) or on the estimation of the precision of themeasurement error model.

There is strong interest in being able to relax the fully parametric set up, in a way which is notdata dependent, while keeping the flexibility of the conditional independence modelling and theBayesian approach. The use of flexible mixture distributions is a natural way to go in thatdirection. Indeed, mixture of standard distributions are often used in a semi-parametric way toapproximate distributions which are not easily modelled by standard parametric families. Gibbssampling analysis of finite mixtures has been described by Diebolt and Robert31 and the nextstage of development of our modelling approach will be to incorporate the possibility of usinga mixture model for the exposure distribution. Furthermore, advantage might be taken of recentdevelopments in the class of MCMC algorithms32 which will enable the use of mixtures witha variable number of components, thus increasing the flexibility of this semi-parametric approach.Finally, let us note that in recent work, Mallick and Gelfand33 have used mixtures of betadistributions to model unknown link functions and have also applied these ideas in the context ofmeasurement error problems.25

REFERENCES

1. Willett, W. ‘An overview of issues related to the correction of non-differential exposure measurementerror in epidemiologic studies’, Statistics in Medicine, 8, 1031—1040 (1989).

2. Fuller, W. A. Measurement Error Models, Wiley, New York, 1987.

BAYESIAN MODELLING OF MEASUREMENT ERROR PROBLEMS 211

Page 10: SOME COMMENTS ON MISSPECIFICATION OF PRIORS IN BAYESIAN MODELLING OF MEASUREMENT ERROR PROBLEMS

3. Carroll, R. J. and Wand, M. P. ‘Semiparametric estimation in logistic measurement error models’,Journal of the Royal Statistical Society, Series B, 53, 573—585 (1991).

4. Chesher, A. ‘The effect of measurement error’, Biometrika, 78, 451—462 (1991).5. Pepe, M. S. and Fleming, T. R. ‘A non-parametric method for dealing with mismeasured covariate data’,

Journal of the American Statistical Association, 86, 108—113 (1991).6. Robins, J. M., Rotnitzky, A. and Zhao, L. P. ‘Estimation of regression coefficients when some regressors

are not always observed’, Journal of the American Statistical Association, 89, 846—866 (1994).7. Carroll, R. J., Ruppert D. and Stefanski, L. A. Measurement Error in Nonlinear Models, Chapman

& Hall, New York, 1995.8. Armstrong, B. G. ‘The effects of measurement errors on relative risk regression’, American Journal of

Epidemiology, 132, 1176—1184 (1990).9. Rosner, B., Willett, W. C. and Spiegelman, D. ‘Correction of logistic regression relative risk estimates

and confidence intervals for systematic within-person measurement error’, Statistics in Medicine, 8,1051—1069 (1989).

10. Rosner, B., Spiegelman, D. and Willett, W. C. ‘Correction of logistic regression relative risk estimatesand confidence intervals for measurement error: the case of multiple covariate measured with error’,American Journal of Epidemiology, 132, 734—745 (1990).

11. Pierce, D. A., Stram, D. O., Vaeth, M. and Schafer, D. W. ‘The errors in variables problem: consider-ations provided by radiation dose-response analyses of the A-bomb survivor data’, Journal of theAmerican Statistical Association, 87, 351—359 (1992).

12. Duffy, S. W., Maximovitch, D. and Day, N. E. ‘External validation, repeat determination and precisionof risk estimation in misclassified exposure data in epidemiology’, Journal of Epidemiology and Commun-ity Health, 46, 620—624 (1992).

13. Caroll, R. J., Gail, M. H. and Lubin, J. H. ‘Case-control studies with errors in covariates’, Journal of theAmerican Statistical Association, 88, 421, 185—199 (1993).

14. Thomas, D., Stram, D. and Dwyer, J. ‘Exposure measurement error: influence on exposure-diseaserelationship and methods of correction’, Annual Review of Public Health, 14, 69—93 (1993).

15. Tanner, A. ¹ools for Statistical Inference, Springer Verlag, New York, 1993.16. Gelfand, A. E. and Smith, A. F. M. ‘Bayesian analysis of constrained parameters and truncated data

problems using Gibbs sampling’, Journal of the American Statistical Association, 87, 523—532 (1992).17. Richardson, S. and Gilks, W. R. ‘Conditional independence models for epidemiological studies with

covariate measurement error’, Statistics in Medicine, 12, 1703—1722 (1993).18. Kirby, A. J. and Spielgelhalter, D. J. Statistical Modelling for the Precursors of Cervical Cancer, Case

Studies in Biometry, N. Lange (ed), Wiley, New York, 1994.19. Besag, J., Green, P. J., Higdon, D. and Mengersen, K. ‘Bayesian computation and stochastic system’,

Statistical Science, 10, 1, 3—41 (1995).20. Gilks, W. R. and Richardson, S. ‘Analysis of disease risks using ancillary risk factors, with application to

job-exposure matrices’, Statistics in Medicine, 11, 1443—63 (1992).21. Richardson, S. and Gilks, W. R. ‘A Bayesian approach to measurement error problems in epidemiology

using conditional independence models’, American Journal of Epidemiology, 138, 6, 430—442 (1993).22. Thomas, D. C., Gauderman, J. and Kerber, R. ‘A non-parametric Monte Carlo approach to adjustment

for covariate measurement errors in regression problems’, Technical report, Department of PreventiveMedicine, University of Southern California, 1991.

23. Stephens, D. A. and Dellaportas, P. ‘Bayesian analysis of generalised linear models with covariatemeasurement error’, in Bernardo, J. M., Berger, J. O., Dawid, A. P. and Smith, A. F. M. (eds), BayesianStatistics 4, Oxford University Press, Oxford, 1992.

24. Schmid, C. H. and Rosner, B. ‘A Bayesian approach to logistic regression models having measurementerror following a mixture distribution’, Statistics in Medicine, 12, 1141—1153 (1993).

25. Mallick, B. K. and Gelfand, A. E. ‘Semiparametric errors in variables models: a Bayesian approach’,Journal of Statistical Planning and Inference, 52, 307—321 (1996).

26. Clayton, D. G. ‘Models for the analysis of cohort and case-control studies with inaccurately measuredexposures’, in Dwyer, J. H., Feinleib, N., Lippert, P. and Hoffmeister, H. (eds) Statistical Models for¸ongitudinal Studies of Health, Oxford University Press, New York, 1992.

27. Hastings, W. K. ‘Monte-Carlo sampling methods using Markov chains and their applications’, Biomet-rika, 57, 97—109 (1970).

28. Gelfand, A. E. and Smith, A. F. M. ‘Sampling based approaches to calculating marginal densities’,Journal of the American Statistical Association, 85, 398—409 (1990).

212 S. RICHARDSON AND L. LEBLOND

Page 11: SOME COMMENTS ON MISSPECIFICATION OF PRIORS IN BAYESIAN MODELLING OF MEASUREMENT ERROR PROBLEMS

29. Gelfand, A. E., Hills, S. E., Racine-Poon, A. and Smith, A. F. M. ‘Illustration of Bayesian inference innormal data models using Gibbs sampling’, Journal of the American Statistical Association, 85, 972—985(1990).

30. Gilks, W. R., Richardson, S. and Spiegelhalter, D. J. (eds) Practical Markov Chains Monte Carlo,Chapman and Hall, London, 1996.

31. Diebolt, J. and Robert, C. P. ‘Estimation of finite mixture distributions through Bayesian sampling’,Journal of the Royal Statistical Society, Series B, 56, 163—175 (1994).

32. Green, P. J. ‘Reversible jump MCMC computation and Bayesian model determination’, Biometrika, 82,4, 711—732 (1995).

33. Mallick, B. K. and Gelfand, A. E. ‘Generalized linear models with unknown link functions’, Biometrika,81, 237—245 (1995).

.

BAYESIAN MODELLING OF MEASUREMENT ERROR PROBLEMS 213