analysis of longitudinal data with missing values
TRANSCRIPT
-
UNIVERSITY OF CALIFORNIALos Angeles
Analysis of longitudinal data with missing values
A dissertation submitted in partial satisfaction
of the requirements for the degree
Doctor of Philosophy in Statistics
by
Jinhui Li
2006
-
Copyright by
Jinhui Li
2006
-
The dissertation of Jinhui Li is approved.
Xiaowei Yang
Thomas R. Belin
Rick Paik Schoenberg
Hongquan Xu
Yingnian Wu, Committee Chair
University of California, Los Angeles
2006
ii
-
To my family
iii
-
TABLE OF CONTENTS
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Models for longitudinal data analysis and missing data issue . . . . . . 5
2.1 Models for longitudinal data analysis . . . . . . . . . . . . . . . . . . 6
2.1.1 Linear models . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.2 Generalized linear models . . . . . . . . . . . . . . . . . . . 7
2.1.3 Transition models . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.4 Random effect models . . . . . . . . . . . . . . . . . . . . . 9
2.1.5 Estimation and inference . . . . . . . . . . . . . . . . . . . . 10
2.2 Missing data issue . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Ignorable versus Nonignorable . . . . . . . . . . . . . . . . . 13
2.3 Modelling with missing values . . . . . . . . . . . . . . . . . . . . . 15
2.3.1 MULTIPLE PARTIAL IMPUTATION . . . . . . . . . . . . . 16
2.3.2 Selection model . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.3 REMTM for binary response variables . . . . . . . . . . . . . 20
3 Bayesian framework with MCMC approach . . . . . . . . . . . . . . . 24
3.1 Bayesian methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1.1 Bayes Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1.2 Prior distribution . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1.3 Bayesian inference . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 MCMC methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
iv
-
3.2.1 Monte Carlo methods . . . . . . . . . . . . . . . . . . . . . . 28
3.2.2 MCMC fundamentals . . . . . . . . . . . . . . . . . . . . . 30
3.2.3 Gibbs sampling . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.4 Metropolis-Hastings algorithm . . . . . . . . . . . . . . . . . 34
3.2.5 Adaptive rejection sampling for Gibbs sampling . . . . . . . 35
4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1 Bayesian Inference using MCMC . . . . . . . . . . . . . . . . . . . 39
4.1.1 Priors specification and monitoring convergence . . . . . . . 39
4.1.2 MCMC:Data augmentation via Gibbs Sampler . . . . . . . . 40
4.2 Implementation for selection models . . . . . . . . . . . . . . . . . . 42
4.2.1 Selection Models with AR(1) Covariance . . . . . . . . . . . 42
4.2.2 Selection Models with Random-Effects . . . . . . . . . . . . 44
4.3 Implementation for shared-parameter models . . . . . . . . . . . . . 46
4.3.1 Count response variables . . . . . . . . . . . . . . . . . . . . 47
4.3.2 Binary response variables . . . . . . . . . . . . . . . . . . . 49
4.3.3 Continuous response variables . . . . . . . . . . . . . . . . . 50
4.4 Model extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.5 MPI package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.1 Selection models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.1.1 A simulation study . . . . . . . . . . . . . . . . . . . . . . . 55
5.1.2 Selection Model for Continuous Carbon Monoxide Levels . . 58
v
-
5.2 Shared-parameter models . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2.1 A Simulation Study on count response variables . . . . . . . 63
5.2.2 Application to Smoking Cessation Data . . . . . . . . . . . . 65
5.3 Multiple-model sensitivity analysis . . . . . . . . . . . . . . . . . . . 71
5.3.1 Pattern-Mixture Models for Continuous Carbon Monoxide Lev-
els . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.3.2 REMTM for Continuous Carbon Monoxide Data . . . . . . . 74
6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
vi
-
LIST OF FIGURES
2.1 Three Representations of Missingness Mechanisms. Three ways in
modeling incomplete data are depicted here. Parameters and sym-
bols in the figures are defined as: yi = (yobsi ,ymisi ) - observed and
missing values for subject i; ri - missingness indicator for repeatedmeasures on subject i; - parameters of data; - parameters ofmissingess indicators; and - parameters shared by data and miss-
ingness indicators. . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1 The figure shows the idea of Adaptive Rejection Sampling with func-tions defined as: h(x) = log(g(x)) the logarithm of a function g(x)
proportional to a statistical density function f(x); u(x) upper hall
of h(x); and l(x) lower hall of h(x). . . . . . . . . . . . . . . . . 36
5.1 The average and SD curves for the log-scaled carbon monoxide levels.
On this plot, the four mean curves of the log-scaled carbon monoxide
levels and the corresponding point-wise standard errors are drawn for
each of the four treatment conditions: Control, RP-only, CM-only, and
RP+CM (RP=Relapse Prevention, CM=contingency Management). Ver-tical bars indicate the estimated standard errors of average carbon monox-
ide levels. The stars (*) over the x-axis mark the time points (i.e., visitnumbers) where the carbon monoxide levels are significantly differentindicated by a point-wise ANOVA (P-value0.001). Y-axis indicatesvalues of carbon monoxide levels after log(1+x) transform. X-axisrepresents number of clinic visit for study participants (1, . . . , 36). . . 60
vii
-
5.2 Missingness patterns for the carbon monoxide levels across treatment
conditions. For each treatment condition, an image depicts the miss-
ingness indicators of carbon monoxide levels for each smoker at each
research visit. Dark colored area indicates that the corresponding car-
bon monoxide levels were observed while white colored area indicates
that the corresponding data were missing intermittently or missing af-
ter dropout. The four treatment conditions are Control, RP-only, CM-
only, and RP+CM (RP=Relapse Prevention, CM=contingency Man-agement). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.3 The Sampling Process and Posterior Distribution of Parameter 1 in
the Simulation Study. The left side figure draws three chains of the
augmented Gibbs sampler, each starting from randomly selected points.
It is seen that after around 100 iterations, the Gibbs sampler converges.
The right side histogram depicts the posterior distribution of the para-
meter of interest (i.e., 1). From the histogram, we can see that themean value of the posterior distribution is very close to the true value
(i.e., -1.0). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.4 Histograms of estimators of 1 in the Simulation Study. . . . . . . . . 66
viii
-
5.5 Repeated Count Measures Over the 12-week Study Period for the Smok-
ing Cessation Clinical Trial.The left graph plots the repeated measured
number of smoking episodes for the 90 smokers in the treatment group
who received contingency management. The right graph plots the re-
peated measured number of smoking episodes for the 85 smokers in
the control group who did not receive contingency management. In
both plots, the y-axis indicates the number or counts of tobacco use
for the previous week, the x-axis corresponds to the week numbers (1-12). The thick solid curves depict the mean profiles in the two groups,while the dashed curves represent the individual profiles. . . . . . . . 67
5.6 Mean Carbon Monoxide Levels for Completers and Early terminators.
By diving the 174 smokers into two groups: Completers (n1=112) andEarly terminators (n2=62), the mean curves of carbon monoxide levelsfor subjects receiving CM (contingency management) and for subjectsreceiving no CM are depicted within each of the two groups (com-pleters and early terminators). . . . . . . . . . . . . . . . . . . . . . 73
5.7 Pattern-Dependent Distribution of Carbon Monoxide Levels. Using
the software package named MPI 2.0, profiles and mean curves of car-
bon monoxide levels are drawn within each of the five groups deter-
mined by the dropout times: dropout before week 5 (a), 7 (b), 9 (c), 11(d), and 12 (e). In plots (a) to (e), green curves correspond to the meancarbon monoxide levels of subjects who received CM (contingencymanagement), red curves indicate the mean curves of the subjects whodid not receive CM, and gray-colored dash-lines depicts the profiles of
all the subjects within each group. The plot (f) depicts all the meanprofiles corresponding to the five dropout patterns. . . . . . . . . . . 75
ix
-
LIST OF TABLES
5.1 Averaged percentage bias (%) in the estimate of slope difference fromthe 100 simulated data sets. . . . . . . . . . . . . . . . . . . . . . . 57
5.2 Coverage rate (%) in the estimate of slope difference from the 100simulated data sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.3 Estimates Treatment Effects and Parameters of the Dropout Model for
the Four Partially Imputed Carbon Monoxide Data Sets. . . . . . . . 63
5.4 Relative Bias(%),Standard Deviations and 95% Posterior Credible In-tervals of the Averaged Treatment Effect Estimators ( ) in the Simu-lation Study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.5 Parameter Estimation with Standard Deviations and 95% Posterior Cred-
ible Intervals for the Smoking Cessation Data Set. . . . . . . . . . . 69
5.6 Estimation of Treatment Effect Parameter (1) in the Smoking Cessa-tion Study with artificially generated intermittent missing values. . . . 69
5.7 Estimation of Parameters in the Smoking Cessation Study with the
Method of Multiple Partial Imputation. . . . . . . . . . . . . . . . . . 71
5.8 Estimated Treatment Effect of Contingency Management (1(S.D.))using the Pattern-Mixture Model with two Patterns (Complete vs. Dropout) 74
5.9 Estimated Treatment Effect of Contingency Management using the
Pattern-Mixture Models within the Framework of 2-stage MPI. . . . . 74
5.10 Posterior Parameter Estimation with Standard Deviation and 95% Cred-
ible Intervals Using the REMTM to the continuous Carbon Monoxide
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
x
-
ACKNOWLEDGMENTS
First and foremost, I would like to express my deepest gratitude to my advisor, my
collaborator, and my friend, Prof. Yingnian Wu. His continuous guidance, support,
and care make the research presented in this dissertation possible.
I also owe my heartfelt gratitude to Prof. Xiaowei Yang of UCDavis, one of the
top researcher on missing data modeling, for giving Yingnian and I this most valuable
opportunity to work with him and guiding me all through. The dissertation is based
on this fruitful collaboration. I admire Prof. Yangs scientific achievements, and also
cherish his warmth and kindness.
I would also like to thank the former fellow students in the vision lab for helpful
discussions. In particular, I thank Dr. Xiangrong Chen, Dr. Cheng-en Guo, Zijian Xufor the help on programming.
My sincere thanks go to Prof. Hongquan Xu , Prof. Rick Paik Schoenberg and
Prof. Thomas R. Belin for serving on my doctoral committee and giving useful advice
on my dissertation.
Finally, I am forever grateful to my family and friends for their care and help.
xi
-
VITA
1976 Born, Hubei, P.R.China
1999 B.S. in Statistics, Peking University, Beijing, P.R.China
2002 M.S. in Statistics, Peking University, Beijing, P.R.China
2004 M.S. in Statistics, UCLA
20022006 Graduate Student Researcher & Teaching Assistant, Department of
Statistics, UCLA
PUBLICATIONS
J. Li, X. Yang, Y. Wu, and S. Shoptaw, A Random-effects Markov Transition Model
for Poisson-distributed Repeated Measures with Nonignorable Missing Values, 2005
Proceedings of the American Statistical Association, Biometrics Section [CD-ROM],Alexandria, VA: American Statistical Association.
J. Li, X. Yang, Y. Wu, and S. Shoptaw, A Random-effects Markov Transition Model
for Poisson-distributed Repeated Measures with Nonignorable Missing Values, ac-
cepted by Statistics in Medcine.
xii
-
X. Yang and J. Li ( 2006), Selection Models with Augmented Gibbs Samplers forContinuous Repeated Measures with Nonignorable Dropout, to be submitted.
X. Yang, J. Li, and S. Shoptaw ( 2006), Multiple Partial Imputation for LongitudinalData with Missing Values in Clinical Trials, resubmitted to Statistics in Medcine.
Y. N. Wu, S.-C. Zhu, J. Li, and S. Bahrami (2006), Information Scaling in NaturalImages, resubmitted to JASA.
Y. N. Wu, J. Li, Z. Liu, and S.-C. Zhu (2006), Statistical Principles in Low-LevelVision, tentatively accepted by Technometrics.
xiii
-
ABSTRACT OF THE DISSERTATION
Analysis of longitudinal data with missing values
by
Jinhui LiDoctor of Philosophy in Statistics
University of California, Los Angeles, 2006
Professor Yingnian Wu, Chair
Biomedical research is plagued with problems of missing data, especially in clini-
cal trials of medical and behavioral therapies adopting longitudinal design. After a
comprehensive literature review on modeling incomplete longitudinal data based on
the full-likelihood functions, this dissertation proposes a Bayesian framework with
MCMC strategies for implementing two kind of advanced models: selection models
and shared parameter models, for handling intermittent missing values and dropouts
that are potentially nonignorable according to various criteria. We combine the advan-
tages of mixed effect models and Markov Transition Models. Simulation study and
application to practical data were performed and comparisons with likelihood meth-
ods were also made, to show the efficacy and consistency of our methods.
KEY WORDS: Selection Model; Shared Parameter Model; Markov Transition
Model; Nonignorable Missing Values.
xiv
-
CHAPTER 1
Introduction
Longitudinal data refer to data points collected over time on the same subjects. Forexample, if some selected persons measure their weights once a week for ten consec-
utive weeks, the collection of weights through time are longitudinal data. This kind
of data points repeatedly collected on the same subjects are called repeated measures.Longitudinal data are very common for biomedical research and clinical trials, where
the characteristic or some measurement of a person or a thing , for example, the status
of a disease of one person or the value of a car, evolves or develops over time. One
variable is such a characteristic or measurement.
We often have several variables among which one is the target or response variable
and the others are potentially explanatory variables. Theoretically any one of them
can change and be measured over time but usually the response variable is referred by
repeated measures. We are interested in how the change or variation of the response
variable can be explained by the explanatory variables, and statistical modeling is ex-
ploited to approximate the relationship between the two kind of variables. Diggle,
Liang and Zeger([DLZ02]) presented a detailed summary of the models used. Popularchoices are linear models,generalized linear models, transition models and mixed
effect models. The simplest model is the linear or additive models, where we use
a linear combination of the explanatory variables and random errors to describe the
response variable. We can also include some generated new variables ,i.e., functions
of the original explanatory variables. If we use a linear combination of the explana-
1
-
tory variables to describe a function of the expected value of the response variable,
we have a generalized linear model. Furthermore, if we let the model for the current
measure explicitly depend on its previous measures, it becomes a (Markov) transitionmodel. Finally, if we allow the coefficients of the explanatory variables vary from sub-
ject to subject, we have mixed effect models, in contrast with the models containingsame coefficients of the explanatory variables for all subjects so as to tell the popula-tion trend which are called fixed effect models. We can get the likelihood function(orconditional likelihood function) based on the models used as the joint probability dis-tribution function over all subjects and all time points. MLE(Maximum LikelihoodEstimation) methods are then used to make the estimation and inference.
With the same kinds of models, the computation gets much more complicated when
some of the data points are missing. But missing data are very common for biomedical
research with longitudinal studies,reflecting the problematic nature of the phenomenon
under study, such as substance abuse([NC02]), sexual behavior([CS94]) and mentalhealth disorder([TSBU02]). The proportion of missing data is sometimes noticeablylarge, e.g., as large as 70% at termination in a randomized study of buprenorphine
versus methadone([FW95]). Although investigators may devote substantial efforts tominimize the number of missing values, some amount of missing data is inevitable
in the practice of randomized medical clinical trials. In this dissertation, we focus
on missing values of the response variables. Missing data can be categorized into
two kinds: intermittent missing values (i.e., missing values due to occasionally with-drawal), with observed values afterwards, and dropout values(i.e., missing values dueto earlier withdrawal), with no more observed values. Reasons for dropout are oftenstudy-related, e.g., negative side-effects of the testing medicines, ineffectiveness of the
intervention, and inappropriate conduction of the therapy([YS05]). Without carefulhandling of dropouts, either biased parameter estimates or invalid inferences would
end up. This may also be true for intermittent missing values.
2
-
Its not possible for us to compute the original joint probability function of all therepeated measures directly with missing data. There are at least three ways to deal with
this situation: Complete case only, Analyze as incomplete, and Imputation. Com-
plete case only means just keeping the complete cases with specified weight and drop-ping all other cases with missing values. Imputation method is to fill in the missing
values with randomly generated ones and how to generate good numbers to approxi-
mate the original ones is critical here. Analyze as incomplete intends to integrate out
the intermittent missing values and to ignore the dropout missing values, which seems
a straightforward choice but its computationally expensive and not always stable.
All above three approaches focus on the response variable since we are mainly
concerned about estimating the effect of the covariates on the response variable. If
the missing data dont affect the estimation at all, we can just use the available datadirectly and ignore the missing ones. So here comes the problem of ignorability
of the missing values. Usually it is unknown to us, so the safe and natural way is
to model the repeated measures and the missingness indicators jointly. There existthree ways to factor the joint distribution of the complete data and missingness in-dicators: outcome-dependent factorization, pattern-dependent factorization, and
parameter-dependent factorization. Correspondingly there are three kind of mod-
els: selection models, pattern-mixture models and shared-parameter models. No
matter what model you use, there are still gaps in computing the joint probability func-tion of all the repeated measures. To compute this joint probability with integration isa common choice but it is usually slow and sometimes not stable. Because of this, we
proposed one approach, which is to use Bayesian framework with MCMC implemen-
tation to overcome the difficulty of integration.
Following this introductory chapter, this dissertation is organized in the follow-
ing way: we review models commonly used for longitudinal data analysis, missing
3
-
data issue and ignorability in Chapter 2, after that we review the Bayesian framework
and MCMC methodology in Chapter 3. Correspondingly and based on those meth-
ods, we describe our methodology to handle the issue of modeling longitudinal data
with missing values in Chapter 4. Next, in Chapter 5 we present the results of sim-
ulation studies and applications to practical data, making comparison with maximum
likelihood method and showing the consistency of our models. Then we discuss the
problems and future work in the final chapter.
4
-
CHAPTER 2
Models for longitudinal data analysis and missing data
issue
For most scientific researches, we are concerned about the relationship between dif-
ferent things. For a longitudinal data set with balanced design, our most interest
is in the relationship between the response variable(usually some measurement re-lated to the treatment effect) and the covariates(especially the treatment if there isany). Repeated measures are potentially observed on the ith subject at jth time pointtij(i = 1, . . . , N ; j = 1, . . . , J). From now on, in the following discussion, we use
capital symbols to represent variables, e.g., Yij indicating response variables, and
Xij = (Xij1, . . . , Xijp) indicating covariates or explanatory variables. Symbols in
lower case represent observed or missing values,i.e., the realizations of variables: yijdenoting the value of Yij and xijk denoting the value of Xijk recorded at time tij(i = 1, . . . , N ; j = 1, . . . , J ; k = 1, . . . , p). Bold symbols represent vectors or ma-trices, e.g., yi = (yi1, . . . , yiJ)T is a vector of values for the repeated measures and
xi = [xijk]Jpis a matrix of values of time-varying or time-independent covariates on
the ith subject.For our current study, the response variables are among three common types: con-
tinuous, count and binary. The covariates can be time dependent but usually they are
time independent in practice.
5
-
2.1 Models for longitudinal data analysis
The review in this section mainly relies on the detailed and systematic summary of the
widely used models by Diggle, Liang and Zeger([DLZ02]).We can use linear mod-els,generalized linear models, transition models and mixed effect models to approx-
imate the relationship between the response variables and the covariates. Different
models need different assumptions, e.g., normality, independence. Those assumptions
need to be checked before or after you apply any model.
Although no model is absolutely correct to reveal the underlying mechanisms for
generating the real-world data, some provide useful approximations and are helpful for
us to understand the mechanisms.
2.1.1 Linear models
For continuous responses, the simplest model is a linear or additive model, where we
assume Yij can be expressed by a linear or additive combination of the covariates and
the random error,
Yij =
pk=1
kxijk + ij = xij + ij
where we have the assumption that the random error ij N(0, 2) , which is the mostwidely used normality assumption. Note that usually xij1 = 1 and 1 is an intercept
term.
Usually we assume the subjects coming from a random sample of the populationand then being independent of each other. Without loss of generality, we can justdescribe the model for the ith subject. In matrix from, we have the model for Yi as,
Yi N(xi, 2V )
where V is a general definite positive symmetric matrix. If we assume that V is a di-
6
-
agonal matrix or even an identity matrix, it means that the repeated measures from the
same subject are independent of each other. Usually they are dependent and positivelyrelated, especially for the neighbors, and V has no simple structure.
We can set constraints on the elements vjk of the variance-covariance structure
matrix V. Some common choices are like:
vjj = 1, vjk =
which is called compound symmetric since the correlation between any pair Yij, Yik is
and
vjk = |jk|
which is called AR(1)(first order autoregressive) where the correlation between anypair Yij, Yik is decided by and the length of time between the two time points, since
1 1 the correlation is weaker for far away measures.
2.1.2 Generalized linear models
Instead of directly linking the response and the covariates for continuous response, we
can also link the marginal expectation of the response ij = E(Yij) to the covariates
for other kind of responses, thus get the generalized linear models.
For binary responses,e.g., the status of smoking or not, the indicator of having
certain disease or not, let ij = E(Yij) = P (Yij = 1) and the link function is a logit
function
logit(ij) = logij
1 ij = logP (Yij = 1)
P (Yij = 0)= xij
which models the log transformation of the odd P (Yij=1)P (Yij=0)
.
For count responses,e.g., the number of smoking episodes in a certain period of
time, the repeated measures following Poisson distribution is often a reasonable as-
7
-
sumption, i.e.,
P (Yij = yij) =yijij
yij!exp(ij)
For the marginal expectation of the response we have ij = ij and a log function
serves as the link function
log(ij) = xij
We should notice that these models for the marginal expectation of the response
assume independence between the repeated measures since there are only covariates
and fixed parameters in the model. This is usually not true and we need more complex
models.
2.1.3 Transition models
The above way to model the marginal expectation of the response through generalized
linear models ignores the fact that usually the repeated measures on the same subjectare not independent. Its quite straight forward to make the distribution of the current
repeated measure related to its predecessors, usually through some kind of residuals.
This results in the transition models.
Assume a q-step transition,i.e.the current repeated measure is related to its q im-
mediate predecessors. For continuous variables, we may have
Yij = xij +
qr=1
r(Yijr xijr) + ij
where ij N(0, 2) as before. Here we connect Yij to its q immediate predecessorsthrough the residuals Yijr xijr, r = 1, . . . , q . This is just a re-interpretation ofthe autoregressive covariance structure for continuous variables.
For binary variables, we may have the one step transition probabilities Pkl =
8
-
Pr(yij = l|yi,j1 = k)(k = 0or1; l = 0or1) be modeled by logistic regressions,
logit(P01) = logit(Pr(yij = 1|yi,j1 = 0,xij)) = xij01 (2.1)logit(P10) = logit(Pr(yij = 0|yi,j1 = 1,xij)) = xij10 (2.2)
.
Similarly, for count response following Poisson distribution, we can make the in-
tensity(mean) dependent on its q immediate predecessors
log(ij) = xij +
qr=1
r(log(max(1, yijr) xijr)
So transition models provides a reasonable way to directly connect the autocorre-
lated repeated measures.
2.1.4 Random effect models
The previous models all assume fixed effects, i.e., the coefficients of the covariates
are the same for all subjects. We know that there is variation from subject to subject,so reasonably we can have a different(random) coefficient on each covariate for eachsubject to account for the between subject variation. Then there are many more para-meters in the models and comes the danger of over-parameterization. To avoid this,
the natural constraint is that we assume the coefficients for the same covariate coming
from the same distribution, which usually is a normal distribution.
As an example, for continuous variables,we may have
Yij = xiji + ij (2.3)= xij + xiji + ij (2.4)
where ij N(0, 2) as before and the subject specific coefficient i N(,)is decomposed into and i to account for the population effect and subject(random)
9
-
effect respectively. i N(0,) is commonly used and the specification of isneeded. In practice, we often assume is diagonal if appropriate.
The covariates associated with the random effects can also be different from those
associated with the fixed effects but usually the former is a subset of the latter, which
means we dont have to assume a random effect for each covariate. Random intercept
only or Random intercept and slope structure are widely used.
2.1.5 Estimation and inference
With complete data and appropriate models assumed, we can write down the likelihood
function. Estimation and inference based on MLE(maximum likelihood estimation)or RMLE(restricted maximum likelihood estimation) is widely used. Please referto Diggle, Liang and Zeger([DLZ02]) for details.
2.2 Missing data issue
Missing data make the parameter estimation and inference much more complicated.
The likelihood function requires integration to evaluate. Missing values can happen
to both the response variables and the covariates but we assume only the response
variables have missing values in the current study.
When some values of repeated measures are missing,we partition yi into two parts
yi = (yobsi ,y
misi ), with yobsi indicating the observed values, and ymisi indicating values
that would be observed if they were not missing. When missing values are introduced
by dropout, the pattern of missingness can be indicated by a scalar di, which represents
the actual time of withdrawal for subject i (i.e., di = 2, . . . , J + 1), with di = J + 1indicating the case of completion of the study. Since a subject who drops out at base-line does not contribute to the likelihood function, the case di = 1 is excluded from
10
-
our consideration. A vector of missingness indicators is defined as ri = (ri1, . . . , riJ)T
with elements
rit =
0 if yit observed
1 if yit intermittently missing
2 if yit missing after dropout.
Theoretically, the joint distribution of the complete data (i.e., yi ) and missingnesspatterns (i.e., ri ) should be modeled in statistical analysis, and based on that we canwrite down the full likelihood function,
L(, |y, r) Ni=1
f(yi, ri|, )dymisi
where represents parameters of the model for repeated measures, and represents
the parameters of the missingness mechanism.
According to the possible graphical representations, there exist three ways to fac-
tor the joint distribution of the complete data and missingness indicators: outcome-dependent factorization, pattern-dependent factorization, and parameter-dependent
factorization. Figure 2.1 shows the general directions for factorizations with the asso-
ciated models respectively. Outcome-dependent factorization means the repeated mea-
sures or outcomes affect the distribution of missingness indicators, pattern-dependent
factorization means missingness indicators(i.e.,missingness patterns) indicate the dif-ference in the distribution of the repeated measures, and parameter-dependent fac-
torization means some common parameters(covariates or random effects) affect bothrepeated measures and missingness indicators.
The three corresponding models available for incomplete longitudinal data analysis
are:
Selection Model, which factors the joint distribution into a marginal distribution
11
-
Figure 2.1: Three Representations of Missingness Mechanisms. Three ways in mod-
eling incomplete data are depicted here. Parameters and symbols in the figures are
defined as: yi = (yobsi ,ymisi ) - observed and missing values for subject i; ri -missingness indicator for repeated measures on subject i; - parameters of data; - parameters of missingess indicators; and - parameters shared by data and
missingness indicators.
for yi and a conditional distribution of ri given yi, i.e.,
f(yi, ri|xi, , ) = f(yi|xi, )f(ri|yi,xi, )
where f(ri|yi,xi, ) can be interpreted as self-selection of the ith subject intoa specific missingness group.
Pattern-Mixture Model, which is a pattern-dependent model, assumes that distri-bution of repeated measures varies with the missingness patterns and that jointdistribution is factored as
f(yi, ri|xi, , ) = f(yi|ri,xi, )f(ri|xi, ).
Assuming that there are P patterns of missingness in a data set, the marginal
distribution of yi is a mixture,
f(yi|xi, ) =Pp=1
f(yi|ri = rpi ,xi, (p))p
12
-
where (p) represents the parameters of f(yi) in the pth pattern and i = Pr(ri =
p|xi, ) and ri numerates theP patterns. In pattern-mixture models, (1), , (p)
can be different in dimensionality or in value.
Shared-Parameter Model,we assume that yi and ri are conditionally indepen-dent of each other, given a group of parameters, i,
f(yi, ri|xi, , ) =
f(yi|i,xi, )f(ri|i,xi, )f(i)di (2.5)
Shared parameters i affect both yi and ri, thus can be either observable variables
(e.g., gender) or latent variables (e.g., random-effects or latent scores). For thecase of observed confounders, model 2.5 is in fact a mixture model and the
analysis can be conducted by a stratification analysis.
My focus is on Selection Model and Shared-Parameter Model while Dr.Yang did
a lot of research on Pattern-Mixture Model. One example is shown in Yang and
Li([YL06]).
2.2.1 Ignorable versus Nonignorable
The above models are natural and it is ture that in certain biomedical studies, both
missingness patterns and values of repeated measures are of interest. For example, in a
heart-disease study, the repeatedly measured blood pressures and the survival lengths
of the patients can be modeled jointly. In these scenarios, the above selection, pattern-mixture, and shared-parameter models can be applied directly or after some modifi-
cation( [HL97]). In a majority biomedical studies, however, only the parameters ofrepeated measures themselves are of interest, while parameters related to missing val-
ues are usually viewed as nuisance. In this latter case, it is desirable that missing data
be ignored if possible.
13
-
Within the setting of outcome-dependent missingness, the concept of ignorabil-
ity was defined and extensively addressed. According to Rubin([R76]), missing val-ues are ignorable when
(i) ri is independent of ymisi , given yobsi and xi ;(ii) and are distinct.Under this ignorability, the log-likelihood function for can be separated from the
log-likelihood function for , i.e.,
l(, |yobsi , ri) = l(|yobsi , ri) + l(|yobsi , ri).
In Little and Rubin [LR02], outcome-dependent missingness was further divided intosub-categories:
missing completely at random (MCAR): ri yi|xi;
missing at random (MAR): ri ymisi |yobsi ,xi;
not missing at random(NMAR): None of above.
For intermittent missing values, ignorability can be interpreted as whether the val-
ues can be interpolated from neighborhood observed values. For dropouts, the assump-
tion corresponds to whether missing values after dropout can be extrapolated from
the previous observed values. In certain applications, occasional omission or nonre-
sponses are due to reasons that are purely random in nature, e.g., schedule conflicts
or bad weather, thus, can be assumed to be ignorable. Nonetheless, subjects withdrawfrom a study usually because of study-related reasons, e.g., being unsatisfactory with
the intervention or notorious side effects of a medical therapy, hence are nonignorable
([VM00, DLZ02]).
The definition of ignorability may be extended to meet the needs of pattern-mixture
and shared-parameter models. Adopting an informal way, we interpret ignorability
14
-
as a condition under which observed repeated measures can be used to estimate
without bias. For selection and pattern-mixture models, so long as ri and ymisi are
independent of each other, given yobsi and xi , missing data can be ignored. For shared-
parameter models, ignorability corresponds only to the case where i are observable
variables, which are usually viewed as a subset of xi . Unless ri and yi share no
random-effects, a shared-parameter model would generally associate with an assump-
tion as nonignorable.
However, it is difficult or impossible to check ignorability directly. Therefore,
we are going to take an approach by jointly modeling the missingness patterns and re-peated measures without assuming ignorability. But in the end, we can tell ignorability
based on the results from our models.
2.3 Modelling with missing values
In this section, we give a brief summary of several common approaches used by re-
searchers to handle modelling with missing values, including imputation based meth-
ods illustrated by multiple partial imputation, and integration based methods, which
are Analyze as incomplete approaches, illustrated by selection model for continuous
data with dropout and shared-parameter model for binary responses with both inter-
mittent missing and dropout.
Imputation based methods usually need critical assumptions on the missing mech-
anisms. Wrong assumptions lead to imputed values inconsistently distributed, com-
pared to the observed ones. The previously proposed integration based methods are
generally expensive on computation and tend to be unstable.
15
-
2.3.1 MULTIPLE PARTIAL IMPUTATION
For incomplete longitudinal data sets, the method of multiple imputation(see Rubin[R87]) is especially useful. Accurately predicting missing values is possible becauserepeated measures are often highly correlated to each other. When imputing, all above
three modeling options can be used. In longitudinal data sets, missingness patterns
and mechanisms for intermittent missing values and dropouts are apt to be distinct,
thus requiring different treatment. Their empirical experiences suggest that in certain
clinical trials intermittent missing values are ignorable due to factors that are non-
related to the theme of the study, while dropouts should not be simply ignored. In
Yang and Shoptaw([YS05]), a partial version of multiple imputation, MPI, was firstproposed, within which only intermittent missing values are imputed. As seen in the
application of pattern-mixture model, imputation methods can be further employed to
implement various schemes of identification of restriction for managing dropouts. This
leads to a further extension of multiple imputation, which they term as 2-stage MPI
here. Depending on the assumptions of the mechanism of intermittent missingness
and dropout, there exist many specific forms of MPI and 2-stage MPI.
They further partitioned ymisi into (yIMi ,yDMi ) to denote intermittent missing val-
ues and dropouts. For MPI, they drawm > 1 independent valuesyIM(1)i ,yIM(2)i , ,yIM(m)i
using the posterior predictive distribution p(yimi |yobsi , ri) . Within the 2-stage MPI, foreach of the partial imputation for intermittent missing values,n conditionally indepen-
dent values yDM(j,1)i ,yDM(j,2)i , ,yDM(j,n))i are additionally drawn from the predic-
tive distribution p(yDM(j,k)i |yobsi ,yIM(j)i ),j = 1, , m. The 2-stage MPI provides anatural framework for fitting pattern-mixture models by identifying restrictions with
information borrowed from completers, neighboring cases, or available cases. If we
use selection models or shared-parameter models, imputations for dropouts can be
similarly conducted by applying appropriate MCMC algorithms.
16
-
A main concern for multiple imputation is how to combine the multiple point
estimators to make an overall inferential statement. A set of rules for combination
was originally developed by Rubin and Schenker([RS86]), which can be used di-rectly for MPI. In Shen [S00], the idea was extended for the case of 2-step multi-ple imputation, which can be viewed as a general approach for a 2-stage MPI. More
specifically, m n complete data sets are obtained eventually in the 2-stage MPI,y(j,k)i = (y
obsi ,y
IM(j)i ,y
DM(j,k)i ) : j = 1, , m, k = 1, , n. A noticeable problem
with these complete data sets is that they are not independent from each other, be-
cause each block or nest ( yDM(j,1)i ,yDM(j,2)i , ,yDM(j,n)i ) contains identical valuesfor yIM(j)i . By denoting Q(j,k) and U (j,k) as the point and variance estimates for Q
from the (j, k)th completed data set, the overall point estimate for Q is still the simply
grand average, i.e.,
Q =1
mn
mj=1
nk=1
Q(j,k) (2.6)
The associated variance for Q involves three components, i.e.,
T = U + (1 1n)W + (1 1
m)B (2.7)
where
U =1
mn
mj=1
nk=1
U (j,k)
estimates the complete-data variance,
B =1
m 1mj=1
(Q(j,) Q)2
indicates the between-nest variance,
W =1
m
mj=1
1
n 1n
k=1
(Q(j,k) Q(j,))2
represents the within-nest variance, and Q(j,) = 1n1
nk=1 Q
(j,k). Inferences about Q
17
-
are based on the Students t-distribution QQ(T )
t with d.f.
=1
m(n 1)[(1 1/n)W
T]2 +
1
m 1[(1 + 1/m)W
T]2
. Other formulas such as rates of missing information and relative efficiency are seen
in Shen [S00].
Harel and Schafer([HS02]) proposed to use two-stage multiple imputation to han-dle two qualitatively different types of missingness, not necessarily classified as in-
termittent missingness and dropout. This idea might be extended to do multiple-stage
multiple imputation to handle several qualitatively different types of missingness.
2.3.2 Selection model
Diggle and Kenward( [DK94]) studied modelling continuous response variables withdropout missing values where the missingness indicator rij = 1 means dropout and
rij = 0 means observed.
Let us denote tdi as the dropout time for the ith subject, where 2 di J + 1( di = J + 1 indicates a subject who has completed the study). Then, missingnessindicators ri is a vector of di1 consecutive zeros followed by J +1di consecutiveones. Suppressing the dependence on covariates, the selection model of Diggle and
Kenward [DK94] assumes:(i) Pr(rij = 1|j > di) = 1;(ii) for j di, Pr(rij = 1) depends on yij and its history Hij = (yi1, . . . , yidi1)T ;(iii) the conditional distribution of yij givenHij is fij(yij|Hij, ).
Under these assumptions, the full likelihood function for the ith subject is ex-pressed as
Li(, |yobsi , ri) di1j=1
fij(yij|Hij, )di1j=1
[1 pj(yij,Hij)] Pr(ridi = 1|Hidi)
18
-
where pj(yij,Hij) = Pr(rij = 1|yij,Hij, ) , indicating the probability of dropout attime tij . Dropout probability
Pr(ridi = 1|Hidi) =
Pr(ridi = 1|y,Hidi, )fidi(y|Hidi, )dy (2.8)
, if di < J + 1 ; and Pr(ridi = 1|Hidi) = 1 , if di = J + 1 .
A natural choice for calculating Pr(rij = 1|yij,Hij, ) is a logistic regressionmodel,
logit(Pr(rij = 1|yij,Hij, )) = 0 +Hij1 + 2yij
from where 2 6= 0 implies that dropout process is outcome-dependent nonignor-able.
The full log-likelihood function of the whole data set for (, ) can be partitioned
into
l(, ) = l1() + l2() + l3(, )
where
l1() =
Ni=1
log f(yobsi )
corresponds to the observed-data log-likelihood function for ,
l2() =Ni=1
di1j=2
[1 Pr(rij = 1|yij,Hij, )]
and
l3(, ) =
iN ;diJ
log(Pr(ridi = 1|Hi,di))
together determine the log-likelihood function of dropout process, which contains par-
tial information on .
If dropouts are ignorable, then l3(, ) depends only on , thus can be absorbed
into l2(). In this case, estimation of can be solely derived from l1().
19
-
For a normal longitudinal data set, yi N(Xi,i()) with parameters =(T , T )T , the conditional distribution, fij(y|Hij, ) is a scalar normal, and the mar-ginal distribution
di1j=1 f(yij|Hij, ) = f(yobsi ), is also a multivariate normal.
To evaluate the integration in equation 2.8, they used a probit approximation to the
logit transformation of the dropout probability and converted the problem into moment
computations of the conditional normal distribution of the first dropout missing value.
With the likelihood function at hand, Diggle and Kenward([NM65]) resorted to thesimplex algorithm to get the MLE estimators. The simplex algorithm does not depend
on derivatives but it converges unacceptably slowly and provides no Fisher information
matrix.
Selection models originated from the Tobit model of Heckman( [H76]). Verbekeand Molenburghs ([VM00]) addressed the theoretical translation from Tobit modelto Diggle and Kenwards selection model. Subsequently, Troxel, Harrington, and
Lipsitz([THL98]) extended it to the non-monotone setting. Selection models for cat-egorical and other type of measures were also developed; see Fitzmaurice, Molen-
berghs, and Lipsitz [FML], Molenberghs Kenward, and Lesaffre [MKL97], Nordheim[N84], and Kenward and Molenburghs [KM99].
2.3.3 REMTM for binary response variables
When the dynamic features of the transition pattern in longitudinal data are of inter-
est, an appropriate longitudinal approach is a transition model. For binary repeated
measures with nonigorable missing values, Albert and Follmann ([AF03]) developeda Markov transition model with random-effects shared by the sub-model on measure-
ment and the sub-model on missingness indicators. We let REMTM stand for random-
effects Markov transition model and review their models in this section.
In the REMTM for incomplete binary repeated measures, the sub-model for mea-
20
-
surement process is assumed to be a first-order Markov chain for each series of binary
measures. The response variable can jump between the two states, 0 and 1. The tran-sitional probabilities Pkl = Pr(yij = l|yi,j1 = k)(k = 0, 1; l = 0, 1) can be modeledby a logistic regression with random intercepts,
logit(P01) = logit(Pr(yij = 1|yi,j1 = 0,xij, i)) = xij01 + i (2.9)logit(P10) = logit(Pr(yij = 0|yi,j1 = 1,xij, i)) = xij10 + i (2.10)
where iiid N(0, 2 ) denotes the random intercept, and is the heterogeneity parame-
ter indicating the correlation between P10 and P01 .
The distribution of misisngness indicators ri = (ri1, . . . , riJ)T can be modeled by
another Markov transition model. Here
rij =
0 if yij is observed
1 if yij is intermittently missing
2 if yij is missing due to dropout
They used a first-order Markov process associated with 3 3 transition probabili-ties (i.e., Pkl = Pr(rij = l|ri,j1 = k)(k = 0, 1, 2; l = 0, 1, 2)). Determined bycertain restrictions, the following transition probabilities would be always equal to
zero:P12 = P20 = P21 = 0 . For other combinations of ri,j1 and rij , the transition
probabilities are calculated in the following way. First, if the previous repeated mea-
sure is observed (i.e.,ri,j1 = 0), then the current one could be observed, intermittentmissing, or dropout; the 3-category multinomial-logit model can be used to calculate
the transition probabilities:
P (rij = k|i, xij, ri,j1 = 0) =
11+P2
l=1 exp(xijl+il)if k = 0,
exp(xijk+ik)
1+P2
l=1 exp(xijl+il)if k = 1 or 2.
Second, if the previous measure is intermittently missing, then the current one may
only be observed or intermittently missing again. Correspondingly, a logistic regres-
21
-
sion model is used to calculate P10 and P11, i.e.,
P (rij = k|i, xij , ri,j1 = 1) =
11+exp(xij1+i1)
if k = 0,exp(xit1+i1)
1+exp(xij1+i1)if k = 1.
Third, for the absorbing state 2, we would always have P (rit = 2|i, xit, rit1 = 2) =1.In the above logit and logistics models, regression coefficients 1 and 2 respectively
indicate whether intermittent missingness and dropout depend on covariates, while co-
efficients 1 and 2 respectively indicate whether intermittent missingness and dropout
are informative (i.e., nonignorable).
By combining the above sub-models for measurement and missingness, the likeli-
hood function for parameters = (01, 10, , 2 ) and = (1, 2, 1, 2) is expressed
as
L(, ) n
i=1
{Tij=1
p(yij|xij , yi,j1, i, )}{Jj=1
p(rij|i, xij , ri,j1, )}p(i)di
where p(i) is the pdf of N(0, 2) and Ti is the last observed time point.
They derived the K step transition probabilities from the one step transition proba-
bilities to handle the intermittent missing values.Then they maximized the above like-
lihood function using a Newton-Raphson algorithm where the integral was evaluated
with Gaussian quadrature.
It should be remarked here that the above REMTM can be easily extended to
deal with other types of repeated measures. In Li et al. [LYWS06] we applied theREMTMs to Poisson-distributed repeated measures with nonignorable missing val-
ues. The random-intercept in the models can be replaced with other types of random
effects, including random slopes and random cohort effects. The REMTM is only
one specific example of shared-parameter models, other longitudinal models such
as marginal model or random-effects models can be also used to implement shared-
parameters modeling. The shared-parameter model was first developed by Wu and
22
-
Caroll([WC88]) where certain parameters are shared by the measurement model and acensoring process.
23
-
CHAPTER 3
Bayesian framework with MCMC approach
As stated before, for the longitudinal models based on full-likelihood functions, pa-
rameter estimation based on asymptotic normal theory is difficult to apply mainly
because of the complicated form of likelihood functions. Without an analytical so-
lution for the score function and Hessian matrix, optimization is challenging. Diggle
and Kenward ([DK94]) resort to the simplex algorithm([NM65]), which does not de-pend on derivatives. Unfortunately it converges unacceptably slowly and provides no
Fisher information matrix,as a result of derivatives free algorithm. We implemented
the nonlinear optimization algorithms of Dennis and Schnabel([DS96]) with numeri-cal derivatives, but found that global maximum remained difficult to obtain in practical
settings. When calculating the dropout probabilities in the selection model or integrat-
ing over the random-effects in the REMTMs, time-consuming numerical integration is
demanded, such as the method of Gauss-Hermits. Bayesian methods based on MCMC
provide a more affordable and appropriate alternative to do the parameter estimation
and inferences. By sampling parameters and missing values, MCMC method using
Gibbs sampler offers a natural option for integration and optimization, without relying
on fully determined density functions or expensive derivatives.
In this chapter, we review the basics for Bayesian framework and MCMC methods,
based on the systematic description by Carlin and Louis in [CL00].
24
-
3.1 Bayesian methods
3.1.1 Bayes Rule
Basically, if we have data D, i.e., collected samples, at hand, and we want to estimate
some parameters with the data and specified model, the likelihood function is the
joint probability distribution of the data given ,
L() = f(D|)
Maximum likelihood method aims to estimate the parameters, assuming they are justunknown constants,completely based on this likelihood function. Bayes approach is
different in that it takes the parameters as random samples from certain prior distri-
bution, q() and makes the estimation and inference completely based on theposterior distribution P (|D), which is calculated by simple calculus and is in a sim-ple form of Bayes Rule,
P (|D) = q()f(D|)f(D)
where,
f(D) =
q()f(D|)d
3.1.2 Prior distribution
For Bayesian approach, introducing prior distribution provides extra information. There
are three major ways to specify it,
Elicited Priors: prior distributions are specified by some experts or elicitees withexperiences and knowledge in the field;
Conjugate priors: prior distribution is selected from a member of the familywhich is conjugate to the likelihood function so that the posterior distributionbelongs to the same distributional family as the prior ;
25
-
Noninformative Priors: normal distribution with infinite variance or uniformdistribution over the support of is used as its prior distribution;
Among the three kind of priors, Noninformative Priors are simple and actually
provide no extra information, thus can avoid the danger of introducing incorrect extra
information. With enough data, the information from the data usually dominates the
information from priors so the choice doesnt matter much.
3.1.3 Bayesian inference
Within the Bayesian framework, we can perform point estimation, interval estimation
and hypothesis test based on the posterior distribution. The first two are most impor-
tant.
Point estimation:First lets look at the univariate parameter case.The estimator(D) can be a summary feature such as mean, median, or mode of the posterior
distribution P (|D). The basic observations on the three features are
The posterior mode equals the maximum likelihood estimator With a flat
prior;
The mean and the median are identical for symmetric posterior densities;
All three measures are the same for symmetric unimodal posteriors.
The posterior variance with respect to (D), or mean squared error, is defined as
E(D)( (D))2
provides a measure of accuracy of the point estimator. Denote (D) = E(D)()
as the posterior mean which of course depends the data (D). We have the fol-
26
-
lowing decomposition
E(D)( (D))2 = E(D)[ (D) + (D) (D)]2
= E(D)( (D))2 + ((D) (D))2
E(D)( (D))2 = V ar(D)( (D))
The equal sign = is achieved if and only if (D) = (D), i.e., we use the
posterior mean as the estimator.
When is a vector, the posterior mode is more difficult to get and we still have
that
E(D)((D))((D))T = V ar(D)((D))+((D)(D))((D)(D))T
and the minimum is still achieved at (D) = (D). So generally we use the
posterior mean as the estimator to get the smallest variance.
Interval estimation The counterpart of a frequentist confidence interval (CI) inBayesian framework is called a credible set, or Bayesian confidence interval.The 100(1 )% credible set for is defined as a subset S of the support of such that
1 P (S|D) =S
P (|D)d
However, there is a computation issue with Bayes approach. The posterior distrib-
ution
P (|D) = q()f(D|)q()f(D|)d
is the basis for all estimation and inference, but with the integration at the denominator
it is difficult to explicitly computed. In addition, we still need to compute the posterior
mean and variance,which involve additional integration, usually in high dimensional
27
-
space,
E(D)() =
P (|D)d
V ar(D)() =
( E(D)())2P (|D)d
Fortunately, we can perform the estimation and inference through sampling, even with
the existence of missing data, which is the idea of Monte Carlo methods. MCMC is
a powerful tool for us to overcome the difficulty in computing. We review basics about
MCMC in next section.
3.2 MCMC methods
MCMC(Markov chain Monte Carlo) methods are the most widely used methods forstatistical computing now. They are based on a class of algorithms for sampling from
probability distributions by constructing a Markov chain whose stationary distribution
is the desired distribution. After a large number of steps, a state of the chain can
be used as a sample from the target distribution. Gilks et al.([GRS96]) provided acomprehensive summary of the related theories and practices.
3.2.1 Monte Carlo methods
Monte Carlo methods refer to integration by sampling and averaging. If we have
q() and we want to compute the mean of g(),
= E(g()) =
g()q()d
. Suppose we have an iid(independent and identically-distributed) sample 1, . . . , N q(), then the average
=1
N
Ni=1
g(i)
28
-
converges to with probability 1 as N by the Law of Large Numbers. It is agreat property in that as long as we have such an iid sample 1, . . . , N q(), we canapproximate the expectation of any function with respect to . So now the problem of
integration becomes how to get a good sample.
Usually the target distribution q() is quite complex and hard to be directly sam-
pled. When it can be evaluated,two commonly used sampling techniques are impor-
tance sampling and rejection sampling.
Importance sampling: we sample from a simpler proposal distribution h() in-stead of q() and define the weight function as w() = q()
h(), then we have
= E(g()) =
g()q()d =
g()w()h()d
and
=1
N
Ni=1
g(i)w(i)
The optimal proposal distribution is
h() =|g()|q() |g()|q()d
which minimizes the variance of the estimator. The minimum variance is
V arh()(g()w()) = Eh()(g2()w2()) 2
Rejection sampling: If there exists a simpler distribution h() and a M so thatq() Mh(), we can sample in this way:
1. Sample i from h();
2. Sample u from the uniform distribution on the interval (0,1) U(0,1);
3. Keep i if it satisfies that u < q(i)
Mh(i); Otherwise reject it and return to step
1.
29
-
Unfortunately, for the posterior distribution in the Bayes approach, most of the time
it is not even computable. In this case, we can turn to MCMC though.It is such a way
to sample: start from some initial values and keep updating iteratively with Markov
chain(s) until it reaches its stationary distribution. Metropolis-Hastings algorithmand Gibbs sampling are most popular methods. Gibbs sampling allows us to sample
from a distribution proportional to the target distribution and thus avoids the integra-
tion problem to compute the denominator of the posterior distribution in the Bayesian
approach, so it is very useful for us.
3.2.2 MCMC fundamentals
Mathematically, start from (0) q(), the strategy for generating new samples (i)
, while exploring the state space using a Markov chain mechanism, is by sampling
from
P ((i)|(i1), . . . , (0)) = K((i)|(i1))
To eventually sample from P (), the requirement is that
q((0))Ktt P ()
Then the stationary distribution of the Markov chain must be P () and stationary
means
PK = P
If this is the case, we can start from an arbitrary state, use the Markov chain to per-
form the transition for a long time, then stop and output the current state (t) which is
sampled from P ().
To ensure that the Markov chain converges to a unique stationary distribution the
following conditions are sufficient:
30
-
Irreducibility: every state is eventually reachable from any start state (0); forany state there exists a t such that
Pr(t)(|(0)) > 0
Aperiodicity: the chain doesnt get caught in cycles.
The process is ergodic if it is both irreducible and aperiodic.
To ensure that the stationary distribution of the Markov chain is P (), it is sufficient
for functions P and K to satisfy the detailed balance (reversibility) condition:
P ((i))K((i1)|(i)) = P ((i1))K((i)|(i1))
Usually with some care the conditions are satisfied for MCMC algorithm. But we
need to tell the time when the stationary distribution or the convergence is reached in
practice ,where the functions K and P are difficult or impossible to compute so it is
not practical to use the above formula. This task is called Convergence Diagnostics.
The pre-convergence time is called the burn-in period. Basically, there are multiple
chain approaches,where independently running multiple chains is needed, and single
chain approaches, where running a single chain is enough. The advantage of multiple
chain approaches is that we can have independent samples from multiple chains while
the disadvantage is there is a huge waste of samples. On the contrary, single chain
approaches is less expensive but we only have correlated samples on which to make
roughly independent samples.
One representative and popular multiple chain approach was proposed by Gelman
and Rubin([GR92]). They run a small number (n) of parallel chains with differentover-dispersed starting points ,with respect to the true posterior, for 2N iterations each
and kept samples from the latter N iterations, they then attempted to compare the
variation within the chains for a given parameter of interest to the total variation across
31
-
the chains. Specifically, they monitored convergence by the estimated scale reduction
factorR where R was defined by,
R = (N 1N
+n+ 1
nN
B
W)
df
df 2where B
Nwas the variance between the means from the n parallel chains and W was
the average of the n within-chain variances, and df was the degrees of freedom of an
approximating t density to the posterior distribution. They showed thatR
t 1
which means after a long running time, the between-chain variance will be equal to
the within-chain variance. We can randomly select some parameters to observe since
thats only for one parameter. There are also ways to get a criteria to monitor the
convergence of all parameters.
One representative single chain approach was proposed by Raftery and Lewis([RL92]).They retained only every nth sample after burn-in, with n large enough so that the re-
tained samples were approximately independent. More specifically, their procedure
supposed the goal was to estimate a posterior quantile q = P ( < a|D) to within rwith probability s. For example, with q =.025, r = .005, and s = .95, the reported 95%
credible sets would have true coverage between .94 and .96. To accomplish this, they
assumed Z(n)t = Z1+(t1)n with the indicator function Zj = I((j) < a) and used
results from Markov chain convergence theory. So basically they waited until some
selected posterior quantiles got very stable.
The multiple chain approach by Gelman and Rubin is less subject to failure todetect the sorts of convergence failures than single chain approach but it also fails oc-
casionally. And since all the diagnostics methods use only a finite realization from the
chain, no diagnostic can prove convergence of an MCMC algorithm. We statisticians
still rely heavily on such diagnostics, because a weak diagnostic is still useful.
32
-
We also need to provide the point and interval estimations based on the samples we
generate. Similarly there are multiple chain approaches and single chain approaches.
For multiple chain approaches, you need to have a large number,say, N, of indepen-
dently running Markov chains, after they all converged, collect the current sample (j)
from chain j, and the posterior mean can be estimated as
E(|D) = N = 1N
Nj=1
(j).
The variance of the posterior mean can be estimated as
V ar(N ) =1
N(N 1)Nj=1
((j) N )2.
For single chain approaches, one representative method is called batching. We
divide the single long run of length N into m successive batches of length k (i.e.,N = mk), provided that k is large enough so that the correlation between batches isnegligible, and m is large enough to reliably estimate the between batch variance. We
then compute the mean for each batch to get B1, . . . , Bm and the estimators for the
posterior mean is estimated by
E(|D) = N = 1m
mj=1
Bj.
The variance of the posterior mean can be estimated by
V ar(N) =1
m(m 1)mj=1
(Bj N )2.
The 95% credible interval estimators for E(|D) are always given by
N z0.025
V ar(N),
or
N tm1,0.025
V ar(N),
33
-
, where z0.025 and tm1,0.025 are the upper .025 point of a standard normal or t distrib-
ution with m-1 degrees of freedom respectively, depending on the methods used and
the sample size or batch size.
3.2.3 Gibbs sampling
Gibbs sampling provides a way to convert multivariate sampling problem into univari-
ate sampling problem. Suppose we have
= (1, . . . , p) q(1, . . . , p).
If all conditional distributions {qi(i|j, j 6= i), i = 1, . . . , p} are available for sam-pling, we start from an arbitrary set of initial values ((0)1 , . . . ,
(0)p ), at iteration t(t
0), we proceed with the following sampling steps to get ((t+1)1 , . . . , (t+1)p ),
(t+1)1 q1(1|(t)2 , . . . , (t)p )(t+1)2 q2(2|(t+1)1 , (t)3 , . . . , (t)p )
.
.
.
(t+1)p qp(p|(t+1)1 , . . . , (t+1)p1 )
Geman and Geman([GG84] ) proved the following results,
Gibbs Convergence Theorem:(i)((t)1 , . . . , (t)p ) d (1, . . . , p) q(1, . . . , p) as t.(ii) Using the L1 norm, the convergence rate is exponential in t.
3.2.4 Metropolis-Hastings algorithm
Metropolis-Hastings algorithm is one of the most popular MCMC method. Metropolis
et al.([MRR+53]) presented the method in its simplest form. We choose an auxiliary
34
-
proposal function h(x, y) such that h(., y) is a pdf with respect to x for any y and
symmetric,i.e., h(x, y) = h(y, x). At step t, we have a current state (t), then we do
the following,
1. Sample h(., (t));
2. Compute the ratio r = min{1, q()q((t))
};
3. Sample u U(0,1), let
(t+1) =
(t) if u > r
if u r
It can be proved that under mild conditions, (t) d q() as t.
Hastings([H70]) made a simple but important generalization of the Metropolis al-gorithm in 1970. He got rid of the requirement that the proposal function h(x, y) is
symmetric, so the ratio became
r = min{1, q()h(, (t))
q((t))h((t), )}
the rest of the procedure remained the same.
3.2.5 Adaptive rejection sampling for Gibbs sampling
Even with Gibbs sampling, at every step we only need to sample from the univariate
conditional distribution qi(i|j , j 6= i), it is still difficult to sample directly under mostcircumstances. Adaptive rejection sampling, proposed by Gilks and Wild([GW92]), isan efficient method to be used if the conditional distribution is log-concave, directly
or after some transformation.The basic idea of the sampling method is illustrated by
Figure 3.1.
35
-
Figure 3.1: The figure shows the idea of Adaptive Rejection Sampling with functionsdefined as: h(x) = log(g(x)) the logarithm of a function g(x) proportional to a
statistical density function f(x); u(x) upper hall of h(x); and l(x) lower hall of
h(x).
36
-
Suppose g(x) is proportional to the target distribution and let h(x) = log(g(x))
which is concave and has domain . We should have a set of sorted points {xi, i =1, . . . , n}, x1 < x2 < . . . < xn at which h(x) was evaluated. The lower hull l(x)is formed by the chords between (xi, h(xi)) and (xi+1, h(xi+1)) while the upper hull
u(x) is formed by the tangents at {xi, i = 1, . . . , n}. Denote the abscissae as {zi, i =1, . . . , n 1} where zi is the intersection of tangents at xi and xi+1.
The kernel of the sampling algorithm to sample one point is the following:
1. Sample x from the normalized exponential of the upper hull, s(x);
2. Sample q from the uniform distribution on the interval (0,1);
3. If q < exp[l(x) u(x)], accept x and stop;Else evaluate h(x). If q < exp[h(x) u(x)], accept x and stop.Otherwise update the upper hull and lower hull with the rejected x and (h(x), h(x)),resulting in closer upper and lower bounds and go back to step 1;
From here we can see that one rejected sample will lead to closer upper and lowerbounds and higher efficiency for subsequent sampling. So it is called adaptive.
Gilks and Wild([GW92]) also provided the following formulas. To compute theabscissae and tangent functions, we have
zi = xi +h(xi) h(xi+1) + h(xi+1)(xi+1 xi)
h(xi+1) h(xi)u(zi) = h
(xi)(zi xi) + h(xi)
They can be derived from the tangent functions and formula for intersection of two
lines.
Let cu denote the normalizing factor of the exponential upper hull, which is the
37
-
integration of the exponential upper hull over its domain, we have
cu =
exp[u(x)]dx =
n1j=0
1
h(xj+1){exp[h(zj+1)] exp[h(zj)]}
where z0 is the lower bound of x or if x has no lower bound, and zn is the upperbound of x or if x has no upper bound. Denote the cumulative distribution functionof s(x) as scum(x) and the formula at one abscissus is
scum(zi) =1
cu
n1j=0
1
h(xj+1){exp[h(zj+1)] exp[h(zj)]}
To sample x from s(x), we first sample q from the uniform distribution on the
interval (0,1) and find i such that i = argmaxi(scum(zi) < q) and compute x by
x = zi +1
h(xi+1)log[1 +
h(xi+1)cu[q scum(zi)]exp[u(zi)]
]
Adaptive rejection sampling method allows us to use a function proportional to thetarget distribution and constraints it with upper hull and lower hull, which are simple
piece wise straight line segments. The introducing of upper hull and lower hull exploits
the advantage of rejection sampling with squeezing. So it is efficient and the mostlyused method in practice for our current research.
38
-
CHAPTER 4
Methodology
In this chapter, we present our approach, a Bayesian framework with MCMC imple-
mentation, to handle analysis of longitudinal data with missing values. This approach
is able to handle different assumptions on the missingness mechanism and can be ap-
plied to various kind of repeated measures.
4.1 Bayesian Inference using MCMC
4.1.1 Priors specification and monitoring convergence
In our Bayesian framework, we need to specify prior distributions for all parameters
involved. Since most of the time, there is no actual prior information or enough his-
torical studies, non-informative or flat priors are adopted, which usually ends up with
results that are consistent to those based on the stable maximum likelihood estimate if
there is any. More specifically, normal distribution with infinite variance or uniform
distribution over the support of the target is used for all the regression coefficients, and
the logarithms of the variances of random-effects and random errors. We tried some
informative priors, e.g.,inverse Gamma distribution on the variance parameters, but
did not observe much difference since with enough data, the information from the data
usually dominates the information from priors. Flat uniform distributions are used for
space restricted parameters, such as the correlation parameter in the AR(1) structure,
39
-
which has the natural constraint 1 1.
On monitoring convergence, we tried two approaches: the first is to start two or
more Markov chains from quite different initial values and wait until they interweave
with each other and the between chain variance is about the same as the within chain
variance. The second is to work with one chain, retaining only every kth sample after
burn-in, with k large enough so that the retained samples are approximately indepen-
dent. The stopping time is when some selected quantiles become stable.
4.1.2 MCMC:Data augmentation via Gibbs Sampler
We employ the method of data augmentation via Gibbs sampler to alleviate the com-
putation difficulty with missing values by sampling from the posterior distribution. We
first impute the missing values and then draw parameters one by one conditional on
the observed and imputed data. More specifically, the Gibbs sampler is an iterative
procedure with each iteration consisting of two steps.
(I) Imputation-Step, where the missing values are updated by drawing from the condi-tional predictive distribution. That is, for i = 1 to N , draw ymisi,j from
ymisij fij(y|y\ji ,xi, ri, ).
where y\ji means excluding yij from the vector yi. For multivariate-normally distrib-
uted measures, this predictive condition is a scalar normal distribution.
We still use di to denote the dropout time point for subject i. To fill in the miss-ing values, we actually only need to impute all intermittent missing values and the
first dropout value for the purpose of computation, because the dropout probability at
di only depends on the current and previous values (i.e., yi,di and Hi,di). Imputingmore dropout values is not necessary for computation and wont provide additional
information since all information actually comes from the available data and the sta-
40
-
ble assumptions we have: the models remain unchanged through time. Note that to
impute intermittent missing values is different from to impute the first dropout value,
since the former has information from both its history and future while the latter only
has information from its history.
(II) Posterior-Step, where the parameters are drawn from the posterior distribution P (|Y,X,R) in the following order according to the decomposition of thejoint distribution into full-conditional distributions,
f(|\,y1, . . . ,yN ,X,R) f(|\,y1, . . . ,yN ,X,R)
where yi = (yi1, . . . , yi,di1, ymisi,di )T
,Y = (y1, . . . ,yN)
T, and \ means excluding
(e.g., (, )\ = ).
Note that in the above algorithm, missing values are simulated from the predic-
tive function, while parameters are simulated from full-conditional distributions. But
essentially, it is a process that one unobserved quantity,a missing value or parame-
ter, is drawn at a time from the distribution conditional on all other related quan-
tities: observed data, imputed data and parameters. Similar ideas were adopted by
Schafer([S97]) in his data augmentation algorithms for creating imputations of miss-ing values seen in multivariate data sets.
Starting from an initial point and repeating the two sampling stages with large
enough iterations, the procedure would converge to its stationary distribution, i.e., the
joint distribution of parameters and the missing values. Thus, after a long enough burn-in period, the simulated missing values and parameters can be used for parameter esti-
mation. Note that, this algorithm can be used for conducting multiple imputation([R87]),where multiple imputed data sets are first created and then analyzed using standard
longitudinal models for complete data.
41
-
4.2 Implementation for selection models
We implemented it for continuous response data with dropout missing values with the
same modeling strategy as in Diggle and Kenward [DK94]. For the autocorrelation, wetried the first order autoregressive(AR(1)) variance structure and the explicit randomeffects(random intercept) model.
4.2.1 Selection Models with AR(1) Covariance
For the AR(1) covariance matrix, we have
1J =1
2(1 2)
1 0 0 . . . 0 1 + 2 0 . . . 0
.
.
.
.
.
.
.
.
.
.
.
.
...
.
.
.
0 . . . 0 1 + 2 0 . . . 0 0 1
and det(J) = (2)J(1 2)J1.
The Gibbs sampler consists of the following steps.
I. Imputation-Step: draw missing values from
yi,di fi,di(y|yobsi ,xobsi , ) 12i
exp{(y K
k=1 xi,di,kk i)22i
}
exp(0 + 1y +di
k=2 yi,di+1kk)
1 + exp(0 + 1y +di
k=2 yi,di+1kk)
where
i = CT(di1)
1di1(yobsi xobsi )
and
i = 2(1 CT(di1)1di1C(di1)),
with xobsi = (xi1, . . . ,xi,di1)T and C(j) = (j1, . . . , )T . This can be derived from
regressing yi,di to yobsi .
42
-
II. Posterior-Step: draw parameters one by one in the following order
(1) For i = 1, . . . , K, draw fixed parameters:
k f(k|\k ,Y,X) Ni=1
exp{(yi xi)1di (yi xi)T
2}
(2) Draw the autocorrelation parameter in the variance structure:
f(|\,Y,X) Ni=1
12 det(di)
exp{(yi xi)T1di (yi xi)
2}
(3) Draw variance of residuals:
2 f(2|\2 ,Y,X) Ni=1
12 det(i)
exp{(yi xi)T1i (yi xi)
2}
(4) For k = 1, . . . , J , draw parameters of the dropout mechanism:
k f(k|\k ,Y) Ni=1
di1j=2
1
1 + exp(0 + 1yij +j
k=2 yi,j+1kk)
exp(0 + 1yi,di +di
k=2 yi,di+1kk)
1 + exp(0 + 1yi,di +di
k=2 yi,di+1kk)
In the above density functions, xi = (xi1, . . . ,xi,di)T and X = (x1, . . . ,xN)T .
When draw each parameter, all other parameters are viewed as known constants.
All above conditional distributions except the one for have log-concave forms, di-
rectly or after some transformation, thus can be simulated using the method of adaptive
rejection sampling. A convenience sampling method for is the Metroplis-Hastingsalgorithm. The proof is simple:
(1) For the missing values, the predictive function is log-concave because
2log fi,di(y|yobsi ,xobsi , )y2i,di
= 1i
21 exp(0 + 1y +
dik=2 yi,di+1kk)
(1 + exp(0 + 1y +di
k=2 yi,di+1kk))2< 0
where i is the positive conditional variance.
43
-
(2) For the fixed-parameters, the conditional density function is log-concave because2log f(k|\k ,Y,X)
k=
Ni=1
(xi1k, . . . , xi,di,k)1di (xi1k, . . . , xi,di,k)T < 0
(3) For the residual variance, the conditional density function after a logarithm trans-form, s = log(2), is log-concave
2log f(s|\s,Y,X)s2
= Ni=1
(yi xi)T()1di (yi xi)2es
< 0
(4) For the parameters of the dropout mechanism, the conditional density function islog-concave because
2log f(k|\k ,Y)2k
=Ni=1
di1j=2
y2i,j+1k exp(0 + 1yij +
jk=2 yi,j+1kk)
(1 + exp(0 + 1yij +j
k=2 yi,j+1kk))2
y2i,di+1k
exp(0 + 1yi,di +di
k=2 yi,di+1kk)
(1 + exp(0 + 1yi,di +di
k=2 yi,di+1kk))2
< 0
4.2.2 Selection Models with Random-Effects
In the selection model with random-effects, the random-effects model for repeated
measures can be rewritten as
yij =
Kk=1
xijkk +
qk=1
zijkik + ij
, where ij N(0, 2). Here, we restrict that random-effects are independent of eachother with ik N(0, 2k) (k = 1, . . . , q). In the Gibbs sampler, the missing values(y1,d1 , . . . , yN,dN ) and parameters ( = (, , 2, 2, )) are simulated in the followingsteps.
I. Imputation-Step: Draw missing values:
yi,di fi,di(y|yobsi ,xi,di, zi,di, ) exp{(y Kk=1 xi,di,kk qk=1 zi,di,kik)2
22}
exp(0 + 1y +di
k=2 yi,di+1kk)
1 + exp(0 + 1y +di
k=2 yi,di+1kk)
44
-
II. Posterior-Step: draw parameters one by one:
(1) For fixed-effect parameters:
k f(k|\k ,Y,X,Z) Ni=1
dij=1
exp{(yij K
k=1 xijkk q
k=1 zijkik)2
22}.
(2) For i = 1, . . . , N ; k = 1, . . . , q, simulate random effects:
f(ik|\ik ,Y,X,Z) dij=1
exp{(yij K
k=1 xijkk q
k=1 zijkik)2
22} exp{ ik
2
22k}
(3) For k = 1, . . . , q, draw variance of random-effects:
f(2k |\2k ) Ni=1
122k
exp{ ik2
22k}
(4) For the residual variance:
f(2|\2 ,Y,X,Z) Ni=1
dij=1
122
exp{(yij K
k=1 xijkk q
k=1 zijkik)2
22}
(5) For the parameters of dropout model:
f(k|\k ,Y) Ni=2
di1j=1
1
1 + exp(0 + 1yij +j+1
k=2 yi,j+1kk)
exp(0 + 1yidi +di+1
k=2 yi,di+1kk)
1 + exp(0 + 1yidi +di+1
k=2 yi,di+1kk)
In above densities, zi = (zi1, . . . , zi,di)T and Z = (z1, . . . , zN)T .
As shown here, all above conditional distributions have log-concave forms, directly
or after some transformation, thus can be simulated using the method of adaptive re-
jection sampling.(1) For the missing values, the predictive function is log-concave because2logfi,di(y|xi,di, zi,di, )
y2idi= 1
2
21 exp(0 + 1y +
dik=2 yi,di+1kk)
(1 + exp(0 + 1y +di
k=2 yi,di+1kk))2< 0
45
-
(2) For the fixed-parameters, the conditional density function is log-concave because
2 log f(k|\k ,Y,X,Z)2k
=
Ni=1
dij=1
(xijk)2
2< 0
(3) For the random effects, the conditional density function is log-concave because
2log f(ik|\ik ,Y,X,Z)2ik
=
dij=1
(zijk)2
2 12k
< 0
(4) For the variance of random effects, the conditional density function after a loga-rithm transform, s = log(2k), is log-concave,
2log f(s|2\s,Y)s2
= Ni=1
ik2
2es< 0
(5) For the residual variance, the conditional density function after a logarithm trans-form, s = log(2), is log-concave
2log f(s|\s,Y,X,Z)s2
= Ni=1
dij=1
(yij K
k=1 xijkk q
k=1 zijkik)2
2es< 0
(6) For the parameters of the dropout mechanism, the conditional density function islog-concave because
2log f(k|\k ,Y)2k
=Ni=1
di1j=2
y2i,t+1k exp(0 + 1yij +
jk=2 yi,j+1kk)
(1 + exp(0 + 1yij +j
k=2 yi,j+1kk))2
y2i,di+1k
exp(0 + 1yi,di +di
k=2 yi,di+1kk)
(1 + exp(0 + 1yi,di +di
k=2 yi,di+1kk))2
< 0
4.3 Implementation for shared-parameter models
Similar to the models for binary response variables(see section 2.3.3) that were usedby Albert and Follmann([AF03]), for count and continuous response variables, we also
46
-
combine the advantage of mixed effect models and Markov transition models to set up
the submodels for the repeated measures and missingness indicators.
4.3.1 Count response variables
For Poisson distributed count response variables, a one-step(Markov) transition modelis assumed for yi = (yi1, . . . , yiT ), where on any time point, yit(t > 1) is independent
of (yi1, . . . , yit2) given the previous observation yit1. A random intercept effect is
used to capture the baseline heterogeneity across subjects. In this random-interceptMarkov transition model, the distribution of the response variable depends on the co-
variates under investigation and a random intercept,
P (yit|xit, yit1, i) = yitit
yit!eit
where xit = (xi1t, . . . , xipt), and it = E(yit|xit, yit1, i). A Poisson linear regressionmodel is used to connect the covariates (xit), random effects (i N(0, 2)), andprevious observation (yit1) with the current observation (yit),
log(it) = xit + (log(max(1, yit1)) xit1)+ i.
Here, contains the fixed parameters, which are of interest when making inferences
regarding the effect of covariates in modifying the transition among outcomes. The
parameter indicates the influence of the previous counts through the residual,
(log(max(1, yit1)) xit1)
, where max(1, yit1) is used to ensure a positive value for the logarithmic operation.
To impute intermittent missing values in the imputation step, the Metropolis Ran-
dom walk sampling method would be used. For i = 1, . . . , n, and t = 1, . . . , Ti, if yitis missing, an imputation would be drawn with a symmetric proposal distribution, the
47
-
acceptance probability is evaluated conditionally on the observed or imputed yit1 and
yit+1, i.e.,f(ymisit,new|yit1, yit+1, )f(ymisit,current|yit1, yit+1, )
where
f(ymisit |yit1, yit+1, ) yitityit!
eityit+1it+1 e
it+1.
For count response variables, in the formula, ymisit,new = ymisit,current 1 since we useRandom Walk sampling.
For the posterior-step, the adaptive rejection sampling method would be used todraw parameters in the following order. First, the counting process parameters and
of the Poisson regression model are drawn using,
f(|\, Ymis) ni=1
Tit=1
P (yit|xit, yit1, i)
f(|\, Ymis) ni=1
Tit=1
P (yit|xit, yit1, i)
where \ refers to the sub vector of excluding , and \ has the same sense for\ and other notations that follow.
Second, for the parameters and of the missingness mechanism model, the fol-
lowing conditional distributions are used,
f(|\, Ymis) ni=1
Dit=1
P (rit|i,xit, rit1)
f(|\ , Ymis) ni=1
Dit=1
P (rit|i,xit, rit1).
Third, for random intercept effects, is, and their variance 2, we use the following
48
-
conditional distributions,
f(i|\i , Ymis) Tit=1
P (yit|xit, yit1, i)Dit=1
P (rit|i,xit, rit1) exp{ 2i
22}
f(2|\2 , Ymis) (2)n/2 exp{n1
2i22
}ba exp(b/2)(2)(a+1)
It is not difficult to verify that the above density functions for , , , , and
i are log-concave by computing the second derivatives of the corresponding log-
transformed functions. For example,
(log(f(|\, Ymis))) ni=1
Tit=1
it(log(max(1, yit1) xit1)2 < 0
. By denoting s = log(2) and letting a = b = 0, it can be shown that the logarithmic
form of
f(s|\s, Ymis) exp{sn+ 22
n
1 2i
2es}
also has a negative second derivative. So we can use the adaptive rejection samplingmethod on all of them.
4.3.2 Binary response variables
For binary response variables, to impute intermittent missing values in the imputation
step, the Metropolis Random walk sampling method would be used. For i = 1, . . . , n,
and t = 1, . . . , Ti, if yit is missing, an imputation would be drawn with a symmet-
ric proposal distrib