statistical methods for individualized health: … · 2020. 5. 20. · statistical methods for...
TRANSCRIPT
STATISTICAL METHODS FOR INDIVIDUALIZED HEALTH:
ETIOLOGY, DIAGNOSIS, AND INTERVENTION EVALUATION
by
Zhenke Wu
A dissertation submitted to The Johns Hopkins University in conformity with the
requirements for the degree of Doctor of Philosophy.
Baltimore, Maryland
September, 2014
© Zhenke Wu 2014
All rights reserved
Abstract
The term individualized health represents a goal of the next generation of health sys-
tem: to treat the right person, at the right place, at the right time taking account of the
individuals’ characteristics, circumstances and preferences. To advance this goal, a new
partnership of statistical and biomedical science is needed to intelligently use information
to better understand disease etiology, to improve diagnoses and treatment decisions and
to accurately evaluate health interventions. In two parts, this thesis addresses statistical
methods in support of the individualized health goal.
In Part I, the key objective is to characterize an individual’s underlying health state given
imprecise measurements. We assume that the health states can be usefully represented by
categorical latent variables. We describe a statistical framework, termed nested partially-
latent class models (npLCM), to estimate the population fraction of individuals in each
class, and to predict an individual’s health state given multivariate binary measurements
from case-control studies. We assume each observation is a draw from a mixture model
whose components represent latent health state classes. Conditional dependence among
the binary measurements on an individual is induced by nesting subclasses within each
ii
ABSTRACT
latent health/disease class. Measurement precision and dependence among measurements
can be estimated using the control sample for whom the class is known. Model estimation,
model checking, and individual diagnosis are carried out using posterior samples drawn by
Gibbs Sampler. We illustrate the model using a subset of data from the motivating Pneu-
monia Etiology Research for Child Health (PERCH) study that examines the distribution
of pneumonia-causing bacterial or viral pathogens in developing countries.
The second part of this thesis focuses on improving the efficiency of estimating the ef-
fect of individualized intervention using data from matched-pair cluster randomized (MPCR)
designs, where person-level or cluster-level covariates are available. Covariate imbalances
between pairs are commonly observed under MPCR even after matching. We show that the
naive approaches that ignore such imbalance are biased. We propose a covariate-calibrated
approach to achieve both consistency and greater efficiency. We use the new method to
evaluate the effect of an individualized health care intervention in the Guided Care study.
Advisor:
Scott Zeger, PhD
Committee:
Gary Rosner, PhD (chair); Brian Caffo, PhD; Maria Deloria-Knoll, PhD
Alternates:
Roger Peng, PhD; Elizabeth Stuart, PhD
iii
Acknowledgments
My advisor, Dr. Scott Zeger, has been incredibly supportive during my study at Hop-
kins. He led me into the pneumonia etiology project with extreme patience and original
insights. He encouraged me to go where data are, and demonstrated how to design and
communicate inventive statistical methods. His genuine enthusiasm for doing quality sci-
ence and constant pursuit of effective and creative methods have been, and will continue to
be my source of inspiration in my future careers. I am very fortunate to have worked with
him during my PhD program.
I would also like to thank another of my mentors, Dr. Constantine Frangakis, whose
support, guidance, and friendship have fostered my interest in clinical trials, causal infer-
ence, and many other areas.
This thesis has also benefited from discussions with many other professors. The de-
tailed comments from my thesis committee members, Drs. Gary Rosner, Brian Caffo,
Maria Deloria-Knoll are educational and have led to greatly improved presentation of the
methods. I also thank Drs. Elizabeth Stuart and Roger Peng for willing to spend time
reading my thesis. Drs. Thomas Louis and Daniel Scharfstein have also generously offered
iv
ACKNOWLEDGMENTS
me much advice in better formulating and communicating methodological ideas during our
collaborations, which led to Chapter 5.
I would like to thank Jiawei Bai, a very close friend, and many other professors and
fellow students who have created a pleasant working place and helped me at Johns Hopkins.
The Pneumonia Etiology Research for Child Health (PERCH) study has helped me
understand the elements of a real scientific investigation. Direct communications with
medical doctors and epidemiologists have led to much improved statistical modeling over
the years. I am very fortunate to have collaborated with a team of wonderful scientists
led by Dr. Katherine O’Brien. Kate has been very supportive and responsive during our
model development phase. Her pursuit of scientifically interpretable statistical results has
also led to heated discussions in steering committee meetings. It was challenging as well
as educational for me to participate and contribute during these discussions.
More importantly, the weekly analysis meetings with Drs. Maria Deloria-Knoll, Laura
Hammitt, Daniel Feikin, and many other investigators have always been constructive and
fun. Our conversations have helped shape the statistical approach presented in this thesis.
I would also like to thank the PERCH coordinators, Wei Fu, Daniel Park, Christine Pros-
peri, Melissa Higdon, and Mengying Li for their strong support and trust that have helped
demonstrate the pneumonia etiology analysis. I also thank the members of PERCH Expert
Group who provided external advice to further improve the statistical models.
Finally, I am also grateful for the generous financial supports from the Department of
Biostatistics, Bill & Melinda Gates Foundation, and the Sommer Scholar program, which
v
ACKNOWLEDGMENTS
made my life at Hopkins more pleasant.
On a personal note, I would like to thank my wife Ruoping Chai, my parents and
parents-in-law, Changlin Xu, Juan Wu, Ruizhen Sun, Shiduo Chai, for their love, encour-
agement and constant support through the five years seeking my degree.
This thesis is dedicated to you, and also to my son, Tyler.
vi
Contents
Abstract ii
Acknowledgments iv
List of Tables xi
List of Figures xii
1 Introduction 1
1.1 Statistical challenges in individualized health . . . . . . . . . . . . . . . . 2
1.2 Organizational overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 Latent Class Models 11
2.1 Brief history and formulation . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Identifiability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Estimation by Markov chain Monte Carlo . . . . . . . . . . . . . . . . . . 18
vii
CONTENTS
2.4 Grade-of-Membership model . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5 Approximation properties . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3 Partially-Latent Class Models (pLCM) for Case-Control Studies of Childhood
Pneumonia Etiology 26
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 A partially-latent class model for multiple indirect measurements . . . . . . 34
3.2.1 Model identifiability . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.2 Parameter estimation and individual etiology prediction . . . . . . 41
3.3 Simulation for three pathogens case with GS and BrS data . . . . . . . . . 43
3.4 Analysis of PERCH data . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4 Nested Partially-Latent Class Models (npLCM) for Estimating Disease Etiol-
ogy in Case-Control Studies 62
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2 Model specification of npLCM . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2.1 npLCM likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.2.2 Prior specifications . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.3 Model properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.3.1 Non-interference submodels . . . . . . . . . . . . . . . . . . . . . 76
4.3.2 Mean and covariance structure . . . . . . . . . . . . . . . . . . . . 77
viii
CONTENTS
4.3.3 Alternate approaches to borrowing information from the control
population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.3.4 Modeling choices . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.4 Posterior computations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.5 Simulation studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.6 Analysis of PERCH data . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.6.1 Estimation of etiologic fractions . . . . . . . . . . . . . . . . . . . 88
4.6.2 Model checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5 Estimation of Treatment Effects in Matched-Pair Cluster Randomized Tri-
als by Calibrating Covariate Imbalance Between Clusters with Application to
Guided Care Study 100
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.2 The goal and design using potential outcomes . . . . . . . . . . . . . . . . 103
5.3 Complications with existing methods . . . . . . . . . . . . . . . . . . . . . 109
5.3.1 Consequences when ignoring covariates. . . . . . . . . . . . . . . 109
5.3.2 Complications with existing covariate methods. . . . . . . . . . . . 116
5.4 Addressing the Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.4.1 Calibration of observed covariate differences between clinical prac-
tices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.4.2 Estimation of quantities of original interest . . . . . . . . . . . . . 121
ix
CONTENTS
5.4.3 Assessment of the hypothesis of no effect . . . . . . . . . . . . . . 125
5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6 Conclusions and Future Work 129
Appendices 136
A1 Appendix to Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
A1.1 Full conditional distributions in Gibbs sampler . . . . . . . . . . . 136
A1.2 Pathogen names and their abbreviations . . . . . . . . . . . . . . . 137
A1.3 Additional simulation results . . . . . . . . . . . . . . . . . . . . . 138
A2 Appendix to Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
A2.1 Posterior computations . . . . . . . . . . . . . . . . . . . . . . . . 139
A2.2 Full pathogen names with abbreviations . . . . . . . . . . . . . . . 143
A2.3 Stick-breaking prior . . . . . . . . . . . . . . . . . . . . . . . . . 143
A2.4 Mean and covariance structure . . . . . . . . . . . . . . . . . . . . 144
A3 Appendix to Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
A3.1 Proof of Result 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
A3.2 Proof of Result (4) . . . . . . . . . . . . . . . . . . . . . . . . . . 147
A3.3 Proof of part (a) in Result 2 . . . . . . . . . . . . . . . . . . . . . 151
Bibliography 152
Curriculum Vitae 171
x
List of Tables
4.1 Results for simulated data sets separately fitted by the npLCM and pLCM. . 85
5.1 Summary of average SF36 outcomes for uncalibrated versus calibrated ap-proaches. The first row block displays sample sizes; the second row blockdisplays average outcomes that are uncalibrated and calibrated, respectively. 110
5.2 Checking covariate imbalances within each pair. For a continuous covariate(indicated by (a)), we calculate effect size as difference divided by pooledstandard deviation. For a categorical covariate (indicated by (b)), odds ratiois calculated comparing rates of occurrence of each category between twoclusters in a pair. To prevent infinite odds ratio, 0.5 is added to all the cellswhen calculating sample odds ratios. . . . . . . . . . . . . . . . . . . . . . 113
5.3 Results from MLE, profile MLE, Bayes estimates and permutation test inthe Guided Care program study. The covariates used for calibration arelisted in the first column of Table 5.2; the outcome is the physical compo-nent summary of the Short Form 36 (SF36).Results from different methods 114
xi
List of Figures
3.1 Directed acyclic graph (DAG) illustrating relationships among lung infec-tion state (IL), imperfect lab measurements on the presence/absence ofeach of a list of pathogens at each site(MNP , MB and ML), disease out-come, and covariates (X). For a subject missing one or more of the threetypes of measurements, we remove the corresponding measurement com-ponent(s). For example, if a case does not have lung aspirate (LA) mea-surement, we remove ML from the DAG. . . . . . . . . . . . . . . . . . . 30
3.2 Population and individual etiology estimations for a single sample with500 cases and 500 controls with true π = (0.67, 0.26, 0.07)T and either1%N = 5) or 10%(N = 50) GS data on cases. In (a) or (b), Red circledplus shows the true population etiology distribution π. The closed curvesare 95 percent credible regions: blue dashed lines “- - -”, light green solidlines “—”, black dotted lines “· · · ” correspond to analysis using BrS dataonly, BrS+GS data, GS data only, respectively; Solid square/dot/triangleare corresponding posterior means of π; The 95 percent highest densityregion of uniform prior distribution is also visualized by red “· − ·−” forcomparison. 8(= 23) BrS measurement patterns and predictions for indi-vidual children are shown with different shapes, with measurement patternsattached to them. The radii of circles and numbers at the vertices show em-pirical frequencies GS measurements belonging to A, B, or C. . . . . . . . 45
xii
LIST OF FIGURES
3.3 Results using expert priors on TPRs. The observed BrS rates (with 95%confidence intervals) for cases and controls are shown on the far left. Theconditional odds ratio given the other pathogens is listed with 95% confi-dence interval in the box to the right of the BrS data summary. Below thecase and control observed rates is a horizontal line with a triangle. From leftto right, the line starts at the estimated false positive rate (FPR) and endsat the estimated true positive rate (TPR), both obtained from the model.Below the TPR are two boxplots summarizing its posterior (top) and prior(bottom) distributions. The location of the triangle, expressed as a fractionof the distance from the FPR to the TPR, is the model-based point estimateof the etiologic fraction for each pathogen. The SS data are shown in asimilar fashion to the right of the BrS data. The observed rate for the casesis shown with its 95% confidence interval. The estimated SS TPR (θSSj )with prior and posterior distributions is shown as for the BrS data, exceptthat we plot 95% and 50% credible intervals for SS TPR above the boxplotfor its prior distribution. See Appendix for pathogen name abbreviations. . 52
3.4 Results on using uniform priors on TPRs. As in Figure 3.3 with uniformpriors on the TPRs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.5 Summary of posterior distribution of pneumonia etiology estimates usingexpert (left) and uniform (right) priors on TPRs. In each subfigure, top:posterior (solid) and prior (dashed) distribution of viral etiology; bottomleft: posterior etiology distribution for top two bacterial causes given bac-teria is a cause; bottom right: posterior etiology distribution for top twoviral causes given virus is a cause. B-rest and V-rest stand for the restof bacteria and viruses other than the top two species, respectively. Thenested blue circles are 95%, 80%, and 50% credible regions for populationetiology estimates within bacterial or viral group. . . . . . . . . . . . . . . 55
3.6 Model chPosterior predictive checking for 10 most frequent BrS measure-ment patterns among cases and controls with expert priors on TPRs. . . . . 57
3.7 Posterior predictive checking for pairwise odds ratios separately for cases(lower right triangle) and controls (upper left triangle) with expert priorson TPRs. Each entry is a standardized log odds ratio (SLOR): the observedlog odds ratio for a pair of BrS measurements minus the mean LOR forthe posterior predictive distribution divided by the standard deviation ofthe posterior predictive distribution. The first significant digit of absoluteSLORs are shown in red for positive and blue for negative values, and onlythose greater than 2 are shown. . . . . . . . . . . . . . . . . . . . . . . . . 58
xiii
LIST OF FIGURES
4.1 Model structure that incorporates conditional dependence within each dis-ease class illustrated by J = 5 pathogens (called A, B, C, D, and E) in thePERCH study. On the left is the control measurements that arise from amixture of K = 2 conditionally independent subclass measurement pro-files with mixing weights ν1 and ν2. Here ψ(j)
k is the false positive rate forpathogen j in a subclass k. On the right are the J = 5 disease classes, onefor each possible pathogen. Each case is assumed to be caused by a uniquepathogen indicated by IL taking values in 1, ..., J. For a class containingall cases whose IL = j0, the K = 2 subclasses of measurement profiles areassumed equal to the control false positive rates ψ(j)
k for j 6= j0, and equalto the true positive rate θ(j)k for j = j0, k = 1, ..., K. Within each diseaseclass, two subclass measurement profiles are nested. The mixing weights ofsubclasses nested in the jth disease class are η(j)1 and η(j)2 . π = (π1, ..., πJ)′
are disease class mixing weights, and are called etiologic fractions. . . . . 714.2 Directed acyclic graph for the npLCM. . . . . . . . . . . . . . . . . . . . . 814.3 Misclassification rate comparisons between the pLCM and npLCM predic-
tions. 50 simulated training data sets are generated under (a) scenario I(pLCM), or (b) scenario II (npLCM). Each training data set is then fittedby the pLCM (clear boxplots) or npLCM (filled boxplots) to produce indi-vidual predictions. In (a) and (b), the first 5 pair of boxplots are to compareclass-specific misclassification rates; the last pair is to compare the overallmisclassification rates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.4 Comparison of population etiologic fraction posterior distributions betweenthe pLCM (black) and npLCM (blue). On the left, the positive observa-tion rates rates for cases and controls are plotted for each pathogen us-ing connected blue dots; “+” and “*” denote posterior mean of θMj andψMj , respectively; the fitted case rate is indicated by “δ”. On the right, the
blue/black curves, numbers, and credible intervals above the curves denotethe marginal posterior density, mean, and 50% and 95% credible intervalsfor πj , j = 1, .., 10 for the pLCM/npLCM models. . . . . . . . . . . . . . 90
4.5 Individual diagnoses for the most frequent measurement patterns amongthe cases, separately predicted from the pLCM and the npLCM. . . . . . . 93
4.6 Posterior predictive distributions for checking of pairwise log odds ratiosfor the controls (top) and the cases (bottom). . . . . . . . . . . . . . . . . . 96
xiv
LIST OF FIGURES
5.1 The underlying structure of the paired-cluster randomized design. The toppart (observed pair p) and bottom part (observed pair p′) are the two possi-ble ways in which a single pair can be manifested in the design. Observedpair p has two clinical practices (represented by the two squares). For eachclinical practice, the first row shows the mean and variance of patient out-comes if the clinical practice is assigned control and the second row showsthe mean and variance if assigned intervention. The clinical practice ac-tually assigned control is indicated by its placement in column “1” , andthe clinical practice actually assigned intervention is in column “2”. Thesolid (nonsolid) ellipsoids show the means and variances that can (cannot)be estimated directly. Observed pair p′ shows how the same pair would bemanifested in the design if the assignment of treatment to clinical practiceswere in reverse (a line with arrows connects the same clinical practice inthese two different assignments). Condition 1 means that each of the twomanifestations, p and p′ has the same probability. . . . . . . . . . . . . . . 107
5.2 Checking second level dependence. . . . . . . . . . . . . . . . . . . . . . 124
xv
Chapter 1
Introduction
1
CHAPTER 1. INTRODUCTION
1.1 Statistical challenges in individualized health
Due to the biotechnology and information technology revolutions of the past two decades,
novel individual-level health information is accumulating at an unprecedented pace. Biomed-
ical and public health researchers seek to intelligently use this new information to advance
their understandings about the mechanism of population health and disease, and to create,
evaluate and iteratively improve treatments for the right person at the right time and right
place.
For example, pneumonia is a clinical syndrome associated with lung infection that can
be caused by a variety of bacteria, viruses or fungi. Recent studies estimated that pneumo-
nia kills more children than other illness–more than AIDS, malaria and measles combined.
Over 1 million children died from pneumonia in 2010, accounting for almost one in five
deaths for children under five years old (Liu et al., 2012). In 2009, the Pneumonia Eti-
ology for Research in Child Health (PERCH) study (Levine et al., 2012) was launched
at the Johns Hopkins Bloomberg School of Public Health with the goal of 1) identifying
the top pneumonia-causing pathogens for children under five, and 2) establishing an evi-
dence base for future patient pneumonia diagnosis. The study sites encompass 7 countries
in South Asia and Sub-Saharan Africa. In PERCH, for the first time in pneumonia etiol-
ogy research, comprehensive and standardized bioassays and other biotechnology-enabled
tests are systematically used to assess the presence or absence of more than 30 pneumonia-
causing pathogens in body fluids.
In PERCH and other health studies, there are at least four statistical problems that are
2
CHAPTER 1. INTRODUCTION
central to advancing individualized health goals: estimation of population etiology, al-
gorithms for individual diagnosis, intervention selections optimized given an individual’s
characteristics, circumstances and preferences, and evaluation of individualized interven-
tions. Each is briefly considered in turn.
Population etiology estimation. Here, population disease etiology is defined as the
population distribution of health states. The health states are commonly not directly ob-
served, and, for most diseases, such as cancer, they are time-varying. For example, in
PERCH, the relevant health state for pneumonia cases is defined by the pathogen currently
infecting a child’s lung. To capture relevant aspects about these unobserved quantities, a
series of measurements are collected for analyses. Measurements can be biological (DNA
sequence, epigenetic makes, biomarkers), clinical (symptom reports, patient history, med-
ication use), environmental (exposures), or others, and can be of variable quality. In sta-
tistical analysis to estimate the distribution of health states in a population, it is important
to integrate measures to account for differential measurement errors in the data collection
process.
Individual diagnosis. Individual diagnosis can be improved by embedding a new
patient within a subpopulation with similar observed characteristics, either through post-
stratification or by study design. This reference to population data is indispensable given
the current state of medical knowledge because we lack the “laws of biology” that can
reliably predict the current or future health states of an individual using only her low-
dimensional measurements. For example, in cancer screening, the posterior probability
3
CHAPTER 1. INTRODUCTION
that a new person has cancer given her positive test result (positive predictive value) can be
estimated from data on the subpopulation with similar covariate profiles. This embedding
process proceeds iteratively as new patients continuously enrich our database and serve
as potential references for future cases with similar characteristics. The patient’s decision
about how to use such a probabilistic diagnosis depends on her loss function given the
available intervention options.
For example, in PERCH study,we define ILi to be individual i’s true infection state in the
lung. It is the individual’s health state taking value from 0, 1, ..., J, where we denote no
lung infection as 0 (observed for the controls) and infection cause for different pathogens
as 1, ..., J (unobserved for the cases for a prespecified list of potential causes). An indi-
vidual’s latent health state and her measurements can be affected by a set of covariatesXi.
The study generates multivariate binary data Mi from samples with potentially different
error rates. Using these data, the investigators want to estimate 1) the population etiologic
fractions (π1, ..., πJ), that are the fractions of pneumonia cases caused by each of the J
pathogens, and 2) to facilitate individual diagnosis by calculating the posterior probability
that a pneumonia case is caused by each of the putative pathogens in light of the individ-
ual’s and the similar subpopulation’s measurements, i.e. pi = P (ILi = j |Mi,Xi,Data).
Because the clinicians may differentially treat cases caused by different pathogens, un-
derstanding the characteristics of the underlying health state, (ILi ), is essential for making
medical decisions.
Treatment option selection. After individual diagnosis, patients and their clinicians
4
CHAPTER 1. INTRODUCTION
choose the most suitable treatments based on the individual’s estimated health state, the
current knowledge of intervention effects for the subpopulation with similar measured char-
acteristics, and the individual’s preferences (loss function).
For example, many forms of breast cancer are based in part on genetic characteristics,
such as human epidermal growth factor receptor 2 (HER2)-positive and HER2-negative,
each with different prognoses (ASCO, 2007). Drugs like Trastuzamab (Herceptin) and
Laptatinib (Tykerb) specifically target HER2 and are recommended for women whose tu-
mors are HER2 positive (National Cancer Institute, 2014). HER2-negative patients can be
eligible for other targeted medicine (Genentech, 2014). That is, we have a health state, the
HER2 positive/negative status, on an individual that can guide the selection of therapies. If
we can estimate the health states well through screening tests, the treatment benefits for the
patient can be optimized by selecting the right medication.
If the underlying disease mechanism is less well understood, statistical techniques can
potentially be used to estimate the treatment effect on a subpopulation with a particular co-
variate profile. For example, after a clinical trial on a large population has been conducted,
we want to estimate the likely effect of the intervention for a subpopulation with specific
baseline covariates. Although the trial was originally designed to detect the treatment ef-
fect on the whole population, subgroup analysis methods have been developed for both
consistent and efficient (with smallest possible variance) estimation of the treatment effect
on subpopulations (Cai et al., 2011; Zhao et al., 2013).
Besides analytic approaches to improve efficiency in estimating subpopulation treat-
5
CHAPTER 1. INTRODUCTION
ment effect, efficient designs can also help. For example, adaptive clinical trial (designs)
(Berry et al., 2010) can confirm the effectiveness of a drug and identify the subpopulation
who benefit the most. In an adaptive design, a new patient’s probability of randomization
to the different treatment arms changes as a function of the accumulated outcomes to date.
Adaptive designs can result in the same amount of information about the relative interven-
tion effect on a subpopulation but with shorter trial duration or fewer study subjects.
The goal of subgroup analysis and adaptive designs is to find the right subpopulation
for a particular treatment. A related goal is to match treatment for each of multiple subpop-
ulations according to their covariates. When each subject receives multiple treatments over
time, for example in a crossover trial (Senn, 2002), we can estimate the patient’s health
states after each treatment, which can then be combined with other patients’ data to col-
lectively learn the optimal treatments specific for subgroups as defined by covariates. The
“N-of-1” trial is an extreme example where sequential treatments are assigned to a single
study subject and the variation in the measured health outcomes over time can be used to
estimate the intervention effects specific for each individual (Guyatt et al., 1990).
We review relevant literature in the final chapter for treatment option selection, but in
this thesis focus on the following question.
Evaluation of individualized interventions. After an individualized decision rule has
been applied to each patient, a key question is to what extent the individualized rules im-
proved the health outcomes for the entire population? Individualized rules will have clear-
cut benefits for subjects with some specific baseline profiles, but not for others. The latter
6
CHAPTER 1. INTRODUCTION
group of patients usually have baseline profiles under which benefits to all available treat-
ments are estimated with less precision. Therefore, even if individualized intervention rules
are adopted, assessment at the population level is useful for health care policy makers to
decide whether to adopt these individualized rules in their local populations. Similar to es-
timating the subpopulation intervention effects, consistency and efficiency considerations
are essential for objective and precise evaluation of the individualized interventions.
For example, in the Guided Care study (Boult et al., 2013), for each of the chronically
ill older patients in a clinical practice, a specially educated registered nurse is in place to 1)
help create an evidence-based plan of care, 2) coordinate the efforts of all clinicians who
provide the patient’s health care, 3) smooth the patient’s transitions between sites of care,
4) coach the patient’s self-management, and 5) educate and supporting family caregivers,
etc. To assess the hypothesis that such individualized health care can improve functional
outcomes measured by Short-Form (SF)-36 version 2 (Ware and Kosinski, 2001) and other
measures on quality of care and health services utilization, the investigators enrolled eli-
gible patients from 14 practices in Baltimore and DC areas and assigned specially trained
nurses to 7 of the practices under a matched-pair cluster randomized (MPCR) design. The
goal of policy interest is to consistently and efficiently compare the average measured out-
comes if all the study subjects in the population are assigned such specially trained nurses
versus the average if no such nurses are assigned.
7
CHAPTER 1. INTRODUCTION
1.2 Organizational overview
Since the first part of this thesis will develop an extension of the latent class model
(LCM) (Goodman, 1974), in Chapter 2, we review the history and statistical properties of
the LCM. Then we discuss a related method, the Grade-of-Membership (GoM) model, that
can be shown to be equivalent to the LCM usually with smaller number of classes to de-
scribe the same marginal multivariate discrete distribution. Common to the LCM and GoM
is the flexibility for characterizing higher-order moments and their regression extensions.
The final section gives some technical theorems that describe the approximation properties
of the LCM using nonnegative matrix decompositions (Bhattacharya and Dunson, 2012),
which we will use in the development of npLCM in Chapter 4.
Chapter 3 presents the partially-latent class models (pLCM) that enables population
etiology estimation and individual diagnosis using data from a case-control design. Specif-
ically, the pLCM assumes that the health states of the cases are latent, while the health
states of the controls are observed, hence the name partially-latent. Given the unobserved
health state of a case subject, the pLCM, like the LCM assumes conditional independence
among multivariate binary measurements with the positive rate for each dimension equal-
ing the true positive rate, or the false positive rates that can be estimated from the controls.
Measurements of different error rates, which we term as gold, silver and bronze standards,
are integrated systematically in the pLCM through combined likelihood specifications. The
detailed model specification can be found in Section 3.2. Also, to inform allocation of sam-
pling and laboratory resources and to improve future study designs, in Section 3.3 and the
8
CHAPTER 1. INTRODUCTION
appendix to Chapter 3, we quantify the fraction of information about the mixing weight of
a particular class of the health states that derives from each level of measurement. Section
3.4 introduces graphical displays of the population data and inferred latent-class frequen-
cies, which are effective tools for communications between the statisticians and the domain
experts during the course of this research.
Building upon the pLCM, Chapter 4 introduces the nested partially-latent class model
(npLCM) to enable conditional dependence. Section 4.2 and 4.3 present model likelihood
and noninterference submodels, which formalize the heuristics that the joint measurement
distribution for controls informs the case model. We show by simulation studies in Section
4.4 the degree to which ignoring conditional dependence can lead to bias in the estimation
of population etiology and individual diagnosis. Analyses of subsets of the PERCH data
under both the pLCM and the npLCM are presented and compared.
Chapter 5 begins Part II of the thesis in which we develop methods for evaluating an
individualized intervention when it is applied to a population of clinical practices. We aim
to obtain consistent and efficient evaluation by leveraging individual-level or cluster-level
covariates when the data has been collected from a special design, called the matched-pair
cluster randomized design. One goal of policy interest is to estimate the average outcome
if all clusters in all pairs are assigned control versus if all clusters in all pairs are assigned
to intervention. Section 5.2 formulates the study design and the observed data likelihood in
terms of the potential outcome framework. Under this framework, Section 5.3 shows that
previous meta-analytic approaches have implicitly assumed conditional independence be-
9
CHAPTER 1. INTRODUCTION
tween pair-specific mean outcome differences and variances given the population treatment
effect, hence may lead to bias if such an assumption is inappropriate. Bias can also occur
if we do not account for the covariate imbalances that may still exist between clusters in a
pair after matching. We propose a covariate-calibrated estimator to reduce these biases and
improve efficiency (Wu et al., 2014b). Lastly, the methodology is illustrated by an analysis
using data from the Guided Care study.
Chapter 6 summarizes the contribution of this dissertation and suggests open questions
for future research.
1.3 Software
The R and WinBUGS programs to reproduce all the results in this thesis work are
available at the following website:
http://www.biostat.jhsph.edu/∼zhwu/software/thesis.code.zip
The programs are organized into two main folders: npLCM and MPCR. The npLCM
folder contains all of the programs used for fitting the nested partially-latent class models
(npLCM) and creating our proposed visualizations described in Chapter 3 and 4. The MPCR
folder contains the function for producing the covariate-calibrated estimator described in
Chapter 5 and the graphics/tables presented therein. Both folders contain descriptions about
how to use the programs including data format requirements, model specifications, and
options for outputs.
10
Chapter 2
Latent Class Models
11
CHAPTER 2. LATENT CLASS MODELS
2.1 Brief history and formulation
Latent variables are used in statistical models to represent individual characteristics
usually of scientific interest that are not directly observable. A goal of analysis is to infer
the values of the latent variables from other observable quantities. Some examples of latent
variables include depression or other mental states, disability, and intelligence. These latent
variables are clearly understood in their respective contexts while not directly measured.
Models that relate the latent variables and manifest (observed) responses (e.g. mea-
surements of symptoms, detection of pathogens) for an individual can be described by the
general term, “latent structure model” as was summarized and discussed in book-length
by Lazarsfeld and Henry (1968). The observed variables are usually assumed to be con-
ditionally independent given the values of the latent variables. These models serve the
purpose of summarizing the observed variations of measurements in terms of a vector of
low-dimensional latent variables, which are underlying constructs of interest. McCutcheon
(1987) defined four types of latent structure models: factor analysis (continuous outcomes
and continuous latent variables), latent trait model (discrete outcomes and continuous latent
variables), latent profile model (continuous outcomes and discrete latent variables); latent
class model (discrete outcomes and discrete latent variables). In this chapter, we focus on
the latent class models (LCM) for multivariate binary data with the latent variables that take
values in a finite set of classes, because it is relevant to the motivating PERCH application
detailed in Chapter 3 and 4.
Let Mi be a J-dimensional binary measurement Mi for subject i = 1, 2, ..., N , com-
12
CHAPTER 2. LATENT CLASS MODELS
prising the observed binary outcomes on J items, e.g., questions in psychometrics, or
pathogens presence/absence in infectious disease research. Let the latent variable for in-
dividual i be discrete and take its value in a finite set of values Zi ∈ 1, ..., K, where K
is the number of possible classes. Let νk = P (Zi = k) denote the probability that the ith
person is in the kth class. In the PERCH application, the control measurements Mi is a
vector of measurements on J different species of pathogens. Given the latent variable Zi
for individual i, the LCM assumes that the J measurements are conditionally independent
of one another, with the conditional distribution
pr(Mi | Zi = k,pZi) =
J∏j=1
pMij
jk (1− pjk)1−Mij , (2.1.1)
where pk = pkj = P (Mij = 1 | Zi = k), j = 1, ..., J is the vector of conditional prob-
abilities that an individual i who is in latent class k will have a positive response on the
jth dimension. In words, the conditional independence says that if the latent variable Zi is
observed, then the observed multivariate measurements, Mi1,Mi2, ...,MiJ , are not infor-
mative about one another. For any pair of margins (j, j′), the observed marginal association
between Mij and Mij′ is induced by their separate associations with the shared latent vari-
able Zi.
When the study population is a mixture of subject with unknown latent variable Zi, the
13
CHAPTER 2. LATENT CLASS MODELS
observed data distribution is a finite mixture distribution
pr(Mi;ν,P ) =K∑k=1
νk ·J∏j=1
pMij
jk (1− pjk)1−Mij , (2.1.2)
where P is a J × K matrix with columns being conditional probability vectors pk, k =
1, ..., K. Assuming that the sampled individuals are mutually independent, we obtain the
full likelihood specification of the LCM
pr(Mi, i = 1, ..., N | ν,P ) =N∏i=1
K∑k=1
νk ·J∏j=1
pMij
jk (1− pjk)1−Mij . (2.1.3)
Remark 1. Goodman (1974) and Haberman (1979) also noted that the LCM can be equiv-
alently formulated as a log-linear model for a contingency table where one of the category
variables, the latent variable is unobserved.
The popularity of the LCM lies in its ability to describe correlated binary measure-
ments, which otherwise lack a standard joint distribution that is equivalent to the multivari-
ate Gaussian distribution in the continuous case. As discussed in Section 2.5, the LCM can
approximate any multivariate binary distribution with arbitrary precision if the number of
latent classes K is large enough.
The LCM is often used as a clustering tool since each observation is assigned a prob-
ability of being in each of the K classes. One specific area that finds LCM particularly
useful is the evaluation of medical diagnostic tests, especially when no gold-standard is
available to directly observe the disease status (Albert et al., 2001; Pepe and Janes, 2007).
14
CHAPTER 2. LATENT CLASS MODELS
More specifically, suppose J diagnostic tests are applied to an individual and that we cannot
directly observe the disease status Zi ∈ 0, 1 of the individual. We use the J-dimensional
measurements to infer Z. The LCM collectively uses the information provided by different
diagnostic tests with possibly varied error rates, to accomplish three tasks: 1) estimate the
mixing weights ν, 2) estimate the conditional probabilities P that describe the measure-
ment characteristics (e.g. error rates) given known disease status, and 3) predict the vector
of probabilities that an individual i belongs to each of the classes.
Task 1) is related to understanding the population structure, i.e., to estimate the preva-
lence of each latent classes. It is sometimes useful to stratify the population into groups
within each of which the measurement characteristics are homogeneous and can be sum-
marized by the conditional probabilities estimated in 2).
By inspecting estimates obtained in 2), we can assign substantive meanings to each es-
timated latent class. For example, in the research of functional disability related to aging in
the US population (Corder and Manton, 1991), one of the research questions is “what are
the characteristics of each functional disability severity category in terms of measured ac-
tivities?”. The conditional probabilities can describe the response probabilities to questions
like “can you do heavy/light housework?” or “can you getting about outside?”. An class
estimated with larger values of these conditional probabilities means that the individuals
in this class have high functional disabilities, while another class with small estimates of
the conditional probabilities represents the subgroup of people whose physical status are
relatively good.
15
CHAPTER 2. LATENT CLASS MODELS
The final task 3) can be accomplished in the Bayesian framework where the individual
probability is the posterior distribution of class membership given her measurements and
the population data. It assigns each study subject probablistically to the estimated latent
classes, which can guide clinical decisions on this individual.
More applications of the LCM and their substantive interpretations can be found in
the areas of diagnosis and rater agreement (e.g., Albert et al. (2001), Uebersax (1988),
Uebersax and Grove (1993), Dillon and Mulani (1984), Gelfand and Solomon (1973)),
psychiatry (e.g., Young (1983); Eaton et al. (1989); Sullivan et al. (1998)), education (e.g.,
Aitkin et al. (1981); Uebersax (1997)), and infectious disease studies (e.g., Jokinen and
Scott (2010)).
2.2 Identifiability
Potential non-identifiability of the LCM parameters is well known. For example, an
LCM with four observed binary indicators and three latent classes is not identifiable de-
spite providing 15 degrees-of-freedom to estimate 14 parameters (Goodman, 1974). In
latent variable analysis, the model identifiability is usually discussed in a local sense as
first described by McHugh (1956). We call a distribution F “locally identifiable” if at the
parameter ψ0, there exists some neighborhood N (ψ0) such that
FM (m;ψ0) = FM (m;ψ) ∀m ∈ supp(F )⇔ ψ = ψ0, ∀ψ ∈ N (ψ0) ∩Ψ, (2.2.1)
16
CHAPTER 2. LATENT CLASS MODELS
where M denotes the random vector of measurements, Ψ is the parameter space and
supp(F ) is the support set of distribution F .
In the standard LCM with K classes, the parameter vector ψ = (ν,P ) is of dimension
K(J + 1)− 1 as defined in the previous section. Goodman (1974) concluded that a model
is identifiable if K < 2(J−1)/2 and not identifiable if K > 2J/(J + 1). Models where
neither of the two inequality is true may or may not be identifiable. Such identifiability
is theoretical and may require a very large sample size in order to distinguish the model
likelihood over two parameter values.
Remark 2. Jones et al. (2010) discusses from a geometric perspective the global identifi-
ability, weak identifiability, and partial identifiability in the context of multiple diagnostic
testing in the absence of a gold standard.
When the sample size is finite or the number of individual in a latent class is small,
the data may not be fully informative about the parameters in that class, and weak es-
timability (Dawid, 1979; Gelfand and Sahu, 1999) can occur. Weak estimability is when
technical conditions for identifiability are met, but the data provide little information about
the particular parameters so that their posterior and prior distributions are similar. Sup-
pose the model is denoted by L(ψ;M ) and ψ is partitioned as ψ = (ψ1, ψ2). If f(ψ2 |
ψ1,M) = f(ψ2 | ψ1), then we say ψ2 is weak estimable. The data M does not provide
extra information about ψ2 given ψ1 beyond the prior conditional distribution f(ψ2 | ψ1).
Therefore, ψ2 cannot be identified from the data. Note that, however, this does not mean
f(ψ2 |M ) = f(ψ2), because if ψ1 is identifiable from the data, then ψ2 can be indirectly
17
CHAPTER 2. LATENT CLASS MODELS
learned through integrating over [ψ1 | Data] in the prior conditional distribution f(ψ2 | ψ1),
where conditioning set has been estimated from the data (Gustafson et al., 2001).
When a model is not locally identifiable, we cannot estimate a LCM with likelihood
methods. But if we have sources of prior information about a subset of the parameters,
we can estimate the latent class model by Bayesian methods. The Bayesian framework
can avoid the identifiability issue by supplying prior information on model parameters, and
posterior distribution is an legitimate summary of both prior and likelihood information
(Gustafson, 2009).
2.3 Estimation by Markov chain Monte Carlo
In the Bayesian approach, there is no distinction between latent variables and parame-
ters; all are considered random quantities whose distribution are to be updated given data.
A survey of Markov chain Monte Carlo (MCMC) methods can be found in Robert and
Casella (1999), Gilks et al. (1996), or Brooks et al. (2011). MCMC methods have been
used for a variety of latent variable models, including generalized linear mixed models
(e.g., Zeger and Karim (1991); Clayton (1996)), multilevel models, covariate measurement
models, etc. For the LCM, Garrett and Zeger (2000) detailed the MCMC algorithm to draw
approximating samples from the joint posterior distribution of all the unknowns (model pa-
rameters and latent variables). An important advantage of MCMC is that the approach
can be used to estimate complex models for which other methods are either unfeasible
18
CHAPTER 2. LATENT CLASS MODELS
or work poorly. Another advantage is that any characteristics of the posterior distribution
can be investigated based on stationary simulated values, for instance posterior means and
percentiles.
There are several tuning parameters that must be chosen in using MCMC methods.
The first is, the burn-in period, or the number of initial iterations to discard while the
Markov chain is converging close to its asymptotic distribution. And, in WinBUGS, this
period is also needed to choose good parameters for Metropolis-Hastings (MH) proposal
distributions (Spiegelhalter et al., 2003). The second tuning constant is the total length of
the MCMC. It is considerably more difficult to monitor convergence to a distribution than
to a point. A popular approach is to use an arbitrary large number or to run a number of
chains with different initial values to assess convergence (Gelman et al., 2013). Approaches
based on Monte Carlo errors have been proposed (Flegal et al., 2008) to ensure acceptable
precision of the estimates like the posterior means. It can be particularly hard to judge
convergence of the estimates when there is slow mixing, that is, when the chain moves
slowly through the bottlenecks of the target distribution. When the mixing is poor, the
chain has to be run for a very long time to obtain accurate estimates.
2.4 Grade-of-Membership model
The LCM (2.1.3) represents the joint distribution of multivariate binary responses as
a mixture of conditionally independent (product) distributions, one for each class. The
19
CHAPTER 2. LATENT CLASS MODELS
Grade-of-Membership (GoM) model, developed by Max Woodbury in the 1970s for medi-
cal classification (Woodbury et al., 1978; Clive et al., 1983), is another approach to charac-
terize multivariate distribution for categorical variables with potentially more parsimonious
representation compared to the LCM (Manton et al., 1994; Singer, 1989; Erosheva et al.,
2007; Bhattacharya and Dunson, 2012). The GoM model is especially useful when the ma-
jority of the cells in the observed contingency table have small or zero counts. The GoM
has been applied in genetic studies (Pritchard et al., 2000), studies of functional disabilities
(Erosheva et al., 2007), and topic modeling (Blei et al., 2003).
Specifically, let gi = (gi1, gi2, ..., giK)′ be a latent partial membership vector for indi-
vidual i comprising K nonnegative random variables that sum to one. Define an “extreme
profile” to be a vector of conditional response probabilities λkjmj= P (Mij = mj | gik =
1, gik′ = 0, k′ 6= k) when the individual is entirely a member of class k, that is gik = 1
and gik′ = 0 for all k′ 6= k, for k = 1, 2, ..., K, j = 1, 2, ..., J , and mj = 1, 2, ..., Dj
with Dj being the number of categories on the jth dimension of measurements. The set of
conditional response probabilities must satisfy the following constraint
Dj∑mj=1
λkjmj= 1, (2.4.1)
for k = 1, 2, ..., K, and j = 1, 2, ..., J .
Given partial membership vector gi ∈ [0, 1]K , the conditional distribution of observed
measurement Mij is given by a convex combination of the extreme profiles’ conditional
20
CHAPTER 2. LATENT CLASS MODELS
response probabilities, that is
P (Mj = mij | gi) =K∑k=1
gikλkjmj, (2.4.2)
for j = 1, 2, ..., J , and mj = 1, 2, ..., Dj . Similar to the LCM, the local independence
assumption states that manifest variables, Mi1, ...,MiJ , are conditional independent given
latent variables gi. Under this assumption, the conditional probability of observing re-
sponse patternMi = m is
P (Mi = m | gi) =J∏j=1
(K∑k=1
gkλkjmj
), (2.4.3)
By marginalizing over the distribution of latent vector gi (denoted as G(·)), we obtain the
observed marginal distribution for response patternm:
P (Mi = m) =
∫P (Mi = m | gi)dG(gi) (2.4.4)
=
∫ J∏j=1
(K∑k=1
gikλkjmj
)dG(gi). (2.4.5)
Both the LCM and the GoM are finite mixture models but they differ in the level of
mixture. The LCM’s marginal distribution or integrated likelihood (2.1.3) can simplify
to a summation of K components, which is usually termed population-level mixture. In
contrast, the functional form of the marginal distribution of responses in the GoM (2.4.5)
cannot simplify to a finite sum, and is similar to the structure in the random-effects model
21
CHAPTER 2. LATENT CLASS MODELS
where the random effects gi follows a continuous distribution G(·). Erosheva et al. (2007)
termed the GoM as individual-level mixture model because an individual has her specific
vector of mixing weights gi over K extreme profiles.
If the number of mixture components used in the LCM can be different from that in
the GoM, connections between LCM and GoM can be established. Specifically, the book
Woodbury et al. (1978) first pointed out the nested property of the LCM and GoM. In its
book review, Haberman (1995) suggested that the GoM is a special case of the LCM with
a set of restrictions imposed upon a latent class model. We can construct a LCM such
that its marginal distribution of manifest variables is exactly the same as under the GoM
model. (Erosheva et al., 2007) showed the equivalence between individual-level and the
population-level mixture model exists and can be summarized by the following theorem.
Theorem 2.4.1. (Fundamental Representation Theorem, Theorem 3.2, Erosheva 2007)
Given J manifest variables, any individual-level mixture model withK components can
be represented as a constrained population-level mixture model with KJ components.
The fundamental representation theorem indicates that the MCMC algorithm with data
augmentation can be used for posterior calculation that has been developed for the LCM
model (Erosheva et al., 2007).
When covariates are considered to influence the probability that an individual belonging
to different latent classes, Dayton and Macready (1988), Bandeen-Roche et al. (1997), and
Huang and Bandeen-Roche (2004) have extended the LCM to the regression setting and
termed such extensions as latent class regression models (LCRM). The GoM regression
22
CHAPTER 2. LATENT CLASS MODELS
extension has also been developed in several PhD theses (Connor, 2006; Manrique-Vallier,
2010) in the longitudinal study of disability survey data.
2.5 Approximation properties
The LCM can approximate a multivariate discrete distribution arbitrarily well if the
number of classes (K) is sufficiently large. When the dimension is J = 2, the result-
ing D1 × D2 contingency table has cell (m1,m2) containing the count∑n
i=1 1Mi1 =
m1,Mi2 = m2, for m1 = 1, ..., D1 and m2 = 1, ..., D2. Let the contingency table be
represented by p0 = P (Mi1 = m1,Mi2 = m2). The LCM is equivalent to the finite
mixture specification introduced by as
P (mi1 = m1,mi2 = m2) =K∑k=1
νkψ(1)km1
ψ(2)km2
, (2.5.1)
where ν = (ν1, ..., νK) is a vector of mixture probabilities.
Good (1969) first noted the similarity between the singular value decomposition (SVD)
and the latent structure model without providing a proof that any probability matrix p0 ∈
ΠD1D2 can be decomposed by (2.5.1). When the dimension is larger than 2, let
p0 = P (Mi1 = m1, ...,MiJ = mJ) = pm1···mJ,mj = 1, ..., Dj, j = 1, ..., J ∈ ΠD1···Dp
23
CHAPTER 2. LATENT CLASS MODELS
denote a higher order tensor with ΠD1···Dp denoting the set of all probability tensors of size
D1×D2×· · ·×DJ , where the probability tensors have nonnegative elements and constraint
D1∑m1=1
· · ·DJ∑
mJ=1
|pm1m2···mJ| = 1.
The following corollary describes that any such contingency table can be decomposed into
the following form equivalent to that in the LCM formulation.
Theorem 2.5.1. (Corollary 1, Dunson and Xing (2009))
p =K∑k=1
νkΨk, Ψk = ψ(1)k ⊗ψ
(2)k ⊗ · · · ⊗ψ
(J)k ,
where ⊗ is Kronecker product, ν = (ν1, ..., νK)′ is a probability vector that sums to one,
Ψk ∈ ΠD1···DJ, and ψ(j)
k is a Dj × 1 probability vector, for k = 1, ..., K and j = 1, .., J .
This makes clear that any multivariate categorical data distribution can be expressed as
a latent structure model,
P (Mi1 = m1, ...,MiJ = mJ) =K∑k=1
νk
J∏j=1
ψ(j)kmj
,
where ν is a vector of component probabilities, zi ∈ 1, ..., K is a latent class index,
mi = (mi1, ...,miJ)′ are conditionally independent given Zi and P (Mij = mj | Zi =
k) = ψ(j)kmj
is the probability of Mij = mj given allocation of individual i to class k.
The decomposition above is referred to as nonnegative PARAFAC decomposition (Shashua
24
CHAPTER 2. LATENT CLASS MODELS
and Hazan, 2005), which is one way of generalizing the matrix singular value decomposi-
tion. Its goal is to express the tensor as a sum of K rank 1 tensors.
The GoM also has approximation properties that are summarized in Bhattacharya and
Dunson (2012). It is a nonnegative higher-order singular value decomposition (HOSVD),
which was first proposed by Tucker (1966) for three-way data, and was later extended
to arbitrary tensors by De Lathauwer et al. (2000). The nonnegative HOSVD achieves
better data compression and requires fewer components compared with the nonnegative
PARAFAC decomposition as it uses all combinations of the mode vectors (Bhattacharya
and Dunson, 2012). The GoM allows manifest variables, Mi1,Mi2...,MiJ , to be allocated
to different classes via the local class indicator Zij specific for each dimension j and in-
dividual i. The LCM let all the manifest variables on an individual to fall into the same
class. In general, the GoM requires less number of classes than the LCM. However, in
our PERCH applications, we find that the LCM provides sufficient approximation for the
control population’s distribution. Bhattacharya and Dunson (2012) also suggested that the
GoM formulation can be extended to accommodate measurements with mixed data type on
each dimension, for example, some continuous, some discrete, using kernel techniques.
25
Chapter 3
Partially-Latent Class Models (pLCM)
for Case-Control Studies of Childhood
Pneumonia Etiology
26
CHAPTER 3. PARTIALLY-LATENT CLASS MODELS (PLCM) FORCASE-CONTROL STUDIES OF CHILDHOOD PNEUMONIA ETIOLOGY
Abstract
In population studies on the etiology of disease, one goal is the estimation of the fraction
of cases attributable to each of several causes. For example, pneumonia is a clinical diag-
nosis of lung infection that may be caused by viral, bacterial, fungal, or other pathogens.
The study of pneumonia etiology is challenging because directly sampling from the lung
to identify the etiologic pathogen is not standard clinical practice in most settings. In-
stead, measurements from multiple peripheral specimens are made. This paper considers
the problem of estimating the population etiology distribution and the individual etiology
probabilities. We formulate the scientific problem in statistical terms as estimating mixing
weights and latent class indicators under a partially-latent class model (pLCM) that com-
bines heterogeneous measurements with different error rates obtained from a case-control
study. We introduce the pLCM as an extension of the latent class model. We also intro-
duce graphical displays of the population data and inferred latent-class frequencies. The
methods are illustrated with simulated and real data sets. The paper closes with a brief
description of extensions of the pLCM to the regression setting and to the case where con-
ditional independence among the measures is relaxed.
27
CHAPTER 3. PARTIALLY-LATENT CLASS MODELS (PLCM) FORCASE-CONTROL STUDIES OF CHILDHOOD PNEUMONIA ETIOLOGY
3.1 Introduction
Identifying the pathogens responsible for infectious diseases in a population poses sig-
nificant statistical challenges. Consider the measurement problem in the Pneumonia Eti-
ology Research for Child Health (PERCH), a case-control study that has enrolled 9, 500
children from 7 sites around the world. Pneumonia is a clinical syndrome that devel-
ops because of an infection of the lung tissue by bacteria, viruses, mycobacteria or fungi
(Levine et al., 2012). The appropriate treatment and public health control measures vary
by pathogen. Which pathogen is infecting the lung usually cannot be directly observed
and must therefore be inferred from multiple peripheral measurements with differing error
rates. The primary goals of the PERCH study are to integrate the multiple sources of data
to: (1) aid the attribution of which pathogen or pathogens have caused a particular case’s
lung infection, and (2) estimate the prevalences of the etiologic pathogens in a population
of children.
The basic statistical framework of the problem is pictured in Figure 3.1. Let Yi rep-
resent whether the child is a pneumonia case (Yi = 1) or control (Yi = 0). For a child
with pneumonia, let ILi indicate which pathogen causes the lung infection. ILi takes values
in 0, 1, 2, ...J where 0 represents no infection (control) and ILi = j, j = 1, ..., J , rep-
resents the jth pathogen from a pre-specified cause-of-pneumonia or pneumonia etiology
list. Among the J candidate pathogens being tested, we assume only one is the primary
cause. Because, for most cases, it is not possible to directly sample the lung, we do not
know with certainty which pathogen infected the lung, so we seek to infer the infection
28
CHAPTER 3. PARTIALLY-LATENT CLASS MODELS (PLCM) FORCASE-CONTROL STUDIES OF CHILDHOOD PNEUMONIA ETIOLOGY
status ILi based upon a series of laboratory measurements of specimens from various body
fluids and body sources S (MSi ).
The measurement error rates differ by type of measurement. In the motivating PERCH
application and the following discussions, the error rates refer to epidemiologic error rates
that characterize the probability of the pathogen’s presence/absence in specimen tests given
whether it infected the lung. For this and possibly other applications, it is convenient to cat-
egorize measures into three subgroups referred to as “gold”, “silver”, and “bronze” stan-
dard measurements. A gold-standard (GS) measurement is assumed to have both perfect
sensitivity and specificity. A silver-standard (SS) measurement is assumed to have per-
fect specificity, but imperfect sensitivity. Culturing bacteria from blood samples (B-Cx)
is an example of silver standard measurements in PERCH. Finally, bronze-standard (BrS)
measurements are assumed to have imperfect sensitivity and specificity. Polymerase chain
reaction (PCR) evaluation of bacteria and viruses from nasopharyngeal samples is an ex-
ample. In the PERCH study, both SS and BrS measurements are available in all cases.
BrS measures are also available for controls. A goal of this study is to develop a statistical
model that combines GS and SS measurements from cases, with bronze data from cases
and controls to estimate the distribution of pathogens in the population of pneumonia cases,
and the conditional probability that each of the J pathogens is the primary cause of an indi-
vidual child’s pneumonia given her or his set of measurements. Even in applications where
GS data is not available, a flexible modeling framework that can accommodate GS data is
useful for both the evaluation of statistical information from BrS data (Section 3.3) and the
29
CHAPTER 3. PARTIALLY-LATENT CLASS MODELS (PLCM) FORCASE-CONTROL STUDIES OF CHILDHOOD PNEUMONIA ETIOLOGY
Figure 3.1: Directed acyclic graph (DAG) illustrating relationships among lung infec-tion state (IL), imperfect lab measurements on the presence/absence of each of a list ofpathogens at each site(MNP , MB and ML), disease outcome, and covariates (X). For asubject missing one or more of the three types of measurements, we remove the corre-sponding measurement component(s). For example, if a case does not have lung aspirate(LA) measurement, we remove ML from the DAG.
30
CHAPTER 3. PARTIALLY-LATENT CLASS MODELS (PLCM) FORCASE-CONTROL STUDIES OF CHILDHOOD PNEUMONIA ETIOLOGY
incorporation of GS data if it becomes available as measurement technology improves.
Latent class models (LCM) (Goodman, 1974) have been successfully used to integrate
multiple diagnostic tests or raters’ assessments to estimate a binary latent statusD ∈ 0, 1
for all study subjects (Hui and Walter, 1980; Qu and Hadgu, 1998; Albert et al., 2001; Al-
bert and Dodd, 2008). (In these applications, D = 1 if IL > 0.) In the LCM framework,
conditional distributions [M |D = j], j = 0, 1, are specified to use multivariate measure-
ments M to maximize the likelihood as a function of the disease prevalence, sensitivi-
ties and specificities. This framework has also been extended to infer ordinal latent status
(Wang et al., 2011).
There are three salient features of the PERCH childhood pneumonia problem that re-
quire extension of the typical LCM approach. First, we have partial knowledge of the latent
lung state IL for some subjects as a result of the case-control design. In the standard LCM
approach, the study population comprises subjects with completely unknown class mem-
bership D. In this study, the latent etiology IL = 0 is applied to all controls because absent
clinical disease, the lung is assumed to be non-infected. Also, were gold standard mea-
surements available from the lung for some cases, their latent variable would be directly
observed. As the latent state is known for a non-trivial subset of the study population, we
refer to the model posited below as a partially-Latent Class Model or pLCM.
Second, in most LCM applications, the number of diagnostic test results on a subject
is much larger than the number of latent state categories. Here, the number of diagnostic
tests is of the same order, and often equal to the number of categories that IL can assume.
31
CHAPTER 3. PARTIALLY-LATENT CLASS MODELS (PLCM) FORCASE-CONTROL STUDIES OF CHILDHOOD PNEUMONIA ETIOLOGY
For example, if we consider only the PERCH study BrS data, we simultaneously observe
the presence/absence of J pathogens for each child. The large number of latent categories
of IL leads to weak model identifiability as is discussed in more detail in Section 3.2.1.
Lastly, measurements with differing error rates (i.e. GS, SS, BrS) need to be inte-
grated in this application. Understanding the relative value of each level of measurements
is important to optimally invest resources into data collection (number of subjects, type
of samples) and laboratory assays. An important goal is therefore to estimate the relative
information from each type of measurements about the population and individual etiology
distributions. Albert and Dodd (2008) studied a model where some subjects are selected to
verify their latent status (i.e. collect from them GS measurements) with the probability of
verification depending on the previous test results or completely at random. They showed
GS data can make model estimates more robust to model misspecifications. We quantify
how much GS data reduces the variance of model parameter estimates for design purposes.
Also, they considered binary latent status and did not have available control data. Another
related literature that uses both GS and BrS data is on verbal autopsy (VA) in the setting
where no complete vital registry system is established in the community (King and Lu,
2008). Quite similar to the goal of inferring pneumonia etiology from lab measurements,
the goal of VA is to infer the cause of death (ID) from a pre-specified list by asking close
family members questions about the presence/absence ofK symptoms. King and Lu (2008)
proposed estimating the cause-of-death distribution in community P (ID = j), j = 1, ...J,
(similar to etiology) using data on K dichotomous symptoms and GS data from the hospi-
32
CHAPTER 3. PARTIALLY-LATENT CLASS MODELS (PLCM) FORCASE-CONTROL STUDIES OF CHILDHOOD PNEUMONIA ETIOLOGY
tal where cause-of-death and symptoms are both recorded. However, their method involves
nonparametric estimation of J K-way probability contingency tables and therefore requires
a sizable sample of GS data, especially when the number of symptoms is large. In addition,
a key difference between VA and most infectious disease etiology studies is that the VA
studies are by definition case-only.
Another approach previously used with case and control data is to perform logistic
regression of case status Y on laboratory measurements M and then to calculate point es-
timates of population attributable risks for each pathogen (Bruzzi et al., 1985; Blackwelder
et al., 2012). This method does not account for imperfect laboratory measurements and
cannot use GS data if available. Also, zero prevalence is assigned to pathogens whose esti-
mated odds ratios are smaller than 1, without taking account of their statistical uncertainty.
In this paper, we define and apply a partially-latent class model (pLCM) with condi-
tional independent assumptions to incorporate these three features: known infection status
for controls, a large number of latent classes, and multiple types of measurements. We use
a hierarchical Bayesian formulation to estimate: (1) the population etiology distribution or
etiology fraction —the frequency with which each pathogen “causes” clinical pneumonia
in the case population; (2) the individual etiology probabilities—the probabilities that a
case is “caused” by each of the candidate pathogens, given observed specimen measure-
ments for that individual; and (3) the relative information content of GS, SS, and BrS data
(Section 3.3 and 3.4).
The remainder of this paper proceeds as follows. In section 3.2, we formulate the
33
CHAPTER 3. PARTIALLY-LATENT CLASS MODELS (PLCM) FORCASE-CONTROL STUDIES OF CHILDHOOD PNEUMONIA ETIOLOGY
pLCM and the Gibbs sampling algorithms for implementation. In Section 3.3, we evaluate
our method through simulations tailored for the childhood pneumonia application. Section
3.4 presents the application of our methodology to a subsample of the PERCH data to
demonstrate its applicability. The last section concludes with a discussion of results and
limitations, a few natural extensions of the pLCM also motivated by the PERCH data, as
well as future directions of research.
3.2 A partially-latent class model for multiple
indirect measurements
We develop pLCM to address two characteristics of the motivating pneumonia problem:
(1) a partially-latent state variable because the pathogen infection status is known for con-
trols but not cases; and (2) multiple categories of measurements with different error rates
across classes. As shown in Figure 3.1, let ILi , taking values in 0, 1, 2, ...J, represent
the true state of child i’s lung (i = 1, ..., N ) where 0 represents no infection (control) and
ILi = j, j = 1, ..., J , represents the jth pathogen from a pre-specified cause-of-pneumonia
list that is assumed to be exhaustive. Let MSi represent the J × 1 vector of binary indi-
cators of the presence/absence of each pathogen in the measurement at site S, where, in
our application S can be nasopharyngeal (NP), blood (B), or lung (L). Let mSi be the ac-
tual observed values. In the following, we replace S with BrS, SS, or GS, because they
correspond to the measurement types at NP, B, and L, respectively.
34
CHAPTER 3. PARTIALLY-LATENT CLASS MODELS (PLCM) FORCASE-CONTROL STUDIES OF CHILDHOOD PNEUMONIA ETIOLOGY
Let Yi = yi ∈ 0, 1 represent the indicator of whether child i is a control or case. Note
ILi = 0 given Yi = 0. To formalize the pLCM, we define three sets of parameters:
• π = (π1, ..., πJ)T for the probability Pr(IL = j | Y = 1,π), j = 1, ..., J
• ψSj = Pr(MSj = 1|IL = 0), the marginal false positive rate (FPR) for measurement j
at site S
• θSj = Pr(MSj = 1|IL = j), the marginal true positive rate (TPR) for measurement j
at site S for a person whose lung is infected by pathogen j.
We further let ψS = (ψS1 , ..., ψSJ )T and θS = (θS1 , ..., θ
SJ )T . Using these definitions, we
have FPR ψBrSj = 0 and TPR θBrS
j = 1 for GS measurements, so that MGSj = 1 if and
only if ILi = j (perfect sensitivity and specificity). Let δi be the binary indicator of a case
i having GS measurements; it equals 1 if the case has available GS data and 0 otherwise.
For SS measurements, FPR ψSSj = 0 so that MSS
j = 0 if ILi 6= j (perfect specificity).
We formalize the model likelihood for each type of measurement. We first describe the
model for BrS measurementMBrS for a control or a case. For control i, positive detection
of the jth pathogen is a false positive representation of the non-infected lung. Therefore,
we assume that for control i, MBrSij | ψBrS ∼ Bernoulli(ψSj ), j = 1, ..., J , with conditional
independence, or equivalently,
P 0,BrSi = Pr(MBrS
i = m | ψBrS) =J∏j=1
(ψBrSj
)mj(
1− ψBrSj
)1−mj
, (3.2.1)
35
CHAPTER 3. PARTIALLY-LATENT CLASS MODELS (PLCM) FORCASE-CONTROL STUDIES OF CHILDHOOD PNEUMONIA ETIOLOGY
where m = mBrSi . For a case i′ infected by pathogen j, the positive detection rate for the
jth pathogen in BrS assays is θBrSj . Since we assume a single cause for each case, detection
of pathogens other than j will be false positives with probability equal to marginal FPR as
in controls: ψBrSl , l 6= j. This nondifferential misclassification across the case and control
populations is the essential assumption of the latent class approach because it allows us to
borrow information from control BrS data to distinguish the true cause from background
colonization. We further discuss it in the context of the pneumonia etiology problem in the
final section. Then,
P 1,BrSi′ = Pr(MBrS
i′ = m | π,θBrS,ψBrS)
=J∑j=1
πj ·(θBrSj
)mj(
1− θBrSj
)1−mj ∏l 6=j
(ψBrSl
)ml(
1− ψBrSl
)1−ml
, (3.2.2)
where m = mBrSi′ , is the likelihood contributed by BrS measurements from case i′. Con-
venient for Gibbs sampler, we introduce the latent lung infection state ILi′ and represent
(3.2.2) by the following two-stage sampling scheme:
(i) multinomial sampling of lung infection state among cases: ILi′ | π, Yi′ = 1 ∼
Multinomial(π),
(ii) measurement stage given lung infection state:
MBrSi′j | ILi′ ,θBrS,ψBrS ∼ Bernoulli
(1IL
i′=jθBrSj +
(1− 1IL
i′=j
)ψBrSj
), j = 1, ..., J ,
conditionally independent, where 1· is the indicator function and equals one if the
statement in · is true; otherwise, zero.
36
CHAPTER 3. PARTIALLY-LATENT CLASS MODELS (PLCM) FORCASE-CONTROL STUDIES OF CHILDHOOD PNEUMONIA ETIOLOGY
Similarly, likelihood contribution from a case i′’s SS measurements can be written as
P 1,SSi′ = Pr(MSS
i′ = m | π,θSS) =J ′∑j=1
πj ·(θSSj
)mj
(1− θSSj )1−mj1∑J′
l=1ml≤1,(3.2.3)
for m = mSSi′ , noting the perfect specificity of SS measurements, where J ′ ≤ J repre-
sents the number of actual SS measurements on each case, and θSS =(θSS1 , ...θSS
J ′
). SS
measurements only test for a subset of all J pathogens, e.g., blood culture only detects bac-
teria and J ′ is the number of bacteria that are potential causes. Finally, GS measurement
MGSi′ that accurately indicates the actual cause for case i′, is assumed to follow multinomial
distribution with likelihood:
P 1,GSi′ = Pr
(MGS
i′ = m | π)
=J∏j=1
π1mj=1j 1∑j mj=1,m = mGS
i′ . (3.2.4)
Combining likelihood components (3.2.1)—(3.2.4), the total model likelihood for BrS,
SS, and GS data across independent cases and controls, L(γ;D), can be expressed as
∏i:Yi=0
P 0,BrSi
∏i′:Yi′=1,δi′=1
P 1,BrSi′ · P 1,SS
i′ · P 1,GSi′
∏i′′:Yi′′=1,δi′′=0
P 1,BrSi′′ · P 1,SS
i′′ , (3.2.5)
where γ = (θBrS,ψBrS,θSS,π)T stacks all unknown parameters, and data D is
mBrS
i
i:Yi=1
∪mBrS
i′ ,mGSi′ ,m
SSi′
i′:Yi′=1,δi′=1
∪mBrS
i′′,mSS
i′′
i′′:Y
i′′=1,δ
i′′=0
37
CHAPTER 3. PARTIALLY-LATENT CLASS MODELS (PLCM) FORCASE-CONTROL STUDIES OF CHILDHOOD PNEUMONIA ETIOLOGY
collects all the available measurements on study subjects. Our primary statistical goal is
to estimate the posterior distribution of the population etiology distribution π, and obtain
individual etiology (IL∗ ) prediction given a case’s measurements (mBrS∗ ,mSS
∗ ), i.e.,
Pr(IL∗ = j |mBrS∗ ,mSS
∗ ,D), j = 1, ..., J.
To enable Bayesian inference, prior distributions on model parameters are specified as
follows: π ∼ Dirichlet(a1, . . . , aJ), ψBrSj ∼ Beta(b1j, b2j), θBrS
j ∼ Beta(c1j, c2j), j =
1, ..., J , and θSSj ∼ Beta(d1j, d2j), j = 1, ..., J ′. Hyperparameters for etiology prior,
a1, ..., aJ , are usually 1s to denote equal and non-informative prior weights for each pathogen
if expert prior knowledge is unavailable. The FPR for the jth pathogen, ψBrSj , generally can
be well estimated from control data, thus b1j = b2j = 1 is the default choice. For TPR pa-
rameters θBrSj and θSS
j , if prior knowledge on TPRs is available, we choose (c1j, c2j) so that
the 2.5% and 97.5% quantiles of Beta distribution with parameter (c1j, c2j) match the prior
minimum and maximum TPR values elicited from pneumonia experts . Otherwise, we use
default value 1s for the Beta hyperparameters. Similarly we choose values of (d1j, d2j) ei-
ther by prior knowledge or default values of 1. We finally assume prior independence of the
parameters as [γ] = [π][ψBrS][θBrS][θSS], where [A] represents the distribution of random
variable or vector A. These priors represent a balance between explicit prior knowledge
about measurement error rates and the desire to be as objective as possible for a particular
study. As described in the next section, the identifiability constraints on the pLCM re-
38
CHAPTER 3. PARTIALLY-LATENT CLASS MODELS (PLCM) FORCASE-CONTROL STUDIES OF CHILDHOOD PNEUMONIA ETIOLOGY
quire specifying a reasonable subset of parameter values to identify parameters of greatest
scientific interest.
3.2.1 Model identifiability
Potential non-identifiability of LCM parameters is well-known. For example, an LCM
with four observed binary indicators and three latent classes is not identifiable despite pro-
viding 15 degree-of-freedom to estimate 14 parameters (Goodman, 1974). In principle, the
Bayesian framework avoids the non-identifiability problem in LCMs by incorporating prior
information about unidentified parameter subspaces (Garrett and Zeger, 2000). Many au-
thors point out that the posterior variance for non-identifiable parameters does not decrease
to zero as sample size approaches infinity (e.g., Kadane (1974); Gustafson et al. (2001);
Gustafson (2005)). For scientific investigations, when data are not fully informative about
a parameter, an identified set of parameter values consistent with the observed data shall,
nevertheless, be valuable in a complex application (Gustafson, 2009) like PERCH.
This identifiability issue for the pLCM only occurs in the absence of GS data. Here
we restrict attention to the scenario with only BrS data for simplicity but similar arguments
pertain to the BrS + SS scenario. The problem can be understood from the form of the
marginal positive measurement rates for pathogens among cases. In the pLCM likelihood
for BrS data (only retaining components in (3.2.5) with superscripts BrS), the marginal
39
CHAPTER 3. PARTIALLY-LATENT CLASS MODELS (PLCM) FORCASE-CONTROL STUDIES OF CHILDHOOD PNEUMONIA ETIOLOGY
positive rate for pathogen j is a convex combination of the TPR and FPR:
Pr(MBrS
i′j = 1 | πj, θBrSj , ψBrS
j
)= πjθ
BrSj + (1− πj)ψBrS
j , (3.2.6)
where the left-hand side of the above equation can be estimated by the observed marginal
positive rate of pathogen j among cases. Although the control data provide ψBrSj estimates,
the two parameters, πj and θBrSj , are not both identified. GS data, if available, identifies
πj and resolves the lack of identifiability. Otherwise, we need to incorporate prior scien-
tific information on one of them, usually the TPR (θBrSj ), derived from infectious disease
and laboratory experts (Murdoch et al., 2012) and/or from vaccine probe studies (Feikin
et al., 2014). If the observed case marginal positive rate is much higher than the rate in
controls (ψBrSj ), only large values of TPR (θBrS
j ) are supported by the data making etiology
estimation more precise (Section 3.2.2).
In more generality, the full model identification can be characterized by inspecting the
Jacobian matrix of the transformation (F ) from model parameters (γ) to the distribution of
the observables (p): p = F (γ). Let γ = (θBrS,ψBrS, π1, ..., πJ−1)T represent the 3J −
1-dimensional unconstrained model parameters. The pLCM defines the transformation
(p1,p0)T = F (γ), where p1 and p0 are the two contingency probability distributions for
the BrS measurements in the case and control populations. It can be shown that the Jacobian
matrix Γ(γ) has J − 1 of its singular values being zero, which means model parameters γ
are not fully identified from the data. The FPRs (ψBrSj , j = 1, ..., J) in pLCM are, however,
40
CHAPTER 3. PARTIALLY-LATENT CLASS MODELS (PLCM) FORCASE-CONTROL STUDIES OF CHILDHOOD PNEUMONIA ETIOLOGY
identifiable parameters that can be estimated from control data. Therefore, pLCM is termed
partially identifiable (Jones et al., 2010).
3.2.2 Parameter estimation and individual etiology pre-
diction
The parameters in likelihood (3.2.5) include the population etiology distribution (π),
TPRs and FPRs for BrS measurements (ψBrS and θBrS), and TPRs for SS measurements
(θSS). The posterior distribution of these parameters can be estimated by constructing
approximating samples from the joint posterior via Gibbs sampler. The full conditional
distributions for the Gibbs sampler are detailed in Section 1 of the supplementary material.
We use freely available software WinBUGS 1.4, to fit the partially-latent class model.
Convergence was monitored via Markov chain Monte Carlo (MCMC) chain histories, auto-
correlations, kernel density plots, and Brooks-Gelman-Rubin statistics (Brooks and Gel-
man, 1998). The statistical results below are based on 10, 000 iterations of burn-in followed
by 10, 000 production samples from each of three parallel chains.
The Bayesian framework naturally allows individual within-sample classification (in-
fection diagnosis) and out-of-sample prediction. This section describes how we calculate
the etiology probabilities for an individual with measurements m∗. We focus on the more
challenging inference scenario when only BrS data are available; the general case follows
directly.
41
CHAPTER 3. PARTIALLY-LATENT CLASS MODELS (PLCM) FORCASE-CONTROL STUDIES OF CHILDHOOD PNEUMONIA ETIOLOGY
The within-sample classification for case i′ is based on the posterior distribution of
latent indicators given the observed data, i.e. Pr(ILi′ = j | D), j = 1, ..., J , which can be
obtained by averaging along the cause indicator (ILi′ ) chain from MCMC samples. For a
case with new BrS measurementsm∗, we have
Pr(ILi′ = j |m∗,D) =
∫Pr(ILi′ = j |m∗,γ)Pr(γ |m∗,D)dγ, j = 1, ...J,(3.2.7)
where the second factor in the integrand can be approximated by the posterior distri-
bution given current data, i.e., Pr(γ | D). For the first term in the integrand, we ex-
plicitly obtain the model-based, one-sample conditional posterior distribution, Pr(ILi′ =
j | m∗,γ) = πj`j(m∗;γ)
/∑m πrm`m(m∗;γ), j = 1, ..., J , where `m(m∗;γ) =(
θBrSj
)m∗j (1− θBrS
j
)1−m∗j ∏l 6=j
(ψBrSl
)m∗l (1− ψBrS
l
)1−m∗lis the mth mixture com-
ponent likelihood function evaluated at m∗. The log relative probability of ILi = j versus
ILi = l is
Rjl = log
(πjπl
)+ log
(θBrSj
ψBrSj
)m∗j (1− θBrS
j
1− ψBrSj
)1−m∗j
+ log
(ψBrSl
θBrSl
)m∗l(
1− ψBrSl
1− θBrSl
)1−m∗l .
The form of Rjl informs us about what is required for correct diagnosis of an individual.
Suppose ILi = j, then averaging overm∗, we have E[Rjl] = log (πj/πl)+I(θBrSj ;ψBrS
j )+
I(ψBrSl ; θBrS
l ), where I(v1, v2) = v1 log(v1/v2)+(1−v1) log ((1− v1)/(1− v2)) is the in-
42
CHAPTER 3. PARTIALLY-LATENT CLASS MODELS (PLCM) FORCASE-CONTROL STUDIES OF CHILDHOOD PNEUMONIA ETIOLOGY
formation divergence (Kullback, 2012) that represents the expected amount of information
in m∗j ∼ Bernoulli(v1) for discriminating against m∗j ∼ Bernoulli(v2). If v1 = v2, then
I(v1; v2) = 0. The form of E[Rjl] shows that there is only additional information from BrS
data about an individual’s etiology in the person’s data when there is a difference between
θBrSj and ψBrS
j , j = 1, ..., J .
Following (3.2.7), we average Pr(ILi′ = j | m∗,γ) over MCMC iterations with γ
replaced by its simulated values γ)∗ at each iteration. Repeating for j = 1, ..., J , we
obtain a J probability vector, pi′ = (pi′1, ..., pi′J)T , that sums to one. This scheme is
especially useful when a newly examined case has a BrS measurement pattern not observed
in D, which often occurs when J is large. The final decisions regarding which pathogen
to treat can then be based upon estimated pi′ . In particular, the pathogen with largest
posterior value might be selected. It is Bayes optimal under mean misclassification loss.
Individual etiology predictions described here generalize the positive/negative predictive
value (PPV/NPV) from single to multivariate binary measurements and can aid diagnosis
of case subjects under other user-specified misclassification loss functions.
3.3 Simulation for three pathogens case with GS
and BrS data
One key question for studies like PERCH is what fraction of the total evidence about
etiology derives from the BrS sources relative to from GS or SS sources if available. In
43
CHAPTER 3. PARTIALLY-LATENT CLASS MODELS (PLCM) FORCASE-CONTROL STUDIES OF CHILDHOOD PNEUMONIA ETIOLOGY
this simulation, we illustrate the extent to which BrS case-control data can supplement
observation of the etiologic agent directly from the site of infection. We discuss the role of
SS measurements in Section 3.4 through application to the PERCH data set.
We simulate BrS data sets with 500 cases and 500 controls for three pathogens, A, B,
and C using pLCM specifications. We focus on three pathogens to facilitate viewing of the
π estimates and individual predictions in the 3-dimensional simplex S2. We use the ternary
diagram (Aitchison, 1986) representation where the vector π = (πA, πB, πC)T is encoded
as a point with each component being the perpendicular distance to one of the three sides.
The parameters involved are fixed at TPR = θ = (θA, θB, θC)T = (0.9, 0.9, 0.9)T , FPR =
ψ = (ψA, ψB, ψC)T = (0.6, 0.02, 0.05)T , and π = (πA, πB, πC)T = (0.67, 0.26, 0.07)T .
We focus on BrS and GS data here and drop the “BrS” superscript on the parameters for
simplicity. We further let the fraction of cases with GS measurements (∆) be either 1%
or 10%. Although GS measurements are rare in the PERCH study, we investigate a large
range of ∆ to understand in general how much statistical information is contained in BrS
measurements relative to GS measurements.
For any given data set, three distinct subsets of the data can be used: BrS-only, GS-
only, and BrS+GS, each producing its posterior mean of π, and 95% credible region by
transformed Gaussian kernel density estimator for compositional data (Chacon et al., 2011).
To study the relative importance of the GS and BrS data, the primary quantity of interest in
the simulations is the relative sizes of the credible regions for each data mix. Here, we use
uniform priors on θ, ψ, and Dirichlet(1, ..., 1) prior for π. The results are shown in Figure
44
CHAPTER 3. PARTIALLY-LATENT CLASS MODELS (PLCM) FORCASE-CONTROL STUDIES OF CHILDHOOD PNEUMONIA ETIOLOGY
3.2.
(a) (b)
Figure 3.2: Population and individual etiology estimations for a single sample with 500cases and 500 controls with true π = (0.67, 0.26, 0.07)T and either 1%N = 5) or10%(N = 50) GS data on cases. In (a) or (b), Red circled plus shows the true populationetiology distribution π. The closed curves are 95 percent credible regions: blue dashedlines “- - -”, light green solid lines “—”, black dotted lines “· · · ” correspond to analysisusing BrS data only, BrS+GS data, GS data only, respectively; Solid square/dot/triangleare corresponding posterior means of π; The 95 percent highest density region of uniformprior distribution is also visualized by red “· − ·−” for comparison. 8(= 23) BrS measure-ment patterns and predictions for individual children are shown with different shapes, withmeasurement patterns attached to them. The radii of circles and numbers at the verticesshow empirical frequencies GS measurements belonging to A, B, or C.
First, in Figures 3.2(a) (1% GS) and 3.2(b) (10% GS), each region covers the true etiol-
ogy π. In data not shown here, the nominal 95% credible regions covers slightly more than
95% of 100 simulations. Credible regions narrow in on the truth as we combine BrS and GS
data, and as the fraction of subjects with GS data (∆) increases. Also, the posterior mean
from the BrS+GS analysis is a result of optimal balance between information contained in
the GS and BrS data.
45
CHAPTER 3. PARTIALLY-LATENT CLASS MODELS (PLCM) FORCASE-CONTROL STUDIES OF CHILDHOOD PNEUMONIA ETIOLOGY
We then fix ψ and π, while varying the TPR θ on the grid (0.6, 0.7, 0.8, 0.9, 0.95, 0.99)
to test the estimation performance under a variety of signal-to-noise ratios, measured by
the difference between the TPRs and FPRs. At each (θ, ∆) grid point, we run analy-
ses on each of 100 simulated data sets. We quantify the gain in precision by adding the
BrS data to the GS data following Xu and Zeger (2001). For pathogen A, let gA(θ) =(d0A − dBrS+GS
A (θ))/(d0A − dGS
A (θ))
, where d0A, dGSA (θ) and dBrS+GS
A (θ) are the length of
95% highest density interval from the prior, length of 95% credible interval using GS data,
and length of the 95% credible interval using BrS and GS data, respectively. This quantity
(gA(θ)) is the ratio of the reduction of the 95% interval widths with and without the BrS
data at TPR value θ. If gA(θ) = 1, then there is no additional gain in the precision of πA
when BrS data is added to GS data. When ∆ = 1%, we observe the expected increase in
gA as TPR θ approaches 1. For pathogen A, gA(0.8) has mean value 1.7 across 100 simu-
lated data sets with standard error 0.3; gA(0.95) further increase to 3.0(standard error 0.3).
Similar patterns are also observed for pathogen B and C.
Using the same simulated data sets, Figures 3.2(a) and 3.2(b) also show individual etiol-
ogy predictions for each of the 8(= 23) possible BrS measurements (mA,mB,mC)T ,mj =
0, 1, obtained by the methods from Section 3.2.2. Consider the example of a newly en-
rolled case without GS data and with no pathogen observed in her BrS data: m = (0, 0, 0).
Suppose she is part of a case population with 10% GS data. In the case illustrated in
Figure 3.2(b), her posterior predictive distribution has highest posterior probability (0.76)
on pathogen A reflecting two competing forces: the FPRs that describe background colo-
46
CHAPTER 3. PARTIALLY-LATENT CLASS MODELS (PLCM) FORCASE-CONTROL STUDIES OF CHILDHOOD PNEUMONIA ETIOLOGY
nization (colonization among the controls) and the population etiology distribution; Given
other parameters, m = (0, 0, 0) gives the smallest likelihood for ILi = A because of its
high FPR that reflects its background colonization rate, ψA = 0.6. However, prior to ob-
serving (0, 0, 0), πA is well estimated to be much larger than πB and πC . Therefore the
posterior distribution for this case is heavily weighted towards pathogen A.
For a case with observation (1, 1, 1), because it is rare to observe pathogen B in a case
whose pneumonia is not caused by B, the prediction favors B. Although B is not the most
prevalent cause among cases, the presence of B in the BrS measurements gives the largest
likelihood when ILi = B. For any measurement pattern with a single positive, the case is
always classified into that category in this example.
Most predictions are stable with increasing ∆. Only 000 cases have predictions that
move from near the center to the corner of A. This is mainly because that TPR θ and
etiology fractions π are not as precisely estimated in GS-scarce scenarios relative to GS-
abundant ones. Averaging over a wider range of θ and π produces 000 case predictions
that are ambiguous, i.e. near the center. As ∆ increases, parameters are well estimated, and
precise predictions result.
3.4 Analysis of PERCH data
The Pneumonia Etiology Research for Child Health (PERCH) study is a standardized
and comprehensive evaluation of etiologic agents causing severe and very severe pneumo-
47
CHAPTER 3. PARTIALLY-LATENT CLASS MODELS (PLCM) FORCASE-CONTROL STUDIES OF CHILDHOOD PNEUMONIA ETIOLOGY
nia among hospitalized children aged 1-59 months in seven low and middle income coun-
tries. The study sites include countries with a significant burden of childhood pneumonia
and a range of epidemiologic characteristics (Levine et al., 2012). PERCH is a case-control
study that has enrolled over 4, 000 patients hospitalized for severe or very severe pneumo-
nia and over 5, 000 controls selected randomly from the community frequency-matched on
age in each month. More details about the PERCH design are available in Deloria-Knoll
et al. (2012).
To illustrate the application of pLCM model for the analysis of PERCH study data, we
have focused on preliminary data from one site with good availability of both SS and BrS
laboratory results. Results for all 7 countries will be reported elsewhere upon study com-
pletion. Included in the current illustrative analysis are BrS data (nasopharyngeal specimen
with PCR detection of pathogens) for 432 cases and 479 frequency-matched controls on 11
species of pathogens (7 viruses and 4 bacteria with their abbreviations in Figure 3.3, and
full names in Section 2 of the supplementary material), and SS data (blood culture results)
on the 4 bacteria for only the cases.
In PERCH, prior scientific knowledge of misclassification rates is incorporated into
the analysis. The TPR of our BrS measurements, θBrSj is assumed to be in the range of
90%−97% (Murdoch et al., 2012). Observations from vaccine probe studies—randomized
clinical trials of pathogen-specific vaccines in which non-specific clinical endpoints such
as clinical pneumonia are evaluated thereby revealing the contribution of the pathogen to
the burden of that syndrome— illustrate that the total number of clinical pneumonia cases
48
CHAPTER 3. PARTIALLY-LATENT CLASS MODELS (PLCM) FORCASE-CONTROL STUDIES OF CHILDHOOD PNEUMONIA ETIOLOGY
prevented by the vaccine is much larger than the few laboratory-confirmed cases prevented.
Comparing the total preventable disease burden to the number of blood culture (SS) pos-
itive cases prevented provides information about the TPR of the bacterial blood culture
measurements, θSSj , j = 1, ..., 4. In our analysis, we use the range 10 − 20% for the SS
TPRs of four bacteria. We set Beta priors that match these ranges (Section 3.2) and as-
sumed Dirichlet(1, ..., 1) prior on etiology fractions π.
In latent variable models like the pLCM, key variables are not directly observed. It is
therefore essential to picture the model inputs and outputs side-by-side to better understand
the analysis performed. In this spirit, Figure 3.3 displays for each of the 11 pathogens, a
summary of the BrS and SS data in the left two columns, along with some of the interme-
diate model results; and the prior and posterior distributions for the etiology fractions on
the right (rows ordered by posterior means). The observed BrS rates (with 95% confidence
intervals) for cases and controls are shown on the far left with solid dots. The conditional
odds ratio contrasting the case and control rates given the other pathogens is listed with
95% confidence interval in the box to the right of the BrS data summary. Below the case
and control observed rates is a horizontal line with a triangle. From left to right, the line
starts at the estimated false positive rate (FPR, ψBrSj ) and ends at the estimated true positive
rate (TPR, θBrSj ), both obtained from the model. Below the TPR are two boxplots sum-
marizing its posterior (top) and prior (bottom) distributions for that pathogen. These box
plots show how the prior assumption influences the TPR estimate as expected given the
identifiability constraints discussed in Section 3.2.1. The triangle on the line is the model
49
CHAPTER 3. PARTIALLY-LATENT CLASS MODELS (PLCM) FORCASE-CONTROL STUDIES OF CHILDHOOD PNEUMONIA ETIOLOGY
estimate of the case rate to compare to the observed value above it. As discussed in Section
3.2.1, the model-based case rate is a linear combination of the FPR and TPR with mixing
fraction equal to the estimated etiology fraction. Therefore, the location of the triangle,
expressed as a fraction of the distance from the FPR to the TPR, is the model-based point
estimate of the etiologic fraction for each pathogen. The SS data are shown in a similar
fashion to the right of the BrS data. By definition, the FPR is 0.0 for SS measures and
there is no control data. The observed rate for the cases is shown with its 95% confidence
interval. The estimated SS TPR (θSSj ) with prior and posterior distributions is shown as
for the BrS data, except that we plot 95% and 50% credible intervals for SS TPR above its
prior distribution boxplot.
On the right side of the display are the marginal posterior and prior distributions of the
etiologic fraction for each pathogen. We appropriately normalized each density to match
the height of the prior and posterior curves. The posterior mean with 50% and 95% credible
intervals are shown above the density.
Figure 3.3 shows that respiratory syncytial virus (RSV), Streptococcus pneumoniae
(PNEU), rhinovirus (RHINO), and human metapneumovirus (HMPV A B) occupy the
greatest fractions of the etiology distribution, from 10% to 30% each. That RSV has the
largest estimated mean etiology fraction reflects the large discrepancy between case and
control positive rates in the BrS data: 25.3% versus 0.8% (marginal odds ratio 38.5 (95%CI
(18, 128.7) ) as shown on the left of the display. RHINO has marginal case and control
rates that are close to each other, yet its estimated mean etiology fraction is 15.9%. This
50
CHAPTER 3. PARTIALLY-LATENT CLASS MODELS (PLCM) FORCASE-CONTROL STUDIES OF CHILDHOOD PNEUMONIA ETIOLOGY
is because the model considers the joint distribution of the pathogens, not the marginal
rates. The conditional odds ratio of case status with RHINO given all the other pathogen
measures is estimated to be 1.5 (1.1, 2.1) as compared to the marginal odds ratio close to 1
(0.8, 1.3).
As discussed in Section 3.2.1, the data alone cannot precisely estimate both the etiologic
fractions and TPRs absent prior knowledge. This is evidenced by comparing the prior and
posterior distributions for the TPRs in the BrS boxes for each pathogen (i.e. left hand
column of Figure 3.3). The posteriors are similar to their priors indicating little else about
TPR is learned from the data. The posteriors for some pathogens making up π (i.e. shown
in the right hand column of Figure 3.3) are likely to be sensitive to the prior specifications
of the TPRs.
We performed sensitivity analyses using multiple sets of priors for the TPRs. At one
extreme, we ignored background scientific knowledge and let the priors on the FPR and
TPR be uniform for both the BrS and SS data. The results are shown in Figure 3.5. Ignor-
ing prior knowledge about error rates lowers the etiology estimates of the bacteria PNEU
and Haemophilus influenzae (HINF). The substantial reduction in the etiology fraction for
PNEU, for example, is a result of the difference in the TPR prior for the SS measurements.
In the original analysis (Figure 3.3), the informative prior on the SS sensitivity (TPR) place
95% mass between 10 − 20%. Hence the model assumes almost 85% of the PNEU infec-
tions are being missed in the SS sampling. When a uniform prior is substituted (Figure
3.4), the fraction assumed missed is greatly reduced. For RSV, its posterior mean etiology
51
CHAPTER 3. PARTIALLY-LATENT CLASS MODELS (PLCM) FORCASE-CONTROL STUDIES OF CHILDHOOD PNEUMONIA ETIOLOGY
Figure 3.3: Results using expert priors on TPRs. The observed BrS rates (with 95% confi-dence intervals) for cases and controls are shown on the far left. The conditional odds ratiogiven the other pathogens is listed with 95% confidence interval in the box to the right ofthe BrS data summary. Below the case and control observed rates is a horizontal line witha triangle. From left to right, the line starts at the estimated false positive rate (FPR, ψBrS
j )and ends at the estimated true positive rate (TPR, θBrS
j ), both obtained from the model.Below the TPR are two boxplots summarizing its posterior (top) and prior (bottom) distri-butions. The location of the triangle, expressed as a fraction of the distance from the FPR tothe TPR, is the model-based point estimate of the etiologic fraction for each pathogen. TheSS data are shown in a similar fashion to the right of the BrS data. The observed rate for thecases is shown with its 95% confidence interval. The estimated SS TPR (θSS
j ) with priorand posterior distributions is shown as for the BrS data, except that we plot 95% and 50%credible intervals for SS TPR above the boxplot for its prior distribution. See Appendix forpathogen name abbreviations.
52
CHAPTER 3. PARTIALLY-LATENT CLASS MODELS (PLCM) FORCASE-CONTROL STUDIES OF CHILDHOOD PNEUMONIA ETIOLOGY
fraction increases from 27.3% to 31.7%. The etiology estimates for other pathogens are
fairly stable, with changes in posterior means between −0.4% and 3.3%.
Under the original priors for TPR, PARA1 has an estimated etiologic fraction of 5.2%,
even though it has conditional odds ratio 5.8 (2.5, 15). In general, pathogens with larger
conditional odds ratios have larger etiology fraction estimates. Also, a pathogen still needs
a reasonably high observed case positive rate to be allocated a high etiology fraction. The
posterior etiology fraction estimate of 5.2% for PARA1 results because the prior for the
TPR takes values in the range of 0.9 − 0.97. By Equation (3.2.6), the TPR weight in the
convex combination with FPR (around 1.5%) has to be very small to explain the small
observed case rate 5.5%. When a uniform prior is placed on TPR instead, the PARA1
etiology fraction increases to 10.2% with a wider 95% credible interval (Figure 3.4).
Furthermore, when uniform priors on TPR and FPR are used, PARA1 is still allocated
a smaller etiology fraction than RHINO despite PARA1 having a larger conditional odds
ratio. This is related to the dependence structure among case measurements. RHINO
has the highest negative association with RSV among cases (standardized log odds ratio
−14). Under the conditional independence assumption of the pLCM, this dependence is
partly induced by multinomial correlation among the latent cause indicators: ILi = RSV
versus ILi = RHINO that is −πRSVπRHINO. RSV has strong evidence as a frequent cause
with a stable estimate πRSV around 30%. The strong negative association in the cases’
measurements between RHINO and RSV is contributing to the increased etiologic fraction
estimate πRHINO relative to other pathogens that have less or no association with RSV
53
CHAPTER 3. PARTIALLY-LATENT CLASS MODELS (PLCM) FORCASE-CONTROL STUDIES OF CHILDHOOD PNEUMONIA ETIOLOGY
Figure 3.4: Results on using uniform priors on TPRs. As in Figure 3.3 with uniform priorson the TPRs.
54
CHAPTER 3. PARTIALLY-LATENT CLASS MODELS (PLCM) FORCASE-CONTROL STUDIES OF CHILDHOOD PNEUMONIA ETIOLOGY
Figure 3.5: Summary of posterior distribution of pneumonia etiology estimates using expert(left) and uniform (right) priors on TPRs. In each subfigure, top: posterior (solid) and prior(dashed) distribution of viral etiology; bottom left: posterior etiology distribution for toptwo bacterial causes given bacteria is a cause; bottom right: posterior etiology distributionfor top two viral causes given virus is a cause. B-rest and V-rest stand for the rest of bacteriaand viruses other than the top two species, respectively. The nested blue circles are 95%,80%, and 50% credible regions for population etiology estimates within bacterial or viralgroup.
55
CHAPTER 3. PARTIALLY-LATENT CLASS MODELS (PLCM) FORCASE-CONTROL STUDIES OF CHILDHOOD PNEUMONIA ETIOLOGY
among the cases. The conditional independence assumption is leveraging information from
the associations between pathogens in estimation of the etiologic fractions.
We have checked the model in two ways by comparing the characteristics of the ob-
served measurements joint distribution with the same characteristic for the distribution of
new measurements generated by the model from a population of the same size. By gener-
ating the new data characteristics at every iteration of the MCMC chain, we can obtain the
predictive distribution for the new data by averaging the posterior distribution of the param-
eters as discussed in Garrett and Zeger (2000). Figure 3.6 displays the observed frequency
of the 10 most common measurement outcomes for the BrS data, separately for cases and
controls to compare to the predictive distributions based upon the model. Among the cases,
the 95% predictive interval includes the observed values in all but two of the BrS patterns
and even there the fits are reasonable. Among the controls, there is evidence of lack of fit
for the most common BrS pattern with only PNEU and HINF. There are fewer cases with
this pattern observed than predicted under the pLCM. This lack of fit is due to associations
of pathogen measurements in control subjects. Note that the FPR estimates remain consis-
tent regardless of such correlation as the number of controls increases, however posterior
variances for them may be underestimated.
Figure 3.7 presents standardized log odds ratios (SLORs) for cases (lower triangle)
and controls (upper triangle). Each entry is the observed log odds ratio for a pair of BrS
measurements minus the mean LOR for the predictive data distribution value divided by
the standard deviation of the LOR predictive distribution. The first significant digit of the
56
CHAPTER 3. PARTIALLY-LATENT CLASS MODELS (PLCM) FORCASE-CONTROL STUDIES OF CHILDHOOD PNEUMONIA ETIOLOGY
Figure 3.6: Posterior predictive checking for 10 most frequent BrS measurement patternsamong cases and controls with expert priors on TPRs.
57
CHAPTER 3. PARTIALLY-LATENT CLASS MODELS (PLCM) FORCASE-CONTROL STUDIES OF CHILDHOOD PNEUMONIA ETIOLOGY
Figure 3.7: Posterior predictive checking for pairwise odds ratios separately for cases(lower right triangle) and controls (upper left triangle) with expert priors on TPRs. Eachentry is a standardized log odds ratio (SLOR): the observed log odds ratio for a pair of BrSmeasurements minus the mean LOR for the posterior predictive distribution divided by thestandard deviation of the posterior predictive distribution. The first significant digit of ab-solute SLORs are shown in red for positive and blue for negative values, and only thosegreater than 2 are shown.
58
CHAPTER 3. PARTIALLY-LATENT CLASS MODELS (PLCM) FORCASE-CONTROL STUDIES OF CHILDHOOD PNEUMONIA ETIOLOGY
absolute SLOR is shown in blue for negative and red for positive values. Absolute SLORs
less than 2 are omitted from the table for graphical effect. We see two large deviations
among the cases: RSV with RHINO and RSV with HMPV. These are caused by strong
seasonality in RSV that is out of phase with weaker seasonality in the other two. Otherwise,
the associations are roughly what is expected under the assumed model.
An attractive feature of using MCMC to estimate posterior distributions is the ease of
estimating posteriors for functions of the latent variables and/or parameters. One interesting
question from a clinical perspective is whether viruses or bacteria are the major cause and
among each subgroup, which species predominate. Figure 3.5 shows the posterior distribu-
tion using expert TPR prior for viruses versus bacteria on the top, and then the conditional
distributions of the two leading bacteria (viruses) among bacterial (viral) causes below. The
posterior shape of the viral etiologic fraction is more concentrated compared to the prior
shape, with mode around 63% and 95% credible interval (54%, 71%). Of all viral cases,
RSV is estimated to cause about 43% (36%, 51%), and RHINO about 25% (17%, 34%).
PNEU accounts for most bacterial cases (71% (48%, 87%)), and HINF accounts for 19%
(4%, 42%). In both the viral and bacterial categories, the 95% credible intervals for the first
most common pathogen does not overlap that of the second most common one.
59
CHAPTER 3. PARTIALLY-LATENT CLASS MODELS (PLCM) FORCASE-CONTROL STUDIES OF CHILDHOOD PNEUMONIA ETIOLOGY
3.5 Discussion
In this paper, we estimated the frequency with which pathogens cause disease in a
case population using a partially-latent class model (pLCM) to allow for known states
for a subset of subjects and for multiple types of measurements with different error rates.
In a case-control study of disease etiology, measurement error will bias estimates from
traditional logistic regression and attributable fraction methods. The pLCM avoids this
pitfall and more naturally incorporates multiple sources of data. Here we considered three
levels of measurement error rates.
Absent GS data, we show that the pLCM is only partially identified because of the
relationship between the estimated TPR and prevalence of the associated pathogen in the
population. Therefore, the inferences are sensitive to the assumptions about the TPR. Un-
certainty about their values persists in the final inferences from the pLCM regardless of the
number of subjects studied.
The current model provides a novel solution to the analytic problems raised by the
PERCH Study. This paper illustrates the design and application of the pLCM using a
preliminary and limited set of data from one PERCH study site. Confirmatory laboratory
testing, incorporation of additional pathogens, and adjustment for various factors are likely
to change the scientific findings that will be reported in the complete analysis of the study
results.
An essential assumption relied upon in the pLCM is that the probability of detecting
one pathogen at a peripheral body site depends on whether that pathogen is infecting the
60
CHAPTER 3. PARTIALLY-LATENT CLASS MODELS (PLCM) FORCASE-CONTROL STUDIES OF CHILDHOOD PNEUMONIA ETIOLOGY
child’s lung, but is unaffected by the presence of other pathogens in the lung, that is, the
non-differential misclassification error assumption, [MSSij | ILi = l] = [MSS
ij | ILi = k],
∀l, k 6= j. We have formulated the model to include GS measures even though they are
infrequently available from PERCH cases. In general, the availability of GS measures
makes it possible to test this assumption as has been discussed by Albert and Dodd (2008).
Several extensions have potential to improve the quality of inferences drawn and are
being developed for PERCH. First, because the control subjects have known class, we can
model the dependence structure among the BrS measurements and use this to avoid aspects
of the conditional independence assumption central to most LCM methods. The approach
is to extend the pLCM to have K subclasses within each of the current disease classes.
These subclasses can introduce correlation among the BrS measurements given the true
disease state. An interesting question is about the bias-variance trade-off for different val-
ues ofK. This ideas follows previous work on the PARAFAC decomposition of probability
distribution for multivariate categorical data (Dunson and Xing, 2009). This extension will
enable model-based checking of the standard pLCM.
Second, in our analyses to date, we have assumed that the pneumonia case definition
is error-free. Given new biomarkers and availability of chest radiograph that can improve
upon the clinical diagnosis of pneumonia, one can introduce an additional latent variable
to indicate true disease status and use these measurements to probabilistically assign each
subject as a case or control. Finally, regression extensions of the pLCM will allow PERCH
investigators to study how the etiology distributions vary with age group and season.
61
Chapter 4
Nested Partially-Latent Class Models
(npLCM) for Estimating Disease
Etiology in Case-Control Studies
62
CHAPTER 4. NESTED PARTIALLY-LATENT CLASS MODELS (NPLCM) FORESTIMATING DISEASE ETIOLOGY IN CASE-CONTROL STUDIES
Abstract
The Pneumonia Etiology Research for Child Health (PERCH) study attempts to infer the
distribution of pneumonia-causing bacterial or viral pathogens in developing countries from
measurements outside of the lung. Recent developments in test standardization make it pos-
sible to collect multiple specimens to detect a large number of pathogens at once with vary-
ing degrees of etiologic relevance and measurement precision. With this data, researchers
seek to estimate the population fraction of cases caused by each pathogen, and to develop
algorithms to assist clinical diagnosis when presented with complex data on an individual
case.
We describe a latent variable model to address these two analytic goals using data from
a case-control design. We assume each observation is a draw from a mixture model for
which each component represents one pathogen. Conditional dependence among multi-
variate binary measurements on a single subject is induced by nesting subclasses within
each disease class. Measurement precision can be estimated using the control sample for
whom the etiologic class is known. We assume the measurement precision is independent
of the disease status. We use stick-breaking priors on the subclass weights to estimate the
population and individual etiologic distributions that are averaged across models indexed
63
CHAPTER 4. NESTED PARTIALLY-LATENT CLASS MODELS (NPLCM) FORESTIMATING DISEASE ETIOLOGY IN CASE-CONTROL STUDIES
by different numbers of subclasses. Assessment of model fit and individual diagnosis are
done using posterior samples drawn by Gibbs Sampling. We demonstrate the method’s op-
erating characteristics via a simulation study tailored to the motivating scientific problem
and illustrate the model with a detailed analysis of PERCH study data.
64
CHAPTER 4. NESTED PARTIALLY-LATENT CLASS MODELS (NPLCM) FORESTIMATING DISEASE ETIOLOGY IN CASE-CONTROL STUDIES
4.1 Introduction
Multivariate binary data are a common outcome in disease etiology studies (Hammitt
et al., 2012), verbal autopsy studies (King and Lu, 2008; King et al., 2010) and genomic
studies (Hoff, 2005). For example, in the Pneumonia Etiology Research for Child Health
(PERCH) study of childhood pneumonia (Levine et al., 2012), a vector of presence/absence
for up to 30 different pathogens is measured by polymerase chain reaction (PCR) using
specimens from the nasopharyngeal cavity. A goal is to use the multivariate binary re-
sponses to infer the pathogen in the child’s lung causing pneumonia. Assuming only one
unknown pathogen has caused each case’s disease, public health researchers are interested
in clustering cases into groups, each with a different pathogen causing its pneumonia, and
then estimating the fraction of each group. Such knowledge about compositional structure
in the case population is useful for designing disease prevention programs and prioritizing
treatments. We term these fractions as etiologic fractions: probabilities that sum to one
with each component corresponding to a pathogen cause or disease class.
The dependence structure among the observed binary measurements has two primary
sources: 1) the multinomial variation in unobserved disease class indicators among cases,
and 2) given disease class, the conditional dependence among the imperfect measurements.
To distinguish these two sources and to infer the disease class for individual cases, latent
class models (LCM) are commonly used to connect an individual’s measurementM to her
unobserved class indicator I through the likelihood [M | I,Θ], where Θ denotes the col-
lection of parameters (Goodman, 1974). For binary measures, Θ includes sensitivities and
65
CHAPTER 4. NESTED PARTIALLY-LATENT CLASS MODELS (NPLCM) FORESTIMATING DISEASE ETIOLOGY IN CASE-CONTROL STUDIES
specificities. Under local identifiability conditions (Jones et al., 2010), joint maximization
of the model likelihood∑J
j=1[M | I = j,Θ]Pr(I = j) with respect to all unknowns gives
parameter estimates Θ and etiologic fraction estimates Pr(I = j). Individual classification
can then be done by applying Bayes rules using the estimated parameters.
As noted by many authors, misspecification of the conditional distribution [M | I] will
likely bias model parameter and mixing proportion estimates (Albert and Dodd, 2004; Pepe
and Janes, 2007). Therefore, in many applications where the conditional independence
model for [M | I] is assumed, model adequacy is studied to ensure valid model-based
conclusions about test sensitivities/specificities and mixing weights (Garrett and Zeger,
2000). A leading example is diagnostic test evaluation without gold-standard data.
In applications where deviations from conditional independence are substantial, condi-
tional dependence in [M | I] has been introduced. For example, the generalized mixed-
effects model with Gaussian random intercepts have been used to introduce within-subject
correlation for diagnostic tests (Qu and Hadgu, 1998). The Gaussian assumption to de-
scribe the heterogeneity across individuals implies symmetry of correlation on the linear
predictor scale and is sometimes not appropriate (Albert et al., 2001). Albert et al. (2001)
described an alternative finite-mixture model that introduced an extra latent subclass nested
within each class to represent subjects whose measurements were made without error. With
these and many other possible conditional dependence specifications, Albert and Dodd
(2004) noted the model fits of different models are sometimes equally adequate and indis-
tinguishable if sample size is small andM has a low dimension.
66
CHAPTER 4. NESTED PARTIALLY-LATENT CLASS MODELS (NPLCM) FORESTIMATING DISEASE ETIOLOGY IN CASE-CONTROL STUDIES
The model identification problem can be partly addressed by collecting gold-standard
data on latent class indicators on some subjects (Albert and Dodd, 2008), or by collecting
extra data that provides consistent estimates of model parameters, e.g. sensitivities or speci-
ficities. In the motivating example for this paper, the PERCH study collected control data
that provides direct evidence about the specificities of tests and thereby enables estimation
of models with conditional dependence among the binary measurements.
Wu et al. (2014a) described a “partially-latent” class model (pLCM) for case and control
data to estimate the etiologic fractions, π := Pr(I = j)j=1,...,J = (π1, ..., πJ)′. For
controls, there is no infection in the lung, hence I = 0; for cases, there is infection so
that I 6= 0, indicating which pathogen causes the infection I takes value in 1, ..., J.
They referred to this as a “partially-latent class model” (pLCM) since control states are
known but cases states are latent. They structured the pLCM to integrate measurements
with differing error rates that are collected in the PERCH case-control design. Estimated by
Markov chain Monte Carlo, their pLCM approximates with arbitrary precision the posterior
distribution of the population and individual etiologic fractions as well as functions of
unknowns in the model.
In their original formulation, Wu et al. (2014a) assumed conditional independence of
the J binary measurements within each disease class. The model fit well for the 10 most
frequent measurement patterns. However, several pairs of pathogens had observed log odds
ratios deviating significantly from the mass of the posterior predictive distribution. This
indicates that model fit might be further improved by considering conditional dependence
67
CHAPTER 4. NESTED PARTIALLY-LATENT CLASS MODELS (NPLCM) FORESTIMATING DISEASE ETIOLOGY IN CASE-CONTROL STUDIES
extensions of the pLCM. The associations from these models also have scientific value in
their own right.
In this paper, we extend the pLCM to introduce dependence among the J measurements
for an individual. We assume there are K subclasses nested within each of the J + 1 (J
case, 1 control) disease classes. Measurements within a subclass are assumed independent.
We assume the same number of subclasses, K, for each disease class (also see Remark 1).
This extension of the pLCM adds 2 J(K − 1) +K − 1 additional parameters compared
to the original pLCM with K J . We refer to the model as a “nested partially-latent
class model” or npLCM. We use a Bayesian penalty to encourage small values of K which
parsimoniously approximate the dependence among the multivariate binary responses and
avoid overfitting. As explained in the next section, our approach has the advantage of
easier interpretation of sensitivities/specificities without having to condition on continuous
random effects.
In this paper, we develop a hierarchical Bayesian model to extend the pLCM to intro-
duce flexible dependence among the binary responses. The control sample provides the
requisite information about specificities. Prior knowledge about sensitivities can be in-
corporated to facilitate estimation of the etiologic fractions. The method is based on the
nonnegative PARAFAC decomposition (Shashua and Hazan, 2005) that enables parsimo-
nious approximation of a high-dimensional contingency table. The approach is especially
helpful if the sample size is small compared to the total number, 2J+1, of cells in the joint
distribution of the measurements, M .
68
CHAPTER 4. NESTED PARTIALLY-LATENT CLASS MODELS (NPLCM) FORESTIMATING DISEASE ETIOLOGY IN CASE-CONTROL STUDIES
Our model is estimated via Markov chain Monte Carlo with data augmented by latent
indicators of disease class and nested indicators of subclass. Throughout the paper, we
rely on the scientific assumption that each child’s pneumonia is caused by a single primary
pathogen. The more general case where disease can be attributed to multiple pathogens can
be developed through our model formulation, but with computational complexities (Section
4.7).
The remainder of this paper proceeds as follows. Section 4.2 introduces the model for-
mulation of the npLCM. Section 4.3 discusses several inherent model properties. Section
4.4 details the posterior computing algorithms. Section 4.5 uses simulated data sets that are
tailored to the motivating application to illustrate the benefits of using the npLCM relative
to the pLCM, a special case. Section 4.6 applies the proposed method to PERCH study
data. Section 4.7 concludes with remarks on model extensions.
4.2 Model specification of npLCM
In this section, we fully specify the nested partially latent class model (npLCM). We
discuss the model properties and its parameter interpretations using the PERCH study as
an example. Let Mi = (Mi1, ...,MiJ) comprise a J-dimensional multivariate binary mea-
surement collected for subjects i = 1, ..., n1 + n0, where the first n1 subjects are cases and
the remaining n0 are controls. Yi = 1 denotes a case and Yi = 0 for a control.
69
CHAPTER 4. NESTED PARTIALLY-LATENT CLASS MODELS (NPLCM) FORESTIMATING DISEASE ETIOLOGY IN CASE-CONTROL STUDIES
4.2.1 npLCM likelihood
Figure 4.1 pictures the general structure of the npLCM with J = 5 dimensional mea-
surements, one pathogen measurement per row in the matrix. With 5 pathogens, there are 6
classes: one for the control state (pathogen-free) on the left of the dashed vertical line; and
5 states for the possible etiologic pathogens on the right. In the figure, the control measure-
ments have joint distribution that is approximated by a mixture of K = 2 subclasses, with
K-dimensional mixing weights ν = (ν1, ..., νK)′. Here Ψk0 = ψ(j)k01≤j≤J is the vector
of false positive rates for measurements j = 1, ..., J in the subclass k0 = 1, ..., K. To the
right of dashed line are the J = 5 classes for cases. The mixing weights of K subclasses
in the case population are assumed to be η = (η(j)1 , ..., η
(j)K ), for j = 1, ..., J . The etiologic
fractions are defined to be the J-dimensional mixing weights for the J classes in the case
population, denoted π = (π1, ..., πJ)′ .
The control measurement distribution is assumed to take the form of a latent class model
(Goodman, 1974). For control i with measurement m, her J-way contingency table with
cell probabilities P 0 = Pr(M = m | ν,Ψ, Y = 0)m∈0,1J can be decomposed as
P 0i =
K∗∑k=1
νk
J∏j=1
ψ
(j)k
mij
1− ψ(j)k
1−mij
, (4.2.1)
where K∗ is a positive integer, v = (v1, ..., vK∗)′ is a vector that sums to one, and Ψ is a
parameter matrix with (j, h)th element ψ(j)h ∈ [0, 1] (Shashua and Hazan, 2005).
We introduce subclass indicator Zi that takes value in 1, ..., K∗ for control subject i.
70
CHAPTER 4. NESTED PARTIALLY-LATENT CLASS MODELS (NPLCM) FORESTIMATING DISEASE ETIOLOGY IN CASE-CONTROL STUDIES
Figure 4.1: Model structure that incorporates conditional dependence within each diseaseclass illustrated by J = 5 pathogens (called A, B, C, D, and E) in the PERCH study. Onthe left is the control measurements that arise from a mixture of K = 2 conditionallyindependent subclass measurement profiles with mixing weights ν1 and ν2. Here ψ(j)
k isthe false positive rate for pathogen j in a subclass k. On the right are the J = 5 diseaseclasses, one for each possible pathogen. Each case is assumed to be caused by a uniquepathogen indicated by IL taking values in 1, ..., J. For a class containing all cases whoseIL = j0, the K = 2 subclasses of measurement profiles are assumed equal to the controlfalse positive rates ψ(j)
k for j 6= j0, and equal to the true positive rate θ(j)k for j = j0,k = 1, ..., K. Within each disease class, two subclass measurement profiles are nested.The mixing weights of subclasses nested in the jth disease class are η(j)1 and η(j)2 . π =(π1, ..., πJ)′ are disease class mixing weights, and are called etiologic fractions.
71
CHAPTER 4. NESTED PARTIALLY-LATENT CLASS MODELS (NPLCM) FORESTIMATING DISEASE ETIOLOGY IN CASE-CONTROL STUDIES
Then (4.2.1) is equivalent to the two-stage model:
Zi ∼ Multinomial(1, ..., K∗,ν), j = 1, ..., J, (4.2.2)
Mij | Zi = k ∼ Bernoulli(ψ(j)h ), independently for j = 1, ..., J, (4.2.3)
where νk = Pr(Zi = k | Y = 0) and ψ(j)k = Pr(Mj = 1 | Zi = k, Y = 0). Here ν is
the vector of mixing weights of K∗ subclass measurement profiles; ψ(j)k is the probability
of mij = 1 given this control subject is allocated to subclass k. In the application of
pneumonia etiology estimation, ψk =ψ
(j)k
j=1,...,J
is the kth false positive rate (FPR)
profile, because any positive detection of pathogens from a control subject will be a false
positive.
The vector of binary measurements for a case is assumed to be generated from a mixture
of latent class models, one for each possible cause. In addition, given each potential cause,
we assume the distribution is the same as for the controls except for the causal pathogen.
Specifically, for disease class j0, the K subclasses of measurement profiles are assumed
equal to the control positive rates ψ(j)k for j 6= j0, k = 1, .., K. We assume the true positive
rates to be θ(j)k for j = j0, k = 1, ..., K. For a case i′ with measurement mi′ , the joint
measurement distribution for the cases, P 1 = Pr(M = m | π,η,Θ,Ψ, Y = 1), for
72
CHAPTER 4. NESTED PARTIALLY-LATENT CLASS MODELS (NPLCM) FORESTIMATING DISEASE ETIOLOGY IN CASE-CONTROL STUDIES
m ∈ 0, 1J , is therefore given by
P 1i′ =
J∑j=1
πj
K∗∑h=1
[η(j)k
θ(j)k
mi′j
1− θ(j)k1−mi′j ∏
l 6=j
ψ
(j)k
mi′l
1− ψ(j)k
1−mi′l
],
(4.2.4)
where π = (π1, ..., πJ)′ is a vector that sums to one, Θ is a parameter matrix with (j, h)th
element θ(j)h ∈ [0, 1]. Parameters in (4.2.4) are better interpreted using subclass indicator
Zi′ and an extra class indicator Ii′ ∈ 1, ..., J (or disease class indicator),
Ii′ | Yi′ = 1 ∼ Multinomial(1, ..., J,π), (4.2.5)
Zi′ | Ii′ = j ∼ Multinomial(1, ..., K,η(j)), j = 1, ..., J (4.2.6)
Mi′j | Zi′ = k, Ii′indep∼ Bernoulli
(θ(j)k 1Ii′=j + ψ
(j)k 1Ii′ 6=j
), j = 1, ..., J.(4.2.7)
In the PERCH application, π = (π1, ..., πJ)′ is the vector of probabilities that a case be-
longs to class 1 through J , i.e. the etiologic fractions, which is the primary target of
inference; η = (η(j)1 , ..., η
(j)K∗)
′ mixes K∗ subclasses nested in each disease class; θ(j)k is
the true positive rate (TPR) for a case belonging to the jth disease class and kth sub-
class measurement profile. Equation (4.2.7) indicates that θ(j) =θ(j)k
1≤k≤K∗
replaces
ψ(j) =ψ
(j)k
1≤k≤K∗
in otherwise similar controls to indicate the change in positive de-
tection rate induced by pathogen infection in a disease class j.
Combining (4.2.1) and (4.2.4), the joint model likelihood across independent cases and
73
CHAPTER 4. NESTED PARTIALLY-LATENT CLASS MODELS (NPLCM) FORESTIMATING DISEASE ETIOLOGY IN CASE-CONTROL STUDIES
controls is
L(π,Θ,Ψ,v,η;D) =∏i:Yi=0
P 0i
∏i′:Yi′=1
P 1i′ , (4.2.8)
where D = mi : Yi = 0 ∪ mi′ : Yi′ = 1 collects all the measurement data on the cases
and controls.
Remark 1. We assumed that dependence of measurements within each case class can
be explained by allowing the same number of conditionally independent subclasses as in
controls. If the disease class I is directly observed for all or a subset of cases, extra case
subclasses can be included. However, without direct observations of I , as is the case in
PERCH, we use the same number of subclasses in the cases so that subclass parameters
can be partly informed by the control population using (4.2.7).
Remark 2. While we assume K, the number of subclasses per class, is the same for cases
as for controls, we do not add the additional restriction that the mixing distribution across
subclassses is also the same, i.e., ν = η(j), j = 1, ..., J . In this way, the dependence among
the measurements M[−j] is not required to be identical for controls and cases caused by
pathogen j.
Remark 3. Any multivariate binary distribution can be expressed as a mixture of product
Bernoullis for a sufficiently large K = K∗ in (4.2.1). However, the choice of K∗ is not
straightforward. Also, since our inferential goal is to estimate the etiology fractions, π,
after marginalizing over subclass indicators (Zi′ ∈ 1, ..., K), the dependence structure
74
CHAPTER 4. NESTED PARTIALLY-LATENT CLASS MODELS (NPLCM) FORESTIMATING DISEASE ETIOLOGY IN CASE-CONTROL STUDIES
within each disease class is represented by nuisance parameters. Therefore, rather than
fixing K, we perform model averaging across a range of plausible values of K, so that
its uncertainty is incorporated into the inference about π. This is particularly desirable
when the observed contingency table P 0 has a large proportion of empty cells (> 97% in
PERCH), and we want to prevent model overfitting in finite sample sizes using K ≤ K∗ as
discussed by Dunson and Xing (2009).
4.2.2 Prior specifications
Prior distributions on unknown parameters are specified as follows:
π ∼ Dirichlet(a1, . . . , aJ), (4.2.9)
ψ(j)k ∼ Beta(b1kj, b2kj), j = 1, ..., J ; k = 1, ...,∞, (4.2.10)
θ(j)k ∼ Beta(c1kj, c2kj), j = 1, ..., J ; k = 1, ...,∞, (4.2.11)
Zi′ | ILi′ = j ∼∞∑k=1
U(j)k
∏l<k
[1− U (j)
l
]δk, for all cases, (4.2.12)
U(j)k ∼ Beta(1, α0), (4.2.13)
Zi ∼∞∑k=1
Vk∏l<k
[1− Vl]δk, for all controls, (4.2.14)
Vk ∼ Beta(1, α0), (4.2.15)
α0 ∼ Gamma(0.25, 0.25), (4.2.16)
75
CHAPTER 4. NESTED PARTIALLY-LATENT CLASS MODELS (NPLCM) FORESTIMATING DISEASE ETIOLOGY IN CASE-CONTROL STUDIES
where δk is a point mass on k, and prior independence is also assumed among these param-
eters.
In (4.2.12) and (4.2.15), we have specified stick-breaking priors for both η(j) and v,
which places decreasing weights on the kth measurement profile as k increases (Ishwaran
and James, 2001).
4.3 Model properties
4.3.1 Non-interference submodels
A fundamental premise of the pLCM and this extended npLCM is that the etiologic
pathogen in the lung IL is differentially expressed in the peripheral measurements M .
That is, if one case has disease caused by pathogen j and another by pathogen j′, then the
joint distributions for the measurements of the remaining pathogens (all but j and j′), call
this Pr(M [−(j, j′)] | IL, Y = 1
), will be the same. This premise is essential if we expect
to infer the lung status from peripheral measurements.
Specifically, the case measurement likelihood (4.2.4) implies that, among cases, if
η(j) = η(j′), or K = 1,
Pr(M[−(j,j′)] | IL = j, Y = 1) = Pr(M[−(j,j′)] | IL = j′, Y = 1), (4.3.1)
for 1 ≤ j < j′ ≤ J . If ν = η(j), or K = 1, we further have, between the controls and the
76
CHAPTER 4. NESTED PARTIALLY-LATENT CLASS MODELS (NPLCM) FORESTIMATING DISEASE ETIOLOGY IN CASE-CONTROL STUDIES
case, that
Pr(M[−j] | Y = 0) = Pr(M[−j] | IL = j, Y = 1), j = 1, ..., J, (4.3.2)
In the PERCH application, equality (4.3.1) implies that measurements on pathogens have
the same distribution for the cases belonging to two different disease class, as long as these
pathogens are not infecting either of them. Equality (4.3.2) also implies that measurements
on pathogens other than j will have the same distribution for the controls and the cases
caused by pathogen j. (4.3.1) and (4.3.2) are hence termed non-interference conditions.
The observed data can be used to support or reject the non-interference submodels (4.3.1)
and (4.3.2) as discussed in more detail in Section 4.6.
4.3.2 Mean and covariance structure
In the appendix at the end of this chapter, we provide straightforward and expressions
for the marginal means and pairwise associations for J pathogens separately for the cases
and controls, and discuss how information borrowed from the controls is manifested in
the cases’ measurements. These formulas are used to generate posterior distribution for
observable characteristics of the measurements that are essential to model checking. That
use is illustrated in Section 4.6 for the PRECH data.
In addition, these formulas allow comparison of the pLCM and npLCM models in terms
of their estimates of the etiologic fractions π that is the primary interest in our application.
77
CHAPTER 4. NESTED PARTIALLY-LATENT CLASS MODELS (NPLCM) FORESTIMATING DISEASE ETIOLOGY IN CASE-CONTROL STUDIES
In Section 4.5, through simulations, we assess the bias-variance trade-off of inferring etio-
logic fractions π when using npLCM compared to pLCM in finite sample sizes.
4.3.3 Alternate approaches to borrowing information from
the control population
It is also clear from (A.2.3) that the control measurements provide direct evidence about
the marginal false positive rates (MFPR), ΨM = Pr(Mij = 1 | Yi = 0), j = 1, ..., J. One
may estimate ΨM without joint modeling of the control measurement, or by assuming a
working independence model. Both approaches provide consistent estimates of the MF-
PRs. One can then use the nonparametric bootstrap to obtain robust covariance estimate on
the logit scale ΣM and place multivariate normal prior, logit(ΨM) ∼ NJ(0, ΣM), to inform
the case model. Similarly, marginal moments beyond the first order can be borrowed to
the case model through GEE2 (Liang et al., 1992) or the more computationally efficient
alternating logistic regression (ALR) (Carey et al., 1993).
These ad hoc approaches to share measurement error rate information between the con-
trol and the case populations have at least two limitations. First, one needs to specify the
order of moments and then obtain the moment estimates using other robust statistical pro-
cedures. Second, one needs to formulate the case model explicitly in terms of moment
parameters, on which priors elicited from the controls are placed. The npLCM framework
overcomes these two inconveniences using the nonnegative PARAFAC decomposition. Es-
78
CHAPTER 4. NESTED PARTIALLY-LATENT CLASS MODELS (NPLCM) FORESTIMATING DISEASE ETIOLOGY IN CASE-CONTROL STUDIES
timated FPR profile parameters (Ψ) from the control population can inform moments of
arbitrary order among the cases with similarity determined by the subclass mixing weights
ν andη(j), j = 1, ..., J
.
4.3.4 Modeling choices
As an alternative to the npLCM introduced here, covariation among multivariate bi-
nary measurements can also be formulated by generalized linear mixed-effects models
(GLMM). Suppose the control measurements follow the distribution
g (Pr(Mij = 1 | Yi = 0, δi)) = δij + ψCj , j = 1, ..., J ; and δi = Λξi + εi,(4.3.3)
ξi = (ξi1, ..., ξiS)′ ∼ NS(0, IS), εi ∼ NJ(0,Ω = diag(σ21, ..., σ
2J)) (4.3.4)
where g(·) is a link function, ψCj is conditional FPR given random effect δij = 0. Here,
Λ is a J × S factor loading matrix that characterizes the covariance structure of random
effect δi through Cov[δi] = ΛΛ′ + Ω. Sparse estimation of the factor loading matrix Λ in
GLMM can be generalized from previous Bayesian methods in the context of continuous
data (Bhattacharya and Dunson, 2011; Pati et al., 2014), possibly with more computational
expenses. We can borrow the control parameters to the jth class of cases by replacing ψCj ,
and the jth row of Λ and Ω with new parameter values. We have chosen npLCM over
GLMM formulation for two reasons. First, in the control population, the normal random-
effect distribution in the GLMM constrains the possible dependence structure among the
79
CHAPTER 4. NESTED PARTIALLY-LATENT CLASS MODELS (NPLCM) FORESTIMATING DISEASE ETIOLOGY IN CASE-CONTROL STUDIES
multivariate binary measurement. In contrast, the npLCM has the advantage of approx-
imating the control distribution arbitrarily close using JK + K − 1 parameters, a much
smaller number than the JS + 2J required in GLMM when S ≈ K J . Second, the
use of the control information in the case population is more natural in the npLCM frame-
work. The jth marginal FPRs in the controls or cases are a linear functions of the jth row
of Ψ. While in the GLMM formulation, we need the jth row of Λ and Ω, ψCj , the link
function g(·), and numerical integrations if g(·) is not the probit function. This also makes
the GLMM formulation more computationally intensive relative to the npLCM.
4.4 Posterior computations
The parameters in likelihood (4.2.8) include the population etiology distribution (π),
TPRs Θ and FPRs Ψ. The posterior distribution of these parameters can be estimated by
constructing approximating samples from the joint posterior via a Gibbs sampler. The full
conditional distributions for Gibbs sampler updating are detailed in the Appendix. Figure
4.2 is the directed acyclic graph (DAG) that shows the model structure and observed and
latent variables in the npLCM.
In this work, we are able to use the freely available software WinBUGS 1.4 to fit the
npLCM. Convergence was monitored via Markov chain Monte Carlo (MCMC) chain his-
tories, auto-correlations, kernel density plots, and Brooks-Gelman-Rubin statistics (Brooks
and Gelman, 1998). The statistical results below are based on 10, 000 iterations of burn-in
80
CHAPTER 4. NESTED PARTIALLY-LATENT CLASS MODELS (NPLCM) FORESTIMATING DISEASE ETIOLOGY IN CASE-CONTROL STUDIES
Figure 4.2: Directed acyclic graph for the npLCM. Quantities in circles are unknown pa-rameters or auxiliary variables; quantities in solid squares are observed (multivariate binarymeasurements here). The etiologic fraction π of primary scientific interest. The solid ar-rows represent probabilistic relationship between the connected variables. The “cut” valve“A B” means that when updating node A in the Gibbs sampler, we drop the likelihoodterms that involve node B (see Section 4.4).
81
CHAPTER 4. NESTED PARTIALLY-LATENT CLASS MODELS (NPLCM) FORESTIMATING DISEASE ETIOLOGY IN CASE-CONTROL STUDIES
followed by 40, 000 production samples from each of three parallel chains. Samples from
every 40 iterations are retained for inference.
Note that the false positive rates parameters Ψ are included in both the control and
case likelihood (4.2.1) and (4.2.4), so that the posterior distribution of Ψ depends on both
the control and case models. This is referred to as “feedback” because the case model
will indirectly inform Ψ. If we only want the control data inform the case model but not
vice versa, we can “cut” (Lunn et al., 2009) this source of feedback through approximate
conditional updating in the Gibbs sampler. That is, we update ψ(j)k by Pr(ψ(j)
k | Mij, i :
Yi = 0) instead of step (7) of the Gibbs sampler (see Appendix). It will cut the information
flow from the case model to the FPR parameters Ψ and is indicated by the check-bit valves
in Figure 4.2. It is desirable when certain parts of the joint model are considered not
reliable to inform a subset of parameters, and can be implemented by the cut function
in WinBUGS 1.4. Such “cut-the-feedback” approximate Bayesian computation has both
gains in computational speed and inferential robustness, and is also suggested in other
contexts (Liu et al., 2009; Warren et al., 2012; Zigler and Dominici, 2014).
4.5 Simulation studies
We compare the pLCM and npLCM in terms of π estimation and individual classifica-
tion error rates in a small simulation study with sample size n1 = n0 = 500 and J = 5
binary measures (pathogens A,...,E). We estimate the bias of not accounting for conditional
82
CHAPTER 4. NESTED PARTIALLY-LATENT CLASS MODELS (NPLCM) FORESTIMATING DISEASE ETIOLOGY IN CASE-CONTROL STUDIES
dependence when it exists and the bias-variance trade-off when choosing between pLCM
and npLCM.
In the Scenario I, we simulated data with conditional independence K = 1 (pLCM)
where the etiology is split evenly across the five pathogens. The FPRs are set to be
Ψ = (0.1, 0.2, 0.3, 0.4, 0.5)′ and the TPRs Θ = (0.9, 0.9, 0.9, 0.9, 0.9)′. In the Scenario
II, we simulated the data under a npLCM specification with true etiologic fraction π =
(0.5, 0.2, 0.15, 0.1, 0.05)′. We then create associations between the binary measurements by
defining two subclasses (K = 2) for the 6 disease states; the FPR profiles are Ψ = [ψ1,ψ2],
where ψ1 = (0.1, 0.1, 0.1, 0.1, 0.5)′ and ψ2 = (0.1, 0.5, 0.1, 0.1, 0.1)′; the TPR profiles are
Θ = [θ1,θ2], where θ1 = (0.9, 0.9, 0.9, 0.9, 0.9)′ and θ2 = (0.1, 0.9, 0.9, 0.9, 0.9)′. The
true subclass mixing weights in the controls (λ) and cases (η(j), j = 1, ..., 5), are all set
equal to (0.9, 0.1)′. With this set of parameter values, the pair of pathogens (B,E) are neg-
atively associated in the controls. In the cases infected by pathogen A, we created negative
association between the pair (A,B), and positive association between the pair (A,E). This is
to mimic the situation where the pathogen infecting a case can inhibit the growth of some
pathogens in the periphery while promoting growth of another.
For each scenario, we generated 50 data sets; the npLCM and pLCM algorithms were
fitted separately to each simulated data set using the posterior computing algorithm de-
scribed in Section 4.4. We obtained good mixing behavior of the Gibbs sampler that con-
verged in each case for the etiologic fractions π as determined by visual inspections of the
chain history, auto- and cross-correlations.
83
CHAPTER 4. NESTED PARTIALLY-LATENT CLASS MODELS (NPLCM) FORESTIMATING DISEASE ETIOLOGY IN CASE-CONTROL STUDIES
Table 4.1 shows the simulation results from the npLCM and pLCM, respectively for
both scenarios. The row beginning with ¯πj is the average of posterior means across simu-
lation replicates; Sπj is the sample standard deviation of posterior means across simulation
replicates;(Vπj)1/2 is the square root of the average posterior variance of πj across sim-
ulation replicates; coverage is the proportion of simulation replicates that produced 95%
credible intervals covering the true πj . With 50 replicates, the empirical coverage rate
of nominal 95% intervals is expected to be 95% and will most likely fall in the range of
(88, 100)%.
In scenario I, the estimates from the pLCM and the npLCM are comparable. The av-
erage posterior means are similar and close to the truth. The coverage rate of the nominal
95% credible intervals are within the (88, 100)% range. As expected, the npLCM esti-
mates generally have larger posterior variances, because the npLCM is a larger model and
includes the pLCM as a special case.
In scenario II, ignoring the conditional dependence in the class A of cases leads to
the downward bias ¯πA = 0.44 compared to the true value (0.5). We also observe under-
coverage (84%) of the nominal 95% credible interval. The npLCM can recover πA accu-
rately with average posterior mean being 0.51. The coverage of the nominal 95% interval is
estimated to be 90%, a reasonable value. In this simulation, other estimates by the pLCM
remain robust to the conditional dependence that is present in the true data generating
mechanism. The sample variance of posterior means are slightly smaller from the pLCM
compared to the npLCM (see row 3 of the bottom panel), indicating that the smaller model,
84
CHAPTER 4. NESTED PARTIALLY-LATENT CLASS MODELS (NPLCM) FORESTIMATING DISEASE ETIOLOGY IN CASE-CONTROL STUDIES
pLCM, has sacrificed bias in π estimates for reduced variance.
Table 4.1: Results for simulated data sets separately fitted by the npLCM and pLCM.
SCENARIO I: truth is pLCMfitted by npLCM fitted by pLCM
A B C D E A B C D Eπ 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2¯πj 0.22 0.21 0.20 0.20 0.16 0.22 0.21 0.20 0.20 0.16
Sπj 0.03 0.03 0.04 0.03 0.05 0.03 0.03 0.03 0.03 0.04(Vπj)1/2 0.05 0.05 0.05 0.06 0.06 0.04 0.04 0.04 0.05 0.05
coverage 98% 98% 100% 100% 96% 96% 98% 100% 100% 96%SCENARIO II: truth is npLCM
fitted by npLCM fitted by pLCMπ 0.5 0.2 0.15 0.1 0.05 0.5 0.2 0.15 0.1 0.05¯πj 0.51 0.19 0.16 0.10 0.04 0.44 0.22 0.17 0.12 0.05
Sπj 0.03 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.01(Vπj)1/2 0.04 0.04 0.04 0.04 0.03 0.04 0.04 0.04 0.04 0.03
coverage 90% 94% 100% 98% 100% 84% 90% 100% 98% 100%
For both scenarios, we have also assessed the out-of-sample predictive performance of
the individual diagnosis based on the pLCM or the npLCM. We used 500 cases and 500
controls (D0) to train the models and predicted an individual’s underlying class indicator
ILi∗ given her J binary measurements. Under a particular model, we classify an individual
into the class that gives the highest posterior probability: ILi∗ = arg maxj=1,2,...,J P (ILi∗ =
j | Mi∗ ,D0). Here, the posterior probabilities are estimated as the frequencies of the
Gibbs sampler imputing ILi∗ as j after the burn-in period. For a class of cases with ILi∗ =
j, we simulate 10, 000 subjects’ multivariate binary measurements Dj using the model
specification. The predictions for new cases ILi∗ and their actual values of ILi∗(= j) are then
85
CHAPTER 4. NESTED PARTIALLY-LATENT CLASS MODELS (NPLCM) FORESTIMATING DISEASE ETIOLOGY IN CASE-CONTROL STUDIES
compared to calculate the estimated misclassification rate:
rj =i∗ ∈ Dj : ILi∗ 6= j and ILi∗ = j
/10, 000, for j = 1, .., J.
The overall misclassification rate is calculated as ro =∑J
j=1 rj · πj .
Figure 4.3 compares the misclassification rates obtained using the pLCM and the npLCM
across simulation replications. In the Scenario I where the data generation mechanism
complies with the conditional independence assumption, both models have similar classi-
fication performance. In the Scenario II, the npLCM has a lower average misclassification
rate in class A relative to the pLCM (see the leftmost pair of boxplots), as expected since
the simulation Scenario II challenges the pLCM that cannot account for the conditional
dependence within class A.
4.6 Analysis of PERCH data
The Pneumonia Etiology Research for Child Health (PERCH) study is a standardized
and comprehensive evaluation of etiologic agents causing severe and very severe pneumo-
nia among hospitalized children aged 1-59 months in seven low and middle income coun-
tries. The study sites include countries with a significant burden of childhood pneumonia
and a range of epidemiologic characteristics (Levine et al., 2012). PERCH is a case-control
study that has enrolled over 4, 000 patients hospitalized for severe or very severe pneumo-
nia and over 5, 000 controls selected randomly from the community, frequency-matched on
86
CHAPTER 4. NESTED PARTIALLY-LATENT CLASS MODELS (NPLCM) FORESTIMATING DISEASE ETIOLOGY IN CASE-CONTROL STUDIES
(a)
(b)
Figure 4.3: Misclassification rate comparisons between the pLCM and npLCM predictions.50 simulated training data sets generated under (a) scenario I (pLCM), or (b) scenario II(npLCM). Each training data set is fitted by the pLCM (clear boxplots) or npLCM (filledboxplots) to produce individual predictions. In (a) and (b), the first 5 pair of boxplotsare to compare class-specific misclassification rates; the last pair is to compare the overallmisclassification rates.
87
CHAPTER 4. NESTED PARTIALLY-LATENT CLASS MODELS (NPLCM) FORESTIMATING DISEASE ETIOLOGY IN CASE-CONTROL STUDIES
age in each month. More details about the PERCH design are available in Deloria-Knoll
et al. (2012).
To illustrate the application of the npLCM model for the analysis of PERCH study data,
we have focused on preliminary data from one site with good availability of laboratory
results on nasopharyngeal (NP) specimens with PCR detection of pathogens. Results for
all 7 countries will be reported elsewhere upon study completion. Included in the current
illustrative analysis are NPPCR data for 578 cases and 603 frequency-matched controls on
9 species of pathogens (6 viruses and 3 bacteria with their abbreviations in Figure 4.4, and
full names in Appendix).
4.6.1 Estimation of etiologic fractions
We have compared the population etiology fractions, π, estimated separately by two
related methods: (a) npLCM with K fixed at 1 (conditional independence submodel, or
pLCM), and (b) npLCM with stick-breaking prior on the subclass weights (truncation level
K = 10). The results are shown in Figure 4.4 using the visualization introduced by Wu
et al. (2014a).
It is desirable to compare the objective evidence in the data (input) and the posterior
distribution of the parameters of main scientific interest, here the etiology fractions π (out-
put). The left panel of Figure 4.4 displays for each pathogen (row) the positive observation
rates from cases and controls, and the estimated conditional odds ratios with 95% confi-
dence intervals of the pathogen with case status adjusted for the presence or absence of
88
CHAPTER 4. NESTED PARTIALLY-LATENT CLASS MODELS (NPLCM) FORESTIMATING DISEASE ETIOLOGY IN CASE-CONTROL STUDIES
other pathogens using standard logistic regression.
In the right panel of Figure 4.4 are the marginal prior and posterior distributions of the
etiologic fraction for each pathogen by method (a) pLCM (black) and (b) npLCM (blue).
The posterior mean with 50% and 95% credible intervals are shown above the density.
With the exception of one virus, the differences in the estimated etiologic fractions from
the two approaches are small. The npLCM estimates that pathogen RSV caused 26.6%
(95% CI: 17.6 − 43.7%) of the disease in the case population. A very similar result is
obtained by the pLCM with 27.4%(18.0− 48.4%) that assumes conditional independence.
The large conditional odds ratio (COR) of RSV with case status (31.6 (14.9−81.8)) cannot
be explained away by strong conditional association with another pathogen.
The one exception is the virus RHINO that has a substantially larger etiologic frac-
tion 22.1%(8.4 − 41.1%) estimated by the npLCM as compared to 10.5%(0.6 − 28.6%)
from the pLCM analysis. This is a result of RHINO’s strong negative association with
RSV in the cases (log OR: −1.8(s.e. 0.3)). From equation (A.2.4), the npLCM assumes
that the pairwise log odds ratio is contributed from all J classes of cases. For RSV
and RHINO, the involved parameters have posterior means: θ(RSV) = (0.75, 0.42, ...)′,
ψ(RHINO) = (0.26, 0.61, ...)′, θ(RHINO) = (0.72, 0.27, ...)′, and ψ(RSV) = (0.01, 0.02, ...)′.
Here, only estimates for the largest two subclasses are shown because the estimated sub-
class mixing weights in the cases, η(j)k , are negligible when k ≥ 2 for j = 1, ..., 10.
Within the RSV class of cases, the relevant parameters are TPRs θ(RSV) = (0.75, 0.42, ...)′
and FPRs ψ(RHINO) = (0.26, 0.61, ...)′. The two vectors are “out-of-phase” with each other
89
CHAPTER 4. NESTED PARTIALLY-LATENT CLASS MODELS (NPLCM) FORESTIMATING DISEASE ETIOLOGY IN CASE-CONTROL STUDIES
Figure 4.4: Comparison of population etiologic fraction posterior distributions between thepLCM (black) and npLCM (blue). On the left, the positive observation rates rates for casesand controls are plotted for each pathogen using connected blue dots; “+” and “*” denoteposterior mean of θMj and ψM
j , respectively; the fitted case rate is indicated by “δ”. On theright, the blue/black curves, numbers, and credible intervals above the curves denote themarginal posterior density, mean, and 50% and 95% credible intervals for πj , j = 1, .., 10for the pLCM/npLCM models.
90
CHAPTER 4. NESTED PARTIALLY-LATENT CLASS MODELS (NPLCM) FORESTIMATING DISEASE ETIOLOGY IN CASE-CONTROL STUDIES
and so induce negative conditional dependence when we marginalize the latent subclass
indicators by subclass mixing weights η(RSV). However, the posterior mean of the subclass
mixing weights is (0.981, 0.017, ...)′, highly concentrated to the first subclass, which re-
sults in small variations in the subclass indicators. The amount of the observed negative
association between RSV and RHINO is therefore only partly accounted for by the RSV
class.
The npLCM tries to account for the extra negative association by assigning higher eti-
ology to the RHINO and other classes of cases where additional negative associations can
be induced. In this data set, this leads to the observed increase in the etiologic fraction of
RHINO. We also observe that the posterior distribution of the RHINO etiologic fraction
is more spread under the npLCM compared to the pLCM (blue versus black curve, row 2,
right panel). It indicates that although the npLCM considers the RHINO class adequate
to induce some extra negative association between RSV and RHINO, the evidence is not
strong. The smaller increases in estimated etiology fractions for pathogens PARA1 and
HMPV A/B are similarly explained by their negative associations with RSV in the cases,
although the magnitude of increase is smaller because these negative associations are not
as strongly supported by the case measurements as was the case for RSV and RHINO.
A strength of the npLCM is its Bayesian formulation and flexible posterior inference
about functions of unobservables through post-processing of MCMC samples. Prediction
91
CHAPTER 4. NESTED PARTIALLY-LATENT CLASS MODELS (NPLCM) FORESTIMATING DISEASE ETIOLOGY IN CASE-CONTROL STUDIES
of latent variables given an individual’s measurements
pi =
Pr(ILi = j |Mi,Data), j = 1, ..., J,
generalizes positive and negative predictive values for multivariate binary measurements.
To illustrate individual prediction, Figure 4.5 shows the posterior etiology probabilities, pi,
for individuals with the most frequent measurement patterns, predicted separately under
conditional independence (clear bars) and conditional dependence (filled bars) assump-
tions.
In general, predictions from the pLCM and the npLCM differs only in RHINO, with the
npLCM favoring RHINO. On an individual level, this increase in the RHINO probabilities
explains the increase in estimated RHINO etiologic fraction shown in Figure 4.4. In the
second row of Figure 4.5, the first (1000001000) and the last (100001010) patterns have
positive HINF and HMPV but differ for RHINO. We have a counter-intuitive higher pre-
dicted RHINO probability (37%) where RHINO is absent in the NP than where it is present
(25%). A naive expectation is that the model estimates for RHINO has marginal specificity
(0.67) that is greater than one minus the marginal sensitivity (0.4). Hence, observing a neg-
ative RHINO measurement should make RHINO less likely a cause. In the npLCM, be-
yond the first-order marginal moment parameters (e.g. marginal sensitivities/specificities),
the association parameters are also terms in the model likelihood. The last pattern (both
HMPV and RHINO positive) is more control-like and has a higher likelihood in support of
92
CHAPTER 4. NESTED PARTIALLY-LATENT CLASS MODELS (NPLCM) FORESTIMATING DISEASE ETIOLOGY IN CASE-CONTROL STUDIES
Figure 4.5: Individual diagnoses for the most frequent measurement patterns among thecases, separately predicted from the pLCM and the npLCM. In each subfigure, the multi-variate binary pattern denotes the observed measurement for a case; the percentage beneathis the observed frequency in the cases; the clear bar on the left is the prediction from pLCM;the filled bar on the right is from npLCM.
93
CHAPTER 4. NESTED PARTIALLY-LATENT CLASS MODELS (NPLCM) FORESTIMATING DISEASE ETIOLOGY IN CASE-CONTROL STUDIES
ILi 6= HMPV,RHINO versus infection by either pathogens. The strong observed positive
association (log OR 1 (s.e. 0.2)) between HMPV and RHINO in the controls that is recog-
nized by the npLCM and borrowed to the cases (see the model structure in Figure 4.1). The
optimal Bayesian weighting inherent in the posterior calculation balances the evidence for
the marginal parameters and pairwise associations, and determines that the latter dominates
and predicts RHINO to be a less likely cause for the last than the first pattern.
4.6.2 Model checking
To compare model fitting of the npLCM relative to the pLCM, we have compared pos-
terior predictive distributions (Gelman et al., 1996) of pairwise log odds ratios (LOR) to
the observed values separately in the controls and the cases. To assess the differences,
we calculate the observed LOR for a pair of measurements minus the mean LOR for the
predictive data distribution value divided by the standard deviation of the LOR predictive
distribution. Figure 4.6 shows pairs of pathogens that have significant deviations of model
predicted LOR from the observed ones, either by the pLCM or npLCM. The size of the cir-
cles for the empirical estimates are proportional to the precision of the observed log odds
ratios shown as the solid dots.
In the controls, the pathogen pairs (1,7), (1,9), and (7,9) have log odds ratios esti-
mated with relatively high precision. They are missed under the conditional independence
assumptions, but are well captured by the npLCM. In the cases, the pair of pathogen mea-
surements (7,8) have a positive log odds ratio with high precision, which is adequately de-
94
CHAPTER 4. NESTED PARTIALLY-LATENT CLASS MODELS (NPLCM) FORESTIMATING DISEASE ETIOLOGY IN CASE-CONTROL STUDIES
scribed by the npLCM. The associations between pairs of measurements (9,10) are not well
described by either model. But we observe that the npLCM posterior predictive distribution
(rightmost boxplot in the bottom panel) has moved towards explaining some negative asso-
ciations, compared to the neutral position of the boxplot under the pLCM. In the PERCH
study, we observed that the seasonal variation in the rate of detection for the 10th pathogen,
RSV, and the 9th pathogen RHINO were out of phase and regression adjustment, discussed
elsewhere, may account for such strong negative association.
In the cases, the npLCM has similar predicted frequencies to that obtained from the
conditional independence assumption. The underestimation of the measurement pattern
with HINF and RSV positive (third from left) is due to strong negative association between
RHINO and RSV that is not captured sufficiently by the npLCM and requires further work
on regression adjustment.
4.7 Discussion
In this paper, we estimated the population frequencies with which the putative pathogens
cause disease among the cases using a nested partially-latent class model (npLCM) that
allows for conditional dependence of measurements. Using multivariate binary measure-
ments from a case-control design, the model first approximates the probability distribution
for the control measurements by a mixture of product Bernoulli distributions with mixing
weights penalized towards a mixture with fewer components. The estimated control depen-
95
CHAPTER 4. NESTED PARTIALLY-LATENT CLASS MODELS (NPLCM) FORESTIMATING DISEASE ETIOLOGY IN CASE-CONTROL STUDIES
Figure 4.6: Posterior predictive distributions for checking of pairwise log odds ratios(LORs) for the controls (top) and the cases (bottom). For each of 45 pairs denoted onthe horizontal axis, the left (right) boxplot displays the posterior predictive distribution forthe pLCM (black) and nplCM (blue) models. Only pairs that have significant deviationsfrom the observed log odds ratios, either by pLCM or npLCM, are shown. The estimatedLORs are denoted by red dots; the size of the circles is proportional to the precision of theestimated LORs. Pathogen numbers and names are given in Figure 4.4 on right.
96
CHAPTER 4. NESTED PARTIALLY-LATENT CLASS MODELS (NPLCM) FORESTIMATING DISEASE ETIOLOGY IN CASE-CONTROL STUDIES
dence structure is then applied to the case model with modifications for the latent disease
state with true to replace false positive rates.
We illustrate by simulation that ignoring conditional dependence in each disease class
can lead to bias in the estimation of population and individual etiologic fraction estimates.
By recognizing similar covariations among pathogen measurements, the npLCM can re-
duce bias.
In the analysis of 10 leading pathogens from the PERCH study, RSV is is estimated to
be the most prevalent infectious cause of childhood pneumonia. That evidence is robust
to the conditional dependence assumption. In contrast, accounting for conditional depen-
dence structure leads to an increased RHINO etiologic fraction estimate so that its role is
less robust to models for the measurement dependence. For other pathogens, we did not ob-
serve substantial changes in the estimated etiologic fraction from the npLCM and pLCM,
indicating that the deviation from conditional independence has only limited influence in
this data set.
When scientific knowledge on true positive rates (TPRs) exists, it can be incorporated
into the npLCM by specifying appropriate prior distributions on the subclass-specific TPRs,
θ(j)k , k = 1, ..., K, for the jth disease class. This prior knowledge, however, are usually
available only on the marginal TPRs, θMj =∑K
k=1 θ(j)k η
(j)k , j = 1, ...J , functionals of a
large number of model parameters. In future work, we are interested in placing marginally
specified priors on θMj and developing efficient sampling algorithms that can improve the
quality of inference about π, while leaving other aspects of the model (e.g. conditional
97
CHAPTER 4. NESTED PARTIALLY-LATENT CLASS MODELS (NPLCM) FORESTIMATING DISEASE ETIOLOGY IN CASE-CONTROL STUDIES
dependence structure) maximally flexible. In this spirit, Kessler et al. (2014) considered
marginally specified priors that are approximately independent for a finite set of functionals
(e.g. margins of large probability contingency tables). Extensions to hierarchical marginal
priors on the vector of marginal TPRs θMj , j = 1, ..., J , can allow information to be bor-
rowed across pathogens when the marginal TPRs are considered similar.
The PERCH study motivates an extension of the npLCM to the regression settings so
that observed covariates, including seasonality can be included to study how the population
etiologic fraction and individual diagnoses vary across subgroups. Such extensions are
natural and underway.
In this paper, we assumed a single primary cause for each pneumonia case in the
npLCM. This framework also extends to multiple pathogen causes in the lung by using
a latent vector for case i, ILi ∈ 0, 1J , where 1 indicates that pathogen is one of possi-
bly multiple causes. For estimation, Hoff (2005) uses Dirichlet process mixture models to
identify multiple abnormal genomic locations that are jointly responsible for each case’s
disease, but using only case data with conditional independence assumption. Alternatively,
one can place an exponential penalty on the number of causes (e.g., Zhang and Liu, 2007),
or use conditionally specified models Pr[ILij = 1 | ILij′ , j′ 6= j,Xij] to characterize interac-
tions between pathogens (Besag, 1974), where Xij is a vector of covariates predictive for
pathogen j being a cause in case i. The computational cost to fit these models increases
substantially because the search space for the latent vector ILi expands exponentially in J .
Development of efficient and reliable posterior sampling algorithms can allow investigators
98
CHAPTER 4. NESTED PARTIALLY-LATENT CLASS MODELS (NPLCM) FORESTIMATING DISEASE ETIOLOGY IN CASE-CONTROL STUDIES
to assess the evidence of multiple-pathogen etiologies as more measurements accrue.
Other pathogen measurements, for example, from blood culture, have also been col-
lected in the PERCH study and can be integrated. This paper used only “bronze-standard”
(BrS) data from the NP for which case and control samples are available. A BrS measure is
assumed to have imperfect sensitivity and imperfect specificity. Blood cultures for bacteria
are an example of “silver-standard” (SS) measures assumed to have perfect specificity and
imperfect sensitivity. The integration of BrS and SS data to estimate π using the pLCM
is described in detail in Wu et al. (2014a). The same approach can be carried over to this
application of npLCM.
The npLCM was originally envisioned for a study with BrS, SS, but also “gold-standard”
(GS) data defined by perfect sensitivity and specificity. As Albert and Dodd (2008) have
shown, GS data enables internal validation of a latent class model including the npLCM.
Absent GS data, external evidence is required to validate model predictions. For example,
the model prediction about the effect of a pneumococcal conjugate vaccine program could
be compared to the observed results. A key final point is that inferences about the lung in-
fection from peripheral measurements must by definition be dependent upon the key model
assumptions that there is a direct link between the observations and the state of the lung.
99
Chapter 5
Estimation of Treatment Effects in
Matched-Pair Cluster Randomized
Trials by Calibrating Covariate
Imbalance Between Clusters with
Application to Guided Care Study
100
CHAPTER 5. COVARIATE CALIBRATION FOR TREATMENT EFFECTS INMATCHED-PAIR CLUSTER RANDOMIZED TRIALS
Abstract
We address estimation of intervention effects in experimental designs in which (a) in-
terventions are assigned at the cluster level; (b) clusters are selected to form pairs, matched
on observed characteristics; and (c) intervention is assigned to one cluster at random within
each pair. One goal of policy interest is to estimate the average outcome if all clusters in
all pairs are assigned control versus if all clusters in all pairs are assigned to intervention.
In such designs, inference that ignores individual level covariates can be imprecise because
cluster-level assignment can leave substantial imbalance in the covariate distribution be-
tween experimental arms within each pair. However, most existing methods that adjust for
covariates have estimands that are not of policy interest. We propose a methodology that
explicitly balances the observed covariates among clusters in a pair to obtain more efficient
estimators, and retains the original estimand of interest. We demonstrate our approach
through the evaluation of the Guided Care program.
101
CHAPTER 5. COVARIATE CALIBRATION FOR TREATMENT EFFECTS INMATCHED-PAIR CLUSTER RANDOMIZED TRIALS
5.1 Introduction
Some useful experimental designs have the following three features: interventions are
assigned at the cluster level; clusters are selected to form pairs, matched on observed co-
variates; and interventions are assigned to one cluster at random within each pair. One goal
of policy interest is to estimate the average outcome if all clusters in all pairs are assigned
control versus if all clusters in all pairs are assigned to intervention. The effect of such a
policy is easy to understand, because its definition does not depend on models, even though
its estimation can be assisted by models. Such designs are useful when individual-level
randomization is not feasible due to practical constraints, and when cluster assignment also
reflects how the assignment would scale in practice.
The Guided Care program is a recent example of such a study (Boult et al., 2013).
The study’s goal was to assess the effect of Guided Care versus a control condition on
functional health and other patient outcomes among clinical practices serving chronically
ill older adults. In Guided Care, a trained nurse works closely with patients and their
physicians to provide coordinated care. The control group does not have access to such a
nurse. To assess the effect of the Guided Care intervention, the study recruited 14 clinical
practices and matched them in 7 pairs using clinical practice and patient characteristics,
and within each pair randomly assigned one clinical practice to Guided Care and the other
to control.
A problem with cluster-level assignment is that it can leave substantial imbalances in
the covariates within pairs. However, existing methods to estimate effects in such designs
102
CHAPTER 5. COVARIATE CALIBRATION FOR TREATMENT EFFECTS INMATCHED-PAIR CLUSTER RANDOMIZED TRIALS
rarely use covariates in order to adjust for these imbalances. As a consequence, Such meth-
ods, including nonparametric as well as hierarchical (meta-analysis) approaches, although
useful in other ways (Imai et al., 2009), can leave large uncertainty in the results. Meth-
ods that do use covariates usually estimate effects conditionally on covariates and cluster-
specific random effects (Thompson et al., 1997; Feng et al., 2001; Hill and Scott, 2009).
With such methods, the estimands are no longer of policy interest and lack meaning when
the modelling assumptions are misspecified.
We propose an approach that explicitly balances the observed covariates between clus-
ters in a pair and still estimates causal effects of policy interest. In Section 2, we formulate
the matched-pair cluster randomized design through potential outcomes. Then, we char-
acterize in Section 3 the existing approaches to causal effects estimation and their compli-
cations. In Section 4, we propose a covariate-calibration approach and develop inferences
with and without the need for assumptions for a hierarchical second level. Throughout
these sections, the arguments are demonstrated through the evaluation of the recent Guided
Care program. Section 5 concludes with discussion.
5.2 The goal and design using potential outcomes
Consider a design that operates in pairs p = 1, . . . , n of clusters. In each pair p, the
design recruits two clusters (e.g., clinical practices) indexed by i = 1, 2, matched on quali-
tative and quantitative characteristics, such as percentage of patients with private insurance,
103
CHAPTER 5. COVARIATE CALIBRATION FOR TREATMENT EFFECTS INMATCHED-PAIR CLUSTER RANDOMIZED TRIALS
and where each clinical practice serves a community, say with a large number of Np,i pa-
tients. The design then assigns to each clinic one of two treatments, namely control (t = 1)
or intervention (t = 2). If clinical practice i of pair p is assigned treatment t, then po-
tential outcomes Yp,i,k(t) (Rubin, 1974, 1978) are to be measured on a random sample of
k = 1, . . . , np,i patients from the Np,i patients served in that clinical practice. We label
Fp,i(y; t), µp,i(t), and σ2p,i(t) the distribution (at value y), mean and variance of the poten-
tial outcome Yp,i,k(t) within clinical practice i of pair p. The average outcomes in pair p
are
µp(t) := µp,i=1(t)πp,i=1 + µp,i=2(t)πp,i=2, (5.2.1)
where “ := ” means “define”, πp,i=1 is the fraction of patients served by clinic i = 1, i.e.
Np,i=1/(Np,i=1 +Np,i=2), and similarly for πp,i=2. One goal of policy interest is to estimate
the average outcome if all clinical practices in all pairs are assigned control versus if all
clinical practices in all pairs are assigned intervention. In terms of the model, the goal is to
estimate a contrast between
µ(1) :=Eµp(t = 1) and µ(2) := Eµp(t = 2),
(5.2.2)
for example δeffect := µ(1)− µ(2),
which is the average outcome if all clusters had been assigned treatment 1 versus if all
clusters had been assigned treatment 2. Here, the expectations are taken over a larger pop-
ulation P of pairs from which p = 1, . . . , n can be considered a random sample. Alternative
104
CHAPTER 5. COVARIATE CALIBRATION FOR TREATMENT EFFECTS INMATCHED-PAIR CLUSTER RANDOMIZED TRIALS
estimands (e.g. conditionally on the sample of pairs, Imai et al. (2009)) can be considered,
although this does not change the main issues discussed here.
Within each pair, the design assigns at random the intervention to one clinical practice
and the control to the other, independently across pairs. Because in this design the original
ordering i is arbitrary, and in order to ease comparisons with the existing meta-analytic
approach (e.g. Thompson et al. (1997)), for each pair p we relabel by c = 1 the clinical
practice that is assigned control, and by c = 2 the clinical practice that is assigned interven-
tion. The quantities Yp,c,k(t), Fp,c(y; t), µp,c(t) and σ2p,c(t) are then redefined based on this
relabeling and the above definitions. Then, the paired cluster randomized design implies
the following:
CONDITION 1. The potential outcomes under treatments 1 and 2 in clinical practice c, and
the number of patients served by clinical practice c are exchangeable (in distribution over
pairs) between clinical practices c = 1 and c = 2, i.e.,
where the arrows connect equal entries in arguments, and distribution pr is over pairs p in
the larger population P of pairs.
Condition 1 implies, for example, over population of pairs, the joint distribution of the
means and variances of potential outcomes under exposure to intervention (t = 1) is the
same for the clinical practices that are actually assigned the intervention (c = 2) as it is
105
CHAPTER 5. COVARIATE CALIBRATION FOR TREATMENT EFFECTS INMATCHED-PAIR CLUSTER RANDOMIZED TRIALS
for the clinical practices that are actually assigned the control (c = 1). Figure 5.1 illus-
trates the structure of pairs, clinical practices, and assigned treatments in this paired cluster
randomized design, along with means and variances of potential outcome distributions.
Here we connect the observed data and existing methods to the above framework of
potential outcomes, because this helps understand the meaning of the assumptions, explicit
or implicit, required by the existing methods.
In order to estimate an effect such as δeffect of (5.2.2), consider first a particular pair
p: we can directly estimate the average potential outcome under control for the clini-
cal practice assigned to the control, namely µp,c=1(t = 1); and the average potential
outcome under intervention for the clinical practice assigned to the intervention, namely
µp,c=2(t = 2). Specifically, for the control clinical practice (c = 1) of pair p, let µp,c=1(t =
1) := 1np,c=1
∑np,c=1
k=1 Yp,c=1,k(t = 1) denote the average of the observed outcomes, i.e.,
the potential outcomes under t = 1; and for the intervention clinical practice (c = 2)
of pair p, let µp,c=2(t = 2) := 1np,c=2
∑np,c=2
k=1 Yp,c=2,k(t = 2) denote the average of the
observed outcomes, i.e., the potential outcomes under t = 2. Then, letting δcrudep =
µp,1(1)− µp,2(2), and conditionally on pairs p whose clinical practices have particular val-
ues of (δcrudep , vcrude
p ), we have that
pr(δcrudep | δcrude
p , vcrudep ) =Normal(δcrude
p , vcrudep ), where
(5.2.3)
δcrudep := µp,1(1)− µp,2(2) and vcrude
p =σ2p,1(1)
np,1+σ2p,1(2)
np,2.
106
CHAPTER 5. COVARIATE CALIBRATION FOR TREATMENT EFFECTS INMATCHED-PAIR CLUSTER RANDOMIZED TRIALS
Figure 5.1: The underlying structure of the paired-cluster randomized design. The top part(observed pair p) and bottom part (observed pair p′) are the two possible ways in whicha single pair can be manifested in the design. Observed pair p has two clinical practices(represented by the two squares). For each clinical practice, the first row shows the meanand variance of patient outcomes if the clinical practice is assigned control and the secondrow shows the mean and variance if assigned intervention. The clinical practice actuallyassigned control is indicated by its placement in column “1” , and the clinical practiceactually assigned intervention is in column “2”. The solid (nonsolid) ellipsoids show themeans and variances that can (cannot) be estimated directly. Observed pair p′ shows howthe same pair would be manifested in the design if the assignment of treatment to clinicalpractices were in reverse (a line with arrows connects the same clinical practice in thesetwo different assignments). Condition 1 means that each of the two manifestations, p andp′ has the same probability.
107
CHAPTER 5. COVARIATE CALIBRATION FOR TREATMENT EFFECTS INMATCHED-PAIR CLUSTER RANDOMIZED TRIALS
Here, “=” means “approximately”, the notation pr(Ap | Bp) and E(Ap | Bp) means the
distribution and expectation, respectively, of characteristic Ap among pairs in the larger
population P that have characteristic Bp (if Bp is empty, the distribution and expectation
are over all pairs).
Remark 1. In a pair, the directly estimable (crude) contrast δcrudep is not a causal effect be-
cause it compares different clinical practices under different treatments (Thompson et al.,
1997). However, the average, E(δcrudep ), over pairs is a causal effect, because the ex-
changeability of potential outcomes and between clinical practices 1 and 2 (Condition 1
above) implies (proof omitted) that
E(δcrudep ) = Eµp(t = 1) − Eµp(t = 2), which is δeffect , (5.2.4)
Thus, one can use the estimated differences, δcrudep , within each pair as in (5.2.3), and ex-
pression (5.2.4), to estimate δeffect , either with no additional assumptions (i.e., by simply
averaging δcrudep over pairs), or under a hierarchical second level model.
Remark 2. The objective meaning that the potential outcomes assign to the terms in the
model (5.2.3) implies the following, subtle fact: if the pair-specific δcrudep are to be elimi-
nated (i.e., marginalized over) from the conditional likelihood (5.2.3), then δcrudep should
be first integrated out of (5.2.3) based on the conditional distribution pr(δcrudep | vcrude
p ),
108
i.e.,
pr(δcrudep | vcrude
p ) =
∫pr(δcrude
p | δcrudep , vcrude
p ) ·pr(δcrudep | vcrude
p ) ·d(δcrudep ).
(5.2.5)
This becomes relevant when examining the existing hierarchical modeling methods.
Next, we discuss complications of existing methods for estimating the effect of inter-
vention δeffect . We demonstrate the arguments by assessing the effect of the Guided Care
intervention on the functional health outcome of the patients as measured by the physical
component summary of the Short Form (SF)-36 version 2 (Ware and Kosinski, 2001).
5.3 Complications with existing methods
5.3.1 Consequences when ignoring covariates.
Table 5.1 displays the observed average SF-36 scores for each of the seven pairs of prac-
tices in the Guided Care study (see outcome rows denoted as uncalibrated). Also displayed
are the within pair differences in average SF-36 outcomes between control and intervention.
Using these, Table 5.3 reports the estimate of the overall effect δeffect , first based only
on the design-derived fact (5.2.4) that the average of δcrudep equals the effect of inter-
est δeffect (see 1st level, “uncalibrated on covariates”). Because this first-level approach
makes no further assumptions about the joint distribution of pr(δcrudep , vcrude
p ), the MLE
of δeffect is simply the unweighted sample average of δcrudep , with its standard error esti-
CHAPTER 5. COVARIATE CALIBRATION FOR TREATMENT EFFECTS INMATCHED-PAIR CLUSTER RANDOMIZED TRIALS
Table 5.1: Summary of average SF36 outcomes for uncalibrated versus calibrated ap-proaches. The first row block displays sample sizes; the second row block displays averageoutcomes that are uncalibrated and calibrated, respectively.
pair p1 2 3 4 5 6 7
sample sizenp,c=1 17 16 42 23 52 23 28np,c=2 38 44 43 33 42 31 43
outcome
uncalibratedon covariates
µp,1(1) 36.4 36.5 39.6 39.1 39.7 33.8 39.6µp,2(2) 37.3 36.6 39.3 35.3 35.2 36.4 40.9
δcrudep -0.8 -0.1 0.3 3.8 4.5 -2.6 -1.3(
vcrudep
)1/22.7 2.6 2.0 2.7 2.1 2.6 2.2
calibratedon covariates
∗µcalibrp,1 37.6 38.8 39.5 38.0 38.7 35.5 40.9
∗µcalibrp,2 36.7 35.8 39.4 36.0 36.4 35.1 40.0
δcalibrp 0.9 3.0 0.1 1.9 2.3 0.5 0.8
†(vcalibrp
)1/22.1 2.4 1.5 2.0 1.7 2.2 1.7
*: calibration based on np,1 and np,2 observations in pair p
†: vcalibrp is the pth diagonal element of Σδcalibr in expression (5.4.8)
110
CHAPTER 5. COVARIATE CALIBRATION FOR TREATMENT EFFECTS INMATCHED-PAIR CLUSTER RANDOMIZED TRIALS
mated by the jackknife. Table 3 also reports the permutation test of no true effect for any
person, by randomly permuting the labels of treatment within each pair.
For a hierarchical second-level (meta-analytic) inference, the current approach for paired-
clustered designs (e.g., Thompson et al., 1997; Feng et al., 2001; Hill and Scott, 2009) is
based on integrating the likelihood in (5.2.3) over the marginal distribution pr(δcrudep ), to
obtain:
pr∗(δcrudep | vcrude
p , δeffect ) =
∫pr(δcrude
p | δcrudep , vcrude
p ) · pr(δcrudep ) · d(δcrude
p );
(5.3.1)
where pr(δcrudep ) = Normal(δeffect , v).
Table 5.3 (see 1st+2nd level, “uncalibrated on covariates”) shows inference for the ef-
fect δeffect using the above likelihood (5.3.1), namely, the method of Thompson et al.
(1997) with and without profiling out the variance v (see row 3 and 4); and also inference
based on the mean of the posterior distribution of δeffect using the uniform shrinkage prior
on v as suggested by Daniels (1999) (see row 5). For comparison, we also obtained the
two-sided tail probability from the distribution of the MLE from (5.3.1) as obtained from
all the permutation possibilities of the intervention and control labels of clinical practices
independently across pairs. None of these results suggest any substantial effect for the
intervention.
In general, the hierarchical and non-hierarchical methods without covariates can be
111
CHAPTER 5. COVARIATE CALIBRATION FOR TREATMENT EFFECTS INMATCHED-PAIR CLUSTER RANDOMIZED TRIALS
inaccurate for at least one of the following two reasons. First, any substantial covariate
imbalances between clinical practices within a pair can result in substantial uncertainty,
which is reflected in the variance of the estimators of the effect, and which may have
influenced the point estimate. For the Guided Care study, Table 5.2 shows that a number
of covariates show substantial imbalance between intervention and control groups. For
example, the continuous covariate Chronic Illness Burden has severe imbalances
between the clinical practices in pairs 2, 5 and 7, with t-statistics being −3.07, −4.81 and
2.52, respectively.
The hierarchical model approach, in addition to its normality assumption, can be ques-
tioned for the following subtle reason. In order to integrate out δcrudep from the likelihood
(5.2.3) to obtain a likelihood that, like (5.3.1), still depends on the variances vcrudep , one
must integrate δcrudep with respect to the conditional distribution of the estimand δcrude
p
given the variance vcrudep , as in (5.2.5) of Remark 2, and not with respect to the marginal
distribution pr(δcrudep ) as in (5.3.1). The comparison of (5.3.1) to (5.2.5) shows that (5.3.1)
implicitly assumes the following:
CONDITION 2. The estimand δcrudep and the variance vcrude
p of δcrudep at the first level
are independent across pairs p.
The motivation for using the likelihood (5.3.1) can be traced to Thompson et al. (1997,
Section 5, Paragraph 2). There, inference for the paired-clustered design is assumed to
have the same random effects structure as that of DerSimonian and Laird (1986), who
also assume Condition 2 but for a design that first randomly samples subjects from the
112
CHAPTER 5. COVARIATE CALIBRATION FOR TREATMENT EFFECTS INMATCHED-PAIR CLUSTER RANDOMIZED TRIALS
Table 5.2: Checking covariate imbalances within each pair. For a continuous covariate (in-dicated by (a)), we calculate effect size as difference divided by pooled standard deviation.For a categorical covariate (indicated by (b)), odds ratio is calculated comparing rates ofoccurrence of each category between two clusters in a pair. To prevent infinite odds ratio,0.5 is added to all the cells when calculating sample odds ratios.
pair1 2 3 4 5 6 7
age at interview(a) 0.3 -0.3 0.1 0.6 0.0 0.1 -0.1Chronic Illness Burden(a) 0.5 -0.6 0.0 0.0 -1.1 0.1 0.6
SF36 Mental(a) -0.3 0.1 0.3 0.2 0.3 -0.6 -0.5SF36 Physical(a) -0.1 -0.4 0.1 0.5 0.4 -0.6 -0.3
lives alone(b) 1.4 0.8 0.7 0.7 1.6 0.9 0.5>high school education(b) 0.4 0.5 0.7 1.4 0.8 0.8 1.1
Female(b) 2.4 0.6 1.0 0.6 1.0 2.5 1.1
race(b)
Caucasian 0.5 0.2 0.9 0.8 1.5 0.5 0.7African American 2.2 0.9 1.2 1.2 0.8 1.6 1.2
other 2.2 15.0 1.0 1.4 0.6 1.3 1.5
finances at end of month(b)
some money left over 0.0 0.7 1.4 0.7 1.5 0.7 0.6just enough to make ends meet 8.9 1.0 0.3 1.3 0.6 1.2 1.4not enough to make ends meet 18.2 8.4 7.0 1.0 1.2 2.0 1.6
self rated health(b)
≥very good 0.3 0.3 0.8 2.2 0.3 0.8 0.6good 2.6 3.4 1.4 0.4 2.5 0.8 1.4fair 0.9 0.9 0.4 0.3 2.5 4.2 0.5poor 6.8 1.5 3.1 4.4 2.0 4.2 2.1
113
CHAPTER 5. COVARIATE CALIBRATION FOR TREATMENT EFFECTS INMATCHED-PAIR CLUSTER RANDOMIZED TRIALS
Table 5.3: Results from MLE, profile MLE, Bayes estimates and permutation test in theGuided Care program study. The covariates used for calibration are listed in the first columnof Table 5.2; the outcome is the physical component summary of the Short Form 36 (SF36).
δeffect 95% C.I. s.e.(δeffect ) var(δ∗p)p-value
(two-sided)uncalibrated on covariates
1st levelMLE 0.5 (−1.4, 2.5) 1.0 − 0.59
permutation − − − − 0.611st+2nd level
MLE 0.6 (−1.2, 2.5) 0.9 0.7 0.50pMLE 0.6 (−1.5, 2.7) − 0.7 −Bayes 0.6 (−1.7, 3.0) 1.2 4.3 0.60
permutation − − − − 0.60calibrated on covariates
1st levelMLE 1.4 (0.5, 2.2) 0.4# − <0.01
permutation − − − − 0.021st+2nd level
MLE 1.2 (−0.2, 2.6) 0.7 0.0 0.08pMLE 1.2 (−0.2, 2.6) − 0.0 −Bayes 1.3 (−0.4, 2.9) 0.9 1.5 0.13
permutation − − − − 0.02
*: represents δcrudep for the uncalibrated approach and δcalibr
p for the calibrated approach.#: estimated by the jackknife.
114
CHAPTER 5. COVARIATE CALIBRATION FOR TREATMENT EFFECTS INMATCHED-PAIR CLUSTER RANDOMIZED TRIALS
population that a pair serves and then completely randomizes them, regardless of their
clinical practice. Call this simpler design, a “paired-strata” design. We show below that
violation of Condition 2 has more severe implications for the paired-clustered than for the
paired-strata design.
In the paired-strata design, the observed difference, say δ′p, in average outcomes be-
tween intervention and control individuals within a pair has mean, say δ′p, equal to the
causal effect µp(2) − µp(1) of (5.2.2). This means that, if the intervention has no effect in
any pair, i.e., the null hypotheses, µp(1) = µp(2) for all p, is correct, then δ′p is a constant
(0) and so Condition 2 is satisfied. As a result, an approach based on (5.3.1) is valid for
testing µp(1) = µp(2) for all p because Condition 2 is correct under the null hypothesis
being tested in that design.
In the paired-clustered design, however, the mean, δcrudep , of δcrude
p is not a causal
effect (see Remark 1 above) even if the intervention has no effect in any cluster, i.e., even
if the null hypotheses, µp,c(1) = µp,c(2) for all p and c, is correct. In particular, under this
null, the mean δcrudep is µp,1(1) − µp,2(1), i.e., the difference between clinical practices
1 and 2 if they had both been assigned control. In practice, even after matching, the two
clinical practices are expected to have imbalances in characteristics of the patients or the
doctors, so that δcrudep is expectedly not zero, and, hence, Condition 2 can be violated. We
have the following result (proof in Appendix):
RESULT 1. If the intervention has no effect, µ(1) = µ(2), but Condition 2 is violated, then
the MLE of the causal effect δeffect based on (5.3.1) can converge to a non-zero value as
115
CHAPTER 5. COVARIATE CALIBRATION FOR TREATMENT EFFECTS INMATCHED-PAIR CLUSTER RANDOMIZED TRIALS
the number of sampled practices increases.
Therefore, it is important to try to assess the plausibility of Condition 2. For the Guided
Care study, Figure 5.2 (left) plots the estimated values of√vcrudep against δcrude
p . Here
there appear no noticeable warnings against independence. However, the covariate imbal-
ances shown in Table 2 could still be contributing to inaccurate estimates through large
variances as discussed earlier.
5.3.2 Complications with existing covariate methods.
Some existing proposals do incorporate covariates into the model for pr(δcrudep ) on
the RHS of likelihood (5.3.1). However, these approaches stop short of addressing the
goal of estimating effects of policy interest. In particular, such existing approaches (e.g.,
Thompson et al. (1997), Sec.5.5, Feng et al. (2001)) define the treatment effect to be a
contrast in the treatment coefficients of the posited model after conditioning on a particular
value of the covariates and/or of random effects specific to the clusters. The first problem
with such a treatment effect is that, its meaning is not objective: if, for example, the model
is misspecified, then an effect set equal to a contrast of coefficients in the model does not
have a well defined physical interpretation. The second problem is that, even if the model
is correct, a treatment effect that is conditional on the covariates and/or the clusters is not
usually equal to but is only partially related to the overall effect.
116
CHAPTER 5. COVARIATE CALIBRATION FOR TREATMENT EFFECTS INMATCHED-PAIR CLUSTER RANDOMIZED TRIALS
5.4 Addressing the Problems
5.4.1 Calibration of observed covariate differences between
clinical practices
In order to use covariates to estimate the treatment effects in (5.2.2), we propose to first
construct calibrated pair-specific averages, for each treatment t = 1, 2, in the sense that the
distribution of the covariates reflected in the averages will be the same as the distribution of
covariates combined from both clinical practices of the pair. Inference for these calibrated
averages will then lead to inference for overall effects (5.2.2) with the gained precision of
accounting for the difference in observed covariates between the matched clinical practices.
This section uses notation for the following additional structure for pair p:
Xp,c,k, for the measurement of a covariate vector before treatment administration, for
the kth sampled patient of clinical practice c in pair p;
Gp,c(x), for the joint cumulative distribution function of the covariate vectorXp,c,k in
clinical practice c, evaluated at value x; and Gp(x) for the joint cumulative distribu-
tion function (evaluated at x) of the covariate vector of a patient selected at random
from pair p (i.e., from the two clinical practices of that pair, combined);
Fp,c(y | x; t), for the cumulative distribution function of the potential outcome Yp,c,k(t)
in clinical practice c, evaluated at value y among covariate levels x, if clinical practice
117
CHAPTER 5. COVARIATE CALIBRATION FOR TREATMENT EFFECTS INMATCHED-PAIR CLUSTER RANDOMIZED TRIALS
c is assigned treatment t; and let µp,c(x; t), for the mean of the latter distribution.
For pair p, consider now the estimable quantity, labelled as µcalibrp (t = 1), that is
constructed by, first, stratifying the average outcome into the covariate levels of the clinical
practice c = 1 (assigned to treatment 1), namely µp,c=1(x; t = 1), and then re-calibrating
it with respect to the covariate distribution of the two clinical practices combined (and
similarly for t = 2):
µcalibrp,c=1 :=
∫x
µp,c=1(x; t = 1)dGp(x), µcalibrp,c=2 :=
∫x
µp,c=2(x; t = 2)dGp(x) (5.4.1)
To understand the above estimand, consider for example two clinical practices in a pair,
that, although matched as closely as possible with respect to, say, the percentage of patients
with a “low” or “high” risk covariate (x = low or high), the percentage of low risk in
clinical practices 1 and 2 is 75% and 85% respectively, i.e., still differs appreciably between
the clinical practices. Suppose also that clinical practice 2 serves twice as many patients as
clinical practice 1. Ignoring covariates, the quantity that can be directly estimated from the
data for representing the average outcome if both clinical practices are assigned treatment 1
is simply the average outcome within clinical practice 1, µp,c=1(1), which can be expressed
in terms of the covariate as 0.75 · µp,c=1(x = low; t = 1) + 0.25 · µp,c=1(x = high; t =
1). When using covariates, the calibrated average µcalibrp,c=1 is 0.82 · µp,c=1(x = low; t =
1) + 0.18 · µp,c=1(x = high; t = 1), because it generalizes the covariate-specific outcome
averages under treatment 1 to the covariate distribution for both clinical practices in which
118
CHAPTER 5. COVARIATE CALIBRATION FOR TREATMENT EFFECTS INMATCHED-PAIR CLUSTER RANDOMIZED TRIALS
0.7513
+ 0.8523
= 0.82 have low risk.
More generally, one should expect that the calibrated contrasts µcalibrp,c=1 −µcalibr
p,c=2 , though
still not equal to the target causal effect µp(t = 1)−µp(t = 2) of (5.2.1) in each pair, should,
(a) share the property with the uncalibrated estimands, i.e., that they average over pairs to
the average causal effect δeffect of (5.2.4); and (b) provide a basis for more efficient estima-
tors than the uncalibrated contrasts. This is true if the design is more carefully formalized
as follows:
CONDITION 3. The characteristics of a clinical practice, i.e., the distribution of potential
outcomes under treatments 1 and 2 conditionally on covariates, the distribution of covari-
ates, and the number of people served by clinical practice c, namely the vector of functions[Fp,c(· | ·, t = 1), Fp,c(· | ·, t = 2), Gp,c(·), Np,c
], is exchangeable (in distribution over
pairs) between clinical practices c = 1 and c = 2.
Then we have the following:
RESULT 2. (a) Under Condition 3, the average over pairs of the covariate-calibrations,
µcalibrp,c=1 , i.e., based on the clinical practice assigned to treatment 1 in each pair (see (5.4.1))
equals the average of the potential outcomes if the entire population had been assigned
treatment 1 (similarly for treatment 2); hence the estimable contrast
Eµcalibrp,c=1 vs. Eµcalibr
p,c=2 (5.4.2)
equals the causal contrast (5.2.2); (b) if µp,c(x; t = c) are correctly specified, then the
119
CHAPTER 5. COVARIATE CALIBRATION FOR TREATMENT EFFECTS INMATCHED-PAIR CLUSTER RANDOMIZED TRIALS
MLEs of Eµcalibrp,c=1 in (5.4.2) (and of the target estimands µ(t) in (5.2.2), due to (a)
and the invariance property of the MLE) are the averages, over the observed pairs, of the
empirical analogues of (5.4.1):
∫µp,c(x; t = c)dGp(x), c = 1, 2, (5.4.3)
where Gp is the weighted empirical distribution of covariates in pair p (the weight is deter-
mined by Np,c).
Condition 3 implies Condition 1. The proof of Result 2 (a) follows by iterated expec-
tations; the proof of (b) follows because the empirical distribution Gp(x) as defined above
is, under no other assumptions, the MLE of Gp(x).
In practice, and simplifying the notation for the estimable averages µp,c(x; t = c) to
µp,c(x), one can consider modelling µp,c(x) for each (pair p, cluster c), with µp,c(x; θ),
where
hµp,c(x, θ) = θp,c + θ′cov · x and h is a link function. (5.4.4)
Since these models condition on the pairs and clusters, the parameter θ can be estimated by
weighted least squares estimator θ, based on the first moment residuals Yp,c,k−µp,c(Xp,c,k, θ),
where approximately
θ | θ,Σθ ∼ Normal(θ,Σθ), (5.4.5)
and where Σθ is the true variance-covariance matrix of θ, which can be estimated by the
120
CHAPTER 5. COVARIATE CALIBRATION FOR TREATMENT EFFECTS INMATCHED-PAIR CLUSTER RANDOMIZED TRIALS
robust variance-covariate matrix denoted by Σθ.
Based on these, the calibrated estimands in (5.4.1) can be estimated within each pair
and clinical practice, by
µcalibrp,c =
∫µp,c(x, θ)dGp(x), for all p, c, (5.4.6)
whose joint distribution can be approximated by the delta method as
level 1 :
µcalibrp=1,c=1 µcalibr
p=1,c=2
......
µcalibrp=N,c=1
µcalibrp=N,c=2
| θ,Σµcalibr ∼ Normal
µcalibrp=1,c=1 µcalibr
p=1,c=2
......
µcalibrp=N,c=1 µcalibr
p=N,c=2
,Σµcalibr
,
(5.4.7)
and where Σµcalibr can be estimated by Σµcalibr .
5.4.2 Estimation of quantities of original interest
Expression (5.4.7) can be used for estimation of the causal contrast µ(1) vs. µ(2)
(because of Result 2(a)); here we focus on δeffect = µ(1) − µ(2). Specifically, setting
δcalibrp = µcalibr
p,c=1 − µcalibrp,c=2 and δcalibr
p = µcalibrp,c=1 − µcalibr
p,c=2 we can consider the first or
121
CHAPTER 5. COVARIATE CALIBRATION FOR TREATMENT EFFECTS INMATCHED-PAIR CLUSTER RANDOMIZED TRIALS
both levels of the following two-level model
level 1′ :
δcalibr1
...
δcalibrN
|δcalibr1
...
δcalibrN
, θ,Σδcalibr ∼ Normal
δcalibr1
...
δcalibrN
,Σδcalibr
, (5.4.8)
level 2′ : δcalibrp | δeffect , τ 2 ∼ Normal(δeffect , τ 2), p = 1, . . . , N,
(5.4.9)
where expression (5.4.8) follows from (5.4.7); the covariance matrix Σδcalibr , obtained by
the delta method, can be estimated by Σδcalibr; and τ 2 is the variance of δcalibrp over pairs p.
Table 5.1 shows the results for the calibrated estimates as derived from expressions
(5.4.7) and (5.4.8) (see rows for outcome “calibrated on covariates”) for each of the seven
pairs in the Guided Care study. The covariates that are involved in the calibration are listed
in Table 5.2. It is notable that these calibrated differences, δcalibrp , are positive, in favor of
the control condition, for all pairs p.
Using these, Table 3 also reports the estimate of the overall effect δeffect , first based
only on the design-derived fact Result 2(a) that the average of δcalibrp equals the effect
of interest δeffect and on the estimation of each of δcalibrp by δcalibr
p as in (5.4.8) (see
1st level, “calibrated on covariates”). As with the uncalibrated first-level approach, this
first-level calibrated approach makes no further assumptions about the joint distribution of
pr(δcalibrp ,Σδcalibr), and the MLE of δeffect is the unweighted sample average of δcalibr
p
(here, its standard error is estimated by the jackknife, although in general it is difficult
122
CHAPTER 5. COVARIATE CALIBRATION FOR TREATMENT EFFECTS INMATCHED-PAIR CLUSTER RANDOMIZED TRIALS
to trust a normal approximation with seven pairs). For this reason, we also calculated
the significance level of the MLE by permutation of the treatment labels, thus testing the
hypothesis of no true effect in any person. In this case, and because all calibrated estimated
differences have the same sign, the permutation based significance level is 2/(27) = 0.016
in favor of the control condition.
For a two-level approach based on (5.4.8) and (5.4.9), one can estimate δeffect , by
first obtaining the marginalized likelihood, say, L(δeffect , τ 2,Σδcalibr). Then we estimated
δeffect by (i) the MLE after Σδcalibr replaces Σδcalibr; (ii) the MLE after profiling τ 2 out; and
(iii) the posterior distribution of δeffect using noninformative priors for τ 2 and δeffect . We
use a uniform shrinkage prior for the second-level variance τ 2 advocated by Daniels (1999).
These results for the two-level approach are given in Table 5.3 (see rows 1st+ 2nd level;
MLE, pMLE, and Bayes, respectively).
As with the uncalibrated approach, the marginalized likelihood that uses (5.4.8) and
(5.4.9) assumes that δcalibrp is independent of Σδcalibr . Figure 3, right panel, plots esti-
mates of the square root of the diagonals of Σδcalibr ,√vcalibrp , versus estimates of δcalibr
p .
Although the plot can be to some degree affected by measurement error, the R2 of 0.19
suggests that some dependence exists. Although this dependence could be modeled in
a modified second level, it is unclear how convincing such an approach would be as it
would introduce even more modeling assumptions. To avoid this, we calculated instead
the significance level of the two-level MLE estimate when evaluated from the permutation
distribution of the treatment labels.
123
CHAPTER 5. COVARIATE CALIBRATION FOR TREATMENT EFFECTS INMATCHED-PAIR CLUSTER RANDOMIZED TRIALS
Figure 5.2: Checking second level dependence. Left: estimates of√vcrudep versus δcrude
p ;
Right: estimates of√vcalibrp versus δcalibr
p , where vcalibrp are the diagonal elements of
Σδcalibr .
124
CHAPTER 5. COVARIATE CALIBRATION FOR TREATMENT EFFECTS INMATCHED-PAIR CLUSTER RANDOMIZED TRIALS
5.4.3 Assessment of the hypothesis of no effect
The proposed approach, in addition to being robust for hypothesis testing when eval-
uated by permutation, is likely to have a more general robustness property analogous to
the one arising in a simpler design. Specifically, in the design of complete randomization
of units (unpaired, unclustered), Rosenblum and van der Laan (2010) have shown that a
certain class of parametric models for covariates yield MLEs for the causal effect that are
consistent for the null value if indeed there is no effect on any person, even if the models
are incorrect. Shinohara et al. (2012) showed that an extended class of models has this
robustness property if the models satisfy an easy to check symmetry criterion.
For the matched-paired clustered-randomization design, analogous classes of models
with such robustness property may also exist. Specifically, suppose that, more generally
than model (5.4.4), we conceptualize a parametric model as one that allows distributions
mp,c(y | x) for the outcome at value y given covariate at value x for each (pair,cluster)
labelled (p, c). Many flexible models mp,c(· | ·) (or, for brevity, mp,c), including (5.4.4),
have the property that if, for two pairs and their clusters
p1c1 p1c2
p2c1 p2c2
, the model allows the distributions
m1,1 m1,2
m2,1 m2,2
125
CHAPTER 5. COVARIATE CALIBRATION FOR TREATMENT EFFECTS INMATCHED-PAIR CLUSTER RANDOMIZED TRIALS
then it also allows the distributions
m2,2 m1,2
m1,2 m2,2
and
m1,1 m2,1
m2,1 m1,1
.
The intuition of this property is that the model allows exchangeable distributions between
any two observed pairs. Following a similar reasoning to that of Shinohara et al. (2012),
we hypothesize that if (a) there is no effect of intervention in the distribution of any cluster,
i.e., in the true distributions defined in Condition 3, Fp,c(· | ·; t = 1) = Fp,c(· | ·; t = 2) for
all p, c, and (b) a model that has the above symmetry property is used, then the limit of the
MLE of the causal effect (5.4.2) is null even if the model is incorrect. A detailed treatment
of this issue can allow for combining validity with increased efficiency in such designs as
well.
5.5 Discussion
For the design that matches clusters of units and assigns interventions to clusters within
pairs, we proposed an approach that estimates the average causal effect while also explicitly
calibrating possibly covariate imbalance between the clusters. The approach can use only
one level of inference, or can be used in a hierarchical model.
In the Guided Care study, a first-level inference with the new approach reports esti-
mates of the causal effect with smaller estimated variance than without using covariates
126
CHAPTER 5. COVARIATE CALIBRATION FOR TREATMENT EFFECTS INMATCHED-PAIR CLUSTER RANDOMIZED TRIALS
(see Table 5.3). Although it is difficult to know if this is objectively true in this small
sample of pairs, the results from the permutation tests between the two approaches are
also consistent with this conclusion. A simple two-level approach, with or without covari-
ates, makes an implicit assumption which can invalidate causal comparison of the inter-
ventions, and explicitly addressing the assumption would introduce additional modeling.
The covariate-calibrated approach reports that the control condition leads to higher, albeit
clinically insignificant, average overall SF36 score compared to that under Guided Care
Nurse intervention, using either a single-level (approximate or permutation-based) analysis
or a two-level permutation-based analysis.
The proposed approach is expected to be more generally robust to model misspecifica-
tion when assessing the hypothesis of no effect, if the model (5.4.4) belongs in a relatively
broad class. This expectation needs formal verification, but, if confirmed, can lead to more
efficient estimation, and, hence, more efficient use of resources.
An alternative to the proposed approach can be to break the matching and then use
regression-assisted (Donner et al., 2007) or doubly-robust estimators (Rosenblum and van der
Laan, 2010) to estimate the treatment effect. Based on Rubin’s (Rubin, 1978) theory, the
matched design is still ignorable (and so the matching can be broken) if the variables that
were used to create the matching are still available and are included in the outcomes model.
In contrast, if these variables are not used in a model, then the matching design cannot be
ignored (namely, the matching cannot be broken), as this could generally lead to bias at
least in the expression of the uncertainty in inference. In this case, methods that explicitly
127
CHAPTER 5. COVARIATE CALIBRATION FOR TREATMENT EFFECTS INMATCHED-PAIR CLUSTER RANDOMIZED TRIALS
acknowledge the matching design are needed.
128
Chapter 6
Conclusions and Future Work
This thesis has developed statistical methods to advance the goal of individualized
health to intelligently use information to optimize each person’s health given their unique
characteristics, circumstances, and preferences. In Part I, we have developed and demon-
strated how nested partially-latent models (npLCM) can be used to estimate population eti-
ology and to better diagnose individuals. Our approach to the estimation of population etio-
logic fractions in a case-control design has been to formulate a hierarchical Bayesian model
that represents the case population as a mixture of different classes of patients. The con-
trol distribution provides essential evidence about the measurement error rates and about
dependence among the binary measurements about a lung infection. Efficient and easy-
to-implement Gibbs sampling algorithms are derived and implemented for realistic sample
sizes and dimension of measurements.
Our model has multiple advantages over the population attributable fraction method
129
CHAPTER 6. CONCLUSIONS AND FUTURE WORK
(Bruzzi et al., 1985). In particular, it allows for multiple sources of measurements and
accounts for possible differential laboratory measurement errors. As measurements be-
come more abundant, this integrative approach could be helpful for assessing the value of
different data sources.
Several features of the PERCH study warrant future statistical research. First, the pneu-
monia case definition is not perfect. We can introduce one more latent variable to indicate
true disease status and use biomarkers to probabilistically assign each prospective case as
a control.
Second, besides the multivariate binary measurements on pathogen presence/absence,
the PERCH study also collected some continuous-scale measurements on pathogen quan-
tities in specimens. We can investigate differences in the density of pathogen for cases
and controls to determine its importance in etiology. Bayesian nonparametric density esti-
mation and dictionary learning methods can be developed in this occasion to capture and
compare the flexible density shapes. We also need to consider zero-inflated model exten-
sions to accommodate the observation that most densities are zero-inflated, meaning many
cases/controls have an exact zero or value below the load of detection for certain pathogens,
even if continuous measurement is the protocol.
Third, although multiple sites are involved in the PERCH study, the current research
focus is to infer site-specific etiologies. Clinical experience suggests that many sites share
the same major pathogen causes but with potential site-specific etiology and laboratory
testing characteristics. We can therefore extend our npLCM to have one more site level in
130
CHAPTER 6. CONCLUSIONS AND FUTURE WORK
the hierarchical formulation.
Lastly, in the current formulation of the npLCM, cross-sectional data is used to inform
about an individual’s health state at a single time point. Extensions of the framework to
incorporate longitudinal measurements, time-varying latent health trajectories, and time-
varying treatment assignment probabilities can support clinicians to make decisions about
treating individual patients as new evidence is obtained over time.
In Part II, we have developed an inferential methodology to evaluate the individualized
interventions if applied to a population. The method accounted for both the matched-pair
cluster randomized (MPCR) design and the potential covariate imbalances between clusters
in a pair even after matching. The proposed class of covariate-calibrated estimators can
correct potential bias that may be present in the MPCR design while retain the original
estimand of interest.
Finally, an important question not discussed in this thesis is the individualized treatment
selection. In clinical practice, patients usually have heterogeneous responses to a particular
treatment. Also, many health disorders (e.g. ADHD, HIV, cystic fibrosis, non-small cell
lung cancer) are of chronic nature, and usually require multiple treatment reconsiderations
or replacements over time. The overarching goal of individualized health is thus to provide
clinically meaningful improved health outcomes for patients by delivering the right drug, at
the right dose, and at the right time. We seek to develop reproducible, statistically efficient
trial designs and analytic methods to learn the individualized treatment rules.
A related but simpler question is to find the right subpopulation of patients for a particu-
131
CHAPTER 6. CONCLUSIONS AND FUTURE WORK
lar treatment. Consider the data from a single-stage randomized trial that compares between
treatments. The investigators wish to assess if the treatment effect varies across subgroups
of individuals defined by covariates. However, a full answer to this problem is unfeasible
if the covariates have a large dimension. Recent methodology by Cai et al. (2011) has ad-
vanced this area by using a two-stage approach: first, using a working parametric model,
the covariates are reduced to a scalar summary; and second, a nonparametric regression is
used to estimate the treatment effect as a function of that summary. A problem with such
approaches is still that the number of possible working models is uncountably large to be
explored. We can approach the problem, by first focusing attention to all possible models
that would ultimately stratify individuals to a finite number of strata, for example, small
enough to be of practical use to clinicians. Statistically, this allows one to search over all
possible functions of covariates that characterize the strata, or even over submodels from
that set.
In the special case where subpopulations that partition the whole population have been
prespecified using pretreatment covariates, Rosenblum et al. (2013) has developed opti-
mal testing procedures to detect the treatment effect on the overall population, or on the
subpopulations jointly.
Novel study designs can also help with efficient estimation of subpopulation treatments.
For example, in adaptive clinical trials, when a person presents for randomization, we can
perform adaptive randomization based on the individual’s covariates, where the probability
of assignment to each of available therapies varies over time as a function of the current
132
CHAPTER 6. CONCLUSIONS AND FUTURE WORK
estimates of the treatment effects in subgroups of people (Rosenblum and van der Laan,
2011). The adaptive design adapts the trial as it progresses and will discard ineffective or
harmful therapies early and find subpopulation of patients who can benefit from a particular
therapy, hence can reduce overall costs, improve patients’ adherence, and save time. For
example, Zhou et al. (2008) described Bayesian adaptive randomization trial designs to use
biomarker profiles for identifying effective targeted therapies in the biomarker-integrated
approaches of targeted therapy of lung cancer elimination (BATTLE) study, which is also
discussed in Berry et al. (2010) together with other adaptively designed studies like the
I-SPY 2 study (Barker et al., 2009).
Returning to the original question of selecting the right treatment for each patient, the
statistical challenges lie in at least three aspects. First , we have to learn individualized
treatment rules from the training data where the optimal treatments are unknown. For ex-
ample, using the data from a traditional clinical trial, we need to learn the rule or treatment
regime, d, that optimally assigns a treatment, among a set of possible treatments, to a pa-
tient as a function of her observed characteristics (X) hence individualizing treatments to
the patient. Qian and Murphy (2011) proposed to find the rule d∗ that optimizes the popula-
tion average response V(d) if a rule d is applied to the whole population. Qian and Murphy
(2011) then used regression models for the outcome for learning d∗. Zhao et al. (2012)
showed that finding the optimal rule d∗ is equivalent to a weighted classification problem
for treatments which motivated their outcome weighted learning method. Other methods
that combine biomarkers for treatment selection have also been developed, see, for exam-
133
CHAPTER 6. CONCLUSIONS AND FUTURE WORK
ple, Gunter et al. (2007); Brinkley et al. (2010); Foster et al. (2011); Gunter et al. (2011);
Zhao et al. (2012); Zhang et al. (2012). Huang et al. (2012) also developed measures to
characterize biomarkers’ capacity to help with treatment selections.
The second challenge is to intelligently use predictors from a combination of diagnostic
tests, imaging, genetics, genomics, proteomics, etc., that may be high-dimensional with
potential high orders of interactions. The two-stage approach proposed by Cai et al. (2011)
is a useful tool to cope with high dimensionality. More work is still needed to integrate
data of different types.
The third challenge is to account for the longitudinal dependence of measurements
within a subject under sequential treatments, for example, in the sequential multiple as-
signment randomized trials (SMART, Lavori and Dawson (2000); Murphy (2005)). The
SMART randomizes a subject into treatments depending on the success or failure of previ-
ously randomized treatments on this individual, e.g., observed improvements in her health
outcome, side effects, burden, etc. For its ethical assignments of treatments, the SMART
has gradually been adopted in areas like mental health (Pelham Jr and Fabiano, 2008; Almi-
rall et al., 2012) and addiction research (Murphy et al., 2007; Strecher et al., 2008). Here
the statistical goal is to find a list of sequential decision rules or dynamic treatment regime
for assigning treatments based on a patient’s history. Thall et al. (2000) and Thall et al.
(2002) used likelihood based methods to model the conditional distribution of outcome
given past information in the sequential trial; Watkins (1989), Watkins and Dayan (1992),
Murphy (2005) and Robins (2004) used Q-learning (“Q” for “quality”) / A-learning (“A”
134
CHAPTER 6. CONCLUSIONS AND FUTURE WORK
for “ advantage”) methods to model full or part of the conditional mean outcome; other ap-
proaches based on weighting methods have also been proposed (Zhao et al., 2012; Zhang
et al., 2012).
It remains an open question about how to design individualized treatment rules for sur-
vival outcomes, multiple outcomes of competing importance (e.g. drug efficacy and toxic-
ity), and continuous dosing levels and timings. Future work is also needed in more complex
settings involving incompleteness of outcome measurements or observational data.
135
APPENDICES
A1 Appendix to Chapter 3
A1.1 Full conditional distributions in Gibbs sampler
In this section, we provide analytic forms of full conditional distributions that are es-
sential for Gibbs sampling algorithm. We use data augmentation scheme by introducing
latent lung state ILi into the sampling chain and we have the following full conditional
distributions:
•[ILi | others
]. If MGS
i is available, Pr(ILi = j | others
)= 1, if MGS
ij = 1 and
MGSil = 0, for l 6= j; otherwise zero. If MGS
i is missing, according as whether
MSSi is available, the full conditional is given as
Pr(ILi = j | others) ∝(θBrSj
)MBrSij
(1− θBrS
j
)1−MBrSij
∏l 6=j
(ψBrSl
)MBrSil
(1− ψBrS
l
)1−MBrSil
·
[(θSSj
)MSSij
(1− θSSj )1−M
SSij 1∑
l6=j MSSil =0
]1j≤J′
· πj; (A.1.1)
if SS measurement is not available for case i, we remove terms involving MSSij .
•[ψBrSj | others
]∼ Beta
(Nj + b1j, n1 −
∑i:Yi=1 1ILi =j + n0 −Nj + b2j
), where
n1 and n0 are number of cases and controls, respectively, and
Nj =∑
i:Yi=1,ILi 6=j
MBrSij +
∑i:Yi=0
MBrSij
136
APPENDICES
is the number of positives at position j for cases with ILi 6= j and all controls.
•[θBrSj | others
]∼ Beta
(Sj + c1j,
∑i:Yi=1 1ILi =j − Sj + c2j
), where
Sj =∑
i:Yi=1,ILi =jMBrSij is the number of positives for cases with jth pathogen as
their causes.
•[θSSj | others
]∼ Beta
(Tj + d1j,
∑i:Yi=1,SSavailable 1ILi =j − Tj + d2j
), where
Tj =∑
i:Yi=1,ILi =j,SSavailable
MSSij .
When no SS data is available, this conditional distribution reduces to Beta(d1j, d2j),
the prior.
•[π | ILi , i : Yi = 1
]∼ Dirichlet(a1 + U1, ..., aJ + UJ), where Uj =
∑i:Yi=1 1ILi =j.
A1.2 Pathogen names and their abbreviations
Bacteria: HINF- Haemophilus influenzae; PNEU-Streptococcus pneumoniae;
SASP-Salmonella species; SAUR-Staphylococcus aureus.
Viruses: ADENOVIRUS-adenovirus; COR 43-coronavirus OC43; FLU C-influenza virus
type C; HMPV A B-human metapneumovirus type A or B; PARA1-parainfluenza type 1
virus; RHINO-rhonovirus; RSV A B-respiratory syncytial virus type A or B.
137
APPENDICES
A1.3 Additional simulation results
(a)
(b)
Figure A.1.1: Reduction in marginal diameter of 95% credible region as θ approaches 1. Ineach subfigure, each boxplot describes the variation of pathogen-specific marginal diame-ters of 95% credible regions across 100 simulated datasets. Each curve connects the meanvalues from boxplots across increasing true positive rates. “—”,“- - -”, and “· · · ”denotemarginal diameters calculated from BrS+GS, BrS-only, GS-only analyses, respectively;“·− ·−” corresponds to prior. Rows of subfigures correspond to different fractions of gold-standard measurements available, 1% and 10%. The blue dashed lines are the same acrossrows for fair comparisons. They are obtained from simulated data sets with the same setsof random seeds.
138
APPENDICES
A2 Appendix to Chapter 4
A2.1 Posterior computations
This section details the full conditional distributions of unknown parameters and aux-
iliary variables as well as their sampling strategy in the Gibbs sampler. [A | B] represents
the conditional probability density or probability mass function for entityA given the value
of entity B; If B is null, [A] represents the marginal distribution of A. The super index (j)
is reserved for class-specific quantities; subclass index k appears only in the subscript.
1. Sample the class indicator ILi′ for cases i′ = 1, ..., n1, from a multinomial distribution
with probabilities
P(ILi′ = j | · · · ) = p(j)i′ ∝ [Mi′ | Zi′ ,Θ,Ψ, ILi′ = j][Zi′ | η(j), ILi′ = j][ILi′ = j | π]
∝θ(j)Zi′
Mi′j
1− θ(j)Zi′
1−Mi′j ∏l 6=j
ψ
(l)Zi′
Mi′l
1− ψ(l)Zi′
1−Mi′l· η(j)Zi′
· πj,
for j = 1, ..., J .
2. Sample subclass indicators Zi′ for case i′ = 1, ..., n1, from a multinomial distribution
139
APPENDICES
with probabilities
P(Zi′ = k | · · · ) = qi′k ∝ [Mi′ | Zi′ , ILi′ ,Θ,Ψ][Zi′ | ILi′ ,η(ILi′ )]
∝ η(IL
i′ )
k ·θ(IL
i′ )
k
Mi′IL
i′
1− θ(ILi′ )
k
1−Mi′IL
i′
×∏l 6=IL
i′
ψ
(l)k
Mi′l
1− ψ(l)k
1−Mi′l.
Sample subclass indicators Zi for control i = n1 + 1, ..., n1 +n0, from a multinomial
distribution with probabilities
P(Zi = k | · · · ) = qik ∝ [Mi | Zi = k,Ψ][Zi = k | ν]
∝ νk ·J∏j=1
ψ
(j)k
Mij
1− ψ(j)k
1−Mij
, k = 1, ..., K.
3. Sample the case subclass weights η(j) for j = 1, ..., J from
pr(η(j) | · · · ) ∝∏
i′:ILi′=j
[Zi′ | η(j), ILi′ ][η(j) | α]
which can be accomplished by first setting u(j)∗K = 1 and sampling
u(j)∗k ∼ Beta
(1 + z
′(j)k , α +
K∑l=k+1
z′(j)l
), k = 1, ..., K − 1,
where z′(j)k is the number of cases assigned to subclass k in class j. We write z′k =
#i′ : Yi′ = 1, Zi′ = k, ILi′ = j
, for k = 1, ..., K−1, where “#A” counts the num-
140
APPENDICES
ber of elements in setA. We then construct η(j)1 = u(j)∗k , η(j)k = u
(j)∗k
∏k−1l=1
1− u(j)∗l
,
k = 2, ..., K.
4. Sample the control subclass weights ν = (ν1, ..., νK)T from
pr(ν | · · · ) ∝∏i:Yi=0
[Zi | ν] · [ν | α],
which can be accomplished by first setting v∗K = 1 and sampling
v∗k ∼ Beta
(1 + zk, α +
K∑l=k+1
zk
), k = 1, ..., K − 1,
where zk is the number of controls assigned to subclass k, and then constructing
ν1 = v∗k, νk = v∗k∏k−1
l=1 (1− v∗l ), k = 2, ..., K.
5. Sample concentration parameter α for stick-breaking prior from
pr(α | · · · ) ∝J∏j=1
[η(j) | α] · [ν | α][α] ∝ α(K−1)J+1 exp(−α · r) · pr(α),
where r = −∑J
j=1
∑K−1k=1 log(1− u(j)∗k ) +
∑K−1k=1 log(1− v∗k)
. If conditionally
conjugate prior for α is used, i.e. α ∼ Gamma(aα, bα), then the full conditional
distribution reduces to Gamma (aα + (K − 1)J + 1, bα + r) .
141
APPENDICES
6. Sample the vector of subclass TPR for j = 1, ..., J from
pr(θ(j) | · · · ) ∝∏
i′:ILi′=j[Mi′ | θ(j), Zi′ , ILi′ ][θ(j)]
∝K∏k=1
θ(j)k
m(j)k1
1− θ(j)km(j)
k0 · [θ(j)], (A.2.1)
where m(j)kc = #i′ : Yi′ = 1, Zi′ = k, ILi′ = j,Mi′j = c, c = 0, 1. If prior for TPRs
are independent Beta distributions, then this is a product of Beta distributions.
7. Sample subclass-specific FPRs ψ(j)k for j = 1, ..., J , k = 1, ..., K from
pr(ψ(j)k | · · · ) ∝
∏i′:Yi′=1,IL
i′ 6=j,Zi′=k
[Mi′j | ψ(j), Zi′ , ILi′ ]∏i:Yi=0
[Mij | ψ(j), Zi] · [ψ(j)k ]
∝ψ
(j)k
s(−j)k1
1− ψ(j)k
s(−j)k0 · pr(ψ(j)
k ),
where s(−j)kc = #i′ : Yi′ = 1, Zi′ = k, ILi′ 6= j,Mi′j = c + #i : Yi = 0, Zi =
k,Mij = c, for c = 0, 1. If the prior on FPRs are Beta(a1, b1), then the above
conditional distribution is Beta(a1 + s(j)k1 , b1 + s
(j)k0 ).
8. Sample π from Dirichlet(d1 + t(j), ..., dJ + t(j)
), where t(j) is the number of cases
assigned to class j, i.e. t(j) = #i′ : Yi′ = 1, ILi′ = j, j = 1, .., J .
142
APPENDICES
A2.2 Full pathogen names with abbreviations
Bacteria: 1.HINF- Haemophilus influenzae; 2.SASP-Salmonella species;
3.SAUR-Staphylococcus aureus.
Viruses: 4.ADENOVIRUS-adenovirus; 5.COR 43-coronavirus OC43; 6.FLU C-influenza
virus type C; 7.HMPV A B-human metapneumovirus type A or B; 8.PARA1-parainfluenza
type 1 virus; 9. RHINO-rhinovirus; 10. RSV A B-respiratory syncytial virus type A or B.
A2.3 Stick-breaking prior
The stick-breaking mixture model has countably infinite number of subclasses. How-
ever, because the νk and η(j)k decrease exponentially quickly, a priori, we expect that only
a small number of subclasses will be used to model the data. The expected number of
subclasses used is logarithmic in the number of observations (Hjort et al., 2010). This is
different than a finite mixture model, which uses a fixed number of clusters to model the
data. In the stick-breaking mixture model, the actual number of clusters used to model data
is not fixed, and can be automatically inferred from data using the usual Bayesian posterior
inference framework (Neal, 2000).
Equations (4.2.12)-(4.2.16) place exchangeable prior weight on the subclasses. Follow-
ing (Ishwaran and James, 2002), in our computations, we truncate the infinite sum to the
first K terms with K sufficiently large to balance computing speed and approximating per-
formance of the model. In our simulations and data application K = 10 is usually deemed
143
APPENDICES
adequate. Most subclass measurement profiles are not occupied by either the simulations
or data application, so that a smaller number of K, e.g. 3, is usually sufficient for approxi-
mation. Also, by choosing hyperpriors for stick-breaking parameters α0 as in (4.2.16), we
can let the data inform us about the desired sparsity level for approximating the probability
contingency tables for the control and each disease class. A small value of the estimate α0
suggests that only a small number of subclasses are necessary for the controls (cases).
A2.4 Mean and covariance structure
Marginal mean of the observations take the form
Pr(Mij = 1 | Yi = 1) = πj
K∑k=1
θ(j)k η
(j)k +
∑s6=j
πs
K∑k=1
ψ(j)k η
(s)k
(A.2.2)
Pr(Mij = 1 | Yi = 0) =K∑k=1
ψ(j)k νk. (A.2.3)
Equation (A.2.2) and (A.2.3) indicate that the observed rate of pathogen j among cases is
a mixture of two components: cases whose disease is caused by pathogen j for which the
observation is a true positive event, and those whose disease is caused by another pathogen
for which the observation is a false positive. The case and control mean rate for pathogen
j observations are equal when either of the following two interesting situations occur:
(I) ψ(j)1 = · · · = ψ
(j)K = ψ(j) and
K∑k=1
θ(j)k η
(j)k = ψ(j).
144
APPENDICES
Condition (I) says that a measurement of pathogen j is independent of measurements of the
other pathogens among the controls, and, within the jth disease class, the rate of positive
pathogen j equals the control rate.
(II) η(s) = ν, s 6= j, andK∑k=1
[θ(j)k η
(j)k − ψ
(j)k νk
]= 0.
Condition (II) implies that, for a disease class s 6= j, the pairwise associations between
pathogen j and the other pathogens are equal between cases and controls.
The marginal pairwise log odds ratio for pathogen pair (j, l) among cases is given by
ωjl = log
Pr(Mij = 1,Mil = 1)Pr(Mij = 0,Mil = 0)
Pr(Mij = 1,Mil = 0)Pr(Mij = 0,Mil = 1)
= log
(J∑c=1
πc
[K∑h=1
θ(j)h
1c=j ψ
(j)h
1c6=j θ(l)h
1c=l ψ
(l)h
1c6=lη(c)h
])
+ log
(J∑c=1
πc
[K∑h=1
1− θ(j)h
1c=j 1− ψ(j)
h
1c6=j·
×
1− θ(l)h1c=l
1− ψ(l)h
1c6=lη(c)h
])− log
(J∑c=1
πc
[K∑h=1
θ(j)h
1c=j 1− ψ(j)
h
1c6=j θ(l)h
1c=l 1− ψ(l)
h
1c6=lη(c)h
])
− log
(J∑c=1
πc
[K∑h=1
1− θ(j)h
1c=j ψ
(j)h
1c6=j 1− θ(l)h
1c=l ψ
(l)h
1c6=lη(c)h
])
(A.2.4)
To illustrate the meaning of the above formula, suppose nearly all of pneumonia is caused
by pathogen j so that πj ≈ 1. If θ(j)k , k = 1, ..., K are equal, then we have approximate
145
APPENDICES
marginal independence between measurements on the jth pathogen and others. If K =
2 and θ(j)k , k = 1, 2 are very different, e.g. 1 versus 0 as an extreme example, ωjl =
logit(ψ(l)1 )− logit(ψ(l)
2 ), which is completely determined by the variation in subclass FPRs
for the lth pathogen.
A3 Appendix to Chapter 5
A3.1 Proof of Result 1
We show that the MLE of δeffect based on the standard meta-analytic likelihood (5.3.1)
is generally inconsistent. To do this, consider the simple but informative case of a popu-
lation of pairs of practices as shown in Fig. 2, where µ follows the positive half of the
standard normal distribution across such pairs. Because δcrudep is µ or −µ with probabil-
ities (12, 12), marginally the normality of the distribution of δcrude
p at the second level of
(5.3.1) holds with δeffect (= E(δcrudep )) = 0 and with var(δcrude
p ) = 1. Consider also,
for simplicity, that var(δcrudep ) is known, and that within clinical practices, the number of
patients sampled is a constant n and the variances σ2p,c(t) are known and are as given in Fig.
2. Then, the maximizer δeffect of the likelihood in (5.3.1) is∑
p upδcrudep /
∑p up where
146
APPENDICES
(up)−1 = var(δcrude
p ) + vcrudep , and
vcrudep =
w1 = 2
n, if the practice p is of typep = 1;
w2 = 1+σ2
n, if the practice p is of typep = 2.
The probability limit of δeffect is E(upδcrudep )/E(up), and its sign will be the sign of
E(upδcrudep ). Here, althoughE(δcrude
p ) = 0, Condition 2 fails because the sign of δcrudep
depends on the magnitude of the variance vp. In particular,E(upδcrudep ) = EE(upδ
crudep |
typep) = µ2[var(δcrude
p ) +w2−1−var(δcrudep ) +w1−1] which is non zero if σ2 6= 1.
This means that even if the null hypothesis of no intervention effect on the means is correct,
the standard meta-analytic approach (5.3.1) is inappropriate if the intervention has an effect
on the variance in at least one clinical practice.
A3.2 Proof of Result (4)
[Note: the reason that we want to prove the following results is that we find the sample
sizes (np,1, np,2) potentially not being exchangeably distributed across the 7 pairs. It is thus
desirable to show that 1st level inference is still valid, e.g. (4) still holds, under even weaker
conditional exchangeability.
CONDITION 1’. The potential outcomes under treatments 1 and 2 in clinical practice c,
and the number of patients served by clinical practice c are exchangeable (in distribution
over pairs) between clinical practices c = 1 and c = 2, i.e.,
147
APPENDICES
Figure A.3.2: Structure for the example used in the proof of Result 1 (Appendix 1). Shownis one true type of pair and the two types of observed pairs to which it can give rise,depending on which clinical practice is assigned control. In each parentheses shown arethe mean and variance of the potential outcomes of patients of the corresponding clinicalpractice and under a give treatment, as denoted in Figure 5.1.
148
APPENDICES
where the arrows connect equal entries in arguments, and distribution pr is over pairs p in
the larger population P of pairs. ]
We show that under Condition 1’, we have the following properties that facilitate 1st-
level only inference. More specifically, we are to show: (i)Eµp,1(t) = Eµp,2(t) for
t = 1, 2; and (ii)E(δcrudep ) = Eµp(t = 1) − Eµp(t = 2). For equalities (i), we only
show the validity under t = 1. Denote the conditioning event Np,c, c = 1, 2 as N . We
also use EA to denote the expectation with respect to the distribution of A.
Epµp,1(1)
= ENEp|Nµp,1(1) | N
= EN
[∫ ∫apr(µp,1(1) = a, µp,1(2) = b, µp,2(1) = s, µp,2(2) = t | N)dbdsdt
da
]= EN
∫ ∫a× pr(µp,1(1) = s, µp,1(2) = t, µp,2(1) = a, µp,2(2) = b | N)dbdsdt
da
(Condition 1’: conditional exchangeability)
= ENEp|Nµp,2(1) | N
= Eµp,2(1).
To prove (ii), we only need to show that Eµp,c=t(t) = Eµp(t) for t = 1, 2. For
149
APPENDICES
t = 1,
Epµp(1) = Epµp,1(1)πp,1 + µp,2(1)πp,2
= ENEp|Nµp,1(1)πp,1 + µp,2(1)πp,2 | N
= EN
πp,1Ep|Nµp,1(1) | N+ πp,2Ep|Nµp,2(1) | N
, (A.3.1)
where the second term can be rewritten using Condition 1 as
Ep|Nµp,2(1) | N
=
∫ ∫s× pr(µp,1(1) = a, µp,1(2) = b, µp,2(1) = s, µp,2(2) = t | N)dadbdt
ds
=
∫ ∫s× pr(µp,1(1) = s, µp,1(2) = t, µp,2(1) = a, µp,2(2) = b | N)dadbdt
ds
= Ep|Nµp,1(1) | N. (A.3.2)
Plugging (A.3.2) into (A.3.1) we obtain Eµp,1(1) by the fact that πp,1 + πp,2 = 1, hence
equality (ii).
150
APPENDICES
A3.3 Proof of part (a) in Result 2
We are to show that Eµcalibrp,c = Eµp(c) for c = 1, 2. We only prove that it holds
for c = 1. Denote conditioning events Gp,c(·), Np,c, c = 1, 2 as G ∩N . We have
Epµcalibrp,1 = Ep
∫µp,1(x; 1)dGp(x)
= EG∩N
Ep|G∩N
∫x
µp,1(x; 1)dGp(x) | G ∩N
= EG∩N
[∫x
Ep|G∩Nµp,1(x; 1) | G ∩NdGp(x)
], (A.3.3)
where the integral indicated by∫x
can be further expanded using the definition of Gp(·),
∫x
πp,1Ep|G∩Nµp,1(x; 1) | G∩NdGp,1(x)+
∫x
πp,2Ep|G∩Nµp,1(x; 1) | G∩NdGp,2(x).
The exchangeability Condition 3 (exchangeability conditional on G ∩N is sufficient) now
implies that we may change the underlined 1 to value of 2 with the value being the same.
Hence, after changing order of∫x
and Ep|G∩N and marginalizing over covariates x in both
terms, we further simplify it to πp,1Ep|G∩Nµp,1(1) | G∩N+πp,2Ep|G∩Nµp,2(1). Plug-
ging it into (A.3.3) and recalling the definition of µp(1) in equation (5.2.1), we have that
(A.3.3) = EG∩NEp|G∩Nµp(1)
= Epµp(1)
151
Bibliography
Aitchison, J. (1986). The statistical analysis of compositional data. Chapman & Hall, Ltd.
Aitkin, M., Anderson, D., and Hinde, J. (1981). Statistical modelling of data on teaching
styles. Journal of the Royal Statistical Society. Series A (General), 144(4):419–461.
Albert, P. and Dodd, L. (2008). On estimating diagnostic accuracy from studies with mul-
tiple raters and partial gold standard evaluation. Journal of the American Statistical
Association, 103(481):61–73.
Albert, P., McShane, L., and Shih, J. (2001). Latent class modeling approaches for as-
sessing diagnostic error without a gold standard: with applications to p53 immunohisto-
chemical assays in bladder tumors. Biometrics, 57(2):610–619.
Albert, P. S. and Dodd, L. E. (2004). A cautionary note on the robustness of latent class
models for estimating diagnostic error without a gold standard. Biometrics, 60(2):427–
435.
Almirall, D., Compton, S. N., Gunlicks-Stoessel, M., Duan, N., and Murphy, S. A. (2012).
152
BIBLIOGRAPHY
Designing a pilot sequential multiple assignment randomized trial for developing an
adaptive treatment strategy. Statistics in Medicine, 31(17):1887–1902.
ASCO (2007). What to know: Asco’s guideline on tumor markers for breast cancer. com-
prehensive cancer network. Clinical Practice Guidelines in Oncology V.2.
Bandeen-Roche, K., Miglioretti, D. L., Zeger, S. L., and Rathouz, P. J. (1997). Latent
variable regression for multiple discrete outcomes. Journal of the American Statistical
Association, 92(440):1375–1386.
Barker, A., Sigman, C., Kelloff, G., Hylton, N., Berry, D., and Esserman, L. (2009). I-
spy 2: an adaptive breast cancer trial design in the setting of neoadjuvant chemotherapy.
Clinical Pharmacology & Therapeutics, 86(1):97–100.
Berry, S. M., Carlin, B. P., Lee, J. J., and Muller, P. (2010). Bayesian adaptive methods for
clinical trials. CRC Press.
Besag, J. (1974). Spatial interaction and the statistical analysis of lattice systems. Journal
of the Royal Statistical Society: Series B (Statistical Methodology), 36(2):192–236.
Bhattacharya, A. and Dunson, D. B. (2011). Sparse bayesian infinite factor models.
Biometrika, 98(2):291–306.
Bhattacharya, A. and Dunson, D. B. (2012). Simplex factor models for multivariate un-
ordered categorical data. Journal of the American Statistical Association, 107(497):362–
377.
153
BIBLIOGRAPHY
Blackwelder, W., Biswas, K., Wu, Y., Kotloff, K., Farag, T., Nasrin, D., Graubard, B., Som-
merfelt, H., and Levine, M. (2012). Statistical methods in the global enteric multicenter
study (gems). Clinical Infectious Diseases, 55(suppl 4):S246–S253.
Blei, D., Ng, A., and Jordan, M. (2003). Latent dirichlet allocation. The Journal of Machine
Learning Research, 3:993–1022.
Boult, C., Leff, B., Boyd, C., Wolff, J., Marsteller, J., Frick, K., Wegener, S., Reider, L.,
Frey, K., Mroz, T., Karm, L., and Scharfstein, D. (2013). A matched-pair cluster cluster-
randomized trial of guided care for multi-morbid older patients. Journal of General
Internal Medicine, 28:612–621.
Brinkley, J., Tsiatis, A., and Anstrom, K. J. (2010). A generalized estimator of the at-
tributable benefit of an optimal treatment regime. Biometrics, 66(2):512–522.
Brooks, S. and Gelman, A. (1998). General methods for monitoring convergence of itera-
tive simulations. Journal of Computational and Graphical Statistics, 7(4):434–455.
Brooks, S., Gelman, A., Jones, G., and Meng, X.-L. (2011). Handbook of Markov Chain
Monte Carlo. CRC Press.
Bruzzi, P., Green, S., Byar, D., Brinton, L., and Schairer, C. (1985). Estimating the popula-
tion attributable risk for multiple risk factors using case-control data. American Journal
of Epidemiology, 122(5):904–914.
154
BIBLIOGRAPHY
Cai, T., Tian, L., Wong, P. H., and Wei, L. (2011). Analysis of randomized comparative
clinical trial data for personalized treatment selections. Biostatistics, 12(2):270–282.
Carey, V., Zeger, S. L., and Diggle, P. (1993). Modelling multivariate binary data with
alternating logistic regressions. Biometrika, 80(3):517–526.
Chacon, J., Mateu-Figueras, G., and Martın-Fernandez, J. (2011). Gaussian kernels for
density estimation with compositional data. Computers & Geosciences, 37(5):702–711.
Clayton, D. G. (1996). Generalized linear mixed models. In Markov Chain Monte Carlo
in Practice, pages 275–301. Springer.
Clive, J., Woodbury, M. A., and Siegler, I. C. (1983). Fuzzy and crisp set-theoretic-based
classification of health and disease. Journal of Medical Systems, 7(4):317–332.
Connor, J. T. (2006). Multivariate mixture models to describe longitudinal patterns of
frailty in American seniors. PhD thesis, Carnegie Mellon University.
Corder, L. S. and Manton, K. G. (1991). National surveys and the health and functioning
of the elderly: The effects of design and content. Journal of the American Statistical
Association, 86(414):513–525.
Daniels, M. J. (1999). A prior for the variance in hierarchical models. Canadian Journal
of Statistics, 27(3):567–578.
Dawid, A. (1979). Conditional independence in statistical theory. Journal of the Royal
Statistical Society. Series B (Methodological), 41(1):1–31.
155
BIBLIOGRAPHY
Dayton, C. M. and Macready, G. B. (1988). Concomitant-variable latent-class models.
Journal of the American Statistical Association, 83(401):173–178.
De Lathauwer, L., De Moor, B., and Vandewalle, J. (2000). A multilinear singular value
decomposition. SIAM Journal on Matrix Analysis and Applications, 21(4):1253–1278.
Deloria-Knoll, M., Feikin, D., Scott, J., OBrien, K., DeLuca, A., Driscoll, A., Levine, O.,
et al. (2012). Identification and selection of cases and controls in the pneumonia etiology
research for child health project. Clinical Infectious Diseases, 54(suppl 2):S117–S123.
DerSimonian, R. and Laird, N. (1986). Meta-analysis in clinical trials. Controlled Clinical
Trials, 7(3):177–188.
Dillon, W. R. and Mulani, N. (1984). A probabilistic latent class model for assessing inter-
judge reliability. Multivariate Behavioral Research, 19(4):438–458.
Donner, A., Taljaard, M., and Klar, N. (2007). The merits of breaking the matches: A
cautionary tale. Statistics in Medicine, 26(9):2036–2051.
Dunson, D. and Xing, C. (2009). Nonparametric bayes modeling of multivariate categorical
data. Journal of the American Statistical Association, 104(487):1042–1051.
Eaton, W. W., Dryman, A., Sorenson, A., and McCutcheon, A. (1989). Dsm-iii major
depressive disorder in the community. a latent class analysis of data from the nimh epi-
demiologic catchment area programme. The British Journal of Psychiatry, 155(1):48–
54.
156
BIBLIOGRAPHY
Erosheva, E. A., Fienberg, S. E., and Joutard, C. (2007). Describing disability through
individual-level mixture models for multivariate binary data. The Annals of Applied
Statistics, 1(2):346–384.
Feikin, D., Scott, J., and Gessner, B. (2014). Use of vaccines as probes to define disease
burden. The Lancet, 383(9930):1762–1770.
Feng, Z., Diehr, P., Peterson, A., and McLerran, D. (2001). Selected statistical issues in
group randomized trials. Annual Review of Public Health, 22(1):167–187.
Flegal, J. M., Haran, M., and Jones, G. L. (2008). Markov chain monte carlo: Can we trust
the third significant figure? Statistical Science, 23(2):250–260.
Foster, J. C., Taylor, J. M., and Ruberg, S. J. (2011). Subgroup identification from random-
ized clinical trial data. Statistics in Medicine, 30(24):2867–2880.
Garrett, E. and Zeger, S. (2000). Latent class model diagnosis. Biometrics, 56(4):1055–
1067.
Gelfand, A. E. and Sahu, S. K. (1999). Identifiability, improper priors, and gibbs sam-
pling for generalized linear models. Journal of the American Statistical Association,
94(445):247–253.
Gelfand, A. E. and Solomon, H. (1973). A study of poisson’s models for jury verdicts in
criminal and civil trials. Journal of the American Statistical Association, 68(342):271–
278.
157
BIBLIOGRAPHY
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin, D. B. (2013).
Bayesian data analysis. CRC press.
Gelman, A., Meng, X.-L., and Stern, H. (1996). Posterior predictive assessment of model
fitness via realized discrepancies. Statistica Sinica, 6(4):733–760.
Genentech (Accessed August 2nd, 2014). http://www.gene.com/patients/disease-
education/breast-cancer.
Gilks, W. R., Richardson, S., and Spiegelhalter, D. J. (1996). Introducing markov chain
monte carlo. In Markov Chain Monte Carlo in Practice, pages 1–19. Springer.
Good, I. J. (1969). Some applications of the singular decomposition of a matrix. Techno-
metrics, 11(4):823–831.
Goodman, L. (1974). Exploratory latent structure analysis using both identifiable and
unidentifiable models. Biometrika, 61(2):215–231.
Gunter, L., Zhu, J., and Murphy, S. (2007). Variable selection for optimal decision making.
In Artificial Intelligence in Medicine, pages 149–154. Springer.
Gunter, L., Zhu, J., and Murphy, S. (2011). Variable selection for qualitative interactions in
personalized medicine while controlling the family-wise error rate. Journal of Biophar-
maceutical Statistics, 21(6):1063–1078.
Gustafson, P. (2005). On model expansion, model contraction, identifiability and prior
158
BIBLIOGRAPHY
information: two illustrative scenarios involving mismeasured variables. Statistical Sci-
ence, 20(2):111–140.
Gustafson, P. (2009). What are the limits of posterior distributions arising from nonidenti-
fied models, and why should we care? Journal of the American Statistical Association,
104(488):1682–1695.
Gustafson, P., Le, N., and Saskin, R. (2001). Case–control analysis with partial knowledge
of exposure misclassification probabilities. Biometrics, 57(2):598–609.
Guyatt, G. H., Keller, J. L., Jaeschke, R., Rosenbloom, D., Adachi, J. D., and Newhouse,
M. T. (1990). The N-of-1 randomized controlled trial: clinical usefulnessour three-year
experience. Annals of Internal Medicine, 112(4):293–299.
Haberman, S. J. (1979). Analysis of qualitative data. vol. 2, new developments. Academic
Press.
Haberman, S. J. (1995). Book review of statistical applications using fuzzy sets, by Kenneth
G. Manton, Max A. Woodbury, and H. Dennis Tolley. Journal of the American Statistical
Association, 90:1131–1133.
Hammitt, L., Kazungu, S., Morpeth, S., Gibson, D., Mvera, B., Brent, A., Mwarumba,
S., Onyango, C., Bett, A., Akech, D., et al. (2012). A preliminary study of pneumonia
etiology among hospitalized children in kenya. Clinical Infectious Diseases, 54(suppl
2):S190–S199.
159
BIBLIOGRAPHY
Hill, J. and Scott, M. (2009). Comment: The essential role of pair matching. Statistical
Science, 24(1):54.
Hjort, N. L., Holmes, C., Muller, P., and Walker, S. G. (2010). Bayesian nonparametrics.
AMC, 10:12.
Hoff, P. D. (2005). Subset clustering of binary sequences, with an application to genomic
abnormality data. Biometrics, 61(4):1027–1036.
Huang, G.-H. and Bandeen-Roche, K. (2004). Building an identifiable latent class model
with covariate effects on underlying and measured variables. Psychometrika, 69(1):5–
32.
Huang, Y., Gilbert, P. B., and Janes, H. (2012). Assessing treatment-selection markers
using a potential outcomes framework. Biometrics, 68(3):687–696.
Hui, S. and Walter, S. (1980). Estimating the error rates of diagnostic tests. Biometrics,
36(1):167–171.
Imai, K., King, G., and Nall, C. (2009). The essential role of pair matching in cluster-
randomized experiments, with application to the Mexican universal health insurance
evaluation. Statistical Science, 24(1):29–53.
Ishwaran, H. and James, L. F. (2001). Gibbs sampling methods for stick-breaking priors.
Journal of the American Statistical Association, 96(453):161–173.
160
BIBLIOGRAPHY
Ishwaran, H. and James, L. F. (2002). Approximate dirichlet process computing in finite
normal mixtures. Journal of Computational and Graphical Statistics, 11(3):508–532.
Jokinen, J. and Scott, J. A. G. (2010). Estimating the proportion of pneumonia attributable
to pneumococcus in kenyan adults: latent class analysis. Epidemiology (Cambridge,
Mass.), 21(5):719–725.
Jones, G., Johnson, W., Hanson, T., and Christensen, R. (2010). Identifiability of models
for multiple diagnostic testing in the absence of a gold standard. Biometrics, 66(3):855–
863.
Kadane, J. (1974). The role of identification in bayesian theory. Studies in Bayesian
Econometrics and Statistics, pages 175–191.
Kessler, D. C., Hoff, P. D., and Dunson, D. B. (2014). Marginally specified priors for
non-parametric bayesian estimation. Journal of the Royal Statistical Society: Series B
(Statistical Methodology), doi: 10.1111/rssb.12059.
King, G. and Lu, Y. (2008). Verbal autopsy methods with multiple causes of death. Statis-
tical Science, 23(1):78–91.
King, G., Lu, Y., and Shibuya, K. (2010). Designing verbal autopsy studies. Population
Health Metrics, 8:19.
Kullback, S. (2012). Information theory and statistics. Courier Dover Publications.
161
BIBLIOGRAPHY
Lavori, P. W. and Dawson, R. (2000). A design for testing clinical strategies: biased adap-
tive within-subject randomization. Journal of the Royal Statistical Society: Series A
(Statistics in Society), 163(1):29–38.
Lazarsfeld, P. F. and Henry, N. W. (1968). Latent structure analysis. Houghton, Mifflin.
Levine, O., O’Brien, K., Deloria-Knoll, M., Murdoch, D., Feikin, D., DeLuca, A., Driscoll,
A., Baggett, H., Brooks, W., Howie, S., et al. (2012). The pneumonia etiology research
for child health project: A 21st century childhood pneumonia etiology study. Clinical
Infectious Diseases, 54(suppl 2):S93–S101.
Liang, K.-Y., Zeger, S. L., and Qaqish, B. (1992). Multivariate regression analyses for
categorical data. Journal of the Royal Statistical Society. Series B (Methodological),
54(1):3–40.
Liu, F., Bayarri, M., Berger, J., et al. (2009). Modularization in bayesian analysis, with
emphasis on analysis of computer models. Bayesian Analysis, 4(1):119–150.
Liu, L., Johnson, H. L., Cousens, S., Perin, J., Scott, S., Lawn, J. E., Rudan, I., Campbell,
H., Cibulskis, R., Li, M., et al. (2012). Global, regional, and national causes of child
mortality: an updated systematic analysis for 2010 with time trends since 2000. The
Lancet, 379(9832):2151–2161.
Lunn, D., Best, N., Spiegelhalter, D., Graham, G., and Neuenschwander, B. (2009). Com-
162
BIBLIOGRAPHY
bining mcmc with ’sequential’PKPD modelling. Journal of Pharmacokinetics and Phar-
macodynamics, 36(1):19–38.
Manrique-Vallier, D. (2010). Longitudinal Mixed Membership Models with Applications
to Disability Survey Data. PhD thesis, Carnegie Mellon University.
Manton, K. G., Tolley, H. D., and Woodbury, M. A. (1994). Statistical applications using
fuzzy sets. New York: John Wiley & Sons, cop.
McCutcheon, A. L. (1987). Latent class analysis. Number 64. Sage.
McHugh, R. B. (1956). Efficient estimation and local identification in latent class analysis.
Psychometrika, 21(4):331–347.
Murdoch, D., O’Brien, K., Driscoll, A., Karron, R., Bhat, N., et al. (2012). Laboratory
methods for determining pneumonia etiology in children. Clinical Infectious Diseases,
54(suppl 2):S146–S152.
Murphy, S. A. (2005). An experimental design for the development of adaptive treatment
strategies. Statistics in Medicine, 24(10):1455–1481.
Murphy, S. A., Lynch, K. G., Oslin, D., McKay, J. R., and TenHave, T. (2007). Developing
adaptive treatment strategies in substance abuse research. Drug and Alcohol Depen-
dence, 88(Suppl 2):S24–S30.
National Cancer Institute (Accessed August 2nd, 2014). Breast cancer treatment.
http://www.cancer.gov/cancertopics/pdq/treatment/breast/patient.
163
BIBLIOGRAPHY
Neal, R. M. (2000). Markov chain sampling methods for dirichlet process mixture models.
Journal of Computational and Graphical Statistics, 9(2):249–265.
Pati, D., Bhattacharya, A., Pillai, N. S., Dunson, D., et al. (2014). Posterior contraction in
sparse bayesian factor models for massive covariance matrices. The Annals of Statistics,
42(3):1102–1130.
Pelham Jr, W. E. and Fabiano, G. A. (2008). Evidence-based psychosocial treatments for
attention-deficit/hyperactivity disorder. Journal of Clinical Child & Adolescent Psychol-
ogy, 37(1):184–214.
Pepe, M. S. and Janes, H. (2007). Insights into latent class analysis of diagnostic test
performance. Biostatistics, 8(2):474–484.
Pritchard, J. K., Stephens, M., and Donnelly, P. (2000). Inference of population structure
using multilocus genotype data. Genetics, 155(2):945–959.
Qian, M. and Murphy, S. A. (2011). Performance guarantees for individualized treatment
rules. Annals of statistics, 39(2):1180–1210.
Qu, Y. and Hadgu, A. (1998). A model for evaluating sensitivity and specificity for corre-
lated diagnostic tests in efficacy studies with an imperfect reference test. Journal of the
American Statistical Association, 93(443):920–928.
Robert, C. P. and Casella, G. (1999). Monte Carlo statistical methods. Springer.
164
BIBLIOGRAPHY
Robins, J. M. (2004). Optimal structural nested models for optimal sequential decisions. In
Proceedings of the Second Seattle Symposium in Biostatistics, pages 189–326. Springer.
Rosenblum, M., Liu, H., and Yen, E.-H. (2013). Optimal tests of treatment ef-
fects for the overall population and two subpopulations in randomized trials, us-
ing sparse linear programming. Journal of the American Statistical Association,
doi:10.1080/01621459.2013.879063.
Rosenblum, M. and van der Laan, M. J. (2010). Simple, efficient estimators of treatment ef-
fects in randomized trials using generalized linear models to leverage baseline variables.
International Journal of Biostatistics, 6.
Rosenblum, M. and van der Laan, M. J. (2011). Optimizing randomized trial designs to
distinguish which subpopulations benefit from treatment. Biometrika, 98(4):845–860.
Rubin, D. (1974). Estimating causal effects of treatments in randomized and nonrandom-
ized studies. Journal of Educational Psychology; Journal of Educational Psychology,
66(5):688–701.
Rubin, D. (1978). Bayesian inference for causal effects: The role of randomization. The
Annals of Statistics, 6(1):34–58.
Senn, S. (2002). Cross-over trials in clinical research, volume 5. John Wiley & Sons.
Shashua, A. and Hazan, T. (2005). Non-negative tensor factorization with applications to
165
BIBLIOGRAPHY
statistics and computer vision. In Proceedings of the 22nd international conference on
Machine learning, pages 792–799. ACM.
Shinohara, R. T., Frangakis, C. E., and Lyketsos, C. G. (2012). A broad symmetry criterion
for nonparametric validity of parametrically based tests in randomized trials. Biometrics,
68(1):85–91.
Singer, B. (1989). Grade of membership representations: Concepts and problems. Prob-
ability, Statistics, and Mathematics: Papers in Honor of Samuel Karlin, TW Andersen,
KB Athreya and DL Iglehart, eds., Academic Press, Inc, New York, pages 317–334.
Spiegelhalter, D., Thomas, A., Best, N., and Lunn, D. (2003). WinBUGS user manual.
Strecher, V. J., McClure, J. B., Alexander, G. L., Chakraborty, B., Nair, V. N., Konkel,
J. M., Greene, S. M., Collins, L. M., Carlier, C. C., Wiese, C. J., et al. (2008). Web-
based smoking-cessation programs: results of a randomized trial. American Journal of
Preventive Medicine, 34(5):373–381.
Sullivan, P. F., Kessler, R. C., and Kendler, K. S. (1998). Latent class analysis of lifetime
depressive symptoms in the national comorbidity survey. American Journal of Psychia-
try, 155(10):1398–1406.
Thall, P. F., Millikan, R. E., Sung, H.-G., et al. (2000). Evaluating multiple treatment
courses in clinical trials. Statistics in Medicine, 19(8):1011–1028.
Thall, P. F., Sung, H.-G., and Estey, E. H. (2002). Selecting therapeutic strategies based
166
BIBLIOGRAPHY
on efficacy and death in multicourse clinical trials. Journal of the American Statistical
Association, 97(457):29–39.
Thompson, S., Pyke, S., and Hardy, R. (1997). The design and analysis of paired cluster
randomized trials: an application of meta-analysis techniques. Statistics in Medicine,
16(18):2063–2079.
Tucker, L. R. (1966). Some mathematical notes on three-mode factor analysis. Psychome-
trika, 31(3):279–311.
Uebersax, J. S. (1988). Validity inferences from interobserver agreement. Psychological
Bulletin, 104(3):405–416.
Uebersax, J. S. (1997). Analysis of student problem behaviors with latent trait, latent class,
and related probit mixture models. Applications of Latent Trait and Latent Class Models
in the Social Sciences, J. Rost and R. Langeheine, eds., Waxmann, New York, NY, pages
188–195.
Uebersax, J. S. and Grove, W. M. (1993). A latent trait finite mixture model for the analysis
of rating agreement. Biometrics, 49(3):823–835.
Wang, Z., Zhou, X., and Wang, M. (2011). Evaluation of diagnostic accuracy in detecting
ordered symptom statuses without a gold standard. Biostatistics, 12(3):567–581.
Ware, J. E. and Kosinski, M. (2001). Interpreting SF-36 summary health measures: A
response. Quality of Life Research, 10(5):405–413.
167
BIBLIOGRAPHY
Warren, J., Fuentes, M., Herring, A., and Langlois, P. (2012). Spatial-temporal modeling
of the association between air pollution exposure and preterm birth: Identifying critical
windows of exposure. Biometrics, 68(4):1157–1167.
Watkins, C. J. and Dayan, P. (1992). Q-learning. Machine Learning, 8(3-4):279–292.
Watkins, C. J. C. H. (1989). Learning from delayed rewards. PhD thesis, University of
Cambridge.
Woodbury, M. A., Clive, J., and Garson Jr, A. (1978). Mathematical typology: a grade
of membership technique for obtaining disease definition. Computers and Biomedical
Research, 11(3):277–298.
Wu, Z., Deloria-Knoll, M., Hammitt, L., Zeger, S., and for the PERCH Core Team (2014a).
Partially-latent class models (pLCM) for case-control studies of childhood pneumonia
etiology. Johns Hopkins University, Dept. of Biostatistics Working Papers, Working
Paper 267. http://biostats.bepress.com/jhubiostat/paper267.
Wu, Z., Frangakis, C. E., Louis, T. A., and Scharfstein, D. O. (2014b). Estimation of treat-
ment effects in matched-pair cluster randomized trials by calibrating covariate imbalance
between clusters. Biometrics, doi: 10.1111/biom.12214.
Xu, J. and Zeger, S. (2001). The evaluation of multiple surrogate endpoints. Biometrics,
57(1):81–87.
168
BIBLIOGRAPHY
Young, M. A. (1983). Evaluating diagnostic criteria: a latent class paradigm. Journal of
Psychiatric Research, 17(3):285–296.
Zeger, S. and Karim, M. (1991). Generalized linear models with random effects; a Gibbs
sampling approach. Journal of the American Statistical Association, 86(413):79–86.
Zhang, B., Tsiatis, A. A., Laber, E. B., and Davidian, M. (2012). A robust method for
estimating optimal treatment regimes. Biometrics, 68(4):1010–1018.
Zhang, Y. and Liu, J. S. (2007). Bayesian inference of epistatic interactions in case-control
studies. Nature Genetics, 39(9):1167–1173.
Zhao, L., Tian, L., Cai, T., Claggett, B., and Wei, L.-J. (2013). Effectively selecting a
target population for a future comparative study. Journal of the American Statistical
Association, 108(502):527–539.
Zhao, Y., Zeng, D., Rush, A. J., and Kosorok, M. R. (2012). Estimating individualized
treatment rules using outcome weighted learning. Journal of the American Statistical
Association, 107(499):1106–1118.
Zhou, X., Liu, S., Kim, E. S., Herbst, R. S., and Lee, J. J. (2008). Bayesian adaptive design
for targeted therapy development in lung cancera step toward personalized medicine.
Clinical Trials, 5(3):181–193.
Zigler, C. M. and Dominici, F. (2014). Uncertainty in propensity score estimation:
169
BIBLIOGRAPHY
Bayesian methods for variable selection and model-averaged causal effects. Journal
of the American Statistical Association, 109(505):95–107.
170
CURRICULUM VITAE
ZHENKE WU
615 N. Wolfe St. E3136
Baltimore, MD 21205
http://www.biostat.jhsph.edu/∼zhwu
Date of Birth: Apr 29th, 1988
Place of Birth: Chun’an, Zhejiang, China
EDUCATION
2009 - 2014 Johns Hopkins Bloomberg School of Public Health, Baltimore, MD
Ph.D. in Biostatistics
Thesis title: Statistical Methods for Individualized Health: Etiology, Di-
agnosis, and Intervention Evaluation
Advisor: Prof. Scott Zeger
2009 Fudan University, Shanghai, China
B.Sc. in Mathematics
171
CURRICULUM VITAE
PROFESSIONAL EXPERIENCE
2013 - present External Statistical Advisor
Child Health Research Foundation (CHRF), Dhaka, Bangladesh, and
National Center for Immunization and Respiratory Diseases (NCIRD),
The U.S. CDC
2010 - present Research Assistant/Statistician
International Vaccine Access Center (IVAC), Johns Hopkins
Bloomberg School of Public Health
Advisor: Prof. Scott Zeger; Principal Investigator: Prof. Katherine
O’Brien
2008 Research Scholar
California NanoSystems Institute, and Department of Mechanical and
Aerospace Engineering, University of California, Los Angeles
2007 - 2009 Research Scholar
Center for Computational Systems Biology, Fudan University, Shang-
hai, China
172
CURRICULUM VITAE
HONORS AND AWARDS
JOHNS HOPKINS UNIVERSITY
2014 First Place: Biostatistics Section of the Delta Omega Poster Competition
2013 Joseph Zeger Conference Travel Award
2012 June B. Culley Award, for outstanding achievement on school-wide oral
exam paper
2011-14 Hopkins Sommer Scholar
2009-14 Department of Biostatistics Graduate Fellowship
FUDAN UNIVERSITY
2009 B.Sc. with First Class Honors
2007-09 Chun-Tsung Scholar, Chinese Undergraduate Research Endowment
(CURE) Scholarship
2008 First Class National Scholarship, Ministry of Education, China
2007 Excellent Undergraduate Student, Government of Shanghai
2006-07 First Class People’s Scholarship
2006 First Class Shi Dai Scholarship
173
CURRICULUM VITAE
PUBLICATIONS
PUBLISHED/SUBMITTED
Wu Z, Frangakis CE, Louis TA, Scharfstein DO (2014). Estimating Treatment Effects in
Cluster Randomized Trials by Calibrating Covariate Imbalances between Clusters. Bio-
metrics. doi: 10.1111/biom.12214.
Georgiades C, Geschwind J-F, Neil H, Hines-Peralta A, Liapi E, Hong K, Wu Z, Kamel I,
Frangakis CE (2012). Lack of response after initial chemoembolization for hepatocellular
carcinoma: Does it predict failure of subsequent treatment? Radiology 265:115-123.
Wu Z, Deloria-Knoll M, Hammitt LL, and Zeger SL (2014). Partially Latent Class Models
(pLCM) for Case-Control Studies of Childhood Pneumonia Etiology.
(http://biostats.bepress.com/jhubiostat/paper267/)
Frangakis CE, Qian T, Wu Z, Diaz I (2014). Deductive Derivation and Computeriza-
tion of Compatible Semiparametric Efficient Estimation. Revision Invited for Biometrics.
(http://biostats.bepress.com/ucbbiostat/paper324/).
WORKING PAPERS
Wu Z, Zeger SL. Nested Partially-Latent Class Models (npLCM) for Estimating Disease
Etiology in Case-Control Studies.
Wu Z, Zeger SL. Partial Latent Class Model in Regression Analysis.
174
CURRICULUM VITAE
PRESENTATIONS (∗upcoming)
2014 Nested Partially Latent Class Models (npLCM) for Case-Control Studies of
Childhood Pneumonia Etiology. Pneumonia Etiology Research for Child
Health (PERCH) Executive Committee Meeting. December 2, London,
England.∗
2014 Nested Partially Latent Class Models (npLCM) for Case-Control Studies of
Childhood Pneumonia Etiology. Joint Statistical Meetings. August 7, Boston,
MA. (Topic contributed)
2014 Estimating Treatment Effects in Cluster Randomized Trials by Calibrating Co-
variate Imbalances between Clusters. Eastern North American Regional meet-
ing of the International Biometric Society. March 18, Baltimore, MD. (Topic
contributed)
2013 Estimating Infectious Etiology from Hierarchical Dirichlet Process Perspective.
Pneumonia Etiology Research for Child Health (PERCH) Executive Committee
Meeting. December 2, London, England.
2013 Partially Latent Class Models (pLCM) for Case-Control Studies of Childhood
Pneumonia Etiology. US Centers for Disease Control and Child Health Re-
search Foundation: Aetiology of Neonatal Infection in South Asia (ANISA)
Project Committee Meeting. November 10, San Diego, CA.
175
CURRICULUM VITAE
2013 Estimating Treatment Effects in Cluster Randomized Trials by Calibrating Co-
variate Imbalances between Clusters. Joint Statistical Meeting. August 4, Mon-
treal, QC, Canada. (Topic contributed)
2013 Hierarchical Bayesian Model for Combining Information from Multiple Biolog-
ical Samples with Measurement Errors: An Application to Children Pneumonia
Etiology Study. Eastern North American Regional meeting of the International
Biometric Society. March 12, Orlando, FL. (Topic contributed)
2012 Revealing and Addressing Existing Basic Inadequacies in the Use of Paired
Cluster Randomized Trials. Department of Biostatistics. Johns Hopkins Bio-
statistics Causal Inference Working Group. December 6, Baltimore, MD.
TEACHING
GUEST LECTURER
2012 A unified framework for high-dimensional analysis of M-estimators with de-
composable regularizers. Advanced Special Topics, 140.840: Large-scale In-
ference, Prof. Han Liu
TEACHING ASSISTANT
2014 Multilevel Statistical Models, Graduate, 140.656.
2014 Analysis of Longitudinal Data, Graduate, 140.655.
176
CURRICULUM VITAE
2013 Biostatistics in Public Health, Undergraduate, 280.346, advanced. Prof.
Scott Zeger.
2013 Case-based Introduction to Biostatistics, www.coursera.org, Prof. Scott
Zeger.
2013 Bayesian Methods I-II, Graduate, 140.762-763, Prof. Gary Rosner.
2012 Biostatistics in Public Health, Undergraduate, 280.346, advanced. Prof.
Scott Zeger
2011-12 Advanced Probability Theory I-II, Graduate, 550.620 - 621, Prof. James
Fill.
2010-11 Essentials of Probability and Statistical Inference I-IV, Graduate, 140.646-
649. Profs. Michael Rosenblum and Charles Rohde.
PROFESSIONAL ACTIVITIES
Co-Organizer Hopkins Biostatistics Student Journal Club, 2012-2013
Committee and treasurer Chinese Public Health Forum (CPHF) at Johns Hopkins,
2010-present
Volunteer ENAR Spring Meeting, Washington, DC, 2012
Representative and panelist Department of Biostatistics Student Recruitment Com-
mittee, 2010-2012
Member Hopkins inHealth (HiH) Learning Methodologies Work-
ing Group
177
CURRICULUM VITAE
JHSPH Causal Inference Working Group
Survival, Longitudinal, and Multilevel Modeling
(SLAM) Working Group
American Statistical Association (ASA), International
Chinese Statistical Association (ICSA), International
Biometric Society (ENAR), Institute of Mathematical
Statistics (IMS), American Public Health Association
(APHA)
Reviewer Journal of Business and Economic Statistics, Annals of
Statistics, Ophthalmic Epidemiology, International Con-
ference on Artificial Intelligence and Statistics (AISTAT),
Statistical Science
178