statistical methods for individualized health: … · 2020. 5. 20. · statistical methods for...

STATISTICAL METHODS FOR INDIVIDUALIZED HEALTH:

ETIOLOGY, DIAGNOSIS, AND INTERVENTION EVALUATION

by

Zhenke Wu

A dissertation submitted to The Johns Hopkins University in conformity with the

requirements for the degree of Doctor of Philosophy.

Baltimore, Maryland

September, 2014

© Zhenke Wu 2014

All rights reserved

Abstract

The term individualized health represents a goal of the next generation of health sys-

tem: to treat the right person, at the right place, at the right time taking account of the

individuals’ characteristics, circumstances and preferences. To advance this goal, a new

partnership of statistical and biomedical science is needed to intelligently use information

to better understand disease etiology, to improve diagnoses and treatment decisions and

to accurately evaluate health interventions. In two parts, this thesis addresses statistical

methods in support of the individualized health goal.

In Part I, the key objective is to characterize an individual’s underlying health state given

imprecise measurements. We assume that the health states can be usefully represented by

categorical latent variables. We describe a statistical framework, termed nested partially-

latent class models (npLCM), to estimate the population fraction of individuals in each

class, and to predict an individual’s health state given multivariate binary measurements

from case-control studies. We assume each observation is a draw from a mixture model

whose components represent latent health state classes. Conditional dependence among

the binary measurements on an individual is induced by nesting subclasses within each

ii

ABSTRACT

latent health/disease class. Measurement precision and dependence among measurements

can be estimated using the control sample for whom the class is known. Model estimation,

model checking, and individual diagnosis are carried out using posterior samples drawn by

Gibbs Sampler. We illustrate the model using a subset of data from the motivating Pneu-

monia Etiology Research for Child Health (PERCH) study that examines the distribution

of pneumonia-causing bacterial or viral pathogens in developing countries.

The second part of this thesis focuses on improving the efficiency of estimating the ef-

fect of individualized intervention using data from matched-pair cluster randomized (MPCR)

designs, where person-level or cluster-level covariates are available. Covariate imbalances

between pairs are commonly observed under MPCR even after matching. We show that the

naive approaches that ignore such imbalance are biased. We propose a covariate-calibrated

approach to achieve both consistency and greater efficiency. We use the new method to

evaluate the effect of an individualized health care intervention in the Guided Care study.

Advisor:

Scott Zeger, PhD

Committee:

Gary Rosner, PhD (chair); Brian Caffo, PhD; Maria Deloria-Knoll, PhD

Alternates:

Roger Peng, PhD; Elizabeth Stuart, PhD

iii

Acknowledgments

My advisor, Dr. Scott Zeger, has been incredibly supportive during my study at Hop-

kins. He led me into the pneumonia etiology project with extreme patience and original

insights. He encouraged me to go where data are, and demonstrated how to design and

communicate inventive statistical methods. His genuine enthusiasm for doing quality sci-

ence and constant pursuit of effective and creative methods have been, and will continue to

be my source of inspiration in my future careers. I am very fortunate to have worked with

him during my PhD program.

I would also like to thank another of my mentors, Dr. Constantine Frangakis, whose

support, guidance, and friendship have fostered my interest in clinical trials, causal infer-

ence, and many other areas.

This thesis has also benefited from discussions with many other professors. The de-

tailed comments from my thesis committee members, Drs. Gary Rosner, Brian Caffo,

Maria Deloria-Knoll are educational and have led to greatly improved presentation of the

methods. I also thank Drs. Elizabeth Stuart and Roger Peng for willing to spend time

reading my thesis. Drs. Thomas Louis and Daniel Scharfstein have also generously offered

iv

ACKNOWLEDGMENTS

me much advice in better formulating and communicating methodological ideas during our

collaborations, which led to Chapter 5.

I would like to thank Jiawei Bai, a very close friend, and many other professors and

fellow students who have created a pleasant working place and helped me at Johns Hopkins.

The Pneumonia Etiology Research for Child Health (PERCH) study has helped me

understand the elements of a real scientific investigation. Direct communications with

medical doctors and epidemiologists have led to much improved statistical modeling over

the years. I am very fortunate to have collaborated with a team of wonderful scientists

led by Dr. Katherine O’Brien. Kate has been very supportive and responsive during our

model development phase. Her pursuit of scientifically interpretable statistical results has

also led to heated discussions in steering committee meetings. It was challenging as well

as educational for me to participate and contribute during these discussions.

More importantly, the weekly analysis meetings with Drs. Maria Deloria-Knoll, Laura

Hammitt, Daniel Feikin, and many other investigators have always been constructive and

fun. Our conversations have helped shape the statistical approach presented in this thesis.

I would also like to thank the PERCH coordinators, Wei Fu, Daniel Park, Christine Pros-

peri, Melissa Higdon, and Mengying Li for their strong support and trust that have helped

demonstrate the pneumonia etiology analysis. I also thank the members of PERCH Expert

Group who provided external advice to further improve the statistical models.

Finally, I am also grateful for the generous financial supports from the Department of

Biostatistics, Bill & Melinda Gates Foundation, and the Sommer Scholar program, which

v

ACKNOWLEDGMENTS

made my life at Hopkins more pleasant.

On a personal note, I would like to thank my wife Ruoping Chai, my parents and

parents-in-law, Changlin Xu, Juan Wu, Ruizhen Sun, Shiduo Chai, for their love, encour-

agement and constant support through the five years seeking my degree.

This thesis is dedicated to you, and also to my son, Tyler.

vi

Contents

Abstract ii

Acknowledgments iv

List of Tables xi

List of Figures xii

1 Introduction 1

1.1 Statistical challenges in individualized health . . . . . . . . . . . . . . . . 2

1.2 Organizational overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.3 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 Latent Class Models 11

2.1 Brief history and formulation . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2 Identifiability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3 Estimation by Markov chain Monte Carlo . . . . . . . . . . . . . . . . . . 18

vii

CONTENTS

2.4 Grade-of-Membership model . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.5 Approximation properties . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3 Partially-Latent Class Models (pLCM) for Case-Control Studies of Childhood

Pneumonia Etiology 26

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2 A partially-latent class model for multiple indirect measurements . . . . . . 34

3.2.1 Model identifiability . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.2.2 Parameter estimation and individual etiology prediction . . . . . . 41

3.3 Simulation for three pathogens case with GS and BrS data . . . . . . . . . 43

3.4 Analysis of PERCH data . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4 Nested Partially-Latent Class Models (npLCM) for Estimating Disease Etiol-

ogy in Case-Control Studies 62

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.2 Model specification of npLCM . . . . . . . . . . . . . . . . . . . . . . . . 69

4.2.1 npLCM likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.2.2 Prior specifications . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.3 Model properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.3.1 Non-interference submodels . . . . . . . . . . . . . . . . . . . . . 76

4.3.2 Mean and covariance structure . . . . . . . . . . . . . . . . . . . . 77

viii

CONTENTS

4.3.3 Alternate approaches to borrowing information from the control

population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

4.3.4 Modeling choices . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

4.4 Posterior computations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

4.5 Simulation studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

4.6 Analysis of PERCH data . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

4.6.1 Estimation of etiologic fractions . . . . . . . . . . . . . . . . . . . 88

4.6.2 Model checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

4.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5 Estimation of Treatment Effects in Matched-Pair Cluster Randomized Tri-

als by Calibrating Covariate Imbalance Between Clusters with Application to

Guided Care Study 100

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5.2 The goal and design using potential outcomes . . . . . . . . . . . . . . . . 103

5.3 Complications with existing methods . . . . . . . . . . . . . . . . . . . . . 109

5.3.1 Consequences when ignoring covariates. . . . . . . . . . . . . . . 109

5.3.2 Complications with existing covariate methods. . . . . . . . . . . . 116

5.4 Addressing the Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

5.4.1 Calibration of observed covariate differences between clinical prac-

tices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

5.4.2 Estimation of quantities of original interest . . . . . . . . . . . . . 121

ix

CONTENTS

5.4.3 Assessment of the hypothesis of no effect . . . . . . . . . . . . . . 125

5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

6 Conclusions and Future Work 129

Appendices 136

A1 Appendix to Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

A1.1 Full conditional distributions in Gibbs sampler . . . . . . . . . . . 136

A1.2 Pathogen names and their abbreviations . . . . . . . . . . . . . . . 137

A1.3 Additional simulation results . . . . . . . . . . . . . . . . . . . . . 138


A2.1 Posterior computations . . . . . . . . . . . . . . . . . . . . . . . . 139

A2.2 Full pathogen names with abbreviations . . . . . . . . . . . . . . . 143

A2.3 Stick-breaking prior . . . . . . . . . . . . . . . . . . . . . . . . . 143

A2.4 Mean and covariance structure . . . . . . . . . . . . . . . . . . . . 144


A3.1 Proof of Result 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

A3.2 Proof of Result (4) . . . . . . . . . . . . . . . . . . . . . . . . . . 147

A3.3 Proof of part (a) in Result 2 . . . . . . . . . . . . . . . . . . . . . 151

Bibliography 152

Curriculum Vitae 171

x

List of Tables

4.1 Results for simulated data sets separately fitted by the npLCM and pLCM. . 85

5.1 Summary of average SF36 outcomes for uncalibrated versus calibrated ap-proaches. The first row block displays sample sizes; the second row blockdisplays average outcomes that are uncalibrated and calibrated, respectively. 110

5.2 Checking covariate imbalances within each pair. For a continuous covariate(indicated by (a)), we calculate effect size as difference divided by pooledstandard deviation. For a categorical covariate (indicated by (b)), odds ratiois calculated comparing rates of occurrence of each category between twoclusters in a pair. To prevent infinite odds ratio, 0.5 is added to all the cellswhen calculating sample odds ratios. . . . . . . . . . . . . . . . . . . . . . 113

5.3 Results from MLE, profile MLE, Bayes estimates and permutation test inthe Guided Care program study. The covariates used for calibration arelisted in the first column of Table 5.2; the outcome is the physical compo-nent summary of the Short Form 36 (SF36).Results from different methods 114

xi

List of Figures

3.1 Directed acyclic graph (DAG) illustrating relationships among lung infec-tion state (IL), imperfect lab measurements on the presence/absence ofeach of a list of pathogens at each site(MNP , MB and ML), disease out-come, and covariates (X). For a subject missing one or more of the threetypes of measurements, we remove the corresponding measurement com-ponent(s). For example, if a case does not have lung aspirate (LA) mea-surement, we remove ML from the DAG. . . . . . . . . . . . . . . . . . . 30

3.2 Population and individual etiology estimations for a single sample with500 cases and 500 controls with true π = (0.67, 0.26, 0.07)T and either1%N = 5) or 10%(N = 50) GS data on cases. In (a) or (b), Red circledplus shows the true population etiology distribution π. The closed curvesare 95 percent credible regions: blue dashed lines “- - -”, light green solidlines “—”, black dotted lines “· · · ” correspond to analysis using BrS dataonly, BrS+GS data, GS data only, respectively; Solid square/dot/triangleare corresponding posterior means of π; The 95 percent highest densityregion of uniform prior distribution is also visualized by red “· − ·−” forcomparison. 8(= 23) BrS measurement patterns and predictions for indi-vidual children are shown with different shapes, with measurement patternsattached to them. The radii of circles and numbers at the vertices show em-pirical frequencies GS measurements belonging to A, B, or C. . . . . . . . 45

xii

LIST OF FIGURES

3.3 Results using expert priors on TPRs. The observed BrS rates (with 95%confidence intervals) for cases and controls are shown on the far left. Theconditional odds ratio given the other pathogens is listed with 95% confi-dence interval in the box to the right of the BrS data summary. Below thecase and control observed rates is a horizontal line with a triangle. From leftto right, the line starts at the estimated false positive rate (FPR) and endsat the estimated true positive rate (TPR), both obtained from the model.Below the TPR are two boxplots summarizing its posterior (top) and prior(bottom) distributions. The location of the triangle, expressed as a fractionof the distance from the FPR to the TPR, is the model-based point estimateof the etiologic fraction for each pathogen. The SS data are shown in asimilar fashion to the right of the BrS data. The observed rate for the casesis shown with its 95% confidence interval. The estimated SS TPR (θSSj )with prior and posterior distributions is shown as for the BrS data, exceptthat we plot 95% and 50% credible intervals for SS TPR above the boxplotfor its prior distribution. See Appendix for pathogen name abbreviations. . 52

3.4 Results on using uniform priors on TPRs. As in Figure 3.3 with uniformpriors on the TPRs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.5 Summary of posterior distribution of pneumonia etiology estimates usingexpert (left) and uniform (right) priors on TPRs. In each subfigure, top:posterior (solid) and prior (dashed) distribution of viral etiology; bottomleft: posterior etiology distribution for top two bacterial causes given bac-teria is a cause; bottom right: posterior etiology distribution for top twoviral causes given virus is a cause. B-rest and V-rest stand for the restof bacteria and viruses other than the top two species, respectively. Thenested blue circles are 95%, 80%, and 50% credible regions for populationetiology estimates within bacterial or viral group. . . . . . . . . . . . . . . 55

3.6 Model chPosterior predictive checking for 10 most frequent BrS measure-ment patterns among cases and controls with expert priors on TPRs. . . . . 57

3.7 Posterior predictive checking for pairwise odds ratios separately for cases(lower right triangle) and controls (upper left triangle) with expert priorson TPRs. Each entry is a standardized log odds ratio (SLOR): the observedlog odds ratio for a pair of BrS measurements minus the mean LOR forthe posterior predictive distribution divided by the standard deviation ofthe posterior predictive distribution. The first significant digit of absoluteSLORs are shown in red for positive and blue for negative values, and onlythose greater than 2 are shown. . . . . . . . . . . . . . . . . . . . . . . . . 58

xiii

LIST OF FIGURES

4.1 Model structure that incorporates conditional dependence within each dis-ease class illustrated by J = 5 pathogens (called A, B, C, D, and E) in thePERCH study. On the left is the control measurements that arise from amixture of K = 2 conditionally independent subclass measurement pro-files with mixing weights ν1 and ν2. Here ψ(j)

k is the false positive rate forpathogen j in a subclass k. On the right are the J = 5 disease classes, onefor each possible pathogen. Each case is assumed to be caused by a uniquepathogen indicated by IL taking values in 1, ..., J. For a class containingall cases whose IL = j0, the K = 2 subclasses of measurement profiles areassumed equal to the control false positive rates ψ(j)

k for j 6= j0, and equalto the true positive rate θ(j)k for j = j0, k = 1, ..., K. Within each diseaseclass, two subclass measurement profiles are nested. The mixing weights ofsubclasses nested in the jth disease class are η(j)1 and η(j)2 . π = (π1, ..., πJ)′

are disease class mixing weights, and are called etiologic fractions. . . . . 714.2 Directed acyclic graph for the npLCM. . . . . . . . . . . . . . . . . . . . . 814.3 Misclassification rate comparisons between the pLCM and npLCM predic-

tions. 50 simulated training data sets are generated under (a) scenario I(pLCM), or (b) scenario II (npLCM). Each training data set is then fittedby the pLCM (clear boxplots) or npLCM (filled boxplots) to produce indi-vidual predictions. In (a) and (b), the first 5 pair of boxplots are to compareclass-specific misclassification rates; the last pair is to compare the overallmisclassification rates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

4.4 Comparison of population etiologic fraction posterior distributions betweenthe pLCM (black) and npLCM (blue). On the left, the positive observa-tion rates rates for cases and controls are plotted for each pathogen us-ing connected blue dots; “+” and “*” denote posterior mean of θMj andψMj , respectively; the fitted case rate is indicated by “δ”. On the right, the

blue/black curves, numbers, and credible intervals above the curves denotethe marginal posterior density, mean, and 50% and 95% credible intervalsfor πj , j = 1, .., 10 for the pLCM/npLCM models. . . . . . . . . . . . . . 90

4.5 Individual diagnoses for the most frequent measurement patterns amongthe cases, separately predicted from the pLCM and the npLCM. . . . . . . 93

4.6 Posterior predictive distributions for checking of pairwise log odds ratiosfor the controls (top) and the cases (bottom). . . . . . . . . . . . . . . . . . 96

xiv

LIST OF FIGURES

5.1 The underlying structure of the paired-cluster randomized design. The toppart (observed pair p) and bottom part (observed pair p′) are the two possi-ble ways in which a single pair can be manifested in the design. Observedpair p has two clinical practices (represented by the two squares). For eachclinical practice, the first row shows the mean and variance of patient out-comes if the clinical practice is assigned control and the second row showsthe mean and variance if assigned intervention. The clinical practice ac-tually assigned control is indicated by its placement in column “1” , andthe clinical practice actually assigned intervention is in column “2”. Thesolid (nonsolid) ellipsoids show the means and variances that can (cannot)be estimated directly. Observed pair p′ shows how the same pair would bemanifested in the design if the assignment of treatment to clinical practiceswere in reverse (a line with arrows connects the same clinical practice inthese two different assignments). Condition 1 means that each of the twomanifestations, p and p′ has the same probability. . . . . . . . . . . . . . . 107

5.2 Checking second level dependence. . . . . . . . . . . . . . . . . . . . . . 124

xv

Chapter 1

Introduction

1

CHAPTER 1. INTRODUCTION

1.1 Statistical challenges in individualized health

Due to the biotechnology and information technology revolutions of the past two decades,

novel individual-level health information is accumulating at an unprecedented pace. Biomed-

ical and public health researchers seek to intelligently use this new information to advance

their understandings about the mechanism of population health and disease, and to create,

evaluate and iteratively improve treatments for the right person at the right time and right

place.

For example, pneumonia is a clinical syndrome associated with lung infection that can

be caused by a variety of bacteria, viruses or fungi. Recent studies estimated that pneumo-

nia kills more children than other illness–more than AIDS, malaria and measles combined.

Over 1 million children died from pneumonia in 2010, accounting for almost one in five

deaths for children under five years old (Liu et al., 2012). In 2009, the Pneumonia Eti-

ology for Research in Child Health (PERCH) study (Levine et al., 2012) was launched

at the Johns Hopkins Bloomberg School of Public Health with the goal of 1) identifying

the top pneumonia-causing pathogens for children under five, and 2) establishing an evi-

dence base for future patient pneumonia diagnosis. The study sites encompass 7 countries

in South Asia and Sub-Saharan Africa. In PERCH, for the first time in pneumonia etiol-

ogy research, comprehensive and standardized bioassays and other biotechnology-enabled

tests are systematically used to assess the presence or absence of more than 30 pneumonia-

causing pathogens in body fluids.

In PERCH and other health studies, there are at least four statistical problems that are

2


central to advancing individualized health goals: estimation of population etiology, al-

gorithms for individual diagnosis, intervention selections optimized given an individual’s

characteristics, circumstances and preferences, and evaluation of individualized interven-

tions. Each is briefly considered in turn.

Population etiology estimation. Here, population disease etiology is defined as the

population distribution of health states. The health states are commonly not directly ob-

served, and, for most diseases, such as cancer, they are time-varying. For example, in

PERCH, the relevant health state for pneumonia cases is defined by the pathogen currently

infecting a child’s lung. To capture relevant aspects about these unobserved quantities, a

series of measurements are collected for analyses. Measurements can be biological (DNA

sequence, epigenetic makes, biomarkers), clinical (symptom reports, patient history, med-

ication use), environmental (exposures), or others, and can be of variable quality. In sta-

tistical analysis to estimate the distribution of health states in a population, it is important

to integrate measures to account for differential measurement errors in the data collection

process.

Individual diagnosis. Individual diagnosis can be improved by embedding a new

patient within a subpopulation with similar observed characteristics, either through post-

stratification or by study design. This reference to population data is indispensable given

the current state of medical knowledge because we lack the “laws of biology” that can

reliably predict the current or future health states of an individual using only her low-

dimensional measurements. For example, in cancer screening, the posterior probability

3


that a new person has cancer given her positive test result (positive predictive value) can be

estimated from data on the subpopulation with similar covariate profiles. This embedding

process proceeds iteratively as new patients continuously enrich our database and serve

as potential references for future cases with similar characteristics. The patient’s decision

about how to use such a probabilistic diagnosis depends on her loss function given the

available intervention options.

For example, in PERCH study,we define ILi to be individual i’s true infection state in the

lung. It is the individual’s health state taking value from 0, 1, ..., J, where we denote no

lung infection as 0 (observed for the controls) and infection cause for different pathogens

as 1, ..., J (unobserved for the cases for a prespecified list of potential causes). An indi-

vidual’s latent health state and her measurements can be affected by a set of covariatesXi.

The study generates multivariate binary data Mi from samples with potentially different

error rates. Using these data, the investigators want to estimate 1) the population etiologic

fractions (π1, ..., πJ), that are the fractions of pneumonia cases caused by each of the J

pathogens, and 2) to facilitate individual diagnosis by calculating the posterior probability

that a pneumonia case is caused by each of the putative pathogens in light of the individ-

ual’s and the similar subpopulation’s measurements, i.e. pi = P (ILi = j |Mi,Xi,Data).

Because the clinicians may differentially treat cases caused by different pathogens, un-

derstanding the characteristics of the underlying health state, (ILi ), is essential for making

medical decisions.

Treatment option selection. After individual diagnosis, patients and their clinicians

4


choose the most suitable treatments based on the individual’s estimated health state, the

current knowledge of intervention effects for the subpopulation with similar measured char-

acteristics, and the individual’s preferences (loss function).

For example, many forms of breast cancer are based in part on genetic characteristics,

such as human epidermal growth factor receptor 2 (HER2)-positive and HER2-negative,

each with different prognoses (ASCO, 2007). Drugs like Trastuzamab (Herceptin) and

Laptatinib (Tykerb) specifically target HER2 and are recommended for women whose tu-

mors are HER2 positive (National Cancer Institute, 2014). HER2-negative patients can be

eligible for other targeted medicine (Genentech, 2014). That is, we have a health state, the

HER2 positive/negative status, on an individual that can guide the selection of therapies. If

we can estimate the health states well through screening tests, the treatment benefits for the

patient can be optimized by selecting the right medication.

If the underlying disease mechanism is less well understood, statistical techniques can

potentially be used to estimate the treatment effect on a subpopulation with a particular co-

variate profile. For example, after a clinical trial on a large population has been conducted,

we want to estimate the likely effect of the intervention for a subpopulation with specific

baseline covariates. Although the trial was originally designed to detect the treatment ef-

fect on the whole population, subgroup analysis methods have been developed for both

consistent and efficient (with smallest possible variance) estimation of the treatment effect

on subpopulations (Cai et al., 2011; Zhao et al., 2013).

Besides analytic approaches to improve efficiency in estimating subpopulation treat-

5


ment effect, efficient designs can also help. For example, adaptive clinical trial (designs)

(Berry et al., 2010) can confirm the effectiveness of a drug and identify the subpopulation

who benefit the most. In an adaptive design, a new patient’s probability of randomization

to the different treatment arms changes as a function of the accumulated outcomes to date.

Adaptive designs can result in the same amount of information about the relative interven-

tion effect on a subpopulation but with shorter trial duration or fewer study subjects.

The goal of subgroup analysis and adaptive designs is to find the right subpopulation

for a particular treatment. A related goal is to match treatment for each of multiple subpop-

ulations according to their covariates. When each subject receives multiple treatments over

time, for example in a crossover trial (Senn, 2002), we can estimate the patient’s health

states after each treatment, which can then be combined with other patients’ data to col-

lectively learn the optimal treatments specific for subgroups as defined by covariates. The

“N-of-1” trial is an extreme example where sequential treatments are assigned to a single

study subject and the variation in the measured health outcomes over time can be used to

estimate the intervention effects specific for each individual (Guyatt et al., 1990).

We review relevant literature in the final chapter for treatment option selection, but in

this thesis focus on the following question.

Evaluation of individualized interventions. After an individualized decision rule has

been applied to each patient, a key question is to what extent the individualized rules im-

proved the health outcomes for the entire population? Individualized rules will have clear-

cut benefits for subjects with some specific baseline profiles, but not for others. The latter

6


group of patients usually have baseline profiles under which benefits to all available treat-

ments are estimated with less precision. Therefore, even if individualized intervention rules

are adopted, assessment at the population level is useful for health care policy makers to

decide whether to adopt these individualized rules in their local populations. Similar to es-

timating the subpopulation intervention effects, consistency and efficiency considerations

are essential for objective and precise evaluation of the individualized interventions.

For example, in the Guided Care study (Boult et al., 2013), for each of the chronically

ill older patients in a clinical practice, a specially educated registered nurse is in place to 1)

help create an evidence-based plan of care, 2) coordinate the efforts of all clinicians who

provide the patient’s health care, 3) smooth the patient’s transitions between sites of care,

4) coach the patient’s self-management, and 5) educate and supporting family caregivers,

etc. To assess the hypothesis that such individualized health care can improve functional

outcomes measured by Short-Form (SF)-36 version 2 (Ware and Kosinski, 2001) and other

measures on quality of care and health services utilization, the investigators enrolled eli-

gible patients from 14 practices in Baltimore and DC areas and assigned specially trained

nurses to 7 of the practices under a matched-pair cluster randomized (MPCR) design. The

goal of policy interest is to consistently and efficiently compare the average measured out-

comes if all the study subjects in the population are assigned such specially trained nurses

versus the average if no such nurses are assigned.

7


1.2 Organizational overview

Since the first part of this thesis will develop an extension of the latent class model

(LCM) (Goodman, 1974), in Chapter 2, we review the history and statistical properties of

the LCM. Then we discuss a related method, the Grade-of-Membership (GoM) model, that

can be shown to be equivalent to the LCM usually with smaller number of classes to de-

scribe the same marginal multivariate discrete distribution. Common to the LCM and GoM

is the flexibility for characterizing higher-order moments and their regression extensions.

The final section gives some technical theorems that describe the approximation properties

of the LCM using nonnegative matrix decompositions (Bhattacharya and Dunson, 2012),

which we will use in the development of npLCM in Chapter 4.

Chapter 3 presents the partially-latent class models (pLCM) that enables population

etiology estimation and individual diagnosis using data from a case-control design. Specif-

ically, the pLCM assumes that the health states of the cases are latent, while the health

states of the controls are observed, hence the name partially-latent. Given the unobserved

health state of a case subject, the pLCM, like the LCM assumes conditional independence

among multivariate binary measurements with the positive rate for each dimension equal-

ing the true positive rate, or the false positive rates that can be estimated from the controls.

Measurements of different error rates, which we term as gold, silver and bronze standards,

are integrated systematically in the pLCM through combined likelihood specifications. The

detailed model specification can be found in Section 3.2. Also, to inform allocation of sam-

pling and laboratory resources and to improve future study designs, in Section 3.3 and the

8


appendix to Chapter 3, we quantify the fraction of information about the mixing weight of

a particular class of the health states that derives from each level of measurement. Section

3.4 introduces graphical displays of the population data and inferred latent-class frequen-

cies, which are effective tools for communications between the statisticians and the domain

experts during the course of this research.

Building upon the pLCM, Chapter 4 introduces the nested partially-latent class model

(npLCM) to enable conditional dependence. Section 4.2 and 4.3 present model likelihood

and noninterference submodels, which formalize the heuristics that the joint measurement

distribution for controls informs the case model. We show by simulation studies in Section

4.4 the degree to which ignoring conditional dependence can lead to bias in the estimation

of population etiology and individual diagnosis. Analyses of subsets of the PERCH data

under both the pLCM and the npLCM are presented and compared.

Chapter 5 begins Part II of the thesis in which we develop methods for evaluating an

individualized intervention when it is applied to a population of clinical practices. We aim

to obtain consistent and efficient evaluation by leveraging individual-level or cluster-level

covariates when the data has been collected from a special design, called the matched-pair

cluster randomized design. One goal of policy interest is to estimate the average outcome

if all clusters in all pairs are assigned control versus if all clusters in all pairs are assigned

to intervention. Section 5.2 formulates the study design and the observed data likelihood in

terms of the potential outcome framework. Under this framework, Section 5.3 shows that

previous meta-analytic approaches have implicitly assumed conditional independence be-

9


tween pair-specific mean outcome differences and variances given the population treatment

effect, hence may lead to bias if such an assumption is inappropriate. Bias can also occur

if we do not account for the covariate imbalances that may still exist between clusters in a

pair after matching. We propose a covariate-calibrated estimator to reduce these biases and

improve efficiency (Wu et al., 2014b). Lastly, the methodology is illustrated by an analysis

using data from the Guided Care study.

Chapter 6 summarizes the contribution of this dissertation and suggests open questions

for future research.

1.3 Software

The R and WinBUGS programs to reproduce all the results in this thesis work are

available at the following website:

http://www.biostat.jhsph.edu/∼zhwu/software/thesis.code.zip

The programs are organized into two main folders: npLCM and MPCR. The npLCM

folder contains all of the programs used for fitting the nested partially-latent class models

(npLCM) and creating our proposed visualizations described in Chapter 3 and 4. The MPCR

folder contains the function for producing the covariate-calibrated estimator described in

Chapter 5 and the graphics/tables presented therein. Both folders contain descriptions about

how to use the programs including data format requirements, model specifications, and

options for outputs.

10

www.biostat.jhsph.edu/~zhwu/software/thesis.code.zip

Chapter 2

Latent Class Models

11

CHAPTER 2. LATENT CLASS MODELS

2.1 Brief history and formulation

Latent variables are used in statistical models to represent individual characteristics

usually of scientific interest that are not directly observable. A goal of analysis is to infer

the values of the latent variables from other observable quantities. Some examples of latent

variables include depression or other mental states, disability, and intelligence. These latent

variables are clearly understood in their respective contexts while not directly measured.

Models that relate the latent variables and manifest (observed) responses (e.g. mea-

surements of symptoms, detection of pathogens) for an individual can be described by the

general term, “latent structure model” as was summarized and discussed in book-length

by Lazarsfeld and Henry (1968). The observed variables are usually assumed to be con-

ditionally independent given the values of the latent variables. These models serve the

purpose of summarizing the observed variations of measurements in terms of a vector of

low-dimensional latent variables, which are underlying constructs of interest. McCutcheon

(1987) defined four types of latent structure models: factor analysis (continuous outcomes

and continuous latent variables), latent trait model (discrete outcomes and continuous latent

variables), latent profile model (continuous outcomes and discrete latent variables); latent

class model (discrete outcomes and discrete latent variables). In this chapter, we focus on

the latent class models (LCM) for multivariate binary data with the latent variables that take

values in a finite set of classes, because it is relevant to the motivating PERCH application

detailed in Chapter 3 and 4.

Let Mi be a J-dimensional binary measurement Mi for subject i = 1, 2, ..., N , com-

12


prising the observed binary outcomes on J items, e.g., questions in psychometrics, or

pathogens presence/absence in infectious disease research. Let the latent variable for in-

dividual i be discrete and take its value in a finite set of values Zi ∈ 1, ..., K, where K

is the number of possible classes. Let νk = P (Zi = k) denote the probability that the ith

person is in the kth class. In the PERCH application, the control measurements Mi is a

vector of measurements on J different species of pathogens. Given the latent variable Zi

for individual i, the LCM assumes that the J measurements are conditionally independent

of one another, with the conditional distribution

pr(Mi | Zi = k,pZi) =

J∏j=1

pMij

jk (1− pjk)1−Mij , (2.1.1)

where pk = pkj = P (Mij = 1 | Zi = k), j = 1, ..., J is the vector of conditional prob-

abilities that an individual i who is in latent class k will have a positive response on the

jth dimension. In words, the conditional independence says that if the latent variable Zi is

observed, then the observed multivariate measurements, Mi1,Mi2, ...,MiJ , are not infor-

mative about one another. For any pair of margins (j, j′), the observed marginal association

between Mij and Mij′ is induced by their separate associations with the shared latent vari-

able Zi.

When the study population is a mixture of subject with unknown latent variable Zi, the

13


observed data distribution is a finite mixture distribution

pr(Mi;ν,P ) =K∑k=1

νk ·J∏j=1

pMij

jk (1− pjk)1−Mij , (2.1.2)

where P is a J × K matrix with columns being conditional probability vectors pk, k =

1, ..., K. Assuming that the sampled individuals are mutually independent, we obtain the

full likelihood specification of the LCM

pr(Mi, i = 1, ..., N | ν,P ) =N∏i=1

K∑k=1

νk ·J∏j=1

pMij

jk (1− pjk)1−Mij . (2.1.3)

Remark 1. Goodman (1974) and Haberman (1979) also noted that the LCM can be equiv-

alently formulated as a log-linear model for a contingency table where one of the category

variables, the latent variable is unobserved.

The popularity of the LCM lies in its ability to describe correlated binary measure-

ments, which otherwise lack a standard joint distribution that is equivalent to the multivari-

ate Gaussian distribution in the continuous case. As discussed in Section 2.5, the LCM can

approximate any multivariate binary distribution with arbitrary precision if the number of

latent classes K is large enough.

The LCM is often used as a clustering tool since each observation is assigned a prob-

ability of being in each of the K classes. One specific area that finds LCM particularly

useful is the evaluation of medical diagnostic tests, especially when no gold-standard is

available to directly observe the disease status (Albert et al., 2001; Pepe and Janes, 2007).

14


More specifically, suppose J diagnostic tests are applied to an individual and that we cannot

directly observe the disease status Zi ∈ 0, 1 of the individual. We use the J-dimensional

measurements to infer Z. The LCM collectively uses the information provided by different

diagnostic tests with possibly varied error rates, to accomplish three tasks: 1) estimate the

mixing weights ν, 2) estimate the conditional probabilities P that describe the measure-

ment characteristics (e.g. error rates) given known disease status, and 3) predict the vector

of probabilities that an individual i belongs to each of the classes.

Task 1) is related to understanding the population structure, i.e., to estimate the preva-

lence of each latent classes. It is sometimes useful to stratify the population into groups

within each of which the measurement characteristics are homogeneous and can be sum-

marized by the conditional probabilities estimated in 2).

By inspecting estimates obtained in 2), we can assign substantive meanings to each es-

timated latent class. For example, in the research of functional disability related to aging in

the US population (Corder and Manton, 1991), one of the research questions is “what are

the characteristics of each functional disability severity category in terms of measured ac-

tivities?”. The conditional probabilities can describe the response probabilities to questions

like “can you do heavy/light housework?” or “can you getting about outside?”. An class

estimated with larger values of these conditional probabilities means that the individuals

in this class have high functional disabilities, while another class with small estimates of

the conditional probabilities represents the subgroup of people whose physical status are

relatively good.

15


The final task 3) can be accomplished in the Bayesian framework where the individual

probability is the posterior distribution of class membership given her measurements and

the population data. It assigns each study subject probablistically to the estimated latent

classes, which can guide clinical decisions on this individual.

More applications of the LCM and their substantive interpretations can be found in

the areas of diagnosis and rater agreement (e.g., Albert et al. (2001), Uebersax (1988),

Uebersax and Grove (1993), Dillon and Mulani (1984), Gelfand and Solomon (1973)),

psychiatry (e.g., Young (1983); Eaton et al. (1989); Sullivan et al. (1998)), education (e.g.,

Aitkin et al. (1981); Uebersax (1997)), and infectious disease studies (e.g., Jokinen and

Scott (2010)).

2.2 Identifiability

Potential non-identifiability of the LCM parameters is well known. For example, an

LCM with four observed binary indicators and three latent classes is not identifiable de-

spite providing 15 degrees-of-freedom to estimate 14 parameters (Goodman, 1974). In

latent variable analysis, the model identifiability is usually discussed in a local sense as

first described by McHugh (1956). We call a distribution F “locally identifiable” if at the

parameter ψ0, there exists some neighborhood N (ψ0) such that

FM (m;ψ0) = FM (m;ψ) ∀m ∈ supp(F )⇔ ψ = ψ0, ∀ψ ∈ N (ψ0) ∩Ψ, (2.2.1)

16


where M denotes the random vector of measurements, Ψ is the parameter space and

supp(F ) is the support set of distribution F .

In the standard LCM with K classes, the parameter vector ψ = (ν,P ) is of dimension

K(J + 1)− 1 as defined in the previous section. Goodman (1974) concluded that a model

is identifiable if K < 2(J−1)/2 and not identifiable if K > 2J/(J + 1). Models where

neither of the two inequality is true may or may not be identifiable. Such identifiability

is theoretical and may require a very large sample size in order to distinguish the model

likelihood over two parameter values.

Remark 2. Jones et al. (2010) discusses from a geometric perspective the global identifi-

ability, weak identifiability, and partial identifiability in the context of multiple diagnostic

testing in the absence of a gold standard.

When the sample size is finite or the number of individual in a latent class is small,

the data may not be fully informative about the parameters in that class, and weak es-

timability (Dawid, 1979; Gelfand and Sahu, 1999) can occur. Weak estimability is when

technical conditions for identifiability are met, but the data provide little information about

the particular parameters so that their posterior and prior distributions are similar. Sup-

pose the model is denoted by L(ψ;M ) and ψ is partitioned as ψ = (ψ1, ψ2). If f(ψ2 |

ψ1,M) = f(ψ2 | ψ1), then we say ψ2 is weak estimable. The data M does not provide

extra information about ψ2 given ψ1 beyond the prior conditional distribution f(ψ2 | ψ1).

Therefore, ψ2 cannot be identified from the data. Note that, however, this does not mean

f(ψ2 |M ) = f(ψ2), because if ψ1 is identifiable from the data, then ψ2 can be indirectly

17


learned through integrating over [ψ1 | Data] in the prior conditional distribution f(ψ2 | ψ1),

where conditioning set has been estimated from the data (Gustafson et al., 2001).

When a model is not locally identifiable, we cannot estimate a LCM with likelihood

methods. But if we have sources of prior information about a subset of the parameters,

we can estimate the latent class model by Bayesian methods. The Bayesian framework

can avoid the identifiability issue by supplying prior information on model parameters, and

posterior distribution is an legitimate summary of both prior and likelihood information

(Gustafson, 2009).

2.3 Estimation by Markov chain Monte Carlo

In the Bayesian approach, there is no distinction between latent variables and parame-

ters; all are considered random quantities whose distribution are to be updated given data.

A survey of Markov chain Monte Carlo (MCMC) methods can be found in Robert and

Casella (1999), Gilks et al. (1996), or Brooks et al. (2011). MCMC methods have been

used for a variety of latent variable models, including generalized linear mixed models

(e.g., Zeger and Karim (1991); Clayton (1996)), multilevel models, covariate measurement

models, etc. For the LCM, Garrett and Zeger (2000) detailed the MCMC algorithm to draw

approximating samples from the joint posterior distribution of all the unknowns (model pa-

rameters and latent variables). An important advantage of MCMC is that the approach

can be used to estimate complex models for which other methods are either unfeasible

18


or work poorly. Another advantage is that any characteristics of the posterior distribution

can be investigated based on stationary simulated values, for instance posterior means and

percentiles.

There are several tuning parameters that must be chosen in using MCMC methods.

The first is, the burn-in period, or the number of initial iterations to discard while the

Markov chain is converging close to its asymptotic distribution. And, in WinBUGS, this

period is also needed to choose good parameters for Metropolis-Hastings (MH) proposal

distributions (Spiegelhalter et al., 2003). The second tuning constant is the total length of

the MCMC. It is considerably more difficult to monitor convergence to a distribution than

to a point. A popular approach is to use an arbitrary large number or to run a number of

chains with different initial values to assess convergence (Gelman et al., 2013). Approaches

based on Monte Carlo errors have been proposed (Flegal et al., 2008) to ensure acceptable

precision of the estimates like the posterior means. It can be particularly hard to judge

convergence of the estimates when there is slow mixing, that is, when the chain moves

slowly through the bottlenecks of the target distribution. When the mixing is poor, the

chain has to be run for a very long time to obtain accurate estimates.

2.4 Grade-of-Membership model

The LCM (2.1.3) represents the joint distribution of multivariate binary responses as

a mixture of conditionally independent (product) distributions, one for each class. The

19


Grade-of-Membership (GoM) model, developed by Max Woodbury in the 1970s for medi-

cal classification (Woodbury et al., 1978; Clive et al., 1983), is another approach to charac-

terize multivariate distribution for categorical variables with potentially more parsimonious

representation compared to the LCM (Manton et al., 1994; Singer, 1989; Erosheva et al.,

2007; Bhattacharya and Dunson, 2012). The GoM model is especially useful when the ma-

jority of the cells in the observed contingency table have small or zero counts. The GoM

has been applied in genetic studies (Pritchard et al., 2000), studies of functional disabilities

(Erosheva et al., 2007), and topic modeling (Blei et al., 2003).

Specifically, let gi = (gi1, gi2, ..., giK)′ be a latent partial membership vector for indi-

vidual i comprising K nonnegative random variables that sum to one. Define an “extreme

profile” to be a vector of conditional response probabilities λkjmj= P (Mij = mj | gik =

1, gik′ = 0, k′ 6= k) when the individual is entirely a member of class k, that is gik = 1

and gik′ = 0 for all k′ 6= k, for k = 1, 2, ..., K, j = 1, 2, ..., J , and mj = 1, 2, ..., Dj

with Dj being the number of categories on the jth dimension of measurements. The set of

conditional response probabilities must satisfy the following constraint

Dj∑mj=1

λkjmj= 1, (2.4.1)

for k = 1, 2, ..., K, and j = 1, 2, ..., J .

Given partial membership vector gi ∈ [0, 1]K , the conditional distribution of observed

measurement Mij is given by a convex combination of the extreme profiles’ conditional

20


response probabilities, that is

P (Mj = mij | gi) =K∑k=1

gikλkjmj, (2.4.2)

for j = 1, 2, ..., J , and mj = 1, 2, ..., Dj . Similar to the LCM, the local independence

assumption states that manifest variables, Mi1, ...,MiJ , are conditional independent given

latent variables gi. Under this assumption, the conditional probability of observing re-

sponse patternMi = m is

P (Mi = m | gi) =J∏j=1

(K∑k=1

gkλkjmj

), (2.4.3)

By marginalizing over the distribution of latent vector gi (denoted as G(·)), we obtain the

observed marginal distribution for response patternm:

P (Mi = m) =

∫P (Mi = m | gi)dG(gi) (2.4.4)

=

∫ J∏j=1

(K∑k=1

gikλkjmj

)dG(gi). (2.4.5)

Both the LCM and the GoM are finite mixture models but they differ in the level of

mixture. The LCM’s marginal distribution or integrated likelihood (2.1.3) can simplify

to a summation of K components, which is usually termed population-level mixture. In

contrast, the functional form of the marginal distribution of responses in the GoM (2.4.5)

cannot simplify to a finite sum, and is similar to the structure in the random-effects model

21


where the random effects gi follows a continuous distribution G(·). Erosheva et al. (2007)

termed the GoM as individual-level mixture model because an individual has her specific

vector of mixing weights gi over K extreme profiles.

If the number of mixture components used in the LCM can be different from that in

the GoM, connections between LCM and GoM can be established. Specifically, the book

Woodbury et al. (1978) first pointed out the nested property of the LCM and GoM. In its

book review, Haberman (1995) suggested that the GoM is a special case of the LCM with

a set of restrictions imposed upon a latent class model. We can construct a LCM such

that its marginal distribution of manifest variables is exactly the same as under the GoM

model. (Erosheva et al., 2007) showed the equivalence between individual-level and the

population-level mixture model exists and can be summarized by the following theorem.

Theorem 2.4.1. (Fundamental Representation Theorem, Theorem 3.2, Erosheva 2007)

Given J manifest variables, any individual-level mixture model withK components can

be represented as a constrained population-level mixture model with KJ components.

The fundamental representation theorem indicates that the MCMC algorithm with data

augmentation can be used for posterior calculation that has been developed for the LCM

model (Erosheva et al., 2007).

When covariates are considered to influence the probability that an individual belonging

to different latent classes, Dayton and Macready (1988), Bandeen-Roche et al. (1997), and

Huang and Bandeen-Roche (2004) have extended the LCM to the regression setting and

termed such extensions as latent class regression models (LCRM). The GoM regression

22


extension has also been developed in several PhD theses (Connor, 2006; Manrique-Vallier,

2010) in the longitudinal study of disability survey data.

2.5 Approximation properties

The LCM can approximate a multivariate discrete distribution arbitrarily well if the

number of classes (K) is sufficiently large. When the dimension is J = 2, the result-

ing D1 × D2 contingency table has cell (m1,m2) containing the count∑n

i=1 1Mi1 =

m1,Mi2 = m2, for m1 = 1, ..., D1 and m2 = 1, ..., D2. Let the contingency table be

represented by p0 = P (Mi1 = m1,Mi2 = m2). The LCM is equivalent to the finite

mixture specification introduced by as

P (mi1 = m1,mi2 = m2) =K∑k=1

νkψ(1)km1

ψ(2)km2

, (2.5.1)

where ν = (ν1, ..., νK) is a vector of mixture probabilities.

Good (1969) first noted the similarity between the singular value decomposition (SVD)

and the latent structure model without providing a proof that any probability matrix p0 ∈

ΠD1D2 can be decomposed by (2.5.1). When the dimension is larger than 2, let

p0 = P (Mi1 = m1, ...,MiJ = mJ) = pm1···mJ,mj = 1, ..., Dj, j = 1, ..., J ∈ ΠD1···Dp

23


denote a higher order tensor with ΠD1···Dp denoting the set of all probability tensors of size

D1×D2×· · ·×DJ , where the probability tensors have nonnegative elements and constraint

D1∑m1=1

· · ·DJ∑

mJ=1

|pm1m2···mJ| = 1.

The following corollary describes that any such contingency table can be decomposed into

the following form equivalent to that in the LCM formulation.

Theorem 2.5.1. (Corollary 1, Dunson and Xing (2009))

p =K∑k=1

νkΨk, Ψk = ψ(1)k ⊗ψ

(2)k ⊗ · · · ⊗ψ

(J)k ,

where ⊗ is Kronecker product, ν = (ν1, ..., νK)′ is a probability vector that sums to one,

Ψk ∈ ΠD1···DJ, and ψ(j)

k is a Dj × 1 probability vector, for k = 1, ..., K and j = 1, .., J .

This makes clear that any multivariate categorical data distribution can be expressed as

a latent structure model,

P (Mi1 = m1, ...,MiJ = mJ) =K∑k=1

νk

J∏j=1

ψ(j)kmj

,

where ν is a vector of component probabilities, zi ∈ 1, ..., K is a latent class index,

mi = (mi1, ...,miJ)′ are conditionally independent given Zi and P (Mij = mj | Zi =

k) = ψ(j)kmj

is the probability of Mij = mj given allocation of individual i to class k.

The decomposition above is referred to as nonnegative PARAFAC decomposition (Shashua

24


and Hazan, 2005), which is one way of generalizing the matrix singular value decomposi-

tion. Its goal is to express the tensor as a sum of K rank 1 tensors.

The GoM also has approximation properties that are summarized in Bhattacharya and

Dunson (2012). It is a nonnegative higher-order singular value decomposition (HOSVD),

which was first proposed by Tucker (1966) for three-way data, and was later extended

to arbitrary tensors by De Lathauwer et al. (2000). The nonnegative HOSVD achieves

better data compression and requires fewer components compared with the nonnegative

PARAFAC decomposition as it uses all combinations of the mode vectors (Bhattacharya

and Dunson, 2012). The GoM allows manifest variables, Mi1,Mi2...,MiJ , to be allocated

to different classes via the local class indicator Zij specific for each dimension j and in-

dividual i. The LCM let all the manifest variables on an individual to fall into the same

class. In general, the GoM requires less number of classes than the LCM. However, in

our PERCH applications, we find that the LCM provides sufficient approximation for the

control population’s distribution. Bhattacharya and Dunson (2012) also suggested that the

GoM formulation can be extended to accommodate measurements with mixed data type on

each dimension, for example, some continuous, some discrete, using kernel techniques.

25

Chapter 3

Partially-Latent Class Models (pLCM)

for Case-Control Studies of Childhood

Pneumonia Etiology

26

CHAPTER 3. PARTIALLY-LATENT CLASS MODELS (PLCM) FORCASE-CONTROL STUDIES OF CHILDHOOD PNEUMONIA ETIOLOGY

Abstract

In population studies on the etiology of disease, one goal is the estimation of the fraction

of cases attributable to each of several causes. For example, pneumonia is a clinical diag-

nosis of lung infection that may be caused by viral, bacterial, fungal, or other pathogens.

The study of pneumonia etiology is challenging because directly sampling from the lung

to identify the etiologic pathogen is not standard clinical practice in most settings. In-

stead, measurements from multiple peripheral specimens are made. This paper considers

the problem of estimating the population etiology distribution and the individual etiology

probabilities. We formulate the scientific problem in statistical terms as estimating mixing

weights and latent class indicators under a partially-latent class model (pLCM) that com-

bines heterogeneous measurements with different error rates obtained from a case-control

study. We introduce the pLCM as an extension of the latent class model. We also intro-

duce graphical displays of the population data and inferred latent-class frequencies. The

methods are illustrated with simulated and real data sets. The paper closes with a brief

description of extensions of the pLCM to the regression setting and to the case where con-

ditional independence among the measures is relaxed.

27


3.1 Introduction

Identifying the pathogens responsible for infectious diseases in a population poses sig-

nificant statistical challenges. Consider the measurement problem in the Pneumonia Eti-

ology Research for Child Health (PERCH), a case-control study that has enrolled 9, 500

children from 7 sites around the world. Pneumonia is a clinical syndrome that devel-

ops because of an infection of the lung tissue by bacteria, viruses, mycobacteria or fungi

(Levine et al., 2012). The appropriate treatment and public health control measures vary

by pathogen. Which pathogen is infecting the lung usually cannot be directly observed

and must therefore be inferred from multiple peripheral measurements with differing error

rates. The primary goals of the PERCH study are to integrate the multiple sources of data

to: (1) aid the attribution of which pathogen or pathogens have caused a particular case’s

lung infection, and (2) estimate the prevalences of the etiologic pathogens in a population

of children.

The basic statistical framework of the problem is pictured in Figure 3.1. Let Yi rep-

resent whether the child is a pneumonia case (Yi = 1) or control (Yi = 0). For a child

with pneumonia, let ILi indicate which pathogen causes the lung infection. ILi takes values

in 0, 1, 2, ...J where 0 represents no infection (control) and ILi = j, j = 1, ..., J , rep-

resents the jth pathogen from a pre-specified cause-of-pneumonia or pneumonia etiology

list. Among the J candidate pathogens being tested, we assume only one is the primary

cause. Because, for most cases, it is not possible to directly sample the lung, we do not

know with certainty which pathogen infected the lung, so we seek to infer the infection

28


status ILi based upon a series of laboratory measurements of specimens from various body

fluids and body sources S (MSi ).

The measurement error rates differ by type of measurement. In the motivating PERCH

application and the following discussions, the error rates refer to epidemiologic error rates

that characterize the probability of the pathogen’s presence/absence in specimen tests given

whether it infected the lung. For this and possibly other applications, it is convenient to cat-

egorize measures into three subgroups referred to as “gold”, “silver”, and “bronze” stan-

dard measurements. A gold-standard (GS) measurement is assumed to have both perfect

sensitivity and specificity. A silver-standard (SS) measurement is assumed to have per-

fect specificity, but imperfect sensitivity. Culturing bacteria from blood samples (B-Cx)

is an example of silver standard measurements in PERCH. Finally, bronze-standard (BrS)

measurements are assumed to have imperfect sensitivity and specificity. Polymerase chain

reaction (PCR) evaluation of bacteria and viruses from nasopharyngeal samples is an ex-

ample. In the PERCH study, both SS and BrS measurements are available in all cases.

BrS measures are also available for controls. A goal of this study is to develop a statistical

model that combines GS and SS measurements from cases, with bronze data from cases

and controls to estimate the distribution of pathogens in the population of pneumonia cases,

and the conditional probability that each of the J pathogens is the primary cause of an indi-

vidual child’s pneumonia given her or his set of measurements. Even in applications where

GS data is not available, a flexible modeling framework that can accommodate GS data is

useful for both the evaluation of statistical information from BrS data (Section 3.3) and the

29


Figure 3.1: Directed acyclic graph (DAG) illustrating relationships among lung infec-tion state (IL), imperfect lab measurements on the presence/absence of each of a list ofpathogens at each site(MNP , MB and ML), disease outcome, and covariates (X). For asubject missing one or more of the three types of measurements, we remove the corre-sponding measurement component(s). For example, if a case does not have lung aspirate(LA) measurement, we remove ML from the DAG.

30


incorporation of GS data if it becomes available as measurement technology improves.

Latent class models (LCM) (Goodman, 1974) have been successfully used to integrate

multiple diagnostic tests or raters’ assessments to estimate a binary latent statusD ∈ 0, 1

for all study subjects (Hui and Walter, 1980; Qu and Hadgu, 1998; Albert et al., 2001; Al-

bert and Dodd, 2008). (In these applications, D = 1 if IL > 0.) In the LCM framework,

conditional distributions [M |D = j], j = 0, 1, are specified to use multivariate measure-

ments M to maximize the likelihood as a function of the disease prevalence, sensitivi-

ties and specificities. This framework has also been extended to infer ordinal latent status

(Wang et al., 2011).

There are three salient features of the PERCH childhood pneumonia problem that re-

quire extension of the typical LCM approach. First, we have partial knowledge of the latent

lung state IL for some subjects as a result of the case-control design. In the standard LCM

approach, the study population comprises subjects with completely unknown class mem-

bership D. In this study, the latent etiology IL = 0 is applied to all controls because absent

clinical disease, the lung is assumed to be non-infected. Also, were gold standard mea-

surements available from the lung for some cases, their latent variable would be directly

observed. As the latent state is known for a non-trivial subset of the study population, we

refer to the model posited below as a partially-Latent Class Model or pLCM.

Second, in most LCM applications, the number of diagnostic test results on a subject

is much larger than the number of latent state categories. Here, the number of diagnostic

tests is of the same order, and often equal to the number of categories that IL can assume.

31


For example, if we consider only the PERCH study BrS data, we simultaneously observe

the presence/absence of J pathogens for each child. The large number of latent categories

of IL leads to weak model identifiability as is discussed in more detail in Section 3.2.1.

Lastly, measurements with differing error rates (i.e. GS, SS, BrS) need to be inte-

grated in this application. Understanding the relative value of each level of measurements

is important to optimally invest resources into data collection (number of subjects, type

of samples) and laboratory assays. An important goal is therefore to estimate the relative

information from each type of measurements about the population and individual etiology

distributions. Albert and Dodd (2008) studied a model where some subjects are selected to

verify their latent status (i.e. collect from them GS measurements) with the probability of

verification depending on the previous test results or completely at random. They showed

GS data can make model estimates more robust to model misspecifications. We quantify

how much GS data reduces the variance of model parameter estimates for design purposes.

Also, they considered binary latent status and did not have available control data. Another

related literature that uses both GS and BrS data is on verbal autopsy (VA) in the setting

where no complete vital registry system is established in the community (King and Lu,

2008). Quite similar to the goal of inferring pneumonia etiology from lab measurements,

the goal of VA is to infer the cause of death (ID) from a pre-specified list by asking close

family members questions about the presence/absence ofK symptoms. King and Lu (2008)

proposed estimating the cause-of-death distribution in community P (ID = j), j = 1, ...J,

(similar to etiology) using data on K dichotomous symptoms and GS data from the hospi-

32


tal where cause-of-death and symptoms are both recorded. However, their method involves

nonparametric estimation of J K-way probability contingency tables and therefore requires

a sizable sample of GS data, especially when the number of symptoms is large. In addition,

a key difference between VA and most infectious disease etiology studies is that the VA

studies are by definition case-only.

Another approach previously used with case and control data is to perform logistic

regression of case status Y on laboratory measurements M and then to calculate point es-

timates of population attributable risks for each pathogen (Bruzzi et al., 1985; Blackwelder

et al., 2012). This method does not account for imperfect laboratory measurements and

cannot use GS data if available. Also, zero prevalence is assigned to pathogens whose esti-

mated odds ratios are smaller than 1, without taking account of their statistical uncertainty.

In this paper, we define and apply a partially-latent class model (pLCM) with condi-

tional independent assumptions to incorporate these three features: known infection status

for controls, a large number of latent classes, and multiple types of measurements. We use

a hierarchical Bayesian formulation to estimate: (1) the population etiology distribution or

etiology fraction —the frequency with which each pathogen “causes” clinical pneumonia

in the case population; (2) the individual etiology probabilities—the probabilities that a

case is “caused” by each of the candidate pathogens, given observed specimen measure-

ments for that individual; and (3) the relative information content of GS, SS, and BrS data

(Section 3.3 and 3.4).

The remainder of this paper proceeds as follows. In section 3.2, we formulate the

33


pLCM and the Gibbs sampling algorithms for implementation. In Section 3.3, we evaluate

our method through simulations tailored for the childhood pneumonia application. Section

3.4 presents the application of our methodology to a subsample of the PERCH data to

demonstrate its applicability. The last section concludes with a discussion of results and

limitations, a few natural extensions of the pLCM also motivated by the PERCH data, as

well as future directions of research.

3.2 A partially-latent class model for multiple

indirect measurements

We develop pLCM to address two characteristics of the motivating pneumonia problem:

(1) a partially-latent state variable because the pathogen infection status is known for con-

trols but not cases; and (2) multiple categories of measurements with different error rates

across classes. As shown in Figure 3.1, let ILi , taking values in 0, 1, 2, ...J, represent

the true state of child i’s lung (i = 1, ..., N ) where 0 represents no infection (control) and

ILi = j, j = 1, ..., J , represents the jth pathogen from a pre-specified cause-of-pneumonia

list that is assumed to be exhaustive. Let MSi represent the J × 1 vector of binary indi-

cators of the presence/absence of each pathogen in the measurement at site S, where, in

our application S can be nasopharyngeal (NP), blood (B), or lung (L). Let mSi be the ac-

tual observed values. In the following, we replace S with BrS, SS, or GS, because they

correspond to the measurement types at NP, B, and L, respectively.

34


Let Yi = yi ∈ 0, 1 represent the indicator of whether child i is a control or case. Note

ILi = 0 given Yi = 0. To formalize the pLCM, we define three sets of parameters:

• π = (π1, ..., πJ)T for the probability Pr(IL = j | Y = 1,π), j = 1, ..., J

• ψSj = Pr(MSj = 1|IL = 0), the marginal false positive rate (FPR) for measurement j

at site S

• θSj = Pr(MSj = 1|IL = j), the marginal true positive rate (TPR) for measurement j

at site S for a person whose lung is infected by pathogen j.

We further let ψS = (ψS1 , ..., ψSJ )T and θS = (θS1 , ..., θ

SJ )T . Using these definitions, we

have FPR ψBrSj = 0 and TPR θBrS

j = 1 for GS measurements, so that MGSj = 1 if and

only if ILi = j (perfect sensitivity and specificity). Let δi be the binary indicator of a case

i having GS measurements; it equals 1 if the case has available GS data and 0 otherwise.

For SS measurements, FPR ψSSj = 0 so that MSS

j = 0 if ILi 6= j (perfect specificity).

We formalize the model likelihood for each type of measurement. We first describe the

model for BrS measurementMBrS for a control or a case. For control i, positive detection

of the jth pathogen is a false positive representation of the non-infected lung. Therefore,

we assume that for control i, MBrSij | ψBrS ∼ Bernoulli(ψSj ), j = 1, ..., J , with conditional

independence, or equivalently,

P 0,BrSi = Pr(MBrS

i = m | ψBrS) =J∏j=1

(ψBrSj

)mj(

1− ψBrSj

)1−mj

, (3.2.1)

35


where m = mBrSi . For a case i′ infected by pathogen j, the positive detection rate for the

jth pathogen in BrS assays is θBrSj . Since we assume a single cause for each case, detection

of pathogens other than j will be false positives with probability equal to marginal FPR as

in controls: ψBrSl , l 6= j. This nondifferential misclassification across the case and control

populations is the essential assumption of the latent class approach because it allows us to

borrow information from control BrS data to distinguish the true cause from background

colonization. We further discuss it in the context of the pneumonia etiology problem in the

final section. Then,

P 1,BrSi′ = Pr(MBrS

i′ = m | π,θBrS,ψBrS)

=J∑j=1

πj ·(θBrSj

)mj(

1− θBrSj

)1−mj ∏l 6=j

(ψBrSl

)ml(

1− ψBrSl

)1−ml

, (3.2.2)

where m = mBrSi′ , is the likelihood contributed by BrS measurements from case i′. Con-

venient for Gibbs sampler, we introduce the latent lung infection state ILi′ and represent

(3.2.2) by the following two-stage sampling scheme:

(i) multinomial sampling of lung infection state among cases: ILi′ | π, Yi′ = 1 ∼

Multinomial(π),

(ii) measurement stage given lung infection state:

MBrSi′j | ILi′ ,θBrS,ψBrS ∼ Bernoulli

(1IL

i′=jθBrSj +

(1− 1IL

i′=j

)ψBrSj

), j = 1, ..., J ,

conditionally independent, where 1· is the indicator function and equals one if the

statement in · is true; otherwise, zero.

36


Similarly, likelihood contribution from a case i′’s SS measurements can be written as

P 1,SSi′ = Pr(MSS

i′ = m | π,θSS) =J ′∑j=1

πj ·(θSSj

)mj

(1− θSSj )1−mj1∑J′

l=1ml≤1,(3.2.3)

for m = mSSi′ , noting the perfect specificity of SS measurements, where J ′ ≤ J repre-

sents the number of actual SS measurements on each case, and θSS =(θSS1 , ...θSS

J ′

). SS

measurements only test for a subset of all J pathogens, e.g., blood culture only detects bac-

teria and J ′ is the number of bacteria that are potential causes. Finally, GS measurement

MGSi′ that accurately indicates the actual cause for case i′, is assumed to follow multinomial

distribution with likelihood:

P 1,GSi′ = Pr

(MGS

i′ = m | π)

=J∏j=1

π1mj=1j 1∑j mj=1,m = mGS

i′ . (3.2.4)

Combining likelihood components (3.2.1)—(3.2.4), the total model likelihood for BrS,

SS, and GS data across independent cases and controls, L(γ;D), can be expressed as

∏i:Yi=0

P 0,BrSi

∏i′:Yi′=1,δi′=1

P 1,BrSi′ · P 1,SS

i′ · P 1,GSi′

∏i′′:Yi′′=1,δi′′=0

P 1,BrSi′′ · P 1,SS

i′′ , (3.2.5)

where γ = (θBrS,ψBrS,θSS,π)T stacks all unknown parameters, and data D is

mBrS

i

i:Yi=1

∪mBrS

i′ ,mGSi′ ,m

SSi′

i′:Yi′=1,δi′=1

∪mBrS

i′′,mSS

i′′

i′′:Y

i′′=1,δ

i′′=0

37


collects all the available measurements on study subjects. Our primary statistical goal is

to estimate the posterior distribution of the population etiology distribution π, and obtain

individual etiology (IL∗ ) prediction given a case’s measurements (mBrS∗ ,mSS

∗ ), i.e.,

Pr(IL∗ = j |mBrS∗ ,mSS

∗ ,D), j = 1, ..., J.

To enable Bayesian inference, prior distributions on model parameters are specified as

follows: π ∼ Dirichlet(a1, . . . , aJ), ψBrSj ∼ Beta(b1j, b2j), θBrS

j ∼ Beta(c1j, c2j), j =

1, ..., J , and θSSj ∼ Beta(d1j, d2j), j = 1, ..., J ′. Hyperparameters for etiology prior,

a1, ..., aJ , are usually 1s to denote equal and non-informative prior weights for each pathogen

if expert prior knowledge is unavailable. The FPR for the jth pathogen, ψBrSj , generally can

be well estimated from control data, thus b1j = b2j = 1 is the default choice. For TPR pa-

rameters θBrSj and θSS

j , if prior knowledge on TPRs is available, we choose (c1j, c2j) so that

the 2.5% and 97.5% quantiles of Beta distribution with parameter (c1j, c2j) match the prior

minimum and maximum TPR values elicited from pneumonia experts . Otherwise, we use

default value 1s for the Beta hyperparameters. Similarly we choose values of (d1j, d2j) ei-

ther by prior knowledge or default values of 1. We finally assume prior independence of the

parameters as [γ] = [π][ψBrS][θBrS][θSS], where [A] represents the distribution of random

variable or vector A. These priors represent a balance between explicit prior knowledge

about measurement error rates and the desire to be as objective as possible for a particular

study. As described in the next section, the identifiability constraints on the pLCM re-

38


quire specifying a reasonable subset of parameter values to identify parameters of greatest

scientific interest.

3.2.1 Model identifiability

Potential non-identifiability of LCM parameters is well-known. For example, an LCM

with four observed binary indicators and three latent classes is not identifiable despite pro-

viding 15 degree-of-freedom to estimate 14 parameters (Goodman, 1974). In principle, the

Bayesian framework avoids the non-identifiability problem in LCMs by incorporating prior

information about unidentified parameter subspaces (Garrett and Zeger, 2000). Many au-

thors point out that the posterior variance for non-identifiable parameters does not decrease

to zero as sample size approaches infinity (e.g., Kadane (1974); Gustafson et al. (2001);

Gustafson (2005)). For scientific investigations, when data are not fully informative about

a parameter, an identified set of parameter values consistent with the observed data shall,

nevertheless, be valuable in a complex application (Gustafson, 2009) like PERCH.

This identifiability issue for the pLCM only occurs in the absence of GS data. Here

we restrict attention to the scenario with only BrS data for simplicity but similar arguments

pertain to the BrS + SS scenario. The problem can be understood from the form of the

marginal positive measurement rates for pathogens among cases. In the pLCM likelihood

for BrS data (only retaining components in (3.2.5) with superscripts BrS), the marginal

39


positive rate for pathogen j is a convex combination of the TPR and FPR:

Pr(MBrS

i′j = 1 | πj, θBrSj , ψBrS

j

)= πjθ

BrSj + (1− πj)ψBrS

j , (3.2.6)

where the left-hand side of the above equation can be estimated by the observed marginal

positive rate of pathogen j among cases. Although the control data provide ψBrSj estimates,

the two parameters, πj and θBrSj , are not both identified. GS data, if available, identifies

πj and resolves the lack of identifiability. Otherwise, we need to incorporate prior scien-

tific information on one of them, usually the TPR (θBrSj ), derived from infectious disease

and laboratory experts (Murdoch et al., 2012) and/or from vaccine probe studies (Feikin

et al., 2014). If the observed case marginal positive rate is much higher than the rate in

controls (ψBrSj ), only large values of TPR (θBrS

j ) are supported by the data making etiology

estimation more precise (Section 3.2.2).

In more generality, the full model identification can be characterized by inspecting the

Jacobian matrix of the transformation (F ) from model parameters (γ) to the distribution of

the observables (p): p = F (γ). Let γ = (θBrS,ψBrS, π1, ..., πJ−1)T represent the 3J −

1-dimensional unconstrained model parameters. The pLCM defines the transformation

(p1,p0)T = F (γ), where p1 and p0 are the two contingency probability distributions for

the BrS measurements in the case and control populations. It can be shown that the Jacobian

matrix Γ(γ) has J − 1 of its singular values being zero, which means model parameters γ

are not fully identified from the data. The FPRs (ψBrSj , j = 1, ..., J) in pLCM are, however,

40


identifiable parameters that can be estimated from control data. Therefore, pLCM is termed

partially identifiable (Jones et al., 2010).

3.2.2 Parameter estimation and individual etiology pre-

diction

The parameters in likelihood (3.2.5) include the population etiology distribution (π),

TPRs and FPRs for BrS measurements (ψBrS and θBrS), and TPRs for SS measurements

(θSS). The posterior distribution of these parameters can be estimated by constructing

approximating samples from the joint posterior via Gibbs sampler. The full conditional

distributions for the Gibbs sampler are detailed in Section 1 of the supplementary material.

We use freely available software WinBUGS 1.4, to fit the partially-latent class model.

Convergence was monitored via Markov chain Monte Carlo (MCMC) chain histories, auto-

correlations, kernel density plots, and Brooks-Gelman-Rubin statistics (Brooks and Gel-

man, 1998). The statistical results below are based on 10, 000 iterations of burn-in followed

by 10, 000 production samples from each of three parallel chains.

The Bayesian framework naturally allows individual within-sample classification (in-

fection diagnosis) and out-of-sample prediction. This section describes how we calculate

the etiology probabilities for an individual with measurements m∗. We focus on the more

challenging inference scenario when only BrS data are available; the general case follows

directly.

41


The within-sample classification for case i′ is based on the posterior distribution of

latent indicators given the observed data, i.e. Pr(ILi′ = j | D), j = 1, ..., J , which can be

obtained by averaging along the cause indicator (ILi′ ) chain from MCMC samples. For a

case with new BrS measurementsm∗, we have

Pr(ILi′ = j |m∗,D) =

∫Pr(ILi′ = j |m∗,γ)Pr(γ |m∗,D)dγ, j = 1, ...J,(3.2.7)

where the second factor in the integrand can be approximated by the posterior distri-

bution given current data, i.e., Pr(γ | D). For the first term in the integrand, we ex-

plicitly obtain the model-based, one-sample conditional posterior distribution, Pr(ILi′ =

j | m∗,γ) = πj`j(m∗;γ)

/∑m πrm`m(m∗;γ), j = 1, ..., J , where `m(m∗;γ) =(

θBrSj

)m∗j (1− θBrS

j

)1−m∗j ∏l 6=j

(ψBrSl

)m∗l (1− ψBrS

l

)1−m∗lis the mth mixture com-

ponent likelihood function evaluated at m∗. The log relative probability of ILi = j versus

ILi = l is

Rjl = log

(πjπl

)+ log

(θBrSj

ψBrSj

)m∗j (1− θBrS

j

1− ψBrSj

)1−m∗j

+ log

(ψBrSl

θBrSl

)m∗l(

1− ψBrSl

1− θBrSl

)1−m∗l .

The form of Rjl informs us about what is required for correct diagnosis of an individual.

Suppose ILi = j, then averaging overm∗, we have E[Rjl] = log (πj/πl)+I(θBrSj ;ψBrS

j )+

I(ψBrSl ; θBrS

l ), where I(v1, v2) = v1 log(v1/v2)+(1−v1) log ((1− v1)/(1− v2)) is the in-

42


formation divergence (Kullback, 2012) that represents the expected amount of information

in m∗j ∼ Bernoulli(v1) for discriminating against m∗j ∼ Bernoulli(v2). If v1 = v2, then

I(v1; v2) = 0. The form of E[Rjl] shows that there is only additional information from BrS

data about an individual’s etiology in the person’s data when there is a difference between

θBrSj and ψBrS

j , j = 1, ..., J .

Following (3.2.7), we average Pr(ILi′ = j | m∗,γ) over MCMC iterations with γ

replaced by its simulated values γ)∗ at each iteration. Repeating for j = 1, ..., J , we

obtain a J probability vector, pi′ = (pi′1, ..., pi′J)T , that sums to one. This scheme is

especially useful when a newly examined case has a BrS measurement pattern not observed

in D, which often occurs when J is large. The final decisions regarding which pathogen

to treat can then be based upon estimated pi′ . In particular, the pathogen with largest

posterior value might be selected. It is Bayes optimal under mean misclassification loss.

Individual etiology predictions described here generalize the positive/negative predictive

value (PPV/NPV) from single to multivariate binary measurements and can aid diagnosis

of case subjects under other user-specified misclassification loss functions.

3.3 Simulation for three pathogens case with GS

and BrS data

One key question for studies like PERCH is what fraction of the total evidence about

etiology derives from the BrS sources relative to from GS or SS sources if available. In

43


this simulation, we illustrate the extent to which BrS case-control data can supplement

observation of the etiologic agent directly from the site of infection. We discuss the role of

SS measurements in Section 3.4 through application to the PERCH data set.

We simulate BrS data sets with 500 cases and 500 controls for three pathogens, A, B,

and C using pLCM specifications. We focus on three pathogens to facilitate viewing of the

π estimates and individual predictions in the 3-dimensional simplex S2. We use the ternary

diagram (Aitchison, 1986) representation where the vector π = (πA, πB, πC)T is encoded

as a point with each component being the perpendicular distance to one of the three sides.

The parameters involved are fixed at TPR = θ = (θA, θB, θC)T = (0.9, 0.9, 0.9)T , FPR =

ψ = (ψA, ψB, ψC)T = (0.6, 0.02, 0.05)T , and π = (πA, πB, πC)T = (0.67, 0.26, 0.07)T .

We focus on BrS and GS data here and drop the “BrS” superscript on the parameters for

simplicity. We further let the fraction of cases with GS measurements (∆) be either 1%

or 10%. Although GS measurements are rare in the PERCH study, we investigate a large

range of ∆ to understand in general how much statistical information is contained in BrS

measurements relative to GS measurements.

For any given data set, three distinct subsets of the data can be used: BrS-only, GS-

only, and BrS+GS, each producing its posterior mean of π, and 95% credible region by

transformed Gaussian kernel density estimator for compositional data (Chacon et al., 2011).

To study the relative importance of the GS and BrS data, the primary quantity of interest in

the simulations is the relative sizes of the credible regions for each data mix. Here, we use

uniform priors on θ, ψ, and Dirichlet(1, ..., 1) prior for π. The results are shown in Figure

44


3.2.

(a) (b)

Figure 3.2: Population and individual etiology estimations for a single sample with 500cases and 500 controls with true π = (0.67, 0.26, 0.07)T and either 1%N = 5) or10%(N = 50) GS data on cases. In (a) or (b), Red circled plus shows the true populationetiology distribution π. The closed curves are 95 percent credible regions: blue dashedlines “- - -”, light green solid lines “—”, black dotted lines “· · · ” correspond to analysisusing BrS data only, BrS+GS data, GS data only, respectively; Solid square/dot/triangleare corresponding posterior means of π; The 95 percent highest density region of uniformprior distribution is also visualized by red “· − ·−” for comparison. 8(= 23) BrS measure-ment patterns and predictions for individual children are shown with different shapes, withmeasurement patterns attached to them. The radii of circles and numbers at the verticesshow empirical frequencies GS measurements belonging to A, B, or C.

First, in Figures 3.2(a) (1% GS) and 3.2(b) (10% GS), each region covers the true etiol-

ogy π. In data not shown here, the nominal 95% credible regions covers slightly more than

95% of 100 simulations. Credible regions narrow in on the truth as we combine BrS and GS

data, and as the fraction of subjects with GS data (∆) increases. Also, the posterior mean

from the BrS+GS analysis is a result of optimal balance between information contained in

the GS and BrS data.

45


We then fix ψ and π, while varying the TPR θ on the grid (0.6, 0.7, 0.8, 0.9, 0.95, 0.99)

to test the estimation performance under a variety of signal-to-noise ratios, measured by

the difference between the TPRs and FPRs. At each (θ, ∆) grid point, we run analy-

ses on each of 100 simulated data sets. We quantify the gain in precision by adding the

BrS data to the GS data following Xu and Zeger (2001). For pathogen A, let gA(θ) =(d0A − dBrS+GS

A (θ))/(d0A − dGS

A (θ))

, where d0A, dGSA (θ) and dBrS+GS

A (θ) are the length of

95% highest density interval from the prior, length of 95% credible interval using GS data,

and length of the 95% credible interval using BrS and GS data, respectively. This quantity

(gA(θ)) is the ratio of the reduction of the 95% interval widths with and without the BrS

data at TPR value θ. If gA(θ) = 1, then there is no additional gain in the precision of πA

when BrS data is added to GS data. When ∆ = 1%, we observe the expected increase in

gA as TPR θ approaches 1. For pathogen A, gA(0.8) has mean value 1.7 across 100 simu-

lated data sets with standard error 0.3; gA(0.95) further increase to 3.0(standard error 0.3).

Similar patterns are also observed for pathogen B and C.

Using the same simulated data sets, Figures 3.2(a) and 3.2(b) also show individual etiol-

ogy predictions for each of the 8(= 23) possible BrS measurements (mA,mB,mC)T ,mj =

0, 1, obtained by the methods from Section 3.2.2. Consider the example of a newly en-

rolled case without GS data and with no pathogen observed in her BrS data: m = (0, 0, 0).

Suppose she is part of a case population with 10% GS data. In the case illustrated in

Figure 3.2(b), her posterior predictive distribution has highest posterior probability (0.76)

on pathogen A reflecting two competing forces: the FPRs that describe background colo-

46


nization (colonization among the controls) and the population etiology distribution; Given

other parameters, m = (0, 0, 0) gives the smallest likelihood for ILi = A because of its

high FPR that reflects its background colonization rate, ψA = 0.6. However, prior to ob-

serving (0, 0, 0), πA is well estimated to be much larger than πB and πC . Therefore the

posterior distribution for this case is heavily weighted towards pathogen A.

For a case with observation (1, 1, 1), because it is rare to observe pathogen B in a case

whose pneumonia is not caused by B, the prediction favors B. Although B is not the most

prevalent cause among cases, the presence of B in the BrS measurements gives the largest

likelihood when ILi = B. For any measurement pattern with a single positive, the case is

always classified into that category in this example.

Most predictions are stable with increasing ∆. Only 000 cases have predictions that

move from near the center to the corner of A. This is mainly because that TPR θ and

etiology fractions π are not as precisely estimated in GS-scarce scenarios relative to GS-

abundant ones. Averaging over a wider range of θ and π produces 000 case predictions

that are ambiguous, i.e. near the center. As ∆ increases, parameters are well estimated, and

precise predictions result.

3.4 Analysis of PERCH data

The Pneumonia Etiology Research for Child Health (PERCH) study is a standardized

and comprehensive evaluation of etiologic agents causing severe and very severe pneumo-

47


nia among hospitalized children aged 1-59 months in seven low and middle income coun-

tries. The study sites include countries with a significant burden of childhood pneumonia

and a range of epidemiologic characteristics (Levine et al., 2012). PERCH is a case-control

study that has enrolled over 4, 000 patients hospitalized for severe or very severe pneumo-

nia and over 5, 000 controls selected randomly from the community frequency-matched on

age in each month. More details about the PERCH design are available in Deloria-Knoll

et al. (2012).

To illustrate the application of pLCM model for the analysis of PERCH study data, we

have focused on preliminary data from one site with good availability of both SS and BrS

laboratory results. Results for all 7 countries will be reported elsewhere upon study com-

pletion. Included in the current illustrative analysis are BrS data (nasopharyngeal specimen

with PCR detection of pathogens) for 432 cases and 479 frequency-matched controls on 11

species of pathogens (7 viruses and 4 bacteria with their abbreviations in Figure 3.3, and

full names in Section 2 of the supplementary material), and SS data (blood culture results)

on the 4 bacteria for only the cases.

In PERCH, prior scientific knowledge of misclassification rates is incorporated into

the analysis. The TPR of our BrS measurements, θBrSj is assumed to be in the range of

90%−97% (Murdoch et al., 2012). Observations from vaccine probe studies—randomized

clinical trials of pathogen-specific vaccines in which non-specific clinical endpoints such

as clinical pneumonia are evaluated thereby revealing the contribution of the pathogen to

the burden of that syndrome— illustrate that the total number of clinical pneumonia cases

48


prevented by the vaccine is much larger than the few laboratory-confirmed cases prevented.

Comparing the total preventable disease burden to the number of blood culture (SS) pos-

itive cases prevented provides information about the TPR of the bacterial blood culture

measurements, θSSj , j = 1, ..., 4. In our analysis, we use the range 10 − 20% for the SS

TPRs of four bacteria. We set Beta priors that match these ranges (Section 3.2) and as-

sumed Dirichlet(1, ..., 1) prior on etiology fractions π.

In latent variable models like the pLCM, key variables are not directly observed. It is

therefore essential to picture the model inputs and outputs side-by-side to better understand

the analysis performed. In this spirit, Figure 3.3 displays for each of the 11 pathogens, a

summary of the BrS and SS data in the left two columns, along with some of the interme-

diate model results; and the prior and posterior distributions for the etiology fractions on

the right (rows ordered by posterior means). The observed BrS rates (with 95% confidence

intervals) for cases and controls are shown on the far left with solid dots. The conditional

odds ratio contrasting the case and control rates given the other pathogens is listed with

95% confidence interval in the box to the right of the BrS data summary. Below the case

and control observed rates is a horizontal line with a triangle. From left to right, the line

starts at the estimated false positive rate (FPR, ψBrSj ) and ends at the estimated true positive

rate (TPR, θBrSj ), both obtained from the model. Below the TPR are two boxplots sum-

marizing its posterior (top) and prior (bottom) distributions for that pathogen. These box

plots show how the prior assumption influences the TPR estimate as expected given the

identifiability constraints discussed in Section 3.2.1. The triangle on the line is the model

49


estimate of the case rate to compare to the observed value above it. As discussed in Section

3.2.1, the model-based case rate is a linear combination of the FPR and TPR with mixing

fraction equal to the estimated etiology fraction. Therefore, the location of the triangle,

expressed as a fraction of the distance from the FPR to the TPR, is the model-based point

estimate of the etiologic fraction for each pathogen. The SS data are shown in a similar

fashion to the right of the BrS data. By definition, the FPR is 0.0 for SS measures and

there is no control data. The observed rate for the cases is shown with its 95% confidence

interval. The estimated SS TPR (θSSj ) with prior and posterior distributions is shown as

for the BrS data, except that we plot 95% and 50% credible intervals for SS TPR above its

prior distribution boxplot.

On the right side of the display are the marginal posterior and prior distributions of the

etiologic fraction for each pathogen. We appropriately normalized each density to match

the height of the prior and posterior curves. The posterior mean with 50% and 95% credible

intervals are shown above the density.

Figure 3.3 shows that respiratory syncytial virus (RSV), Streptococcus pneumoniae

(PNEU), rhinovirus (RHINO), and human metapneumovirus (HMPV A B) occupy the

greatest fractions of the etiology distribution, from 10% to 30% each. That RSV has the

largest estimated mean etiology fraction reflects the large discrepancy between case and

control positive rates in the BrS data: 25.3% versus 0.8% (marginal odds ratio 38.5 (95%CI

(18, 128.7) ) as shown on the left of the display. RHINO has marginal case and control

rates that are close to each other, yet its estimated mean etiology fraction is 15.9%. This

50


is because the model considers the joint distribution of the pathogens, not the marginal

rates. The conditional odds ratio of case status with RHINO given all the other pathogen

measures is estimated to be 1.5 (1.1, 2.1) as compared to the marginal odds ratio close to 1

(0.8, 1.3).

As discussed in Section 3.2.1, the data alone cannot precisely estimate both the etiologic

fractions and TPRs absent prior knowledge. This is evidenced by comparing the prior and

posterior distributions for the TPRs in the BrS boxes for each pathogen (i.e. left hand

column of Figure 3.3). The posteriors are similar to their priors indicating little else about

TPR is learned from the data. The posteriors for some pathogens making up π (i.e. shown

in the right hand column of Figure 3.3) are likely to be sensitive to the prior specifications

of the TPRs.

We performed sensitivity analyses using multiple sets of priors for the TPRs. At one

extreme, we ignored background scientific knowledge and let the priors on the FPR and

TPR be uniform for both the BrS and SS data. The results are shown in Figure 3.5. Ignor-

ing prior knowledge about error rates lowers the etiology estimates of the bacteria PNEU

and Haemophilus influenzae (HINF). The substantial reduction in the etiology fraction for

PNEU, for example, is a result of the difference in the TPR prior for the SS measurements.

In the original analysis (Figure 3.3), the informative prior on the SS sensitivity (TPR) place

95% mass between 10 − 20%. Hence the model assumes almost 85% of the PNEU infec-

tions are being missed in the SS sampling. When a uniform prior is substituted (Figure

3.4), the fraction assumed missed is greatly reduced. For RSV, its posterior mean etiology

51


Figure 3.3: Results using expert priors on TPRs. The observed BrS rates (with 95% confi-dence intervals) for cases and controls are shown on the far left. The conditional odds ratiogiven the other pathogens is listed with 95% confidence interval in the box to the right ofthe BrS data summary. Below the case and control observed rates is a horizontal line witha triangle. From left to right, the line starts at the estimated false positive rate (FPR, ψBrS

j )and ends at the estimated true positive rate (TPR, θBrS

j ), both obtained from the model.Below the TPR are two boxplots summarizing its posterior (top) and prior (bottom) distri-butions. The location of the triangle, expressed as a fraction of the distance from the FPR tothe TPR, is the model-based point estimate of the etiologic fraction for each pathogen. TheSS data are shown in a similar fashion to the right of the BrS data. The observed rate for thecases is shown with its 95% confidence interval. The estimated SS TPR (θSS

j ) with priorand posterior distributions is shown as for the BrS data, except that we plot 95% and 50%credible intervals for SS TPR above the boxplot for its prior distribution. See Appendix forpathogen name abbreviations.

52


fraction increases from 27.3% to 31.7%. The etiology estimates for other pathogens are

fairly stable, with changes in posterior means between −0.4% and 3.3%.

Under the original priors for TPR, PARA1 has an estimated etiologic fraction of 5.2%,

even though it has conditional odds ratio 5.8 (2.5, 15). In general, pathogens with larger

conditional odds ratios have larger etiology fraction estimates. Also, a pathogen still needs

a reasonably high observed case positive rate to be allocated a high etiology fraction. The

posterior etiology fraction estimate of 5.2% for PARA1 results because the prior for the

TPR takes values in the range of 0.9 − 0.97. By Equation (3.2.6), the TPR weight in the

convex combination with FPR (around 1.5%) has to be very small to explain the small

observed case rate 5.5%. When a uniform prior is placed on TPR instead, the PARA1

etiology fraction increases to 10.2% with a wider 95% credible interval (Figure 3.4).

Furthermore, when uniform priors on TPR and FPR are used, PARA1 is still allocated

a smaller etiology fraction than RHINO despite PARA1 having a larger conditional odds

ratio. This is related to the dependence structure among case measurements. RHINO

has the highest negative association with RSV among cases (standardized log odds ratio

−14). Under the conditional independence assumption of the pLCM, this dependence is

partly induced by multinomial correlation among the latent cause indicators: ILi = RSV

versus ILi = RHINO that is −πRSVπRHINO. RSV has strong evidence as a frequent cause

with a stable estimate πRSV around 30%. The strong negative association in the cases’

measurements between RHINO and RSV is contributing to the increased etiologic fraction

estimate πRHINO relative to other pathogens that have less or no association with RSV

53


Figure 3.4: Results on using uniform priors on TPRs. As in Figure 3.3 with uniform priorson the TPRs.

54


Figure 3.5: Summary of posterior distribution of pneumonia etiology estimates using expert(left) and uniform (right) priors on TPRs. In each subfigure, top: posterior (solid) and prior(dashed) distribution of viral etiology; bottom left: posterior etiology distribution for toptwo bacterial causes given bacteria is a cause; bottom right: posterior etiology distributionfor top two viral causes given virus is a cause. B-rest and V-rest stand for the rest of bacteriaand viruses other than the top two species, respectively. The nested blue circles are 95%,80%, and 50% credible regions for population etiology estimates within bacterial or viralgroup.

55


among the cases. The conditional independence assumption is leveraging information from

the associations between pathogens in estimation of the etiologic fractions.

We have checked the model in two ways by comparing the characteristics of the ob-

served measurements joint distribution with the same characteristic for the distribution of

new measurements generated by the model from a population of the same size. By gener-

ating the new data characteristics at every iteration of the MCMC chain, we can obtain the

predictive distribution for the new data by averaging the posterior distribution of the param-

eters as discussed in Garrett and Zeger (2000). Figure 3.6 displays the observed frequency

of the 10 most common measurement outcomes for the BrS data, separately for cases and

controls to compare to the predictive distributions based upon the model. Among the cases,

the 95% predictive interval includes the observed values in all but two of the BrS patterns

and even there the fits are reasonable. Among the controls, there is evidence of lack of fit

for the most common BrS pattern with only PNEU and HINF. There are fewer cases with

this pattern observed than predicted under the pLCM. This lack of fit is due to associations

of pathogen measurements in control subjects. Note that the FPR estimates remain consis-

tent regardless of such correlation as the number of controls increases, however posterior

variances for them may be underestimated.

Figure 3.7 presents standardized log odds ratios (SLORs) for cases (lower triangle)

and controls (upper triangle). Each entry is the observed log odds ratio for a pair of BrS

measurements minus the mean LOR for the predictive data distribution value divided by

the standard deviation of the LOR predictive distribution. The first significant digit of the

56


Figure 3.6: Posterior predictive checking for 10 most frequent BrS measurement patternsamong cases and controls with expert priors on TPRs.

57


Figure 3.7: Posterior predictive checking for pairwise odds ratios separately for cases(lower right triangle) and controls (upper left triangle) with expert priors on TPRs. Eachentry is a standardized log odds ratio (SLOR): the observed log odds ratio for a pair of BrSmeasurements minus the mean LOR for the posterior predictive distribution divided by thestandard deviation of the posterior predictive distribution. The first significant digit of ab-solute SLORs are shown in red for positive and blue for negative values, and only thosegreater than 2 are shown.

58


absolute SLOR is shown in blue for negative and red for positive values. Absolute SLORs

less than 2 are omitted from the table for graphical effect. We see two large deviations

among the cases: RSV with RHINO and RSV with HMPV. These are caused by strong

seasonality in RSV that is out of phase with weaker seasonality in the other two. Otherwise,

the associations are roughly what is expected under the assumed model.

An attractive feature of using MCMC to estimate posterior distributions is the ease of

estimating posteriors for functions of the latent variables and/or parameters. One interesting

question from a clinical perspective is whether viruses or bacteria are the major cause and

among each subgroup, which species predominate. Figure 3.5 shows the posterior distribu-

tion using expert TPR prior for viruses versus bacteria on the top, and then the conditional

distributions of the two leading bacteria (viruses) among bacterial (viral) causes below. The

posterior shape of the viral etiologic fraction is more concentrated compared to the prior

shape, with mode around 63% and 95% credible interval (54%, 71%). Of all viral cases,

RSV is estimated to cause about 43% (36%, 51%), and RHINO about 25% (17%, 34%).

PNEU accounts for most bacterial cases (71% (48%, 87%)), and HINF accounts for 19%

(4%, 42%). In both the viral and bacterial categories, the 95% credible intervals for the first

most common pathogen does not overlap that of the second most common one.

59


3.5 Discussion

In this paper, we estimated the frequency with which pathogens cause disease in a

case population using a partially-latent class model (pLCM) to allow for known states

for a subset of subjects and for multiple types of measurements with different error rates.

In a case-control study of disease etiology, measurement error will bias estimates from

traditional logistic regression and attributable fraction methods. The pLCM avoids this

pitfall and more naturally incorporates multiple sources of data. Here we considered three

levels of measurement error rates.

Absent GS data, we show that the pLCM is only partially identified because of the

relationship between the estimated TPR and prevalence of the associated pathogen in the

population. Therefore, the inferences are sensitive to the assumptions about the TPR. Un-

certainty about their values persists in the final inferences from the pLCM regardless of the

number of subjects studied.

The current model provides a novel solution to the analytic problems raised by the

PERCH Study. This paper illustrates the design and application of the pLCM using a

preliminary and limited set of data from one PERCH study site. Confirmatory laboratory

testing, incorporation of additional pathogens, and adjustment for various factors are likely

to change the scientific findings that will be reported in the complete analysis of the study

results.

An essential assumption relied upon in the pLCM is that the probability of detecting

one pathogen at a peripheral body site depends on whether that pathogen is infecting the

60


child’s lung, but is unaffected by the presence of other pathogens in the lung, that is, the

non-differential misclassification error assumption, [MSSij | ILi = l] = [MSS

ij | ILi = k],

∀l, k 6= j. We have formulated the model to include GS measures even though they are

infrequently available from PERCH cases. In general, the availability of GS measures

makes it possible to test this assumption as has been discussed by Albert and Dodd (2008).

Several extensions have potential to improve the quality of inferences drawn and are

being developed for PERCH. First, because the control subjects have known class, we can

model the dependence structure among the BrS measurements and use this to avoid aspects

of the conditional independence assumption central to most LCM methods. The approach

is to extend the pLCM to have K subclasses within each of the current disease classes.

These subclasses can introduce correlation among the BrS measurements given the true

disease state. An interesting question is about the bias-variance trade-off for different val-

ues ofK. This ideas follows previous work on the PARAFAC decomposition of probability

distribution for multivariate categorical data (Dunson and Xing, 2009). This extension will

enable model-based checking of the standard pLCM.

Second, in our analyses to date, we have assumed that the pneumonia case definition

is error-free. Given new biomarkers and availability of chest radiograph that can improve

upon the clinical diagnosis of pneumonia, one can introduce an additional latent variable

to indicate true disease status and use these measurements to probabilistically assign each

subject as a case or control. Finally, regression extensions of the pLCM will allow PERCH

investigators to study how the etiology distributions vary with age group and season.

61

Chapter 4

Nested Partially-Latent Class Models

(npLCM) for Estimating Disease

Etiology in Case-Control Studies

62

CHAPTER 4. NESTED PARTIALLY-LATENT CLASS MODELS (NPLCM) FORESTIMATING DISEASE ETIOLOGY IN CASE-CONTROL STUDIES

Abstract

The Pneumonia Etiology Research for Child Health (PERCH) study attempts to infer the

distribution of pneumonia-causing bacterial or viral pathogens in developing countries from

measurements outside of the lung. Recent developments in test standardization make it pos-

sible to collect multiple specimens to detect a large number of pathogens at once with vary-

ing degrees of etiologic relevance and measurement precision. With this data, researchers

seek to estimate the population fraction of cases caused by each pathogen, and to develop

algorithms to assist clinical diagnosis when presented with complex data on an individual

case.

We describe a latent variable model to address these two analytic goals using data from

a case-control design. We assume each observation is a draw from a mixture model for

which each component represents one pathogen. Conditional dependence among multi-

variate binary measurements on a single subject is induced by nesting subclasses within

each disease class. Measurement precision can be estimated using the control sample for

whom the etiologic class is known. We assume the measurement precision is independent

of the disease status. We use stick-breaking priors on the subclass weights to estimate the

population and individual etiologic distributions that are averaged across models indexed

63


by different numbers of subclasses. Assessment of model fit and individual diagnosis are

done using posterior samples drawn by Gibbs Sampling. We demonstrate the method’s op-

erating characteristics via a simulation study tailored to the motivating scientific problem

and illustrate the model with a detailed analysis of PERCH study data.

64


4.1 Introduction

Multivariate binary data are a common outcome in disease etiology studies (Hammitt

et al., 2012), verbal autopsy studies (King and Lu, 2008; King et al., 2010) and genomic

studies (Hoff, 2005). For example, in the Pneumonia Etiology Research for Child Health

(PERCH) study of childhood pneumonia (Levine et al., 2012), a vector of presence/absence

for up to 30 different pathogens is measured by polymerase chain reaction (PCR) using

specimens from the nasopharyngeal cavity. A goal is to use the multivariate binary re-

sponses to infer the pathogen in the child’s lung causing pneumonia. Assuming only one

unknown pathogen has caused each case’s disease, public health researchers are interested

in clustering cases into groups, each with a different pathogen causing its pneumonia, and

then estimating the fraction of each group. Such knowledge about compositional structure

in the case population is useful for designing disease prevention programs and prioritizing

treatments. We term these fractions as etiologic fractions: probabilities that sum to one

with each component corresponding to a pathogen cause or disease class.

The dependence structure among the observed binary measurements has two primary

sources: 1) the multinomial variation in unobserved disease class indicators among cases,

and 2) given disease class, the conditional dependence among the imperfect measurements.

To distinguish these two sources and to infer the disease class for individual cases, latent

class models (LCM) are commonly used to connect an individual’s measurementM to her

unobserved class indicator I through the likelihood [M | I,Θ], where Θ denotes the col-

lection of parameters (Goodman, 1974). For binary measures, Θ includes sensitivities and

65


specificities. Under local identifiability conditions (Jones et al., 2010), joint maximization

of the model likelihood∑J

j=1[M | I = j,Θ]Pr(I = j) with respect to all unknowns gives

parameter estimates Θ and etiologic fraction estimates Pr(I = j). Individual classification

can then be done by applying Bayes rules using the estimated parameters.

As noted by many authors, misspecification of the conditional distribution [M | I] will

likely bias model parameter and mixing proportion estimates (Albert and Dodd, 2004; Pepe

and Janes, 2007). Therefore, in many applications where the conditional independence

model for [M | I] is assumed, model adequacy is studied to ensure valid model-based

conclusions about test sensitivities/specificities and mixing weights (Garrett and Zeger,

2000). A leading example is diagnostic test evaluation without gold-standard data.

In applications where deviations from conditional independence are substantial, condi-

tional dependence in [M | I] has been introduced. For example, the generalized mixed-

effects model with Gaussian random intercepts have been used to introduce within-subject

correlation for diagnostic tests (Qu and Hadgu, 1998). The Gaussian assumption to de-

scribe the heterogeneity across individuals implies symmetry of correlation on the linear

predictor scale and is sometimes not appropriate (Albert et al., 2001). Albert et al. (2001)

described an alternative finite-mixture model that introduced an extra latent subclass nested

within each class to represent subjects whose measurements were made without error. With

these and many other possible conditional dependence specifications, Albert and Dodd

(2004) noted the model fits of different models are sometimes equally adequate and indis-

tinguishable if sample size is small andM has a low dimension.

66


The model identification problem can be partly addressed by collecting gold-standard

data on latent class indicators on some subjects (Albert and Dodd, 2008), or by collecting

extra data that provides consistent estimates of model parameters, e.g. sensitivities or speci-

ficities. In the motivating example for this paper, the PERCH study collected control data

that provides direct evidence about the specificities of tests and thereby enables estimation

of models with conditional dependence among the binary measurements.

Wu et al. (2014a) described a “partially-latent” class model (pLCM) for case and control

data to estimate the etiologic fractions, π := Pr(I = j)j=1,...,J = (π1, ..., πJ)′. For

controls, there is no infection in the lung, hence I = 0; for cases, there is infection so

that I 6= 0, indicating which pathogen causes the infection I takes value in 1, ..., J.

They referred to this as a “partially-latent class model” (pLCM) since control states are

known but cases states are latent. They structured the pLCM to integrate measurements

with differing error rates that are collected in the PERCH case-control design. Estimated by

Markov chain Monte Carlo, their pLCM approximates with arbitrary precision the posterior

distribution of the population and individual etiologic fractions as well as functions of

unknowns in the model.

In their original formulation, Wu et al. (2014a) assumed conditional independence of

the J binary measurements within each disease class. The model fit well for the 10 most

frequent measurement patterns. However, several pairs of pathogens had observed log odds

ratios deviating significantly from the mass of the posterior predictive distribution. This

indicates that model fit might be further improved by considering conditional dependence

67


extensions of the pLCM. The associations from these models also have scientific value in

their own right.

In this paper, we extend the pLCM to introduce dependence among the J measurements

for an individual. We assume there are K subclasses nested within each of the J + 1 (J

case, 1 control) disease classes. Measurements within a subclass are assumed independent.

We assume the same number of subclasses, K, for each disease class (also see Remark 1).

This extension of the pLCM adds 2 J(K − 1) +K − 1 additional parameters compared

to the original pLCM with K J . We refer to the model as a “nested partially-latent

class model” or npLCM. We use a Bayesian penalty to encourage small values of K which

parsimoniously approximate the dependence among the multivariate binary responses and

avoid overfitting. As explained in the next section, our approach has the advantage of

easier interpretation of sensitivities/specificities without having to condition on continuous

random effects.

In this paper, we develop a hierarchical Bayesian model to extend the pLCM to intro-

duce flexible dependence among the binary responses. The control sample provides the

requisite information about specificities. Prior knowledge about sensitivities can be in-

corporated to facilitate estimation of the etiologic fractions. The method is based on the

nonnegative PARAFAC decomposition (Shashua and Hazan, 2005) that enables parsimo-

nious approximation of a high-dimensional contingency table. The approach is especially

helpful if the sample size is small compared to the total number, 2J+1, of cells in the joint

distribution of the measurements, M .

68


Our model is estimated via Markov chain Monte Carlo with data augmented by latent

indicators of disease class and nested indicators of subclass. Throughout the paper, we

rely on the scientific assumption that each child’s pneumonia is caused by a single primary

pathogen. The more general case where disease can be attributed to multiple pathogens can

be developed through our model formulation, but with computational complexities (Section

4.7).

The remainder of this paper proceeds as follows. Section 4.2 introduces the model for-

mulation of the npLCM. Section 4.3 discusses several inherent model properties. Section

4.4 details the posterior computing algorithms. Section 4.5 uses simulated data sets that are

tailored to the motivating application to illustrate the benefits of using the npLCM relative

to the pLCM, a special case. Section 4.6 applies the proposed method to PERCH study

data. Section 4.7 concludes with remarks on model extensions.

4.2 Model specification of npLCM

In this section, we fully specify the nested partially latent class model (npLCM). We

discuss the model properties and its parameter interpretations using the PERCH study as

an example. Let Mi = (Mi1, ...,MiJ) comprise a J-dimensional multivariate binary mea-

surement collected for subjects i = 1, ..., n1 + n0, where the first n1 subjects are cases and

the remaining n0 are controls. Yi = 1 denotes a case and Yi = 0 for a control.

69


4.2.1 npLCM likelihood

Figure 4.1 pictures the general structure of the npLCM with J = 5 dimensional mea-

surements, one pathogen measurement per row in the matrix. With 5 pathogens, there are 6

classes: one for the control state (pathogen-free) on the left of the dashed vertical line; and

5 states for the possible etiologic pathogens on the right. In the figure, the control measure-

ments have joint distribution that is approximated by a mixture of K = 2 subclasses, with

K-dimensional mixing weights ν = (ν1, ..., νK)′. Here Ψk0 = ψ(j)k01≤j≤J is the vector

of false positive rates for measurements j = 1, ..., J in the subclass k0 = 1, ..., K. To the

right of dashed line are the J = 5 classes for cases. The mixing weights of K subclasses

in the case population are assumed to be η = (η(j)1 , ..., η

(j)K ), for j = 1, ..., J . The etiologic

fractions are defined to be the J-dimensional mixing weights for the J classes in the case

population, denoted π = (π1, ..., πJ)′ .

The control measurement distribution is assumed to take the form of a latent class model

(Goodman, 1974). For control i with measurement m, her J-way contingency table with

cell probabilities P 0 = Pr(M = m | ν,Ψ, Y = 0)m∈0,1J can be decomposed as

P 0i =

K∗∑k=1

νk

J∏j=1

ψ

(j)k

mij

1− ψ(j)k

1−mij

, (4.2.1)

where K∗ is a positive integer, v = (v1, ..., vK∗)′ is a vector that sums to one, and Ψ is a

parameter matrix with (j, h)th element ψ(j)h ∈ [0, 1] (Shashua and Hazan, 2005).

We introduce subclass indicator Zi that takes value in 1, ..., K∗ for control subject i.

70


Figure 4.1: Model structure that incorporates conditional dependence within each diseaseclass illustrated by J = 5 pathogens (called A, B, C, D, and E) in the PERCH study. Onthe left is the control measurements that arise from a mixture of K = 2 conditionallyindependent subclass measurement profiles with mixing weights ν1 and ν2. Here ψ(j)

k isthe false positive rate for pathogen j in a subclass k. On the right are the J = 5 diseaseclasses, one for each possible pathogen. Each case is assumed to be caused by a uniquepathogen indicated by IL taking values in 1, ..., J. For a class containing all cases whoseIL = j0, the K = 2 subclasses of measurement profiles are assumed equal to the controlfalse positive rates ψ(j)

k for j 6= j0, and equal to the true positive rate θ(j)k for j = j0,k = 1, ..., K. Within each disease class, two subclass measurement profiles are nested.The mixing weights of subclasses nested in the jth disease class are η(j)1 and η(j)2 . π =(π1, ..., πJ)′ are disease class mixing weights, and are called etiologic fractions.

71


Then (4.2.1) is equivalent to the two-stage model:

Zi ∼ Multinomial(1, ..., K∗,ν), j = 1, ..., J, (4.2.2)

Mij | Zi = k ∼ Bernoulli(ψ(j)h ), independently for j = 1, ..., J, (4.2.3)

where νk = Pr(Zi = k | Y = 0) and ψ(j)k = Pr(Mj = 1 | Zi = k, Y = 0). Here ν is

the vector of mixing weights of K∗ subclass measurement profiles; ψ(j)k is the probability

of mij = 1 given this control subject is allocated to subclass k. In the application of

pneumonia etiology estimation, ψk =ψ

(j)k

j=1,...,J

is the kth false positive rate (FPR)

profile, because any positive detection of pathogens from a control subject will be a false

positive.

The vector of binary measurements for a case is assumed to be generated from a mixture

of latent class models, one for each possible cause. In addition, given each potential cause,

we assume the distribution is the same as for the controls except for the causal pathogen.

Specifically, for disease class j0, the K subclasses of measurement profiles are assumed

equal to the control positive rates ψ(j)k for j 6= j0, k = 1, .., K. We assume the true positive

rates to be θ(j)k for j = j0, k = 1, ..., K. For a case i′ with measurement mi′ , the joint

measurement distribution for the cases, P 1 = Pr(M = m | π,η,Θ,Ψ, Y = 1), for

72


m ∈ 0, 1J , is therefore given by

P 1i′ =

J∑j=1

πj

K∗∑h=1

[η(j)k

θ(j)k

mi′j

1− θ(j)k1−mi′j ∏

l 6=j

ψ

(j)k

mi′l

1− ψ(j)k

1−mi′l

],

(4.2.4)

where π = (π1, ..., πJ)′ is a vector that sums to one, Θ is a parameter matrix with (j, h)th

element θ(j)h ∈ [0, 1]. Parameters in (4.2.4) are better interpreted using subclass indicator

Zi′ and an extra class indicator Ii′ ∈ 1, ..., J (or disease class indicator),

Ii′ | Yi′ = 1 ∼ Multinomial(1, ..., J,π), (4.2.5)

Zi′ | Ii′ = j ∼ Multinomial(1, ..., K,η(j)), j = 1, ..., J (4.2.6)

Mi′j | Zi′ = k, Ii′indep∼ Bernoulli

(θ(j)k 1Ii′=j + ψ

(j)k 1Ii′ 6=j

), j = 1, ..., J.(4.2.7)

In the PERCH application, π = (π1, ..., πJ)′ is the vector of probabilities that a case be-

longs to class 1 through J , i.e. the etiologic fractions, which is the primary target of

inference; η = (η(j)1 , ..., η

(j)K∗)

′ mixes K∗ subclasses nested in each disease class; θ(j)k is

the true positive rate (TPR) for a case belonging to the jth disease class and kth sub-

class measurement profile. Equation (4.2.7) indicates that θ(j) =θ(j)k

1≤k≤K∗

replaces

ψ(j) =ψ

(j)k

1≤k≤K∗

in otherwise similar controls to indicate the change in positive de-

tection rate induced by pathogen infection in a disease class j.

Combining (4.2.1) and (4.2.4), the joint model likelihood across independent cases and

73


controls is

L(π,Θ,Ψ,v,η;D) =∏i:Yi=0

P 0i

∏i′:Yi′=1

P 1i′ , (4.2.8)

where D = mi : Yi = 0 ∪ mi′ : Yi′ = 1 collects all the measurement data on the cases

and controls.

Remark 1. We assumed that dependence of measurements within each case class can

be explained by allowing the same number of conditionally independent subclasses as in

controls. If the disease class I is directly observed for all or a subset of cases, extra case

subclasses can be included. However, without direct observations of I , as is the case in

PERCH, we use the same number of subclasses in the cases so that subclass parameters

can be partly informed by the control population using (4.2.7).

Remark 2. While we assume K, the number of subclasses per class, is the same for cases

as for controls, we do not add the additional restriction that the mixing distribution across

subclassses is also the same, i.e., ν = η(j), j = 1, ..., J . In this way, the dependence among

the measurements M[−j] is not required to be identical for controls and cases caused by

pathogen j.

Remark 3. Any multivariate binary distribution can be expressed as a mixture of product

Bernoullis for a sufficiently large K = K∗ in (4.2.1). However, the choice of K∗ is not

straightforward. Also, since our inferential goal is to estimate the etiology fractions, π,

after marginalizing over subclass indicators (Zi′ ∈ 1, ..., K), the dependence structure

74


within each disease class is represented by nuisance parameters. Therefore, rather than

fixing K, we perform model averaging across a range of plausible values of K, so that

its uncertainty is incorporated into the inference about π. This is particularly desirable

when the observed contingency table P 0 has a large proportion of empty cells (> 97% in

PERCH), and we want to prevent model overfitting in finite sample sizes using K ≤ K∗ as

discussed by Dunson and Xing (2009).

4.2.2 Prior specifications

Prior distributions on unknown parameters are specified as follows:

π ∼ Dirichlet(a1, . . . , aJ), (4.2.9)

ψ(j)k ∼ Beta(b1kj, b2kj), j = 1, ..., J ; k = 1, ...,∞, (4.2.10)

θ(j)k ∼ Beta(c1kj, c2kj), j = 1, ..., J ; k = 1, ...,∞, (4.2.11)

Zi′ | ILi′ = j ∼∞∑k=1

U(j)k

∏l<k

[1− U (j)

l

]δk, for all cases, (4.2.12)

U(j)k ∼ Beta(1, α0), (4.2.13)

Zi ∼∞∑k=1

Vk∏l<k

[1− Vl]δk, for all controls, (4.2.14)

Vk ∼ Beta(1, α0), (4.2.15)

α0 ∼ Gamma(0.25, 0.25), (4.2.16)

75


where δk is a point mass on k, and prior independence is also assumed among these param-

eters.

In (4.2.12) and (4.2.15), we have specified stick-breaking priors for both η(j) and v,

which places decreasing weights on the kth measurement profile as k increases (Ishwaran

and James, 2001).

4.3 Model properties

4.3.1 Non-interference submodels

A fundamental premise of the pLCM and this extended npLCM is that the etiologic

pathogen in the lung IL is differentially expressed in the peripheral measurements M .

That is, if one case has disease caused by pathogen j and another by pathogen j′, then the

joint distributions for the measurements of the remaining pathogens (all but j and j′), call

this Pr(M [−(j, j′)] | IL, Y = 1

), will be the same. This premise is essential if we expect

to infer the lung status from peripheral measurements.

Specifically, the case measurement likelihood (4.2.4) implies that, among cases, if

η(j) = η(j′), or K = 1,

Pr(M[−(j,j′)] | IL = j, Y = 1) = Pr(M[−(j,j′)] | IL = j′, Y = 1), (4.3.1)

for 1 ≤ j < j′ ≤ J . If ν = η(j), or K = 1, we further have, between the controls and the

76


case, that

Pr(M[−j] | Y = 0) = Pr(M[−j] | IL = j, Y = 1), j = 1, ..., J, (4.3.2)

In the PERCH application, equality (4.3.1) implies that measurements on pathogens have

the same distribution for the cases belonging to two different disease class, as long as these

pathogens are not infecting either of them. Equality (4.3.2) also implies that measurements

on pathogens other than j will have the same distribution for the controls and the cases

caused by pathogen j. (4.3.1) and (4.3.2) are hence termed non-interference conditions.

The observed data can be used to support or reject the non-interference submodels (4.3.1)

and (4.3.2) as discussed in more detail in Section 4.6.

4.3.2 Mean and covariance structure

In the appendix at the end of this chapter, we provide straightforward and expressions

for the marginal means and pairwise associations for J pathogens separately for the cases

and controls, and discuss how information borrowed from the controls is manifested in

the cases’ measurements. These formulas are used to generate posterior distribution for

observable characteristics of the measurements that are essential to model checking. That

use is illustrated in Section 4.6 for the PRECH data.

In addition, these formulas allow comparison of the pLCM and npLCM models in terms

of their estimates of the etiologic fractions π that is the primary interest in our application.

77


In Section 4.5, through simulations, we assess the bias-variance trade-off of inferring etio-

logic fractions π when using npLCM compared to pLCM in finite sample sizes.

4.3.3 Alternate approaches to borrowing information from

the control population

It is also clear from (A.2.3) that the control measurements provide direct evidence about

the marginal false positive rates (MFPR), ΨM = Pr(Mij = 1 | Yi = 0), j = 1, ..., J. One

may estimate ΨM without joint modeling of the control measurement, or by assuming a

working independence model. Both approaches provide consistent estimates of the MF-

PRs. One can then use the nonparametric bootstrap to obtain robust covariance estimate on

the logit scale ΣM and place multivariate normal prior, logit(ΨM) ∼ NJ(0, ΣM), to inform

the case model. Similarly, marginal moments beyond the first order can be borrowed to

the case model through GEE2 (Liang et al., 1992) or the more computationally efficient

alternating logistic regression (ALR) (Carey et al., 1993).

These ad hoc approaches to share measurement error rate information between the con-

trol and the case populations have at least two limitations. First, one needs to specify the

order of moments and then obtain the moment estimates using other robust statistical pro-

cedures. Second, one needs to formulate the case model explicitly in terms of moment

parameters, on which priors elicited from the controls are placed. The npLCM framework

overcomes these two inconveniences using the nonnegative PARAFAC decomposition. Es-

78


timated FPR profile parameters (Ψ) from the control population can inform moments of

arbitrary order among the cases with similarity determined by the subclass mixing weights

ν andη(j), j = 1, ..., J

.

4.3.4 Modeling choices

As an alternative to the npLCM introduced here, covariation among multivariate bi-

nary measurements can also be formulated by generalized linear mixed-effects models

(GLMM). Suppose the control measurements follow the distribution

g (Pr(Mij = 1 | Yi = 0, δi)) = δij + ψCj , j = 1, ..., J ; and δi = Λξi + εi,(4.3.3)

ξi = (ξi1, ..., ξiS)′ ∼ NS(0, IS), εi ∼ NJ(0,Ω = diag(σ21, ..., σ

2J)) (4.3.4)

where g(·) is a link function, ψCj is conditional FPR given random effect δij = 0. Here,

Λ is a J × S factor loading matrix that characterizes the covariance structure of random

effect δi through Cov[δi] = ΛΛ′ + Ω. Sparse estimation of the factor loading matrix Λ in

GLMM can be generalized from previous Bayesian methods in the context of continuous

data (Bhattacharya and Dunson, 2011; Pati et al., 2014), possibly with more computational

expenses. We can borrow the control parameters to the jth class of cases by replacing ψCj ,

and the jth row of Λ and Ω with new parameter values. We have chosen npLCM over

GLMM formulation for two reasons. First, in the control population, the normal random-

effect distribution in the GLMM constrains the possible dependence structure among the

79


multivariate binary measurement. In contrast, the npLCM has the advantage of approx-

imating the control distribution arbitrarily close using JK + K − 1 parameters, a much

smaller number than the JS + 2J required in GLMM when S ≈ K J . Second, the

use of the control information in the case population is more natural in the npLCM frame-

work. The jth marginal FPRs in the controls or cases are a linear functions of the jth row

of Ψ. While in the GLMM formulation, we need the jth row of Λ and Ω, ψCj , the link

function g(·), and numerical integrations if g(·) is not the probit function. This also makes

the GLMM formulation more computationally intensive relative to the npLCM.

4.4 Posterior computations

The parameters in likelihood (4.2.8) include the population etiology distribution (π),

TPRs Θ and FPRs Ψ. The posterior distribution of these parameters can be estimated by

constructing approximating samples from the joint posterior via a Gibbs sampler. The full

conditional distributions for Gibbs sampler updating are detailed in the Appendix. Figure

4.2 is the directed acyclic graph (DAG) that shows the model structure and observed and

latent variables in the npLCM.

In this work, we are able to use the freely available software WinBUGS 1.4 to fit the

npLCM. Convergence was monitored via Markov chain Monte Carlo (MCMC) chain his-

tories, auto-correlations, kernel density plots, and Brooks-Gelman-Rubin statistics (Brooks

and Gelman, 1998). The statistical results below are based on 10, 000 iterations of burn-in

80


Figure 4.2: Directed acyclic graph for the npLCM. Quantities in circles are unknown pa-rameters or auxiliary variables; quantities in solid squares are observed (multivariate binarymeasurements here). The etiologic fraction π of primary scientific interest. The solid ar-rows represent probabilistic relationship between the connected variables. The “cut” valve“A B” means that when updating node A in the Gibbs sampler, we drop the likelihoodterms that involve node B (see Section 4.4).

81


followed by 40, 000 production samples from each of three parallel chains. Samples from

every 40 iterations are retained for inference.

Note that the false positive rates parameters Ψ are included in both the control and

case likelihood (4.2.1) and (4.2.4), so that the posterior distribution of Ψ depends on both

the control and case models. This is referred to as “feedback” because the case model

will indirectly inform Ψ. If we only want the control data inform the case model but not

vice versa, we can “cut” (Lunn et al., 2009) this source of feedback through approximate

conditional updating in the Gibbs sampler. That is, we update ψ(j)k by Pr(ψ(j)

k | Mij, i :

Yi = 0) instead of step (7) of the Gibbs sampler (see Appendix). It will cut the information

flow from the case model to the FPR parameters Ψ and is indicated by the check-bit valves

in Figure 4.2. It is desirable when certain parts of the joint model are considered not

reliable to inform a subset of parameters, and can be implemented by the cut function

in WinBUGS 1.4. Such “cut-the-feedback” approximate Bayesian computation has both

gains in computational speed and inferential robustness, and is also suggested in other

contexts (Liu et al., 2009; Warren et al., 2012; Zigler and Dominici, 2014).

4.5 Simulation studies

We compare the pLCM and npLCM in terms of π estimation and individual classifica-

tion error rates in a small simulation study with sample size n1 = n0 = 500 and J = 5

binary measures (pathogens A,...,E). We estimate the bias of not accounting for conditional

82


dependence when it exists and the bias-variance trade-off when choosing between pLCM

and npLCM.

In the Scenario I, we simulated data with conditional independence K = 1 (pLCM)

where the etiology is split evenly across the five pathogens. The FPRs are set to be

Ψ = (0.1, 0.2, 0.3, 0.4, 0.5)′ and the TPRs Θ = (0.9, 0.9, 0.9, 0.9, 0.9)′. In the Scenario

II, we simulated the data under a npLCM specification with true etiologic fraction π =

(0.5, 0.2, 0.15, 0.1, 0.05)′. We then create associations between the binary measurements by

defining two subclasses (K = 2) for the 6 disease states; the FPR profiles are Ψ = [ψ1,ψ2],

where ψ1 = (0.1, 0.1, 0.1, 0.1, 0.5)′ and ψ2 = (0.1, 0.5, 0.1, 0.1, 0.1)′; the TPR profiles are

Θ = [θ1,θ2], where θ1 = (0.9, 0.9, 0.9, 0.9, 0.9)′ and θ2 = (0.1, 0.9, 0.9, 0.9, 0.9)′. The

true subclass mixing weights in the controls (λ) and cases (η(j), j = 1, ..., 5), are all set

equal to (0.9, 0.1)′. With this set of parameter values, the pair of pathogens (B,E) are neg-

atively associated in the controls. In the cases infected by pathogen A, we created negative

association between the pair (A,B), and positive association between the pair (A,E). This is

to mimic the situation where the pathogen infecting a case can inhibit the growth of some

pathogens in the periphery while promoting growth of another.

For each scenario, we generated 50 data sets; the npLCM and pLCM algorithms were

fitted separately to each simulated data set using the posterior computing algorithm de-

scribed in Section 4.4. We obtained good mixing behavior of the Gibbs sampler that con-

verged in each case for the etiologic fractions π as determined by visual inspections of the

chain history, auto- and cross-correlations.

83


Table 4.1 shows the simulation results from the npLCM and pLCM, respectively for

both scenarios. The row beginning with ¯πj is the average of posterior means across simu-

lation replicates; Sπj is the sample standard deviation of posterior means across simulation

replicates;(Vπj)1/2 is the square root of the average posterior variance of πj across sim-

ulation replicates; coverage is the proportion of simulation replicates that produced 95%

credible intervals covering the true πj . With 50 replicates, the empirical coverage rate

of nominal 95% intervals is expected to be 95% and will most likely fall in the range of

(88, 100)%.

In scenario I, the estimates from the pLCM and the npLCM are comparable. The av-

erage posterior means are similar and close to the truth. The coverage rate of the nominal

95% credible intervals are within the (88, 100)% range. As expected, the npLCM esti-

mates generally have larger posterior variances, because the npLCM is a larger model and

includes the pLCM as a special case.

In scenario II, ignoring the conditional dependence in the class A of cases leads to

the downward bias ¯πA = 0.44 compared to the true value (0.5). We also observe under-

coverage (84%) of the nominal 95% credible interval. The npLCM can recover πA accu-

rately with average posterior mean being 0.51. The coverage of the nominal 95% interval is

estimated to be 90%, a reasonable value. In this simulation, other estimates by the pLCM

remain robust to the conditional dependence that is present in the true data generating

mechanism. The sample variance of posterior means are slightly smaller from the pLCM

compared to the npLCM (see row 3 of the bottom panel), indicating that the smaller model,

84


pLCM, has sacrificed bias in π estimates for reduced variance.

Table 4.1: Results for simulated data sets separately fitted by the npLCM and pLCM.

SCENARIO I: truth is pLCMfitted by npLCM fitted by pLCM

A B C D E A B C D Eπ 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2¯πj 0.22 0.21 0.20 0.20 0.16 0.22 0.21 0.20 0.20 0.16

Sπj 0.03 0.03 0.04 0.03 0.05 0.03 0.03 0.03 0.03 0.04(Vπj)1/2 0.05 0.05 0.05 0.06 0.06 0.04 0.04 0.04 0.05 0.05

coverage 98% 98% 100% 100% 96% 96% 98% 100% 100% 96%SCENARIO II: truth is npLCM

fitted by npLCM fitted by pLCMπ 0.5 0.2 0.15 0.1 0.05 0.5 0.2 0.15 0.1 0.05¯πj 0.51 0.19 0.16 0.10 0.04 0.44 0.22 0.17 0.12 0.05

Sπj 0.03 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.01(Vπj)1/2 0.04 0.04 0.04 0.04 0.03 0.04 0.04 0.04 0.04 0.03

coverage 90% 94% 100% 98% 100% 84% 90% 100% 98% 100%

For both scenarios, we have also assessed the out-of-sample predictive performance of

the individual diagnosis based on the pLCM or the npLCM. We used 500 cases and 500

controls (D0) to train the models and predicted an individual’s underlying class indicator

ILi∗ given her J binary measurements. Under a particular model, we classify an individual

into the class that gives the highest posterior probability: ILi∗ = arg maxj=1,2,...,J P (ILi∗ =

j | Mi∗ ,D0). Here, the posterior probabilities are estimated as the frequencies of the

Gibbs sampler imputing ILi∗ as j after the burn-in period. For a class of cases with ILi∗ =

j, we simulate 10, 000 subjects’ multivariate binary measurements Dj using the model

specification. The predictions for new cases ILi∗ and their actual values of ILi∗(= j) are then

85


compared to calculate the estimated misclassification rate:

rj =i∗ ∈ Dj : ILi∗ 6= j and ILi∗ = j

/10, 000, for j = 1, .., J.

The overall misclassification rate is calculated as ro =∑J

j=1 rj · πj .

Figure 4.3 compares the misclassification rates obtained using the pLCM and the npLCM

across simulation replications. In the Scenario I where the data generation mechanism

complies with the conditional independence assumption, both models have similar classi-

fication performance. In the Scenario II, the npLCM has a lower average misclassification

rate in class A relative to the pLCM (see the leftmost pair of boxplots), as expected since

the simulation Scenario II challenges the pLCM that cannot account for the conditional

dependence within class A.

4.6 Analysis of PERCH data

The Pneumonia Etiology Research for Child Health (PERCH) study is a standardized

and comprehensive evaluation of etiologic agents causing severe and very severe pneumo-

nia among hospitalized children aged 1-59 months in seven low and middle income coun-

tries. The study sites include countries with a significant burden of childhood pneumonia

and a range of epidemiologic characteristics (Levine et al., 2012). PERCH is a case-control

study that has enrolled over 4, 000 patients hospitalized for severe or very severe pneumo-

nia and over 5, 000 controls selected randomly from the community, frequency-matched on

86


(a)

(b)

Figure 4.3: Misclassification rate comparisons between the pLCM and npLCM predictions.50 simulated training data sets generated under (a) scenario I (pLCM), or (b) scenario II(npLCM). Each training data set is fitted by the pLCM (clear boxplots) or npLCM (filledboxplots) to produce individual predictions. In (a) and (b), the first 5 pair of boxplotsare to compare class-specific misclassification rates; the last pair is to compare the overallmisclassification rates.

87


age in each month. More details about the PERCH design are available in Deloria-Knoll

et al. (2012).

To illustrate the application of the npLCM model for the analysis of PERCH study data,

we have focused on preliminary data from one site with good availability of laboratory

results on nasopharyngeal (NP) specimens with PCR detection of pathogens. Results for

all 7 countries will be reported elsewhere upon study completion. Included in the current

illustrative analysis are NPPCR data for 578 cases and 603 frequency-matched controls on

9 species of pathogens (6 viruses and 3 bacteria with their abbreviations in Figure 4.4, and

full names in Appendix).

4.6.1 Estimation of etiologic fractions

We have compared the population etiology fractions, π, estimated separately by two

related methods: (a) npLCM with K fixed at 1 (conditional independence submodel, or

pLCM), and (b) npLCM with stick-breaking prior on the subclass weights (truncation level

K = 10). The results are shown in Figure 4.4 using the visualization introduced by Wu

et al. (2014a).

It is desirable to compare the objective evidence in the data (input) and the posterior

distribution of the parameters of main scientific interest, here the etiology fractions π (out-

put). The left panel of Figure 4.4 displays for each pathogen (row) the positive observation

rates from cases and controls, and the estimated conditional odds ratios with 95% confi-

dence intervals of the pathogen with case status adjusted for the presence or absence of

88


other pathogens using standard logistic regression.

In the right panel of Figure 4.4 are the marginal prior and posterior distributions of the

etiologic fraction for each pathogen by method (a) pLCM (black) and (b) npLCM (blue).

The posterior mean with 50% and 95% credible intervals are shown above the density.

With the exception of one virus, the differences in the estimated etiologic fractions from

the two approaches are small. The npLCM estimates that pathogen RSV caused 26.6%

(95% CI: 17.6 − 43.7%) of the disease in the case population. A very similar result is

obtained by the pLCM with 27.4%(18.0− 48.4%) that assumes conditional independence.

The large conditional odds ratio (COR) of RSV with case status (31.6 (14.9−81.8)) cannot

be explained away by strong conditional association with another pathogen.

The one exception is the virus RHINO that has a substantially larger etiologic frac-

tion 22.1%(8.4 − 41.1%) estimated by the npLCM as compared to 10.5%(0.6 − 28.6%)

from the pLCM analysis. This is a result of RHINO’s strong negative association with

RSV in the cases (log OR: −1.8(s.e. 0.3)). From equation (A.2.4), the npLCM assumes

that the pairwise log odds ratio is contributed from all J classes of cases. For RSV

and RHINO, the involved parameters have posterior means: θ(RSV) = (0.75, 0.42, ...)′,

ψ(RHINO) = (0.26, 0.61, ...)′, θ(RHINO) = (0.72, 0.27, ...)′, and ψ(RSV) = (0.01, 0.02, ...)′.

Here, only estimates for the largest two subclasses are shown because the estimated sub-

class mixing weights in the cases, η(j)k , are negligible when k ≥ 2 for j = 1, ..., 10.

Within the RSV class of cases, the relevant parameters are TPRs θ(RSV) = (0.75, 0.42, ...)′

and FPRs ψ(RHINO) = (0.26, 0.61, ...)′. The two vectors are “out-of-phase” with each other

89


Figure 4.4: Comparison of population etiologic fraction posterior distributions between thepLCM (black) and npLCM (blue). On the left, the positive observation rates rates for casesand controls are plotted for each pathogen using connected blue dots; “+” and “*” denoteposterior mean of θMj and ψM

j , respectively; the fitted case rate is indicated by “δ”. On theright, the blue/black curves, numbers, and credible intervals above the curves denote themarginal posterior density, mean, and 50% and 95% credible intervals for πj , j = 1, .., 10for the pLCM/npLCM models.

90


and so induce negative conditional dependence when we marginalize the latent subclass

indicators by subclass mixing weights η(RSV). However, the posterior mean of the subclass

mixing weights is (0.981, 0.017, ...)′, highly concentrated to the first subclass, which re-

sults in small variations in the subclass indicators. The amount of the observed negative

association between RSV and RHINO is therefore only partly accounted for by the RSV

class.

The npLCM tries to account for the extra negative association by assigning higher eti-

ology to the RHINO and other classes of cases where additional negative associations can

be induced. In this data set, this leads to the observed increase in the etiologic fraction of

RHINO. We also observe that the posterior distribution of the RHINO etiologic fraction

is more spread under the npLCM compared to the pLCM (blue versus black curve, row 2,

right panel). It indicates that although the npLCM considers the RHINO class adequate

to induce some extra negative association between RSV and RHINO, the evidence is not

strong. The smaller increases in estimated etiology fractions for pathogens PARA1 and

HMPV A/B are similarly explained by their negative associations with RSV in the cases,

although the magnitude of increase is smaller because these negative associations are not

as strongly supported by the case measurements as was the case for RSV and RHINO.

A strength of the npLCM is its Bayesian formulation and flexible posterior inference

about functions of unobservables through post-processing of MCMC samples. Prediction

91


of latent variables given an individual’s measurements

pi =

Pr(ILi = j |Mi,Data), j = 1, ..., J,

generalizes positive and negative predictive values for multivariate binary measurements.

To illustrate individual prediction, Figure 4.5 shows the posterior etiology probabilities, pi,

for individuals with the most frequent measurement patterns, predicted separately under

conditional independence (clear bars) and conditional dependence (filled bars) assump-

tions.

In general, predictions from the pLCM and the npLCM differs only in RHINO, with the

npLCM favoring RHINO. On an individual level, this increase in the RHINO probabilities

explains the increase in estimated RHINO etiologic fraction shown in Figure 4.4. In the

second row of Figure 4.5, the first (1000001000) and the last (100001010) patterns have

positive HINF and HMPV but differ for RHINO. We have a counter-intuitive higher pre-

dicted RHINO probability (37%) where RHINO is absent in the NP than where it is present

(25%). A naive expectation is that the model estimates for RHINO has marginal specificity

(0.67) that is greater than one minus the marginal sensitivity (0.4). Hence, observing a neg-

ative RHINO measurement should make RHINO less likely a cause. In the npLCM, be-

yond the first-order marginal moment parameters (e.g. marginal sensitivities/specificities),

the association parameters are also terms in the model likelihood. The last pattern (both

HMPV and RHINO positive) is more control-like and has a higher likelihood in support of

92


Figure 4.5: Individual diagnoses for the most frequent measurement patterns among thecases, separately predicted from the pLCM and the npLCM. In each subfigure, the multi-variate binary pattern denotes the observed measurement for a case; the percentage beneathis the observed frequency in the cases; the clear bar on the left is the prediction from pLCM;the filled bar on the right is from npLCM.

93


ILi 6= HMPV,RHINO versus infection by either pathogens. The strong observed positive

association (log OR 1 (s.e. 0.2)) between HMPV and RHINO in the controls that is recog-

nized by the npLCM and borrowed to the cases (see the model structure in Figure 4.1). The

optimal Bayesian weighting inherent in the posterior calculation balances the evidence for

the marginal parameters and pairwise associations, and determines that the latter dominates

and predicts RHINO to be a less likely cause for the last than the first pattern.

4.6.2 Model checking

To compare model fitting of the npLCM relative to the pLCM, we have compared pos-

terior predictive distributions (Gelman et al., 1996) of pairwise log odds ratios (LOR) to

the observed values separately in the controls and the cases. To assess the differences,

we calculate the observed LOR for a pair of measurements minus the mean LOR for the

predictive data distribution value divided by the standard deviation of the LOR predictive

distribution. Figure 4.6 shows pairs of pathogens that have significant deviations of model

predicted LOR from the observed ones, either by the pLCM or npLCM. The size of the cir-

cles for the empirical estimates are proportional to the precision of the observed log odds

ratios shown as the solid dots.

In the controls, the pathogen pairs (1,7), (1,9), and (7,9) have log odds ratios esti-

mated with relatively high precision. They are missed under the conditional independence

assumptions, but are well captured by the npLCM. In the cases, the pair of pathogen mea-

surements (7,8) have a positive log odds ratio with high precision, which is adequately de-

94


scribed by the npLCM. The associations between pairs of measurements (9,10) are not well

described by either model. But we observe that the npLCM posterior predictive distribution

(rightmost boxplot in the bottom panel) has moved towards explaining some negative asso-

ciations, compared to the neutral position of the boxplot under the pLCM. In the PERCH

study, we observed that the seasonal variation in the rate of detection for the 10th pathogen,

RSV, and the 9th pathogen RHINO were out of phase and regression adjustment, discussed

elsewhere, may account for such strong negative association.

In the cases, the npLCM has similar predicted frequencies to that obtained from the

conditional independence assumption. The underestimation of the measurement pattern

with HINF and RSV positive (third from left) is due to strong negative association between

RHINO and RSV that is not captured sufficiently by the npLCM and requires further work

on regression adjustment.

4.7 Discussion

In this paper, we estimated the population frequencies with which the putative pathogens

cause disease among the cases using a nested partially-latent class model (npLCM) that

allows for conditional dependence of measurements. Using multivariate binary measure-

ments from a case-control design, the model first approximates the probability distribution

for the control measurements by a mixture of product Bernoulli distributions with mixing

weights penalized towards a mixture with fewer components. The estimated control depen-

95


Figure 4.6: Posterior predictive distributions for checking of pairwise log odds ratios(LORs) for the controls (top) and the cases (bottom). For each of 45 pairs denoted onthe horizontal axis, the left (right) boxplot displays the posterior predictive distribution forthe pLCM (black) and nplCM (blue) models. Only pairs that have significant deviationsfrom the observed log odds ratios, either by pLCM or npLCM, are shown. The estimatedLORs are denoted by red dots; the size of the circles is proportional to the precision of theestimated LORs. Pathogen numbers and names are given in Figure 4.4 on right.

96


dence structure is then applied to the case model with modifications for the latent disease

state with true to replace false positive rates.

We illustrate by simulation that ignoring conditional dependence in each disease class

can lead to bias in the estimation of population and individual etiologic fraction estimates.

By recognizing similar covariations among pathogen measurements, the npLCM can re-

duce bias.

In the analysis of 10 leading pathogens from the PERCH study, RSV is is estimated to

be the most prevalent infectious cause of childhood pneumonia. That evidence is robust

to the conditional dependence assumption. In contrast, accounting for conditional depen-

dence structure leads to an increased RHINO etiologic fraction estimate so that its role is

less robust to models for the measurement dependence. For other pathogens, we did not ob-

serve substantial changes in the estimated etiologic fraction from the npLCM and pLCM,

indicating that the deviation from conditional independence has only limited influence in

this data set.

When scientific knowledge on true positive rates (TPRs) exists, it can be incorporated

into the npLCM by specifying appropriate prior distributions on the subclass-specific TPRs,

θ(j)k , k = 1, ..., K, for the jth disease class. This prior knowledge, however, are usually

available only on the marginal TPRs, θMj =∑K

k=1 θ(j)k η

(j)k , j = 1, ...J , functionals of a

large number of model parameters. In future work, we are interested in placing marginally

specified priors on θMj and developing efficient sampling algorithms that can improve the

quality of inference about π, while leaving other aspects of the model (e.g. conditional

97


dependence structure) maximally flexible. In this spirit, Kessler et al. (2014) considered

marginally specified priors that are approximately independent for a finite set of functionals

(e.g. margins of large probability contingency tables). Extensions to hierarchical marginal

priors on the vector of marginal TPRs θMj , j = 1, ..., J , can allow information to be bor-

rowed across pathogens when the marginal TPRs are considered similar.

The PERCH study motivates an extension of the npLCM to the regression settings so

that observed covariates, including seasonality can be included to study how the population

etiologic fraction and individual diagnoses vary across subgroups. Such extensions are

natural and underway.

In this paper, we assumed a single primary cause for each pneumonia case in the

npLCM. This framework also extends to multiple pathogen causes in the lung by using

a latent vector for case i, ILi ∈ 0, 1J , where 1 indicates that pathogen is one of possi-

bly multiple causes. For estimation, Hoff (2005) uses Dirichlet process mixture models to

identify multiple abnormal genomic locations that are jointly responsible for each case’s

disease, but using only case data with conditional independence assumption. Alternatively,

one can place an exponential penalty on the number of causes (e.g., Zhang and Liu, 2007),

or use conditionally specified models Pr[ILij = 1 | ILij′ , j′ 6= j,Xij] to characterize interac-

tions between pathogens (Besag, 1974), where Xij is a vector of covariates predictive for

pathogen j being a cause in case i. The computational cost to fit these models increases

substantially because the search space for the latent vector ILi expands exponentially in J .

Development of efficient and reliable posterior sampling algorithms can allow investigators

98


to assess the evidence of multiple-pathogen etiologies as more measurements accrue.

Other pathogen measurements, for example, from blood culture, have also been col-

lected in the PERCH study and can be integrated. This paper used only “bronze-standard”

(BrS) data from the NP for which case and control samples are available. A BrS measure is

assumed to have imperfect sensitivity and imperfect specificity. Blood cultures for bacteria

are an example of “silver-standard” (SS) measures assumed to have perfect specificity and

imperfect sensitivity. The integration of BrS and SS data to estimate π using the pLCM

is described in detail in Wu et al. (2014a). The same approach can be carried over to this

application of npLCM.

The npLCM was originally envisioned for a study with BrS, SS, but also “gold-standard”

(GS) data defined by perfect sensitivity and specificity. As Albert and Dodd (2008) have

shown, GS data enables internal validation of a latent class model including the npLCM.

Absent GS data, external evidence is required to validate model predictions. For example,

the model prediction about the effect of a pneumococcal conjugate vaccine program could

be compared to the observed results. A key final point is that inferences about the lung in-

fection from peripheral measurements must by definition be dependent upon the key model

assumptions that there is a direct link between the observations and the state of the lung.

99

Chapter 5

Estimation of Treatment Effects in

Matched-Pair Cluster Randomized

Trials by Calibrating Covariate

Imbalance Between Clusters with

Application to Guided Care Study

100

CHAPTER 5. COVARIATE CALIBRATION FOR TREATMENT EFFECTS INMATCHED-PAIR CLUSTER RANDOMIZED TRIALS

Abstract

We address estimation of intervention effects in experimental designs in which (a) in-

terventions are assigned at the cluster level; (b) clusters are selected to form pairs, matched

on observed characteristics; and (c) intervention is assigned to one cluster at random within

each pair. One goal of policy interest is to estimate the average outcome if all clusters in

all pairs are assigned control versus if all clusters in all pairs are assigned to intervention.

In such designs, inference that ignores individual level covariates can be imprecise because

cluster-level assignment can leave substantial imbalance in the covariate distribution be-

tween experimental arms within each pair. However, most existing methods that adjust for

covariates have estimands that are not of policy interest. We propose a methodology that

explicitly balances the observed covariates among clusters in a pair to obtain more efficient

estimators, and retains the original estimand of interest. We demonstrate our approach

through the evaluation of the Guided Care program.

101


5.1 Introduction

Some useful experimental designs have the following three features: interventions are

assigned at the cluster level; clusters are selected to form pairs, matched on observed co-

variates; and interventions are assigned to one cluster at random within each pair. One goal

of policy interest is to estimate the average outcome if all clusters in all pairs are assigned

control versus if all clusters in all pairs are assigned to intervention. The effect of such a

policy is easy to understand, because its definition does not depend on models, even though

its estimation can be assisted by models. Such designs are useful when individual-level

randomization is not feasible due to practical constraints, and when cluster assignment also

reflects how the assignment would scale in practice.

The Guided Care program is a recent example of such a study (Boult et al., 2013).

The study’s goal was to assess the effect of Guided Care versus a control condition on

functional health and other patient outcomes among clinical practices serving chronically

ill older adults. In Guided Care, a trained nurse works closely with patients and their

physicians to provide coordinated care. The control group does not have access to such a

nurse. To assess the effect of the Guided Care intervention, the study recruited 14 clinical

practices and matched them in 7 pairs using clinical practice and patient characteristics,

and within each pair randomly assigned one clinical practice to Guided Care and the other

to control.

A problem with cluster-level assignment is that it can leave substantial imbalances in

the covariates within pairs. However, existing methods to estimate effects in such designs

102


rarely use covariates in order to adjust for these imbalances. As a consequence, Such meth-

ods, including nonparametric as well as hierarchical (meta-analysis) approaches, although

useful in other ways (Imai et al., 2009), can leave large uncertainty in the results. Meth-

ods that do use covariates usually estimate effects conditionally on covariates and cluster-

specific random effects (Thompson et al., 1997; Feng et al., 2001; Hill and Scott, 2009).

With such methods, the estimands are no longer of policy interest and lack meaning when

the modelling assumptions are misspecified.

We propose an approach that explicitly balances the observed covariates between clus-

ters in a pair and still estimates causal effects of policy interest. In Section 2, we formulate

the matched-pair cluster randomized design through potential outcomes. Then, we char-

acterize in Section 3 the existing approaches to causal effects estimation and their compli-

cations. In Section 4, we propose a covariate-calibration approach and develop inferences

with and without the need for assumptions for a hierarchical second level. Throughout

these sections, the arguments are demonstrated through the evaluation of the recent Guided

Care program. Section 5 concludes with discussion.

5.2 The goal and design using potential outcomes

Consider a design that operates in pairs p = 1, . . . , n of clusters. In each pair p, the

design recruits two clusters (e.g., clinical practices) indexed by i = 1, 2, matched on quali-

tative and quantitative characteristics, such as percentage of patients with private insurance,

103


and where each clinical practice serves a community, say with a large number of Np,i pa-

tients. The design then assigns to each clinic one of two treatments, namely control (t = 1)

or intervention (t = 2). If clinical practice i of pair p is assigned treatment t, then po-

tential outcomes Yp,i,k(t) (Rubin, 1974, 1978) are to be measured on a random sample of

k = 1, . . . , np,i patients from the Np,i patients served in that clinical practice. We label

Fp,i(y; t), µp,i(t), and σ2p,i(t) the distribution (at value y), mean and variance of the poten-

tial outcome Yp,i,k(t) within clinical practice i of pair p. The average outcomes in pair p

are

µp(t) := µp,i=1(t)πp,i=1 + µp,i=2(t)πp,i=2, (5.2.1)

where “ := ” means “define”, πp,i=1 is the fraction of patients served by clinic i = 1, i.e.

Np,i=1/(Np,i=1 +Np,i=2), and similarly for πp,i=2. One goal of policy interest is to estimate

the average outcome if all clinical practices in all pairs are assigned control versus if all

clinical practices in all pairs are assigned intervention. In terms of the model, the goal is to

estimate a contrast between

µ(1) :=Eµp(t = 1) and µ(2) := Eµp(t = 2),

(5.2.2)

for example δeffect := µ(1)− µ(2),

which is the average outcome if all clusters had been assigned treatment 1 versus if all

clusters had been assigned treatment 2. Here, the expectations are taken over a larger pop-

ulation P of pairs from which p = 1, . . . , n can be considered a random sample. Alternative

104


estimands (e.g. conditionally on the sample of pairs, Imai et al. (2009)) can be considered,

although this does not change the main issues discussed here.

Within each pair, the design assigns at random the intervention to one clinical practice

and the control to the other, independently across pairs. Because in this design the original

ordering i is arbitrary, and in order to ease comparisons with the existing meta-analytic

approach (e.g. Thompson et al. (1997)), for each pair p we relabel by c = 1 the clinical

practice that is assigned control, and by c = 2 the clinical practice that is assigned interven-

tion. The quantities Yp,c,k(t), Fp,c(y; t), µp,c(t) and σ2p,c(t) are then redefined based on this

relabeling and the above definitions. Then, the paired cluster randomized design implies

the following:

CONDITION 1. The potential outcomes under treatments 1 and 2 in clinical practice c, and

the number of patients served by clinical practice c are exchangeable (in distribution over

pairs) between clinical practices c = 1 and c = 2, i.e.,

where the arrows connect equal entries in arguments, and distribution pr is over pairs p in

the larger population P of pairs.

Condition 1 implies, for example, over population of pairs, the joint distribution of the

means and variances of potential outcomes under exposure to intervention (t = 1) is the

same for the clinical practices that are actually assigned the intervention (c = 2) as it is

105


for the clinical practices that are actually assigned the control (c = 1). Figure 5.1 illus-

trates the structure of pairs, clinical practices, and assigned treatments in this paired cluster

randomized design, along with means and variances of potential outcome distributions.

Here we connect the observed data and existing methods to the above framework of

potential outcomes, because this helps understand the meaning of the assumptions, explicit

or implicit, required by the existing methods.

In order to estimate an effect such as δeffect of (5.2.2), consider first a particular pair

p: we can directly estimate the average potential outcome under control for the clini-

cal practice assigned to the control, namely µp,c=1(t = 1); and the average potential

outcome under intervention for the clinical practice assigned to the intervention, namely

µp,c=2(t = 2). Specifically, for the control clinical practice (c = 1) of pair p, let µp,c=1(t =

1) := 1np,c=1

∑np,c=1

k=1 Yp,c=1,k(t = 1) denote the average of the observed outcomes, i.e.,

the potential outcomes under t = 1; and for the intervention clinical practice (c = 2)

of pair p, let µp,c=2(t = 2) := 1np,c=2

∑np,c=2

k=1 Yp,c=2,k(t = 2) denote the average of the

observed outcomes, i.e., the potential outcomes under t = 2. Then, letting δcrudep =

µp,1(1)− µp,2(2), and conditionally on pairs p whose clinical practices have particular val-

ues of (δcrudep , vcrude

p ), we have that

pr(δcrudep | δcrude

p , vcrudep ) =Normal(δcrude

p , vcrudep ), where

(5.2.3)

δcrudep := µp,1(1)− µp,2(2) and vcrude

p =σ2p,1(1)

np,1+σ2p,1(2)

np,2.

106


Figure 5.1: The underlying structure of the paired-cluster randomized design. The top part(observed pair p) and bottom part (observed pair p′) are the two possible ways in whicha single pair can be manifested in the design. Observed pair p has two clinical practices(represented by the two squares). For each clinical practice, the first row shows the meanand variance of patient outcomes if the clinical practice is assigned control and the secondrow shows the mean and variance if assigned intervention. The clinical practice actuallyassigned control is indicated by its placement in column “1” , and the clinical practiceactually assigned intervention is in column “2”. The solid (nonsolid) ellipsoids show themeans and variances that can (cannot) be estimated directly. Observed pair p′ shows howthe same pair would be manifested in the design if the assignment of treatment to clinicalpractices were in reverse (a line with arrows connects the same clinical practice in thesetwo different assignments). Condition 1 means that each of the two manifestations, p andp′ has the same probability.

107


Here, “=” means “approximately”, the notation pr(Ap | Bp) and E(Ap | Bp) means the

distribution and expectation, respectively, of characteristic Ap among pairs in the larger

population P that have characteristic Bp (if Bp is empty, the distribution and expectation

are over all pairs).

Remark 1. In a pair, the directly estimable (crude) contrast δcrudep is not a causal effect be-

cause it compares different clinical practices under different treatments (Thompson et al.,

1997). However, the average, E(δcrudep ), over pairs is a causal effect, because the ex-

changeability of potential outcomes and between clinical practices 1 and 2 (Condition 1

above) implies (proof omitted) that

E(δcrudep ) = Eµp(t = 1) − Eµp(t = 2), which is δeffect , (5.2.4)

Thus, one can use the estimated differences, δcrudep , within each pair as in (5.2.3), and ex-

pression (5.2.4), to estimate δeffect , either with no additional assumptions (i.e., by simply

averaging δcrudep over pairs), or under a hierarchical second level model.

Remark 2. The objective meaning that the potential outcomes assign to the terms in the

model (5.2.3) implies the following, subtle fact: if the pair-specific δcrudep are to be elimi-

nated (i.e., marginalized over) from the conditional likelihood (5.2.3), then δcrudep should

be first integrated out of (5.2.3) based on the conditional distribution pr(δcrudep | vcrude

p ),

108

i.e.,

pr(δcrudep | vcrude

p ) =

∫pr(δcrude

p | δcrudep , vcrude

p ) ·pr(δcrudep | vcrude

p ) ·d(δcrudep ).

(5.2.5)

This becomes relevant when examining the existing hierarchical modeling methods.

Next, we discuss complications of existing methods for estimating the effect of inter-

vention δeffect . We demonstrate the arguments by assessing the effect of the Guided Care

intervention on the functional health outcome of the patients as measured by the physical

component summary of the Short Form (SF)-36 version 2 (Ware and Kosinski, 2001).

5.3 Complications with existing methods

5.3.1 Consequences when ignoring covariates.

Table 5.1 displays the observed average SF-36 scores for each of the seven pairs of prac-

tices in the Guided Care study (see outcome rows denoted as uncalibrated). Also displayed

are the within pair differences in average SF-36 outcomes between control and intervention.

Using these, Table 5.3 reports the estimate of the overall effect δeffect , first based only

on the design-derived fact (5.2.4) that the average of δcrudep equals the effect of inter-

est δeffect (see 1st level, “uncalibrated on covariates”). Because this first-level approach

makes no further assumptions about the joint distribution of pr(δcrudep , vcrude

p ), the MLE

of δeffect is simply the unweighted sample average of δcrudep , with its standard error esti-


Table 5.1: Summary of average SF36 outcomes for uncalibrated versus calibrated ap-proaches. The first row block displays sample sizes; the second row block displays averageoutcomes that are uncalibrated and calibrated, respectively.

pair p1 2 3 4 5 6 7

sample sizenp,c=1 17 16 42 23 52 23 28np,c=2 38 44 43 33 42 31 43

outcome

uncalibratedon covariates

µp,1(1) 36.4 36.5 39.6 39.1 39.7 33.8 39.6µp,2(2) 37.3 36.6 39.3 35.3 35.2 36.4 40.9

δcrudep -0.8 -0.1 0.3 3.8 4.5 -2.6 -1.3(

vcrudep

)1/22.7 2.6 2.0 2.7 2.1 2.6 2.2

calibratedon covariates

∗µcalibrp,1 37.6 38.8 39.5 38.0 38.7 35.5 40.9

∗µcalibrp,2 36.7 35.8 39.4 36.0 36.4 35.1 40.0

δcalibrp 0.9 3.0 0.1 1.9 2.3 0.5 0.8

†(vcalibrp

)1/22.1 2.4 1.5 2.0 1.7 2.2 1.7

*: calibration based on np,1 and np,2 observations in pair p

†: vcalibrp is the pth diagonal element of Σδcalibr in expression (5.4.8)

110


mated by the jackknife. Table 3 also reports the permutation test of no true effect for any

person, by randomly permuting the labels of treatment within each pair.

For a hierarchical second-level (meta-analytic) inference, the current approach for paired-

clustered designs (e.g., Thompson et al., 1997; Feng et al., 2001; Hill and Scott, 2009) is

based on integrating the likelihood in (5.2.3) over the marginal distribution pr(δcrudep ), to

obtain:

pr∗(δcrudep | vcrude

p , δeffect ) =

∫pr(δcrude

p | δcrudep , vcrude

p ) · pr(δcrudep ) · d(δcrude

p );

(5.3.1)

where pr(δcrudep ) = Normal(δeffect , v).

Table 5.3 (see 1st+2nd level, “uncalibrated on covariates”) shows inference for the ef-

fect δeffect using the above likelihood (5.3.1), namely, the method of Thompson et al.

(1997) with and without profiling out the variance v (see row 3 and 4); and also inference

based on the mean of the posterior distribution of δeffect using the uniform shrinkage prior

on v as suggested by Daniels (1999) (see row 5). For comparison, we also obtained the

two-sided tail probability from the distribution of the MLE from (5.3.1) as obtained from

all the permutation possibilities of the intervention and control labels of clinical practices

independently across pairs. None of these results suggest any substantial effect for the

intervention.

In general, the hierarchical and non-hierarchical methods without covariates can be

111


inaccurate for at least one of the following two reasons. First, any substantial covariate

imbalances between clinical practices within a pair can result in substantial uncertainty,

which is reflected in the variance of the estimators of the effect, and which may have

influenced the point estimate. For the Guided Care study, Table 5.2 shows that a number

of covariates show substantial imbalance between intervention and control groups. For

example, the continuous covariate Chronic Illness Burden has severe imbalances

between the clinical practices in pairs 2, 5 and 7, with t-statistics being −3.07, −4.81 and

2.52, respectively.

The hierarchical model approach, in addition to its normality assumption, can be ques-

tioned for the following subtle reason. In order to integrate out δcrudep from the likelihood

(5.2.3) to obtain a likelihood that, like (5.3.1), still depends on the variances vcrudep , one

must integrate δcrudep with respect to the conditional distribution of the estimand δcrude

p

given the variance vcrudep , as in (5.2.5) of Remark 2, and not with respect to the marginal

distribution pr(δcrudep ) as in (5.3.1). The comparison of (5.3.1) to (5.2.5) shows that (5.3.1)

implicitly assumes the following:

CONDITION 2. The estimand δcrudep and the variance vcrude

p of δcrudep at the first level

are independent across pairs p.

The motivation for using the likelihood (5.3.1) can be traced to Thompson et al. (1997,

Section 5, Paragraph 2). There, inference for the paired-clustered design is assumed to

have the same random effects structure as that of DerSimonian and Laird (1986), who

also assume Condition 2 but for a design that first randomly samples subjects from the

112


Table 5.2: Checking covariate imbalances within each pair. For a continuous covariate (in-dicated by (a)), we calculate effect size as difference divided by pooled standard deviation.For a categorical covariate (indicated by (b)), odds ratio is calculated comparing rates ofoccurrence of each category between two clusters in a pair. To prevent infinite odds ratio,0.5 is added to all the cells when calculating sample odds ratios.

pair1 2 3 4 5 6 7

age at interview(a) 0.3 -0.3 0.1 0.6 0.0 0.1 -0.1Chronic Illness Burden(a) 0.5 -0.6 0.0 0.0 -1.1 0.1 0.6

SF36 Mental(a) -0.3 0.1 0.3 0.2 0.3 -0.6 -0.5SF36 Physical(a) -0.1 -0.4 0.1 0.5 0.4 -0.6 -0.3

lives alone(b) 1.4 0.8 0.7 0.7 1.6 0.9 0.5>high school education(b) 0.4 0.5 0.7 1.4 0.8 0.8 1.1

Female(b) 2.4 0.6 1.0 0.6 1.0 2.5 1.1

race(b)

Caucasian 0.5 0.2 0.9 0.8 1.5 0.5 0.7African American 2.2 0.9 1.2 1.2 0.8 1.6 1.2

other 2.2 15.0 1.0 1.4 0.6 1.3 1.5

finances at end of month(b)

some money left over 0.0 0.7 1.4 0.7 1.5 0.7 0.6just enough to make ends meet 8.9 1.0 0.3 1.3 0.6 1.2 1.4not enough to make ends meet 18.2 8.4 7.0 1.0 1.2 2.0 1.6

self rated health(b)

≥very good 0.3 0.3 0.8 2.2 0.3 0.8 0.6good 2.6 3.4 1.4 0.4 2.5 0.8 1.4fair 0.9 0.9 0.4 0.3 2.5 4.2 0.5poor 6.8 1.5 3.1 4.4 2.0 4.2 2.1

113


Table 5.3: Results from MLE, profile MLE, Bayes estimates and permutation test in theGuided Care program study. The covariates used for calibration are listed in the first columnof Table 5.2; the outcome is the physical component summary of the Short Form 36 (SF36).

δeffect 95% C.I. s.e.(δeffect ) var(δ∗p)p-value

(two-sided)uncalibrated on covariates

1st levelMLE 0.5 (−1.4, 2.5) 1.0 − 0.59

permutation − − − − 0.611st+2nd level

MLE 0.6 (−1.2, 2.5) 0.9 0.7 0.50pMLE 0.6 (−1.5, 2.7) − 0.7 −Bayes 0.6 (−1.7, 3.0) 1.2 4.3 0.60

permutation − − − − 0.60calibrated on covariates

1st levelMLE 1.4 (0.5, 2.2) 0.4# − <0.01

permutation − − − − 0.021st+2nd level

MLE 1.2 (−0.2, 2.6) 0.7 0.0 0.08pMLE 1.2 (−0.2, 2.6) − 0.0 −Bayes 1.3 (−0.4, 2.9) 0.9 1.5 0.13

permutation − − − − 0.02

*: represents δcrudep for the uncalibrated approach and δcalibr

p for the calibrated approach.#: estimated by the jackknife.

114


population that a pair serves and then completely randomizes them, regardless of their

clinical practice. Call this simpler design, a “paired-strata” design. We show below that

violation of Condition 2 has more severe implications for the paired-clustered than for the

paired-strata design.

In the paired-strata design, the observed difference, say δ′p, in average outcomes be-

tween intervention and control individuals within a pair has mean, say δ′p, equal to the

causal effect µp(2) − µp(1) of (5.2.2). This means that, if the intervention has no effect in

any pair, i.e., the null hypotheses, µp(1) = µp(2) for all p, is correct, then δ′p is a constant

(0) and so Condition 2 is satisfied. As a result, an approach based on (5.3.1) is valid for

testing µp(1) = µp(2) for all p because Condition 2 is correct under the null hypothesis

being tested in that design.

In the paired-clustered design, however, the mean, δcrudep , of δcrude

p is not a causal

effect (see Remark 1 above) even if the intervention has no effect in any cluster, i.e., even

if the null hypotheses, µp,c(1) = µp,c(2) for all p and c, is correct. In particular, under this

null, the mean δcrudep is µp,1(1) − µp,2(1), i.e., the difference between clinical practices

1 and 2 if they had both been assigned control. In practice, even after matching, the two

clinical practices are expected to have imbalances in characteristics of the patients or the

doctors, so that δcrudep is expectedly not zero, and, hence, Condition 2 can be violated. We

have the following result (proof in Appendix):

RESULT 1. If the intervention has no effect, µ(1) = µ(2), but Condition 2 is violated, then

the MLE of the causal effect δeffect based on (5.3.1) can converge to a non-zero value as

115


the number of sampled practices increases.

Therefore, it is important to try to assess the plausibility of Condition 2. For the Guided

Care study, Figure 5.2 (left) plots the estimated values of√vcrudep against δcrude

p . Here

there appear no noticeable warnings against independence. However, the covariate imbal-

ances shown in Table 2 could still be contributing to inaccurate estimates through large

variances as discussed earlier.

5.3.2 Complications with existing covariate methods.

Some existing proposals do incorporate covariates into the model for pr(δcrudep ) on

the RHS of likelihood (5.3.1). However, these approaches stop short of addressing the

goal of estimating effects of policy interest. In particular, such existing approaches (e.g.,

Thompson et al. (1997), Sec.5.5, Feng et al. (2001)) define the treatment effect to be a

contrast in the treatment coefficients of the posited model after conditioning on a particular

value of the covariates and/or of random effects specific to the clusters. The first problem

with such a treatment effect is that, its meaning is not objective: if, for example, the model

is misspecified, then an effect set equal to a contrast of coefficients in the model does not

have a well defined physical interpretation. The second problem is that, even if the model

is correct, a treatment effect that is conditional on the covariates and/or the clusters is not

usually equal to but is only partially related to the overall effect.

116


5.4 Addressing the Problems

5.4.1 Calibration of observed covariate differences between

clinical practices

In order to use covariates to estimate the treatment effects in (5.2.2), we propose to first

construct calibrated pair-specific averages, for each treatment t = 1, 2, in the sense that the

distribution of the covariates reflected in the averages will be the same as the distribution of

covariates combined from both clinical practices of the pair. Inference for these calibrated

averages will then lead to inference for overall effects (5.2.2) with the gained precision of

accounting for the difference in observed covariates between the matched clinical practices.

This section uses notation for the following additional structure for pair p:

Xp,c,k, for the measurement of a covariate vector before treatment administration, for

the kth sampled patient of clinical practice c in pair p;

Gp,c(x), for the joint cumulative distribution function of the covariate vectorXp,c,k in

clinical practice c, evaluated at value x; and Gp(x) for the joint cumulative distribu-

tion function (evaluated at x) of the covariate vector of a patient selected at random

from pair p (i.e., from the two clinical practices of that pair, combined);

Fp,c(y | x; t), for the cumulative distribution function of the potential outcome Yp,c,k(t)

in clinical practice c, evaluated at value y among covariate levels x, if clinical practice

117


c is assigned treatment t; and let µp,c(x; t), for the mean of the latter distribution.

For pair p, consider now the estimable quantity, labelled as µcalibrp (t = 1), that is

constructed by, first, stratifying the average outcome into the covariate levels of the clinical

practice c = 1 (assigned to treatment 1), namely µp,c=1(x; t = 1), and then re-calibrating

it with respect to the covariate distribution of the two clinical practices combined (and

similarly for t = 2):

µcalibrp,c=1 :=

∫x

µp,c=1(x; t = 1)dGp(x), µcalibrp,c=2 :=

∫x

µp,c=2(x; t = 2)dGp(x) (5.4.1)

To understand the above estimand, consider for example two clinical practices in a pair,

that, although matched as closely as possible with respect to, say, the percentage of patients

with a “low” or “high” risk covariate (x = low or high), the percentage of low risk in

clinical practices 1 and 2 is 75% and 85% respectively, i.e., still differs appreciably between

the clinical practices. Suppose also that clinical practice 2 serves twice as many patients as

clinical practice 1. Ignoring covariates, the quantity that can be directly estimated from the

data for representing the average outcome if both clinical practices are assigned treatment 1

is simply the average outcome within clinical practice 1, µp,c=1(1), which can be expressed

in terms of the covariate as 0.75 · µp,c=1(x = low; t = 1) + 0.25 · µp,c=1(x = high; t =

1). When using covariates, the calibrated average µcalibrp,c=1 is 0.82 · µp,c=1(x = low; t =

1) + 0.18 · µp,c=1(x = high; t = 1), because it generalizes the covariate-specific outcome

averages under treatment 1 to the covariate distribution for both clinical practices in which

118


0.7513

+ 0.8523

= 0.82 have low risk.

More generally, one should expect that the calibrated contrasts µcalibrp,c=1 −µcalibr

p,c=2 , though

still not equal to the target causal effect µp(t = 1)−µp(t = 2) of (5.2.1) in each pair, should,

(a) share the property with the uncalibrated estimands, i.e., that they average over pairs to

the average causal effect δeffect of (5.2.4); and (b) provide a basis for more efficient estima-

tors than the uncalibrated contrasts. This is true if the design is more carefully formalized

as follows:

CONDITION 3. The characteristics of a clinical practice, i.e., the distribution of potential

outcomes under treatments 1 and 2 conditionally on covariates, the distribution of covari-

ates, and the number of people served by clinical practice c, namely the vector of functions[Fp,c(· | ·, t = 1), Fp,c(· | ·, t = 2), Gp,c(·), Np,c

], is exchangeable (in distribution over

pairs) between clinical practices c = 1 and c = 2.

Then we have the following:

RESULT 2. (a) Under Condition 3, the average over pairs of the covariate-calibrations,

µcalibrp,c=1 , i.e., based on the clinical practice assigned to treatment 1 in each pair (see (5.4.1))

equals the average of the potential outcomes if the entire population had been assigned

treatment 1 (similarly for treatment 2); hence the estimable contrast

Eµcalibrp,c=1 vs. Eµcalibr

p,c=2 (5.4.2)

equals the causal contrast (5.2.2); (b) if µp,c(x; t = c) are correctly specified, then the

119


MLEs of Eµcalibrp,c=1 in (5.4.2) (and of the target estimands µ(t) in (5.2.2), due to (a)

and the invariance property of the MLE) are the averages, over the observed pairs, of the

empirical analogues of (5.4.1):

∫µp,c(x; t = c)dGp(x), c = 1, 2, (5.4.3)

where Gp is the weighted empirical distribution of covariates in pair p (the weight is deter-

mined by Np,c).

Condition 3 implies Condition 1. The proof of Result 2 (a) follows by iterated expec-

tations; the proof of (b) follows because the empirical distribution Gp(x) as defined above

is, under no other assumptions, the MLE of Gp(x).

In practice, and simplifying the notation for the estimable averages µp,c(x; t = c) to

µp,c(x), one can consider modelling µp,c(x) for each (pair p, cluster c), with µp,c(x; θ),

where

hµp,c(x, θ) = θp,c + θ′cov · x and h is a link function. (5.4.4)

Since these models condition on the pairs and clusters, the parameter θ can be estimated by

weighted least squares estimator θ, based on the first moment residuals Yp,c,k−µp,c(Xp,c,k, θ),

where approximately

θ | θ,Σθ ∼ Normal(θ,Σθ), (5.4.5)

and where Σθ is the true variance-covariance matrix of θ, which can be estimated by the

120


robust variance-covariate matrix denoted by Σθ.

Based on these, the calibrated estimands in (5.4.1) can be estimated within each pair

and clinical practice, by

µcalibrp,c =

∫µp,c(x, θ)dGp(x), for all p, c, (5.4.6)

whose joint distribution can be approximated by the delta method as

level 1 :

µcalibrp=1,c=1 µcalibr

p=1,c=2

......

µcalibrp=N,c=1

µcalibrp=N,c=2

| θ,Σµcalibr ∼ Normal

µcalibrp=1,c=1 µcalibr

p=1,c=2

......

µcalibrp=N,c=1 µcalibr

p=N,c=2

,Σµcalibr

,

(5.4.7)

and where Σµcalibr can be estimated by Σµcalibr .

5.4.2 Estimation of quantities of original interest

Expression (5.4.7) can be used for estimation of the causal contrast µ(1) vs. µ(2)

(because of Result 2(a)); here we focus on δeffect = µ(1) − µ(2). Specifically, setting

δcalibrp = µcalibr

p,c=1 − µcalibrp,c=2 and δcalibr

p = µcalibrp,c=1 − µcalibr

p,c=2 we can consider the first or

121


both levels of the following two-level model

level 1′ :

δcalibr1

...

δcalibrN

|δcalibr1

...

δcalibrN

, θ,Σδcalibr ∼ Normal

δcalibr1

...

δcalibrN

,Σδcalibr

, (5.4.8)

level 2′ : δcalibrp | δeffect , τ 2 ∼ Normal(δeffect , τ 2), p = 1, . . . , N,

(5.4.9)

where expression (5.4.8) follows from (5.4.7); the covariance matrix Σδcalibr , obtained by

the delta method, can be estimated by Σδcalibr; and τ 2 is the variance of δcalibrp over pairs p.

Table 5.1 shows the results for the calibrated estimates as derived from expressions

(5.4.7) and (5.4.8) (see rows for outcome “calibrated on covariates”) for each of the seven

pairs in the Guided Care study. The covariates that are involved in the calibration are listed

in Table 5.2. It is notable that these calibrated differences, δcalibrp , are positive, in favor of

the control condition, for all pairs p.

Using these, Table 3 also reports the estimate of the overall effect δeffect , first based

only on the design-derived fact Result 2(a) that the average of δcalibrp equals the effect

of interest δeffect and on the estimation of each of δcalibrp by δcalibr

p as in (5.4.8) (see

1st level, “calibrated on covariates”). As with the uncalibrated first-level approach, this

first-level calibrated approach makes no further assumptions about the joint distribution of

pr(δcalibrp ,Σδcalibr), and the MLE of δeffect is the unweighted sample average of δcalibr

p

(here, its standard error is estimated by the jackknife, although in general it is difficult

122


to trust a normal approximation with seven pairs). For this reason, we also calculated

the significance level of the MLE by permutation of the treatment labels, thus testing the

hypothesis of no true effect in any person. In this case, and because all calibrated estimated

differences have the same sign, the permutation based significance level is 2/(27) = 0.016

in favor of the control condition.

For a two-level approach based on (5.4.8) and (5.4.9), one can estimate δeffect , by

first obtaining the marginalized likelihood, say, L(δeffect , τ 2,Σδcalibr). Then we estimated

δeffect by (i) the MLE after Σδcalibr replaces Σδcalibr; (ii) the MLE after profiling τ 2 out; and

(iii) the posterior distribution of δeffect using noninformative priors for τ 2 and δeffect . We

use a uniform shrinkage prior for the second-level variance τ 2 advocated by Daniels (1999).

These results for the two-level approach are given in Table 5.3 (see rows 1st+ 2nd level;

MLE, pMLE, and Bayes, respectively).

As with the uncalibrated approach, the marginalized likelihood that uses (5.4.8) and

(5.4.9) assumes that δcalibrp is independent of Σδcalibr . Figure 3, right panel, plots esti-

mates of the square root of the diagonals of Σδcalibr ,√vcalibrp , versus estimates of δcalibr

p .

Although the plot can be to some degree affected by measurement error, the R2 of 0.19

suggests that some dependence exists. Although this dependence could be modeled in

a modified second level, it is unclear how convincing such an approach would be as it

would introduce even more modeling assumptions. To avoid this, we calculated instead

the significance level of the two-level MLE estimate when evaluated from the permutation

distribution of the treatment labels.

123


Figure 5.2: Checking second level dependence. Left: estimates of√vcrudep versus δcrude

p ;

Right: estimates of√vcalibrp versus δcalibr

p , where vcalibrp are the diagonal elements of

Σδcalibr .

124


5.4.3 Assessment of the hypothesis of no effect

The proposed approach, in addition to being robust for hypothesis testing when eval-

uated by permutation, is likely to have a more general robustness property analogous to

the one arising in a simpler design. Specifically, in the design of complete randomization

of units (unpaired, unclustered), Rosenblum and van der Laan (2010) have shown that a

certain class of parametric models for covariates yield MLEs for the causal effect that are

consistent for the null value if indeed there is no effect on any person, even if the models

are incorrect. Shinohara et al. (2012) showed that an extended class of models has this

robustness property if the models satisfy an easy to check symmetry criterion.

For the matched-paired clustered-randomization design, analogous classes of models

with such robustness property may also exist. Specifically, suppose that, more generally

than model (5.4.4), we conceptualize a parametric model as one that allows distributions

mp,c(y | x) for the outcome at value y given covariate at value x for each (pair,cluster)

labelled (p, c). Many flexible models mp,c(· | ·) (or, for brevity, mp,c), including (5.4.4),

have the property that if, for two pairs and their clusters

p1c1 p1c2

p2c1 p2c2

, the model allows the distributions

m1,1 m1,2

m2,1 m2,2

125


then it also allows the distributions

m2,2 m1,2

m1,2 m2,2

and

m1,1 m2,1

m2,1 m1,1

.

The intuition of this property is that the model allows exchangeable distributions between

any two observed pairs. Following a similar reasoning to that of Shinohara et al. (2012),

we hypothesize that if (a) there is no effect of intervention in the distribution of any cluster,

i.e., in the true distributions defined in Condition 3, Fp,c(· | ·; t = 1) = Fp,c(· | ·; t = 2) for

all p, c, and (b) a model that has the above symmetry property is used, then the limit of the

MLE of the causal effect (5.4.2) is null even if the model is incorrect. A detailed treatment

of this issue can allow for combining validity with increased efficiency in such designs as

well.

5.5 Discussion

For the design that matches clusters of units and assigns interventions to clusters within

pairs, we proposed an approach that estimates the average causal effect while also explicitly

calibrating possibly covariate imbalance between the clusters. The approach can use only

one level of inference, or can be used in a hierarchical model.

In the Guided Care study, a first-level inference with the new approach reports esti-

mates of the causal effect with smaller estimated variance than without using covariates

126


(see Table 5.3). Although it is difficult to know if this is objectively true in this small

sample of pairs, the results from the permutation tests between the two approaches are

also consistent with this conclusion. A simple two-level approach, with or without covari-

ates, makes an implicit assumption which can invalidate causal comparison of the inter-

ventions, and explicitly addressing the assumption would introduce additional modeling.

The covariate-calibrated approach reports that the control condition leads to higher, albeit

clinically insignificant, average overall SF36 score compared to that under Guided Care

Nurse intervention, using either a single-level (approximate or permutation-based) analysis

or a two-level permutation-based analysis.

The proposed approach is expected to be more generally robust to model misspecifica-

tion when assessing the hypothesis of no effect, if the model (5.4.4) belongs in a relatively

broad class. This expectation needs formal verification, but, if confirmed, can lead to more

efficient estimation, and, hence, more efficient use of resources.

An alternative to the proposed approach can be to break the matching and then use

regression-assisted (Donner et al., 2007) or doubly-robust estimators (Rosenblum and van der

Laan, 2010) to estimate the treatment effect. Based on Rubin’s (Rubin, 1978) theory, the

matched design is still ignorable (and so the matching can be broken) if the variables that

were used to create the matching are still available and are included in the outcomes model.

In contrast, if these variables are not used in a model, then the matching design cannot be

ignored (namely, the matching cannot be broken), as this could generally lead to bias at

least in the expression of the uncertainty in inference. In this case, methods that explicitly

127


acknowledge the matching design are needed.

128

Chapter 6

Conclusions and Future Work

This thesis has developed statistical methods to advance the goal of individualized

health to intelligently use information to optimize each person’s health given their unique

characteristics, circumstances, and preferences. In Part I, we have developed and demon-

strated how nested partially-latent models (npLCM) can be used to estimate population eti-

ology and to better diagnose individuals. Our approach to the estimation of population etio-

logic fractions in a case-control design has been to formulate a hierarchical Bayesian model

that represents the case population as a mixture of different classes of patients. The con-

trol distribution provides essential evidence about the measurement error rates and about

dependence among the binary measurements about a lung infection. Efficient and easy-

to-implement Gibbs sampling algorithms are derived and implemented for realistic sample

sizes and dimension of measurements.

Our model has multiple advantages over the population attributable fraction method

129

CHAPTER 6. CONCLUSIONS AND FUTURE WORK

(Bruzzi et al., 1985). In particular, it allows for multiple sources of measurements and

accounts for possible differential laboratory measurement errors. As measurements be-

come more abundant, this integrative approach could be helpful for assessing the value of

different data sources.

Several features of the PERCH study warrant future statistical research. First, the pneu-

monia case definition is not perfect. We can introduce one more latent variable to indicate

true disease status and use biomarkers to probabilistically assign each prospective case as

a control.

Second, besides the multivariate binary measurements on pathogen presence/absence,

the PERCH study also collected some continuous-scale measurements on pathogen quan-

tities in specimens. We can investigate differences in the density of pathogen for cases

and controls to determine its importance in etiology. Bayesian nonparametric density esti-

mation and dictionary learning methods can be developed in this occasion to capture and

compare the flexible density shapes. We also need to consider zero-inflated model exten-

sions to accommodate the observation that most densities are zero-inflated, meaning many

cases/controls have an exact zero or value below the load of detection for certain pathogens,

even if continuous measurement is the protocol.

Third, although multiple sites are involved in the PERCH study, the current research

focus is to infer site-specific etiologies. Clinical experience suggests that many sites share

the same major pathogen causes but with potential site-specific etiology and laboratory

testing characteristics. We can therefore extend our npLCM to have one more site level in

130


the hierarchical formulation.

Lastly, in the current formulation of the npLCM, cross-sectional data is used to inform

about an individual’s health state at a single time point. Extensions of the framework to

incorporate longitudinal measurements, time-varying latent health trajectories, and time-

varying treatment assignment probabilities can support clinicians to make decisions about

treating individual patients as new evidence is obtained over time.

In Part II, we have developed an inferential methodology to evaluate the individualized

interventions if applied to a population. The method accounted for both the matched-pair

cluster randomized (MPCR) design and the potential covariate imbalances between clusters

in a pair even after matching. The proposed class of covariate-calibrated estimators can

correct potential bias that may be present in the MPCR design while retain the original

estimand of interest.

Finally, an important question not discussed in this thesis is the individualized treatment

selection. In clinical practice, patients usually have heterogeneous responses to a particular

treatment. Also, many health disorders (e.g. ADHD, HIV, cystic fibrosis, non-small cell

lung cancer) are of chronic nature, and usually require multiple treatment reconsiderations

or replacements over time. The overarching goal of individualized health is thus to provide

clinically meaningful improved health outcomes for patients by delivering the right drug, at

the right dose, and at the right time. We seek to develop reproducible, statistically efficient

trial designs and analytic methods to learn the individualized treatment rules.

A related but simpler question is to find the right subpopulation of patients for a particu-

131


lar treatment. Consider the data from a single-stage randomized trial that compares between

treatments. The investigators wish to assess if the treatment effect varies across subgroups

of individuals defined by covariates. However, a full answer to this problem is unfeasible

if the covariates have a large dimension. Recent methodology by Cai et al. (2011) has ad-

vanced this area by using a two-stage approach: first, using a working parametric model,

the covariates are reduced to a scalar summary; and second, a nonparametric regression is

used to estimate the treatment effect as a function of that summary. A problem with such

approaches is still that the number of possible working models is uncountably large to be

explored. We can approach the problem, by first focusing attention to all possible models

that would ultimately stratify individuals to a finite number of strata, for example, small

enough to be of practical use to clinicians. Statistically, this allows one to search over all

possible functions of covariates that characterize the strata, or even over submodels from

that set.

In the special case where subpopulations that partition the whole population have been

prespecified using pretreatment covariates, Rosenblum et al. (2013) has developed opti-

mal testing procedures to detect the treatment effect on the overall population, or on the

subpopulations jointly.

Novel study designs can also help with efficient estimation of subpopulation treatments.

For example, in adaptive clinical trials, when a person presents for randomization, we can

perform adaptive randomization based on the individual’s covariates, where the probability

of assignment to each of available therapies varies over time as a function of the current

132


estimates of the treatment effects in subgroups of people (Rosenblum and van der Laan,

2011). The adaptive design adapts the trial as it progresses and will discard ineffective or

harmful therapies early and find subpopulation of patients who can benefit from a particular

therapy, hence can reduce overall costs, improve patients’ adherence, and save time. For

example, Zhou et al. (2008) described Bayesian adaptive randomization trial designs to use

biomarker profiles for identifying effective targeted therapies in the biomarker-integrated

approaches of targeted therapy of lung cancer elimination (BATTLE) study, which is also

discussed in Berry et al. (2010) together with other adaptively designed studies like the

I-SPY 2 study (Barker et al., 2009).

Returning to the original question of selecting the right treatment for each patient, the

statistical challenges lie in at least three aspects. First , we have to learn individualized

treatment rules from the training data where the optimal treatments are unknown. For ex-

ample, using the data from a traditional clinical trial, we need to learn the rule or treatment

regime, d, that optimally assigns a treatment, among a set of possible treatments, to a pa-

tient as a function of her observed characteristics (X) hence individualizing treatments to

the patient. Qian and Murphy (2011) proposed to find the rule d∗ that optimizes the popula-

tion average response V(d) if a rule d is applied to the whole population. Qian and Murphy

(2011) then used regression models for the outcome for learning d∗. Zhao et al. (2012)

showed that finding the optimal rule d∗ is equivalent to a weighted classification problem

for treatments which motivated their outcome weighted learning method. Other methods

that combine biomarkers for treatment selection have also been developed, see, for exam-

133


ple, Gunter et al. (2007); Brinkley et al. (2010); Foster et al. (2011); Gunter et al. (2011);

Zhao et al. (2012); Zhang et al. (2012). Huang et al. (2012) also developed measures to

characterize biomarkers’ capacity to help with treatment selections.

The second challenge is to intelligently use predictors from a combination of diagnostic

tests, imaging, genetics, genomics, proteomics, etc., that may be high-dimensional with

potential high orders of interactions. The two-stage approach proposed by Cai et al. (2011)

is a useful tool to cope with high dimensionality. More work is still needed to integrate

data of different types.

The third challenge is to account for the longitudinal dependence of measurements

within a subject under sequential treatments, for example, in the sequential multiple as-

signment randomized trials (SMART, Lavori and Dawson (2000); Murphy (2005)). The

SMART randomizes a subject into treatments depending on the success or failure of previ-

ously randomized treatments on this individual, e.g., observed improvements in her health

outcome, side effects, burden, etc. For its ethical assignments of treatments, the SMART

has gradually been adopted in areas like mental health (Pelham Jr and Fabiano, 2008; Almi-

rall et al., 2012) and addiction research (Murphy et al., 2007; Strecher et al., 2008). Here

the statistical goal is to find a list of sequential decision rules or dynamic treatment regime

for assigning treatments based on a patient’s history. Thall et al. (2000) and Thall et al.

(2002) used likelihood based methods to model the conditional distribution of outcome

given past information in the sequential trial; Watkins (1989), Watkins and Dayan (1992),

Murphy (2005) and Robins (2004) used Q-learning (“Q” for “quality”) / A-learning (“A”

134


for “ advantage”) methods to model full or part of the conditional mean outcome; other ap-

proaches based on weighting methods have also been proposed (Zhao et al., 2012; Zhang

et al., 2012).

It remains an open question about how to design individualized treatment rules for sur-

vival outcomes, multiple outcomes of competing importance (e.g. drug efficacy and toxic-

ity), and continuous dosing levels and timings. Future work is also needed in more complex

settings involving incompleteness of outcome measurements or observational data.

135

APPENDICES

A1 Appendix to Chapter 3

A1.1 Full conditional distributions in Gibbs sampler

In this section, we provide analytic forms of full conditional distributions that are es-

sential for Gibbs sampling algorithm. We use data augmentation scheme by introducing

latent lung state ILi into the sampling chain and we have the following full conditional

distributions:

•[ILi | others

]. If MGS

i is available, Pr(ILi = j | others

)= 1, if MGS

ij = 1 and

MGSil = 0, for l 6= j; otherwise zero. If MGS

i is missing, according as whether

MSSi is available, the full conditional is given as

Pr(ILi = j | others) ∝(θBrSj

)MBrSij

(1− θBrS

j

)1−MBrSij

∏l 6=j

(ψBrSl

)MBrSil

(1− ψBrS

l

)1−MBrSil

·

[(θSSj

)MSSij

(1− θSSj )1−M

SSij 1∑

l6=j MSSil =0

]1j≤J′

· πj; (A.1.1)

if SS measurement is not available for case i, we remove terms involving MSSij .

•[ψBrSj | others

]∼ Beta

(Nj + b1j, n1 −

∑i:Yi=1 1ILi =j + n0 −Nj + b2j

), where

n1 and n0 are number of cases and controls, respectively, and

Nj =∑

i:Yi=1,ILi 6=j

MBrSij +

∑i:Yi=0

MBrSij

136

APPENDICES

is the number of positives at position j for cases with ILi 6= j and all controls.

•[θBrSj | others

]∼ Beta

(Sj + c1j,

∑i:Yi=1 1ILi =j − Sj + c2j

), where

Sj =∑

i:Yi=1,ILi =jMBrSij is the number of positives for cases with jth pathogen as

their causes.

•[θSSj | others

]∼ Beta

(Tj + d1j,

∑i:Yi=1,SSavailable 1ILi =j − Tj + d2j

), where

Tj =∑

i:Yi=1,ILi =j,SSavailable

MSSij .

When no SS data is available, this conditional distribution reduces to Beta(d1j, d2j),

the prior.

•[π | ILi , i : Yi = 1

]∼ Dirichlet(a1 + U1, ..., aJ + UJ), where Uj =

∑i:Yi=1 1ILi =j.

A1.2 Pathogen names and their abbreviations

Bacteria: HINF- Haemophilus influenzae; PNEU-Streptococcus pneumoniae;

SASP-Salmonella species; SAUR-Staphylococcus aureus.

Viruses: ADENOVIRUS-adenovirus; COR 43-coronavirus OC43; FLU C-influenza virus

type C; HMPV A B-human metapneumovirus type A or B; PARA1-parainfluenza type 1

virus; RHINO-rhonovirus; RSV A B-respiratory syncytial virus type A or B.

137

APPENDICES

A1.3 Additional simulation results

(a)

(b)

Figure A.1.1: Reduction in marginal diameter of 95% credible region as θ approaches 1. Ineach subfigure, each boxplot describes the variation of pathogen-specific marginal diame-ters of 95% credible regions across 100 simulated datasets. Each curve connects the meanvalues from boxplots across increasing true positive rates. “—”,“- - -”, and “· · · ”denotemarginal diameters calculated from BrS+GS, BrS-only, GS-only analyses, respectively;“·− ·−” corresponds to prior. Rows of subfigures correspond to different fractions of gold-standard measurements available, 1% and 10%. The blue dashed lines are the same acrossrows for fair comparisons. They are obtained from simulated data sets with the same setsof random seeds.

138

APPENDICES


A2.1 Posterior computations

This section details the full conditional distributions of unknown parameters and aux-

iliary variables as well as their sampling strategy in the Gibbs sampler. [A | B] represents

the conditional probability density or probability mass function for entityA given the value

of entity B; If B is null, [A] represents the marginal distribution of A. The super index (j)

is reserved for class-specific quantities; subclass index k appears only in the subscript.

1. Sample the class indicator ILi′ for cases i′ = 1, ..., n1, from a multinomial distribution

with probabilities

P(ILi′ = j | · · · ) = p(j)i′ ∝ [Mi′ | Zi′ ,Θ,Ψ, ILi′ = j][Zi′ | η(j), ILi′ = j][ILi′ = j | π]

∝θ(j)Zi′

Mi′j

1− θ(j)Zi′

1−Mi′j ∏l 6=j

ψ

(l)Zi′

Mi′l

1− ψ(l)Zi′

1−Mi′l· η(j)Zi′

· πj,

for j = 1, ..., J .

2. Sample subclass indicators Zi′ for case i′ = 1, ..., n1, from a multinomial distribution

139

APPENDICES

with probabilities

P(Zi′ = k | · · · ) = qi′k ∝ [Mi′ | Zi′ , ILi′ ,Θ,Ψ][Zi′ | ILi′ ,η(ILi′ )]

∝ η(IL

i′ )

k ·θ(IL

i′ )

k

Mi′IL

i′

1− θ(ILi′ )

k

1−Mi′IL

i′

×∏l 6=IL

i′

ψ

(l)k

Mi′l

1− ψ(l)k

1−Mi′l.

Sample subclass indicators Zi for control i = n1 + 1, ..., n1 +n0, from a multinomial

distribution with probabilities

P(Zi = k | · · · ) = qik ∝ [Mi | Zi = k,Ψ][Zi = k | ν]

∝ νk ·J∏j=1

ψ

(j)k

Mij

1− ψ(j)k

1−Mij

, k = 1, ..., K.

3. Sample the case subclass weights η(j) for j = 1, ..., J from

pr(η(j) | · · · ) ∝∏

i′:ILi′=j

[Zi′ | η(j), ILi′ ][η(j) | α]

which can be accomplished by first setting u(j)∗K = 1 and sampling

u(j)∗k ∼ Beta

(1 + z

′(j)k , α +

K∑l=k+1

z′(j)l

), k = 1, ..., K − 1,

where z′(j)k is the number of cases assigned to subclass k in class j. We write z′k =

#i′ : Yi′ = 1, Zi′ = k, ILi′ = j

, for k = 1, ..., K−1, where “#A” counts the num-

140

APPENDICES

ber of elements in setA. We then construct η(j)1 = u(j)∗k , η(j)k = u

(j)∗k

∏k−1l=1

1− u(j)∗l

,

k = 2, ..., K.

4. Sample the control subclass weights ν = (ν1, ..., νK)T from

pr(ν | · · · ) ∝∏i:Yi=0

[Zi | ν] · [ν | α],

which can be accomplished by first setting v∗K = 1 and sampling

v∗k ∼ Beta

(1 + zk, α +

K∑l=k+1

zk

), k = 1, ..., K − 1,

where zk is the number of controls assigned to subclass k, and then constructing

ν1 = v∗k, νk = v∗k∏k−1

l=1 (1− v∗l ), k = 2, ..., K.

5. Sample concentration parameter α for stick-breaking prior from

pr(α | · · · ) ∝J∏j=1

[η(j) | α] · [ν | α][α] ∝ α(K−1)J+1 exp(−α · r) · pr(α),

where r = −∑J

j=1

∑K−1k=1 log(1− u(j)∗k ) +

∑K−1k=1 log(1− v∗k)

. If conditionally

conjugate prior for α is used, i.e. α ∼ Gamma(aα, bα), then the full conditional

distribution reduces to Gamma (aα + (K − 1)J + 1, bα + r) .

141

APPENDICES

6. Sample the vector of subclass TPR for j = 1, ..., J from

pr(θ(j) | · · · ) ∝∏

i′:ILi′=j[Mi′ | θ(j), Zi′ , ILi′ ][θ(j)]

∝K∏k=1

θ(j)k

m(j)k1

1− θ(j)km(j)

k0 · [θ(j)], (A.2.1)

where m(j)kc = #i′ : Yi′ = 1, Zi′ = k, ILi′ = j,Mi′j = c, c = 0, 1. If prior for TPRs

are independent Beta distributions, then this is a product of Beta distributions.

7. Sample subclass-specific FPRs ψ(j)k for j = 1, ..., J , k = 1, ..., K from

pr(ψ(j)k | · · · ) ∝

∏i′:Yi′=1,IL

i′ 6=j,Zi′=k

[Mi′j | ψ(j), Zi′ , ILi′ ]∏i:Yi=0

[Mij | ψ(j), Zi] · [ψ(j)k ]

∝ψ

(j)k

s(−j)k1

1− ψ(j)k

s(−j)k0 · pr(ψ(j)

k ),

where s(−j)kc = #i′ : Yi′ = 1, Zi′ = k, ILi′ 6= j,Mi′j = c + #i : Yi = 0, Zi =

k,Mij = c, for c = 0, 1. If the prior on FPRs are Beta(a1, b1), then the above

conditional distribution is Beta(a1 + s(j)k1 , b1 + s

(j)k0 ).

8. Sample π from Dirichlet(d1 + t(j), ..., dJ + t(j)

), where t(j) is the number of cases

assigned to class j, i.e. t(j) = #i′ : Yi′ = 1, ILi′ = j, j = 1, .., J .

142

APPENDICES

A2.2 Full pathogen names with abbreviations

Bacteria: 1.HINF- Haemophilus influenzae; 2.SASP-Salmonella species;

3.SAUR-Staphylococcus aureus.

Viruses: 4.ADENOVIRUS-adenovirus; 5.COR 43-coronavirus OC43; 6.FLU C-influenza

virus type C; 7.HMPV A B-human metapneumovirus type A or B; 8.PARA1-parainfluenza

type 1 virus; 9. RHINO-rhinovirus; 10. RSV A B-respiratory syncytial virus type A or B.

A2.3 Stick-breaking prior

The stick-breaking mixture model has countably infinite number of subclasses. How-

ever, because the νk and η(j)k decrease exponentially quickly, a priori, we expect that only

a small number of subclasses will be used to model the data. The expected number of

subclasses used is logarithmic in the number of observations (Hjort et al., 2010). This is

different than a finite mixture model, which uses a fixed number of clusters to model the

data. In the stick-breaking mixture model, the actual number of clusters used to model data

is not fixed, and can be automatically inferred from data using the usual Bayesian posterior

inference framework (Neal, 2000).

Equations (4.2.12)-(4.2.16) place exchangeable prior weight on the subclasses. Follow-

ing (Ishwaran and James, 2002), in our computations, we truncate the infinite sum to the

first K terms with K sufficiently large to balance computing speed and approximating per-

formance of the model. In our simulations and data application K = 10 is usually deemed

143

APPENDICES

adequate. Most subclass measurement profiles are not occupied by either the simulations

or data application, so that a smaller number of K, e.g. 3, is usually sufficient for approxi-

mation. Also, by choosing hyperpriors for stick-breaking parameters α0 as in (4.2.16), we

can let the data inform us about the desired sparsity level for approximating the probability

contingency tables for the control and each disease class. A small value of the estimate α0

suggests that only a small number of subclasses are necessary for the controls (cases).

A2.4 Mean and covariance structure

Marginal mean of the observations take the form

Pr(Mij = 1 | Yi = 1) = πj

K∑k=1

θ(j)k η

(j)k +

∑s6=j

πs

K∑k=1

ψ(j)k η

(s)k

(A.2.2)

Pr(Mij = 1 | Yi = 0) =K∑k=1

ψ(j)k νk. (A.2.3)

Equation (A.2.2) and (A.2.3) indicate that the observed rate of pathogen j among cases is

a mixture of two components: cases whose disease is caused by pathogen j for which the

observation is a true positive event, and those whose disease is caused by another pathogen

for which the observation is a false positive. The case and control mean rate for pathogen

j observations are equal when either of the following two interesting situations occur:

(I) ψ(j)1 = · · · = ψ

(j)K = ψ(j) and

K∑k=1

θ(j)k η

(j)k = ψ(j).

144

APPENDICES

Condition (I) says that a measurement of pathogen j is independent of measurements of the

other pathogens among the controls, and, within the jth disease class, the rate of positive

pathogen j equals the control rate.

(II) η(s) = ν, s 6= j, andK∑k=1

[θ(j)k η

(j)k − ψ

(j)k νk

]= 0.

Condition (II) implies that, for a disease class s 6= j, the pairwise associations between

pathogen j and the other pathogens are equal between cases and controls.

The marginal pairwise log odds ratio for pathogen pair (j, l) among cases is given by

ωjl = log

Pr(Mij = 1,Mil = 1)Pr(Mij = 0,Mil = 0)

Pr(Mij = 1,Mil = 0)Pr(Mij = 0,Mil = 1)

= log

(J∑c=1

πc

[K∑h=1

θ(j)h

1c=j ψ

(j)h

1c6=j θ(l)h

1c=l ψ

(l)h

1c6=lη(c)h

])

+ log

(J∑c=1

πc

[K∑h=1

1− θ(j)h

1c=j 1− ψ(j)

h

1c6=j·

×

1− θ(l)h1c=l

1− ψ(l)h

1c6=lη(c)h

])− log

(J∑c=1

πc

[K∑h=1

θ(j)h

1c=j 1− ψ(j)

h

1c6=j θ(l)h

1c=l 1− ψ(l)

h

1c6=lη(c)h

])

− log

(J∑c=1

πc

[K∑h=1

1− θ(j)h

1c=j ψ

(j)h

1c6=j 1− θ(l)h

1c=l ψ

(l)h

1c6=lη(c)h

])

(A.2.4)

To illustrate the meaning of the above formula, suppose nearly all of pneumonia is caused

by pathogen j so that πj ≈ 1. If θ(j)k , k = 1, ..., K are equal, then we have approximate

145

APPENDICES

marginal independence between measurements on the jth pathogen and others. If K =

2 and θ(j)k , k = 1, 2 are very different, e.g. 1 versus 0 as an extreme example, ωjl =

logit(ψ(l)1 )− logit(ψ(l)

2 ), which is completely determined by the variation in subclass FPRs

for the lth pathogen.


A3.1 Proof of Result 1

We show that the MLE of δeffect based on the standard meta-analytic likelihood (5.3.1)

is generally inconsistent. To do this, consider the simple but informative case of a popu-

lation of pairs of practices as shown in Fig. 2, where µ follows the positive half of the

standard normal distribution across such pairs. Because δcrudep is µ or −µ with probabil-

ities (12, 12), marginally the normality of the distribution of δcrude

p at the second level of

(5.3.1) holds with δeffect (= E(δcrudep )) = 0 and with var(δcrude

p ) = 1. Consider also,

for simplicity, that var(δcrudep ) is known, and that within clinical practices, the number of

patients sampled is a constant n and the variances σ2p,c(t) are known and are as given in Fig.

2. Then, the maximizer δeffect of the likelihood in (5.3.1) is∑

p upδcrudep /

∑p up where

146

APPENDICES

(up)−1 = var(δcrude

p ) + vcrudep , and

vcrudep =

w1 = 2

n, if the practice p is of typep = 1;

w2 = 1+σ2

n, if the practice p is of typep = 2.

The probability limit of δeffect is E(upδcrudep )/E(up), and its sign will be the sign of

E(upδcrudep ). Here, althoughE(δcrude

p ) = 0, Condition 2 fails because the sign of δcrudep

depends on the magnitude of the variance vp. In particular,E(upδcrudep ) = EE(upδ

crudep |

typep) = µ2[var(δcrude

p ) +w2−1−var(δcrudep ) +w1−1] which is non zero if σ2 6= 1.

This means that even if the null hypothesis of no intervention effect on the means is correct,

the standard meta-analytic approach (5.3.1) is inappropriate if the intervention has an effect

on the variance in at least one clinical practice.

A3.2 Proof of Result (4)

[Note: the reason that we want to prove the following results is that we find the sample

sizes (np,1, np,2) potentially not being exchangeably distributed across the 7 pairs. It is thus

desirable to show that 1st level inference is still valid, e.g. (4) still holds, under even weaker

conditional exchangeability.

CONDITION 1’. The potential outcomes under treatments 1 and 2 in clinical practice c,

and the number of patients served by clinical practice c are exchangeable (in distribution

over pairs) between clinical practices c = 1 and c = 2, i.e.,

147

APPENDICES

Figure A.3.2: Structure for the example used in the proof of Result 1 (Appendix 1). Shownis one true type of pair and the two types of observed pairs to which it can give rise,depending on which clinical practice is assigned control. In each parentheses shown arethe mean and variance of the potential outcomes of patients of the corresponding clinicalpractice and under a give treatment, as denoted in Figure 5.1.

148

APPENDICES

where the arrows connect equal entries in arguments, and distribution pr is over pairs p in

the larger population P of pairs. ]

We show that under Condition 1’, we have the following properties that facilitate 1st-

level only inference. More specifically, we are to show: (i)Eµp,1(t) = Eµp,2(t) for

t = 1, 2; and (ii)E(δcrudep ) = Eµp(t = 1) − Eµp(t = 2). For equalities (i), we only

show the validity under t = 1. Denote the conditioning event Np,c, c = 1, 2 as N . We

also use EA to denote the expectation with respect to the distribution of A.

Epµp,1(1)

= ENEp|Nµp,1(1) | N

= EN

[∫ ∫apr(µp,1(1) = a, µp,1(2) = b, µp,2(1) = s, µp,2(2) = t | N)dbdsdt

da

]= EN

∫ ∫a× pr(µp,1(1) = s, µp,1(2) = t, µp,2(1) = a, µp,2(2) = b | N)dbdsdt

da

(Condition 1’: conditional exchangeability)

= ENEp|Nµp,2(1) | N

= Eµp,2(1).

To prove (ii), we only need to show that Eµp,c=t(t) = Eµp(t) for t = 1, 2. For

149

APPENDICES

A3.3 Proof of part (a) in Result 2

We are to show that Eµcalibrp,c = Eµp(c) for c = 1, 2. We only prove that it holds

for c = 1. Denote conditioning events Gp,c(·), Np,c, c = 1, 2 as G ∩N . We have

Epµcalibrp,1 = Ep

∫µp,1(x; 1)dGp(x)

= EG∩N

Ep|G∩N

∫x

µp,1(x; 1)dGp(x) | G ∩N

= EG∩N

[∫x

Ep|G∩Nµp,1(x; 1) | G ∩NdGp(x)

], (A.3.3)

where the integral indicated by∫x

can be further expanded using the definition of Gp(·),

∫x

πp,1Ep|G∩Nµp,1(x; 1) | G∩NdGp,1(x)+

∫x

πp,2Ep|G∩Nµp,1(x; 1) | G∩NdGp,2(x).

The exchangeability Condition 3 (exchangeability conditional on G ∩N is sufficient) now

implies that we may change the underlined 1 to value of 2 with the value being the same.

Hence, after changing order of∫x

and Ep|G∩N and marginalizing over covariates x in both

terms, we further simplify it to πp,1Ep|G∩Nµp,1(1) | G∩N+πp,2Ep|G∩Nµp,2(1). Plug-

ging it into (A.3.3) and recalling the definition of µp(1) in equation (5.2.1), we have that

(A.3.3) = EG∩NEp|G∩Nµp(1)

= Epµp(1)

151

Bibliography

Aitchison, J. (1986). The statistical analysis of compositional data. Chapman & Hall, Ltd.

Aitkin, M., Anderson, D., and Hinde, J. (1981). Statistical modelling of data on teaching

styles. Journal of the Royal Statistical Society. Series A (General), 144(4):419–461.

Albert, P. and Dodd, L. (2008). On estimating diagnostic accuracy from studies with mul-

tiple raters and partial gold standard evaluation. Journal of the American Statistical

Association, 103(481):61–73.

Albert, P., McShane, L., and Shih, J. (2001). Latent class modeling approaches for as-

sessing diagnostic error without a gold standard: with applications to p53 immunohisto-

chemical assays in bladder tumors. Biometrics, 57(2):610–619.

Albert, P. S. and Dodd, L. E. (2004). A cautionary note on the robustness of latent class

models for estimating diagnostic error without a gold standard. Biometrics, 60(2):427–

435.

Almirall, D., Compton, S. N., Gunlicks-Stoessel, M., Duan, N., and Murphy, S. A. (2012).

152

BIBLIOGRAPHY

Designing a pilot sequential multiple assignment randomized trial for developing an

adaptive treatment strategy. Statistics in Medicine, 31(17):1887–1902.

ASCO (2007). What to know: Asco’s guideline on tumor markers for breast cancer. com-

prehensive cancer network. Clinical Practice Guidelines in Oncology V.2.

Bandeen-Roche, K., Miglioretti, D. L., Zeger, S. L., and Rathouz, P. J. (1997). Latent

variable regression for multiple discrete outcomes. Journal of the American Statistical

Association, 92(440):1375–1386.

Barker, A., Sigman, C., Kelloff, G., Hylton, N., Berry, D., and Esserman, L. (2009). I-

spy 2: an adaptive breast cancer trial design in the setting of neoadjuvant chemotherapy.

Clinical Pharmacology & Therapeutics, 86(1):97–100.

Berry, S. M., Carlin, B. P., Lee, J. J., and Muller, P. (2010). Bayesian adaptive methods for

clinical trials. CRC Press.

Besag, J. (1974). Spatial interaction and the statistical analysis of lattice systems. Journal

of the Royal Statistical Society: Series B (Statistical Methodology), 36(2):192–236.

Bhattacharya, A. and Dunson, D. B. (2011). Sparse bayesian infinite factor models.

Biometrika, 98(2):291–306.

Bhattacharya, A. and Dunson, D. B. (2012). Simplex factor models for multivariate un-

ordered categorical data. Journal of the American Statistical Association, 107(497):362–

377.

153

BIBLIOGRAPHY

Blackwelder, W., Biswas, K., Wu, Y., Kotloff, K., Farag, T., Nasrin, D., Graubard, B., Som-

merfelt, H., and Levine, M. (2012). Statistical methods in the global enteric multicenter

study (gems). Clinical Infectious Diseases, 55(suppl 4):S246–S253.

Blei, D., Ng, A., and Jordan, M. (2003). Latent dirichlet allocation. The Journal of Machine

Learning Research, 3:993–1022.

Boult, C., Leff, B., Boyd, C., Wolff, J., Marsteller, J., Frick, K., Wegener, S., Reider, L.,

Frey, K., Mroz, T., Karm, L., and Scharfstein, D. (2013). A matched-pair cluster cluster-

randomized trial of guided care for multi-morbid older patients. Journal of General

Internal Medicine, 28:612–621.

Brinkley, J., Tsiatis, A., and Anstrom, K. J. (2010). A generalized estimator of the at-

tributable benefit of an optimal treatment regime. Biometrics, 66(2):512–522.

Brooks, S. and Gelman, A. (1998). General methods for monitoring convergence of itera-

tive simulations. Journal of Computational and Graphical Statistics, 7(4):434–455.

Brooks, S., Gelman, A., Jones, G., and Meng, X.-L. (2011). Handbook of Markov Chain

Monte Carlo. CRC Press.

Bruzzi, P., Green, S., Byar, D., Brinton, L., and Schairer, C. (1985). Estimating the popula-

tion attributable risk for multiple risk factors using case-control data. American Journal

of Epidemiology, 122(5):904–914.

154

BIBLIOGRAPHY

Cai, T., Tian, L., Wong, P. H., and Wei, L. (2011). Analysis of randomized comparative

clinical trial data for personalized treatment selections. Biostatistics, 12(2):270–282.

Carey, V., Zeger, S. L., and Diggle, P. (1993). Modelling multivariate binary data with

alternating logistic regressions. Biometrika, 80(3):517–526.

Chacon, J., Mateu-Figueras, G., and Martın-Fernandez, J. (2011). Gaussian kernels for

density estimation with compositional data. Computers & Geosciences, 37(5):702–711.

Clayton, D. G. (1996). Generalized linear mixed models. In Markov Chain Monte Carlo

in Practice, pages 275–301. Springer.

Clive, J., Woodbury, M. A., and Siegler, I. C. (1983). Fuzzy and crisp set-theoretic-based

classification of health and disease. Journal of Medical Systems, 7(4):317–332.

Connor, J. T. (2006). Multivariate mixture models to describe longitudinal patterns of

frailty in American seniors. PhD thesis, Carnegie Mellon University.

Corder, L. S. and Manton, K. G. (1991). National surveys and the health and functioning

of the elderly: The effects of design and content. Journal of the American Statistical

Association, 86(414):513–525.

Daniels, M. J. (1999). A prior for the variance in hierarchical models. Canadian Journal

of Statistics, 27(3):567–578.

Dawid, A. (1979). Conditional independence in statistical theory. Journal of the Royal

Statistical Society. Series B (Methodological), 41(1):1–31.

155

BIBLIOGRAPHY

Dayton, C. M. and Macready, G. B. (1988). Concomitant-variable latent-class models.

Journal of the American Statistical Association, 83(401):173–178.

De Lathauwer, L., De Moor, B., and Vandewalle, J. (2000). A multilinear singular value

decomposition. SIAM Journal on Matrix Analysis and Applications, 21(4):1253–1278.

Deloria-Knoll, M., Feikin, D., Scott, J., OBrien, K., DeLuca, A., Driscoll, A., Levine, O.,

et al. (2012). Identification and selection of cases and controls in the pneumonia etiology

research for child health project. Clinical Infectious Diseases, 54(suppl 2):S117–S123.

DerSimonian, R. and Laird, N. (1986). Meta-analysis in clinical trials. Controlled Clinical

Trials, 7(3):177–188.

Dillon, W. R. and Mulani, N. (1984). A probabilistic latent class model for assessing inter-

judge reliability. Multivariate Behavioral Research, 19(4):438–458.

Donner, A., Taljaard, M., and Klar, N. (2007). The merits of breaking the matches: A

cautionary tale. Statistics in Medicine, 26(9):2036–2051.

Dunson, D. and Xing, C. (2009). Nonparametric bayes modeling of multivariate categorical

data. Journal of the American Statistical Association, 104(487):1042–1051.

Eaton, W. W., Dryman, A., Sorenson, A., and McCutcheon, A. (1989). Dsm-iii major

depressive disorder in the community. a latent class analysis of data from the nimh epi-

demiologic catchment area programme. The British Journal of Psychiatry, 155(1):48–

54.

156

BIBLIOGRAPHY

Erosheva, E. A., Fienberg, S. E., and Joutard, C. (2007). Describing disability through

individual-level mixture models for multivariate binary data. The Annals of Applied

Statistics, 1(2):346–384.

Feikin, D., Scott, J., and Gessner, B. (2014). Use of vaccines as probes to define disease

burden. The Lancet, 383(9930):1762–1770.

Feng, Z., Diehr, P., Peterson, A., and McLerran, D. (2001). Selected statistical issues in

group randomized trials. Annual Review of Public Health, 22(1):167–187.

Flegal, J. M., Haran, M., and Jones, G. L. (2008). Markov chain monte carlo: Can we trust

the third significant figure? Statistical Science, 23(2):250–260.

Foster, J. C., Taylor, J. M., and Ruberg, S. J. (2011). Subgroup identification from random-

ized clinical trial data. Statistics in Medicine, 30(24):2867–2880.

Garrett, E. and Zeger, S. (2000). Latent class model diagnosis. Biometrics, 56(4):1055–

1067.

Gelfand, A. E. and Sahu, S. K. (1999). Identifiability, improper priors, and gibbs sam-

pling for generalized linear models. Journal of the American Statistical Association,

94(445):247–253.

Gelfand, A. E. and Solomon, H. (1973). A study of poisson’s models for jury verdicts in

criminal and civil trials. Journal of the American Statistical Association, 68(342):271–

278.

157

BIBLIOGRAPHY

Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin, D. B. (2013).

Bayesian data analysis. CRC press.

Gelman, A., Meng, X.-L., and Stern, H. (1996). Posterior predictive assessment of model

fitness via realized discrepancies. Statistica Sinica, 6(4):733–760.

Genentech (Accessed August 2nd, 2014). http://www.gene.com/patients/disease-

education/breast-cancer.

Gilks, W. R., Richardson, S., and Spiegelhalter, D. J. (1996). Introducing markov chain

monte carlo. In Markov Chain Monte Carlo in Practice, pages 1–19. Springer.

Good, I. J. (1969). Some applications of the singular decomposition of a matrix. Techno-

metrics, 11(4):823–831.

Goodman, L. (1974). Exploratory latent structure analysis using both identifiable and

unidentifiable models. Biometrika, 61(2):215–231.

Gunter, L., Zhu, J., and Murphy, S. (2007). Variable selection for optimal decision making.

In Artificial Intelligence in Medicine, pages 149–154. Springer.

Gunter, L., Zhu, J., and Murphy, S. (2011). Variable selection for qualitative interactions in

personalized medicine while controlling the family-wise error rate. Journal of Biophar-

maceutical Statistics, 21(6):1063–1078.

Gustafson, P. (2005). On model expansion, model contraction, identifiability and prior

158

BIBLIOGRAPHY

information: two illustrative scenarios involving mismeasured variables. Statistical Sci-

ence, 20(2):111–140.

Gustafson, P. (2009). What are the limits of posterior distributions arising from nonidenti-

fied models, and why should we care? Journal of the American Statistical Association,

104(488):1682–1695.

Gustafson, P., Le, N., and Saskin, R. (2001). Case–control analysis with partial knowledge

of exposure misclassification probabilities. Biometrics, 57(2):598–609.

Guyatt, G. H., Keller, J. L., Jaeschke, R., Rosenbloom, D., Adachi, J. D., and Newhouse,

M. T. (1990). The N-of-1 randomized controlled trial: clinical usefulnessour three-year

experience. Annals of Internal Medicine, 112(4):293–299.

Haberman, S. J. (1979). Analysis of qualitative data. vol. 2, new developments. Academic

Press.

Haberman, S. J. (1995). Book review of statistical applications using fuzzy sets, by Kenneth

G. Manton, Max A. Woodbury, and H. Dennis Tolley. Journal of the American Statistical

Association, 90:1131–1133.

Hammitt, L., Kazungu, S., Morpeth, S., Gibson, D., Mvera, B., Brent, A., Mwarumba,

S., Onyango, C., Bett, A., Akech, D., et al. (2012). A preliminary study of pneumonia

etiology among hospitalized children in kenya. Clinical Infectious Diseases, 54(suppl

2):S190–S199.

159

BIBLIOGRAPHY

Hill, J. and Scott, M. (2009). Comment: The essential role of pair matching. Statistical

Science, 24(1):54.

Hjort, N. L., Holmes, C., Muller, P., and Walker, S. G. (2010). Bayesian nonparametrics.

AMC, 10:12.

Hoff, P. D. (2005). Subset clustering of binary sequences, with an application to genomic

abnormality data. Biometrics, 61(4):1027–1036.

Huang, G.-H. and Bandeen-Roche, K. (2004). Building an identifiable latent class model

with covariate effects on underlying and measured variables. Psychometrika, 69(1):5–

32.

Huang, Y., Gilbert, P. B., and Janes, H. (2012). Assessing treatment-selection markers

using a potential outcomes framework. Biometrics, 68(3):687–696.

Hui, S. and Walter, S. (1980). Estimating the error rates of diagnostic tests. Biometrics,

36(1):167–171.

Imai, K., King, G., and Nall, C. (2009). The essential role of pair matching in cluster-

randomized experiments, with application to the Mexican universal health insurance

evaluation. Statistical Science, 24(1):29–53.

Ishwaran, H. and James, L. F. (2001). Gibbs sampling methods for stick-breaking priors.

Journal of the American Statistical Association, 96(453):161–173.

160

BIBLIOGRAPHY

Ishwaran, H. and James, L. F. (2002). Approximate dirichlet process computing in finite

normal mixtures. Journal of Computational and Graphical Statistics, 11(3):508–532.

Jokinen, J. and Scott, J. A. G. (2010). Estimating the proportion of pneumonia attributable

to pneumococcus in kenyan adults: latent class analysis. Epidemiology (Cambridge,

Mass.), 21(5):719–725.

Jones, G., Johnson, W., Hanson, T., and Christensen, R. (2010). Identifiability of models

for multiple diagnostic testing in the absence of a gold standard. Biometrics, 66(3):855–

863.

Kadane, J. (1974). The role of identification in bayesian theory. Studies in Bayesian

Econometrics and Statistics, pages 175–191.

Kessler, D. C., Hoff, P. D., and Dunson, D. B. (2014). Marginally specified priors for

non-parametric bayesian estimation. Journal of the Royal Statistical Society: Series B

(Statistical Methodology), doi: 10.1111/rssb.12059.

King, G. and Lu, Y. (2008). Verbal autopsy methods with multiple causes of death. Statis-

tical Science, 23(1):78–91.

King, G., Lu, Y., and Shibuya, K. (2010). Designing verbal autopsy studies. Population

Health Metrics, 8:19.

Kullback, S. (2012). Information theory and statistics. Courier Dover Publications.

161

BIBLIOGRAPHY

Lavori, P. W. and Dawson, R. (2000). A design for testing clinical strategies: biased adap-

tive within-subject randomization. Journal of the Royal Statistical Society: Series A

(Statistics in Society), 163(1):29–38.

Lazarsfeld, P. F. and Henry, N. W. (1968). Latent structure analysis. Houghton, Mifflin.

Levine, O., O’Brien, K., Deloria-Knoll, M., Murdoch, D., Feikin, D., DeLuca, A., Driscoll,

A., Baggett, H., Brooks, W., Howie, S., et al. (2012). The pneumonia etiology research

for child health project: A 21st century childhood pneumonia etiology study. Clinical

Infectious Diseases, 54(suppl 2):S93–S101.

Liang, K.-Y., Zeger, S. L., and Qaqish, B. (1992). Multivariate regression analyses for

categorical data. Journal of the Royal Statistical Society. Series B (Methodological),

54(1):3–40.

Liu, F., Bayarri, M., Berger, J., et al. (2009). Modularization in bayesian analysis, with

emphasis on analysis of computer models. Bayesian Analysis, 4(1):119–150.

Liu, L., Johnson, H. L., Cousens, S., Perin, J., Scott, S., Lawn, J. E., Rudan, I., Campbell,

H., Cibulskis, R., Li, M., et al. (2012). Global, regional, and national causes of child

mortality: an updated systematic analysis for 2010 with time trends since 2000. The

Lancet, 379(9832):2151–2161.

Lunn, D., Best, N., Spiegelhalter, D., Graham, G., and Neuenschwander, B. (2009). Com-

162

BIBLIOGRAPHY

bining mcmc with ’sequential’PKPD modelling. Journal of Pharmacokinetics and Phar-

macodynamics, 36(1):19–38.

Manrique-Vallier, D. (2010). Longitudinal Mixed Membership Models with Applications

to Disability Survey Data. PhD thesis, Carnegie Mellon University.

Manton, K. G., Tolley, H. D., and Woodbury, M. A. (1994). Statistical applications using

fuzzy sets. New York: John Wiley & Sons, cop.

McCutcheon, A. L. (1987). Latent class analysis. Number 64. Sage.

McHugh, R. B. (1956). Efficient estimation and local identification in latent class analysis.

Psychometrika, 21(4):331–347.

Murdoch, D., O’Brien, K., Driscoll, A., Karron, R., Bhat, N., et al. (2012). Laboratory

methods for determining pneumonia etiology in children. Clinical Infectious Diseases,

54(suppl 2):S146–S152.

Murphy, S. A. (2005). An experimental design for the development of adaptive treatment

strategies. Statistics in Medicine, 24(10):1455–1481.

Murphy, S. A., Lynch, K. G., Oslin, D., McKay, J. R., and TenHave, T. (2007). Developing

adaptive treatment strategies in substance abuse research. Drug and Alcohol Depen-

dence, 88(Suppl 2):S24–S30.

National Cancer Institute (Accessed August 2nd, 2014). Breast cancer treatment.

http://www.cancer.gov/cancertopics/pdq/treatment/breast/patient.

163

BIBLIOGRAPHY

Neal, R. M. (2000). Markov chain sampling methods for dirichlet process mixture models.

Journal of Computational and Graphical Statistics, 9(2):249–265.

Pati, D., Bhattacharya, A., Pillai, N. S., Dunson, D., et al. (2014). Posterior contraction in

sparse bayesian factor models for massive covariance matrices. The Annals of Statistics,

42(3):1102–1130.

Pelham Jr, W. E. and Fabiano, G. A. (2008). Evidence-based psychosocial treatments for

attention-deficit/hyperactivity disorder. Journal of Clinical Child & Adolescent Psychol-

ogy, 37(1):184–214.

Pepe, M. S. and Janes, H. (2007). Insights into latent class analysis of diagnostic test

performance. Biostatistics, 8(2):474–484.

Pritchard, J. K., Stephens, M., and Donnelly, P. (2000). Inference of population structure

using multilocus genotype data. Genetics, 155(2):945–959.

Qian, M. and Murphy, S. A. (2011). Performance guarantees for individualized treatment

rules. Annals of statistics, 39(2):1180–1210.

Qu, Y. and Hadgu, A. (1998). A model for evaluating sensitivity and specificity for corre-

lated diagnostic tests in efficacy studies with an imperfect reference test. Journal of the

American Statistical Association, 93(443):920–928.

Robert, C. P. and Casella, G. (1999). Monte Carlo statistical methods. Springer.

164

BIBLIOGRAPHY

Robins, J. M. (2004). Optimal structural nested models for optimal sequential decisions. In

Proceedings of the Second Seattle Symposium in Biostatistics, pages 189–326. Springer.

Rosenblum, M., Liu, H., and Yen, E.-H. (2013). Optimal tests of treatment ef-

fects for the overall population and two subpopulations in randomized trials, us-

ing sparse linear programming. Journal of the American Statistical Association,

doi:10.1080/01621459.2013.879063.

Rosenblum, M. and van der Laan, M. J. (2010). Simple, efficient estimators of treatment ef-

fects in randomized trials using generalized linear models to leverage baseline variables.

International Journal of Biostatistics, 6.

Rosenblum, M. and van der Laan, M. J. (2011). Optimizing randomized trial designs to

distinguish which subpopulations benefit from treatment. Biometrika, 98(4):845–860.

Rubin, D. (1974). Estimating causal effects of treatments in randomized and nonrandom-

ized studies. Journal of Educational Psychology; Journal of Educational Psychology,

66(5):688–701.

Rubin, D. (1978). Bayesian inference for causal effects: The role of randomization. The

Annals of Statistics, 6(1):34–58.

Senn, S. (2002). Cross-over trials in clinical research, volume 5. John Wiley & Sons.

Shashua, A. and Hazan, T. (2005). Non-negative tensor factorization with applications to

165

BIBLIOGRAPHY

statistics and computer vision. In Proceedings of the 22nd international conference on

Machine learning, pages 792–799. ACM.

Shinohara, R. T., Frangakis, C. E., and Lyketsos, C. G. (2012). A broad symmetry criterion

for nonparametric validity of parametrically based tests in randomized trials. Biometrics,

68(1):85–91.

Singer, B. (1989). Grade of membership representations: Concepts and problems. Prob-

ability, Statistics, and Mathematics: Papers in Honor of Samuel Karlin, TW Andersen,

KB Athreya and DL Iglehart, eds., Academic Press, Inc, New York, pages 317–334.

Spiegelhalter, D., Thomas, A., Best, N., and Lunn, D. (2003). WinBUGS user manual.

Strecher, V. J., McClure, J. B., Alexander, G. L., Chakraborty, B., Nair, V. N., Konkel,

J. M., Greene, S. M., Collins, L. M., Carlier, C. C., Wiese, C. J., et al. (2008). Web-

based smoking-cessation programs: results of a randomized trial. American Journal of

Preventive Medicine, 34(5):373–381.

Sullivan, P. F., Kessler, R. C., and Kendler, K. S. (1998). Latent class analysis of lifetime

depressive symptoms in the national comorbidity survey. American Journal of Psychia-

try, 155(10):1398–1406.

Thall, P. F., Millikan, R. E., Sung, H.-G., et al. (2000). Evaluating multiple treatment

courses in clinical trials. Statistics in Medicine, 19(8):1011–1028.

Thall, P. F., Sung, H.-G., and Estey, E. H. (2002). Selecting therapeutic strategies based

166

BIBLIOGRAPHY

on efficacy and death in multicourse clinical trials. Journal of the American Statistical

Association, 97(457):29–39.

Thompson, S., Pyke, S., and Hardy, R. (1997). The design and analysis of paired cluster

randomized trials: an application of meta-analysis techniques. Statistics in Medicine,

16(18):2063–2079.

Tucker, L. R. (1966). Some mathematical notes on three-mode factor analysis. Psychome-

trika, 31(3):279–311.

Uebersax, J. S. (1988). Validity inferences from interobserver agreement. Psychological

Bulletin, 104(3):405–416.

Uebersax, J. S. (1997). Analysis of student problem behaviors with latent trait, latent class,

and related probit mixture models. Applications of Latent Trait and Latent Class Models

in the Social Sciences, J. Rost and R. Langeheine, eds., Waxmann, New York, NY, pages

188–195.

Uebersax, J. S. and Grove, W. M. (1993). A latent trait finite mixture model for the analysis

of rating agreement. Biometrics, 49(3):823–835.

Wang, Z., Zhou, X., and Wang, M. (2011). Evaluation of diagnostic accuracy in detecting

ordered symptom statuses without a gold standard. Biostatistics, 12(3):567–581.

Ware, J. E. and Kosinski, M. (2001). Interpreting SF-36 summary health measures: A

response. Quality of Life Research, 10(5):405–413.

167

BIBLIOGRAPHY

Warren, J., Fuentes, M., Herring, A., and Langlois, P. (2012). Spatial-temporal modeling

of the association between air pollution exposure and preterm birth: Identifying critical

windows of exposure. Biometrics, 68(4):1157–1167.

Watkins, C. J. and Dayan, P. (1992). Q-learning. Machine Learning, 8(3-4):279–292.

Watkins, C. J. C. H. (1989). Learning from delayed rewards. PhD thesis, University of

Cambridge.

Woodbury, M. A., Clive, J., and Garson Jr, A. (1978). Mathematical typology: a grade

of membership technique for obtaining disease definition. Computers and Biomedical

Research, 11(3):277–298.

Wu, Z., Deloria-Knoll, M., Hammitt, L., Zeger, S., and for the PERCH Core Team (2014a).

Partially-latent class models (pLCM) for case-control studies of childhood pneumonia

etiology. Johns Hopkins University, Dept. of Biostatistics Working Papers, Working

Paper 267. http://biostats.bepress.com/jhubiostat/paper267.

Wu, Z., Frangakis, C. E., Louis, T. A., and Scharfstein, D. O. (2014b). Estimation of treat-

ment effects in matched-pair cluster randomized trials by calibrating covariate imbalance

between clusters. Biometrics, doi: 10.1111/biom.12214.

Xu, J. and Zeger, S. (2001). The evaluation of multiple surrogate endpoints. Biometrics,

57(1):81–87.

168

BIBLIOGRAPHY

Young, M. A. (1983). Evaluating diagnostic criteria: a latent class paradigm. Journal of

Psychiatric Research, 17(3):285–296.

Zeger, S. and Karim, M. (1991). Generalized linear models with random effects; a Gibbs

sampling approach. Journal of the American Statistical Association, 86(413):79–86.

Zhang, B., Tsiatis, A. A., Laber, E. B., and Davidian, M. (2012). A robust method for

estimating optimal treatment regimes. Biometrics, 68(4):1010–1018.

Zhang, Y. and Liu, J. S. (2007). Bayesian inference of epistatic interactions in case-control

studies. Nature Genetics, 39(9):1167–1173.

Zhao, L., Tian, L., Cai, T., Claggett, B., and Wei, L.-J. (2013). Effectively selecting a

target population for a future comparative study. Journal of the American Statistical

Association, 108(502):527–539.

Zhao, Y., Zeng, D., Rush, A. J., and Kosorok, M. R. (2012). Estimating individualized

treatment rules using outcome weighted learning. Journal of the American Statistical

Association, 107(499):1106–1118.

Zhou, X., Liu, S., Kim, E. S., Herbst, R. S., and Lee, J. J. (2008). Bayesian adaptive design

for targeted therapy development in lung cancera step toward personalized medicine.

Clinical Trials, 5(3):181–193.

Zigler, C. M. and Dominici, F. (2014). Uncertainty in propensity score estimation:

169

BIBLIOGRAPHY

Bayesian methods for variable selection and model-averaged causal effects. Journal

of the American Statistical Association, 109(505):95–107.

170

CURRICULUM VITAE

ZHENKE WU

[email protected]

615 N. Wolfe St. E3136

Baltimore, MD 21205

http://www.biostat.jhsph.edu/∼zhwu

Date of Birth: Apr 29th, 1988

Place of Birth: Chun’an, Zhejiang, China

EDUCATION

2009 - 2014 Johns Hopkins Bloomberg School of Public Health, Baltimore, MD

Ph.D. in Biostatistics

Thesis title: Statistical Methods for Individualized Health: Etiology, Di-

agnosis, and Intervention Evaluation

Advisor: Prof. Scott Zeger

2009 Fudan University, Shanghai, China

B.Sc. in Mathematics

171

mailto:[email protected]

http://www.biostat.jhsph.edu/~zhwu

CURRICULUM VITAE

PROFESSIONAL EXPERIENCE

2013 - present External Statistical Advisor

Child Health Research Foundation (CHRF), Dhaka, Bangladesh, and

National Center for Immunization and Respiratory Diseases (NCIRD),

The U.S. CDC

2010 - present Research Assistant/Statistician

International Vaccine Access Center (IVAC), Johns Hopkins

Bloomberg School of Public Health

Advisor: Prof. Scott Zeger; Principal Investigator: Prof. Katherine

O’Brien

2008 Research Scholar

California NanoSystems Institute, and Department of Mechanical and

Aerospace Engineering, University of California, Los Angeles

2007 - 2009 Research Scholar

Center for Computational Systems Biology, Fudan University, Shang-

hai, China

172

CURRICULUM VITAE

HONORS AND AWARDS

JOHNS HOPKINS UNIVERSITY

2014 First Place: Biostatistics Section of the Delta Omega Poster Competition

2013 Joseph Zeger Conference Travel Award

2012 June B. Culley Award, for outstanding achievement on school-wide oral

exam paper

2011-14 Hopkins Sommer Scholar

2009-14 Department of Biostatistics Graduate Fellowship

FUDAN UNIVERSITY

2009 B.Sc. with First Class Honors

2007-09 Chun-Tsung Scholar, Chinese Undergraduate Research Endowment

(CURE) Scholarship

2008 First Class National Scholarship, Ministry of Education, China

2007 Excellent Undergraduate Student, Government of Shanghai

2006-07 First Class People’s Scholarship

2006 First Class Shi Dai Scholarship

173

http://www.biostat.jhsph.edu/newsEvent/award/culleyaward.shtml

http://www.jhsph.edu/admissions/scholarships/institutional-scholarships/sommer-scholars/scholars/?scholars=282& type=continuing

CURRICULUM VITAE

PUBLICATIONS

PUBLISHED/SUBMITTED

Wu Z, Frangakis CE, Louis TA, Scharfstein DO (2014). Estimating Treatment Effects in

Cluster Randomized Trials by Calibrating Covariate Imbalances between Clusters. Bio-

metrics. doi: 10.1111/biom.12214.

Georgiades C, Geschwind J-F, Neil H, Hines-Peralta A, Liapi E, Hong K, Wu Z, Kamel I,

Frangakis CE (2012). Lack of response after initial chemoembolization for hepatocellular

carcinoma: Does it predict failure of subsequent treatment? Radiology 265:115-123.

Wu Z, Deloria-Knoll M, Hammitt LL, and Zeger SL (2014). Partially Latent Class Models

(pLCM) for Case-Control Studies of Childhood Pneumonia Etiology.

(http://biostats.bepress.com/jhubiostat/paper267/)

Frangakis CE, Qian T, Wu Z, Diaz I (2014). Deductive Derivation and Computeriza-

tion of Compatible Semiparametric Efficient Estimation. Revision Invited for Biometrics.

(http://biostats.bepress.com/ucbbiostat/paper324/).

WORKING PAPERS

Wu Z, Zeger SL. Nested Partially-Latent Class Models (npLCM) for Estimating Disease

Etiology in Case-Control Studies.

Wu Z, Zeger SL. Partial Latent Class Model in Regression Analysis.

174

http://biostats.bepress.com/jhubiostat/paper267/

http://biostats.bepress.com/ucbbiostat/paper324/

CURRICULUM VITAE

PRESENTATIONS (∗upcoming)

2014 Nested Partially Latent Class Models (npLCM) for Case-Control Studies of

Childhood Pneumonia Etiology. Pneumonia Etiology Research for Child

Health (PERCH) Executive Committee Meeting. December 2, London,

England.∗

2014 Nested Partially Latent Class Models (npLCM) for Case-Control Studies of

Childhood Pneumonia Etiology. Joint Statistical Meetings. August 7, Boston,

MA. (Topic contributed)

2014 Estimating Treatment Effects in Cluster Randomized Trials by Calibrating Co-

variate Imbalances between Clusters. Eastern North American Regional meet-

ing of the International Biometric Society. March 18, Baltimore, MD. (Topic

contributed)

2013 Estimating Infectious Etiology from Hierarchical Dirichlet Process Perspective.

Pneumonia Etiology Research for Child Health (PERCH) Executive Committee

Meeting. December 2, London, England.

2013 Partially Latent Class Models (pLCM) for Case-Control Studies of Childhood

Pneumonia Etiology. US Centers for Disease Control and Child Health Re-

search Foundation: Aetiology of Neonatal Infection in South Asia (ANISA)

Project Committee Meeting. November 10, San Diego, CA.

175

CURRICULUM VITAE

2013 Estimating Treatment Effects in Cluster Randomized Trials by Calibrating Co-

variate Imbalances between Clusters. Joint Statistical Meeting. August 4, Mon-

treal, QC, Canada. (Topic contributed)

2013 Hierarchical Bayesian Model for Combining Information from Multiple Biolog-

ical Samples with Measurement Errors: An Application to Children Pneumonia

Etiology Study. Eastern North American Regional meeting of the International

Biometric Society. March 12, Orlando, FL. (Topic contributed)

2012 Revealing and Addressing Existing Basic Inadequacies in the Use of Paired

Cluster Randomized Trials. Department of Biostatistics. Johns Hopkins Bio-

statistics Causal Inference Working Group. December 6, Baltimore, MD.

TEACHING

GUEST LECTURER

2012 A unified framework for high-dimensional analysis of M-estimators with de-

composable regularizers. Advanced Special Topics, 140.840: Large-scale In-

ference, Prof. Han Liu

TEACHING ASSISTANT

2014 Multilevel Statistical Models, Graduate, 140.656.

2014 Analysis of Longitudinal Data, Graduate, 140.655.

176

CURRICULUM VITAE

2013 Biostatistics in Public Health, Undergraduate, 280.346, advanced. Prof.

Scott Zeger.

2013 Case-based Introduction to Biostatistics, www.coursera.org, Prof. Scott

Zeger.

2013 Bayesian Methods I-II, Graduate, 140.762-763, Prof. Gary Rosner.

2012 Biostatistics in Public Health, Undergraduate, 280.346, advanced. Prof.

Scott Zeger

2011-12 Advanced Probability Theory I-II, Graduate, 550.620 - 621, Prof. James

Fill.

2010-11 Essentials of Probability and Statistical Inference I-IV, Graduate, 140.646-

649. Profs. Michael Rosenblum and Charles Rohde.

PROFESSIONAL ACTIVITIES

Co-Organizer Hopkins Biostatistics Student Journal Club, 2012-2013

Committee and treasurer Chinese Public Health Forum (CPHF) at Johns Hopkins,

2010-present

Volunteer ENAR Spring Meeting, Washington, DC, 2012

Representative and panelist Department of Biostatistics Student Recruitment Com-

mittee, 2010-2012

Member Hopkins inHealth (HiH) Learning Methodologies Work-

ing Group

177

https://www.coursera.org/course/casebasedbiostat

http://www.biostat.jhsph.edu/~zhwu/2012journal.html

https://www.facebook.com/cphf.jhsph?fref=ts

CURRICULUM VITAE

JHSPH Causal Inference Working Group

Survival, Longitudinal, and Multilevel Modeling

(SLAM) Working Group

American Statistical Association (ASA), International

Chinese Statistical Association (ICSA), International

Biometric Society (ENAR), Institute of Mathematical

Statistics (IMS), American Public Health Association

(APHA)

Reviewer Journal of Business and Economic Statistics, Annals of

Statistics, Ophthalmic Epidemiology, International Con-

ference on Artificial Intelligence and Statistics (AISTAT),

Statistical Science

178

http://jhsphcausalinference.weebly.com/people.html

https://sites.google.com/site/jhuslamgroup/members

https://sites.google.com/site/jhuslamgroup/members

statistical methods for individualized health: … · 2020. 5. 20. · statistical methods for...

Documents