missing covariate data in matched case-control studies: do...
TRANSCRIPT
Missing covariate data in matchedcase-control studies: Do the usual
paradigms apply?
Bryan Langholz
USC Department of Preventive Medicine
Joint work withMulugeta Gebregziabher
Larry Goldstein
Mark Huberman
1
CONTEXT AND SCOPE
• Missing covariate data in nested case-control studies.
• Epidemiologic individually matched case-control data.
• 1:1 (single case-single control).
• Single covariate that has missing values.
• Missing indicator methods.
2
MISSING INDICATOR METHOD IN THELITERATURE
Reference Context CommentJones (1986) JASA 91:222-230.
Least squares regression “The missing-indicator methodsshow unacceptably large biases inpractical situations and are notadvisable in general.”
Vach and Blettner (1991) AJE134:895-907.
Unmatched case-control studies(logistic)
“Making use of data on subjectswho are neglected in completecase analysis by creating an ad-ditional category always results inbiased estimation.”
Greenland and Finkle (1995)AJE 142:1255-1264.
Unmatched case-control studies(logistic)
“The authors recommend thatepidemiologists avoid using themissing-indicator method...”
Li, et al. (2004) AJE 159:603-10.
Individually matched case-control
“...the missing-indicator methodshould be used cautiously.”
3
THREADS
• THREAD I: Some background: Missing data inmatched pairs.
• THREAD II: Some theory: Missing data inducedintensity and models.
• THREAD IV: Missing indicator methods: A bad rap?
• THREAD III: Is MCAR/MAR/NI classification “asquare peg in a round hole?”
4
THREAD I: SOME BACKGROUND
• Motivation: Childhood leukemia study.
• Old work: “Retain complete pairs” method formatched pairs.
• Data analysis.
5
CASE-CONTROL STUDY OF MAGNETICFIELDS AND CHILDHOOD LEUKEMIA
• Individually matched population based case-control studyin Los Angeles County.
• Cases identified retrospectively from Los Angeles SEERRegistry.
• Controls individually age matched with friends or randomdigit dial.
• 232 Matched pairs.
• Meters used to collect 24-hour magnetic field profile inchild’s bedroom.
• Spot measurements.
0London et al (1991) AJE 134:923-937
6
LOS ANGELES CHILDHOOD LEUKEMIASTUDY
CompleteMissing matched
Information Cases Controls subjects pairs
Interview data 232 232 0% 232
24-hour magnetic fieldmeasurements
164 144 34% 108
Spot magnetic fieldmeasurement in child’sbedroom
140 109 46% 71
7
REASONS FOR MISSING DATA
• Most because the family had moved “relevant” homeso it was not possible to measure fields.
• Instrument problems.
• Some refusal.
8
COMPLETE CASE ANALYSIS
• Drop subjects with missing data.
• Analyze as matched case-control study.
Note that:
• Pair is dropped if either case or control is missing.
Large loss of data.
9
BREAK THE MATCHING
• Drop subjects with missing data.
• Break the matching.
• Analyze as unmatched case-control study, control forconfounding by matching factors by modelling.
Note that:
• Can model age but
• Quantification of “friend” or “telephone number”matching not possible.
Potential residual confounding.
10
SOME PAST WORK: A BETTER METHOD
Compromise between complete case and breakthe matching:
• Retain matching for those pairs in which bothsubjects have covariate data.
• Break matching for subjects with data in pairs with“missing data partner.”
Avoids large loss of data (complete case analysis)and minimizes potential uncontrolledconfounding (break the matching).
11
ANALYSIS
• Matched pairs part contributes conditional logisticlikelihood part.
• Unmatched part contributes unconditional logisticlikelihood part (perhaps control for confounding bymodelling matching variables).
Estimation based on product of conditional andunconditional logistic likelihood parts.
12
SIMULATION STUDY PARAMETERS
• Three age groups - 50%, 40%, 10% youngest tooldest.
• Exposure - Xi −N (µs, 1) with µs - stratum specificmean.
• µs = µage + U(−.5, .5), µage= age = 0,1, or 2.
• Age - quantifiable matching factor, the U(−.5, .5) forunquantifiable.
• Independent “risk sets” of size 50 were generated for400 cases.
13
• Disease status - prob based on rate ratios,multiplicative in age and exposure withrrage = 2, rrX =
√2, βX = .35.
• A single individual was randomly sampled from the 49in the risk set to serve as the control in the matchedpair set.
• 25% missing X completely at random: 56% pairs hadcomplete data.
• Results based on 500 trials.
14
SIMULATION RESULTS
Mean Empirical Estimated
Method parameter s.e. s.e.√
MSE
No missing 0.35 0.080 0.075 0.075
Complete case 0.35 0.098 0.101 0.101
Break the matching:Pooled over age 0.19 0.052 0.061 0.171Age stratified 0.26 0.065 0.073 0.114
Retain matching in complete pairs:Pooled over age 0.29 0.078 0.077 0.099Age-stratified 0.32 0.082 0.083 0.087Age-trend 0.31 0.081 0.082 0.088
15
“RETAIN THE MATCHING” METHOD
• Efficiency is as least as good as complete-caseanalysis.
• Bias is never as bad as breaking the matching.
Good compromise to the two most commonlyused alternatives.
16
DATA ANALYSIS: FITTING THE “RETAINTHE MATCHING” METHOD
Missing indicator analysis is exactly the “retainthe matching in complete pairs” analysis.
0Huberman and Langholz (1999) AJE 150:1340-1345.
17
MISSING INDICATOR ANALYSIS DATASET
Exposure Missing Age-stratified MIC-C Case(1) Age (0 if missing) indicator Age=1 Age=2 Age=3 Age-trendSet control(0) group Z M1 M2 M3 Mtrend38 1 3 2.99688 0 0 0 0 038 0 3 0.00000 1 0 0 1 3
39 1 3 2.91090 0 0 0 0 039 0 3 0.00000 1 0 0 1 3
40 1 2 0.92814 0 0 0 0 040 0 2 0.00000 1 0 1 0 2
41 1 1 0.62488 0 0 0 0 041 0 1 0.90149 0 0 0 0 0
42 1 1 0.51196 0 0 0 0 042 0 1 -0.76174 0 0 0 0 0
43 1 3 2.48709 0 0 0 0 043 0 3 0.00000 1 0 0 1 3
44 1 3 2.65434 0 0 0 0 044 0 3 0.00000 1 0 0 1 3
45 1 2 0.00000 1 0 1 0 245 0 2 0.00000 1 0 1 0 2
46 1 1 0.00000 1 1 0 0 146 0 1 0.00000 1 1 0 0 1
18
THREAD II: SOME THEORY
• Develop a theory for predictable missingness (doesn’tdepend on case-control status) .
– Consider complete data cohort intensity (rates)under proportional hazards.
– Use the double expectation formula to derivegeneral form of missing data “induced” intensity.
– Extend the proportional hazards model toencompass the possible form(s) of the inducedintensity.
– Propose a semi-parametric version of this model.
– Fitting the model.
– Complete case analysis.
• Adapted (depends on case-control status) missingness.
19
KEY ASSUMPTION FOR THIS PART
• I assume that missingness is predictable - does notdepend on case-control status.
20
COHORT - CASE-CONTROL STUDIES
Key point (Ørnulf’s talk):
• If the true intensity is a member of the specifiedproportional hazards model, then the partiallikelihood(s) provide valid estimation of β0.
• Works for both Cox (for full cohort) and conditionallogistic (for individually matched case-control data)regression.
21
INTENSITY FOR COMPLETE COVARIATEDATA
• Let (Ni, Yi, Zi) counting, censoring, and covariateprocesses (i.i.d.).
• Z - corresponding filtration.
• Then the “complete data” Z-intensity is
λi(t;Z)dt = E[dNi(t)|Zt− ]
0Andersen et al (1992) Statistical Models Based on Counting Pro-cesses, Springer.
22
PROPORTIONAL HAZARDS MODEL FORCOMPLETE DATA
• Semi-parametric model for the rate of disease as afunction of factors (e.g., radiation, genes).
• λ(t, y, z;α(·), β) = y α(t) r(z;β).
– α - rate of disease in “unexposed” as a functionof time (nuisance function).
– r - “rate ratio” as a function of z.
– β - rate ratio parameters to be estimated.
• Assume that there exist λ0, β0 such thatλi(t;Z) = Yi(t) λ0(t) r(Zi(t);β0).
23
DATA MODEL FOR (PREDICTABLE)MISSING DATA
• Zi(t) - covariate with no missing data.
• Xi(t) - single covariate with missing data.
• Mi(t) - missing indicator, independent and leftcontinuous.
• Representation for missing covariate data(Ni, Yi, (1−Mi)Xi,Mi, Zi).
• M-filtration generated by{Ni(t), Yi(t), [1−Mi(t)]Xi(t),Mi(t), Zi(t)}.
Note: Mi is M-predictable.
24
INTENSITY PROCESS FOR(PREDICTABLE) MISSING DATA
• The “Innovation Theorem” says that the M (missingdata)-intensity
λi(t;M)dt = E[dNi(t)|Mt− ] = E[λi(t;Z)|Mt− ]dt
(Double expectation formula for processes)
• Proportional hazards:
λi(t;M) = Yi(t)λ0(t)E[r(Xi, Zi(t);β0)|Mt− ]
25
MISSING DATA INDUCED INTENSITY
Assumptions:
• Distribution of X only depends on summaries of otherfactors at t.
• Recall i.i.d.
Define:
• FX(·; z, t; β0, α0) - distribution function for X conditionalon Y (t) = 1, M(t) = 1, and Z(t) = z.
26
MISSING DATA INDUCED INTENSITY
λi(t;M) = Yi(t) λ0(t) r(Xi(t), Zi(t);β0)1−Mi(t)
× η[t, Zi(t);β0, λ0, FX(·; t, Zi(t);β0, λ0)]Mi(t)
where
η[t, z;β0, λ0(·), FX(·; t, z;β0, λ0)] =
E[r(X, z;β0)|Y (t) = 1,M(t) = 1]
=∫
r(x, z;β0)dFX(x; t, z;β0, λ0).
27
MISSING DATA INDUCED MODEL
• We will need to expand the complete data modelwith a component to accommodate η.
• In (very general) the model would look like:
λ(t, y, (x, z),m;α, β) =
y α(t) r(x, z;β)1−m η[t, z;β, α, Fψ(·; t, z;β, α)]m
where
η(t, z;β, α) =∫
r(x, z;β) dFψ(x; t, z;β, α)
and Fψ is a family of distribution functions.
28
“MODELLING THE MISSINGNESS”
• Modelling the missingness is modelling η, the“expected rate ratio.”
• η(t, z;β, α, Fψ(x, z, t;β, α)) depends β, α and Fψ.
• Missing data methods for failure time data can beseen as different ways of structuring η.
• Prentice (1982)1 assumes a distributional form onX|Z ∼ N(µZ , σ2) and take expectation as functionof β.
1Prentice (1982) Biometrika 69:331-342.
29
A SEMI-PARAMETRIC APPROACH
Model that encompasses the “missing data” inducedintensity under proportional hazards:
λ(t, y, (x, z),m;α, β) = y α(t) r(x, z;β)1−m η(t, z; γ)m.
where η(t, z; γ) corresponds to the “expected rate ratio”E[r(X, Z;β)|Y = 1,M = 1, Z = z)].
• Since the model is still proportional hazards, standardpartial likelihood methods are valid for both cohortand case-control data.
• Need to capture the variation in expected rate ratio ηwith t and (known) modelled covariates Z.
30
SOME OBSERVATIONS
• “Non-parametric” with respect to distribution of X(just like complete data model).
• Probably don’t need to model η too well as it is anuisance parameter in the model.
31
DATA ANALYSIS: PARTIAL LIKELIHOODFOR THE SEMI-PARAMETRIC
MISSINGNESS MODEL
“Missing indicator” method.
32
COMPLETE CASE ANALYSIS
Theorem: If η(t, z; γ) = γ(t, z) (“unstructured”) then, thepartial likelihood estimator is the complete case analysis.
(Pseudo) Proof:
y α(t) r(x, z; β)1−m γ(t, z)m =
{y α(t) r(x, z; β) if m = 0y γ̄(t, z) if m = 1
where γ̄(t, z) = α(t)γ(t, z) captures the disease rate in missing.
• “Missing data” stratified model.
• PL contribution from subjects with missing data isconstant, so “dropped” from analysis.
• Complete case analysis.
33
CONSEQUENCES OF THE THEOREM
• Complete case analysis is “infinite parameter” missingindicator-t-Z interaction model.
• Complete case analysis is thenon-parametric-limiting-case of missing indicatormodelling.
• Spectrum:
– Single parameter - most bias, least variance.
– Complete case - least bias, most variance.
Model missingness with “enough but not too many”parameters.
34
SIMULATION STUDY: PREDICTABLEMISSINGNESS
• X - Dichotomous, can be missing (RR=2).
• Z - Three levels, no missing (RR=4/level).
• 1:1 matched data (as before), matching factor is not aconfounder.
• X-missingness dependence (predictable): none, Z, X,Z −X.
• Missing data methods:
– Complete case.
– Z-specific missing indicators (right model).
• Results based on 500 trials.
35
SIMULATION RESULTS: X MISSING IN 50%:
BIAS
Missing Eβ̂X Eβ̂ZX Complete Missing Complete Missing
depend. case indicators case indicatorsnone 0.72 0.71 1.44 1.41Z 0.72 0.69 1.44 1.42X 0.73 0.71 1.44 1.42Z −X 0.71 0.70 1.43 1.42
1No missing: Eβ̂X = 0.69, Eβ̂Z = 1.40
36
SIMULATION RESULTS: X MISSING IN 50%
VARIANCE
Missing var β̂aX var β̂aZX Complete Missing Complete Missing
depend. case indicators case indicatorsnone 0.138 0.064 0.068 0.029Z 0.121 0.062 0.081 0.040X 0.129 0.064 0.062 0.027Z −X 0.105 0.061 0.049 0.026aVariance well estimated by usual inverse information estimator.
37
ADAPTED MISSINGNESS
Definition:
• Missingness depends on case-control status.
38
ADAPTED MISSINGNESSSIMULATION RESULTS: X MISSING IN 50%:
BIAS
Missing Eβ̂X Eβ̂ZX Complete Missing Complete Missing
depend. case indicators case indicatorsD 0.72 0.71 1.46 1.43D − Z 0.72 0.72 0.70 0.69D −X -0.26 -0.24 1.46 1.42
39
THEORY FOR ADAPTED MISSINGNESS:SECOND STAGE SAMPLING
• r1 - case-control set.
• r2 - subjects in the case-control set with completedata (“2nd stage sample”).
• Ni,r1,r2 - counts i,r1,r2 ”events.”
• π(r2|i, r1) - probability of complete data (r2) givencase-control status.
Proportional hazards:
λi,r1,r2(t) = Yi λ0(t) r(Xi, Zi;β0) π(r1|i) π(r2|i, r1)
40
CONJECTURE: PREDICTABLE VS.ADAPTED MISSINGNESS
• Predictable (missingness does not depend oncase-control status) - can estimate all parameters.
• Adapted (missing depends on case-control status)-can’t estimate all parameters. 2
Conjecture:To estimate all parameters in the model predictability issufficient and necessary.
2D dependent missingness: can’t estimate λ0.
41
THREAD III: HAVE MISSING INDICATORMETHODS GOTTEN A BAD RAP?
42
SO, IS EVERYONE ELSE WRONG ABOUTMISSING INDICATORS?
Two considerations:
• Lack of understanding that one needs to “modellingthe missingness:” model the variation in expectedrate ratio.
• Something special about individually matchedcase-control setting.
43
“FULL COHORT” ANALYSIS
Mean EmpiricalMethod parameter s.e.
No missing 0.34 0.05
Complete case 0.34 0.11
Missing indicator modelsSingle MI 0.24 0.09Z-MI interaction 0.34 0.11Z-MI trend 0.32 0.11
280% missing, MCAR
44
LIKELY SITUATION: WE’RE ALL “RIGHT”
• In other settings, there is less (no?!?) efficiency gainusing missing indicator?
– Survival analysis.
– Single stratum unmatched case-control studies.
Need to evaluate each setting.
45
THREAD IV: “CLASSICAL”MCAR/MAR/NI CLASSIFICATION OF
MISSINGNESS
Piak and Sacco (2000) Regression calibration (imputation)Rathouz, et al. (2002) Missing data likelihoodPiak (2004) Regression calibration (imputation)
All assume MAR.
46
SIMULATION RESULTS: X MISSING IN 50%:
BIAS
Missing Eβ̂X Eβ̂ZX Miss Comp Miss Comp Miss
depend. type case ind case indPredictable missingness
none MCAR 0.72 0.71 1.44 1.41Z MAR 0.72 0.69 1.44 1.42X NI 0.73 0.71 1.44 1.42Z −X NI 0.71 0.70 1.43 1.42
Adapted missingnessD MAR 0.72 0.71 1.46 1.43D − Z MAR 0.72 0.72 0.70 0.69D −X NI -0.26 -0.24 1.46 1.42
47
CLASSICAL CLASSIFICATIONS:UNBIASEDNESS OF COMPLETE CASE OR
MISSING INDICATOR
• MCAR OK, but too strong (sufficient but notnecessary).
• MAR can be unbiased (e.g., MAR(Z)) or biased(e.g., MAR(D − Z)).
• NI can be unbiased (e.g., NI(X)) or biased (e.g.,NI(D −X)).
Not very useful for predicting validity ofmethods.
48
IS MCAR/MAR REASONABLE?
Even if asked prior to disease occurrence, missing thefollowing are likely to be dependent on the actual value(NI):
• Did you smoke during pregnancy?
• What is your current income?
• Radiation dose as abstracted from medical records.
Better not to make a MCAR/MAR assumption.
49
MULTIPLE IMPUTATION APPLIED TOMATCHED PAIRS
SIMULATION STUDY
• X and Z covariates, X with missing.
• Used multiple imputation (SAS PROCs MI, PHREG,and MIANALYZE).
• Imputation model has case-control status and Z.
• Only looked at performance for estimating βX = .69.
50
MISSING INDICATOR AND MULTIPLEIMPUTATION
Missing MultipleMissing Miss indicators imputation
depend. type β̂X varβ̂X β̂X varβ̂Xnone MCAR 0.71 0.26 0.70 0.27Z MAR 0.71 0.27 0.71 0.25X NI 0.70 0.27 0.70 0.28Z −X NI 0.70 0.26 0.69 0.28D MAR 0.70 0.27 0.72 0.30D − Z MAR 0.74 0.27 0.70 0.28D −X NI 1.06 – 2.04 –
51
MISSING INDICATOR AND MULTIPLEIMPUTATION FOR 1:1 DATA
Comparison:
• Performance, including precision, very similar tomissing indicators.
• Both methods require “modelling missingness.”
Classical classification and multiple imputation:
• MCAR/MAR/NI not relevant.
52
CLOSING THOUGHTS
53
MISSING INDICATOR FOR INDIVIDUALLYMATCHED CASE-CONTROL STUDIES
• Standard conditional logistic software using standardmodelling techniques.
• Predictability is key assumption for valid estimation.
• Need to model the variation in the expected rateratio (missing indicator-covariate interactions).
54
GENERAL COMMENTS
• Classical classification of missingness was not veryuseful in assessing estimator behavior nor guiding themethod of analysis: Some research to be done here?
• Predictable vs. adapted classifies well and is closelyassociated with the concept of information bias.
• Adding MAR or MCAR assumptions about missingtype may yield estimators with better efficiency.
• MAR and MCAR should be evaluated carefully.
55