missing covariate data in matched case-control studies: do...

Missing covariate data in matchedcase-control studies: Do the usual

paradigms apply?

Bryan Langholz

USC Department of Preventive Medicine

Joint work withMulugeta Gebregziabher

Larry Goldstein

Mark Huberman

1

CONTEXT AND SCOPE

• Missing covariate data in nested case-control studies.

• Epidemiologic individually matched case-control data.

• 1:1 (single case-single control).

• Single covariate that has missing values.

• Missing indicator methods.

2

MISSING INDICATOR METHOD IN THELITERATURE

Reference Context CommentJones (1986) JASA 91:222-230.

Least squares regression “The missing-indicator methodsshow unacceptably large biases inpractical situations and are notadvisable in general.”

Vach and Blettner (1991) AJE134:895-907.

Unmatched case-control studies(logistic)

“Making use of data on subjectswho are neglected in completecase analysis by creating an ad-ditional category always results inbiased estimation.”

Greenland and Finkle (1995)AJE 142:1255-1264.

Unmatched case-control studies(logistic)

“The authors recommend thatepidemiologists avoid using themissing-indicator method...”

Li, et al. (2004) AJE 159:603-10.

Individually matched case-control

“...the missing-indicator methodshould be used cautiously.”

3

THREADS

• THREAD I: Some background: Missing data inmatched pairs.

• THREAD II: Some theory: Missing data inducedintensity and models.

• THREAD IV: Missing indicator methods: A bad rap?

• THREAD III: Is MCAR/MAR/NI classification “asquare peg in a round hole?”

4

THREAD I: SOME BACKGROUND

• Motivation: Childhood leukemia study.

• Old work: “Retain complete pairs” method formatched pairs.

• Data analysis.

5

CASE-CONTROL STUDY OF MAGNETICFIELDS AND CHILDHOOD LEUKEMIA

• Individually matched population based case-control studyin Los Angeles County.

• Cases identified retrospectively from Los Angeles SEERRegistry.

• Controls individually age matched with friends or randomdigit dial.

• 232 Matched pairs.

• Meters used to collect 24-hour magnetic field profile inchild’s bedroom.

• Spot measurements.

0London et al (1991) AJE 134:923-937

6

LOS ANGELES CHILDHOOD LEUKEMIASTUDY

CompleteMissing matched

Information Cases Controls subjects pairs

Interview data 232 232 0% 232

24-hour magnetic fieldmeasurements

164 144 34% 108

Spot magnetic fieldmeasurement in child’sbedroom

140 109 46% 71

7

REASONS FOR MISSING DATA

• Most because the family had moved “relevant” homeso it was not possible to measure fields.

• Instrument problems.

• Some refusal.

8

COMPLETE CASE ANALYSIS

• Drop subjects with missing data.

• Analyze as matched case-control study.

Note that:

• Pair is dropped if either case or control is missing.

Large loss of data.

9

BREAK THE MATCHING

• Drop subjects with missing data.

• Break the matching.

• Analyze as unmatched case-control study, control forconfounding by matching factors by modelling.

Note that:

• Can model age but

• Quantification of “friend” or “telephone number”matching not possible.

Potential residual confounding.

10

SOME PAST WORK: A BETTER METHOD

Compromise between complete case and breakthe matching:

• Retain matching for those pairs in which bothsubjects have covariate data.

• Break matching for subjects with data in pairs with“missing data partner.”

Avoids large loss of data (complete case analysis)and minimizes potential uncontrolledconfounding (break the matching).

11

ANALYSIS

• Matched pairs part contributes conditional logisticlikelihood part.

• Unmatched part contributes unconditional logisticlikelihood part (perhaps control for confounding bymodelling matching variables).

Estimation based on product of conditional andunconditional logistic likelihood parts.

12

SIMULATION STUDY PARAMETERS

• Three age groups - 50%, 40%, 10% youngest tooldest.

• Exposure - Xi −N (µs, 1) with µs - stratum specificmean.

• µs = µage + U(−.5, .5), µage= age = 0,1, or 2.

• Age - quantifiable matching factor, the U(−.5, .5) forunquantifiable.

• Independent “risk sets” of size 50 were generated for400 cases.

13

• Disease status - prob based on rate ratios,multiplicative in age and exposure withrrage = 2, rrX =

√2, βX = .35.

• A single individual was randomly sampled from the 49in the risk set to serve as the control in the matchedpair set.

• 25% missing X completely at random: 56% pairs hadcomplete data.

• Results based on 500 trials.

14

SIMULATION RESULTS

Mean Empirical Estimated

Method parameter s.e. s.e.√

MSE

No missing 0.35 0.080 0.075 0.075

Complete case 0.35 0.098 0.101 0.101

Break the matching:Pooled over age 0.19 0.052 0.061 0.171Age stratified 0.26 0.065 0.073 0.114

Retain matching in complete pairs:Pooled over age 0.29 0.078 0.077 0.099Age-stratified 0.32 0.082 0.083 0.087Age-trend 0.31 0.081 0.082 0.088

15

“RETAIN THE MATCHING” METHOD

• Efficiency is as least as good as complete-caseanalysis.

• Bias is never as bad as breaking the matching.

Good compromise to the two most commonlyused alternatives.

16

DATA ANALYSIS: FITTING THE “RETAINTHE MATCHING” METHOD

Missing indicator analysis is exactly the “retainthe matching in complete pairs” analysis.

0Huberman and Langholz (1999) AJE 150:1340-1345.

17

MISSING INDICATOR ANALYSIS DATASET

Exposure Missing Age-stratified MIC-C Case(1) Age (0 if missing) indicator Age=1 Age=2 Age=3 Age-trendSet control(0) group Z M1 M2 M3 Mtrend38 1 3 2.99688 0 0 0 0 038 0 3 0.00000 1 0 0 1 3

39 1 3 2.91090 0 0 0 0 039 0 3 0.00000 1 0 0 1 3

40 1 2 0.92814 0 0 0 0 040 0 2 0.00000 1 0 1 0 2

41 1 1 0.62488 0 0 0 0 041 0 1 0.90149 0 0 0 0 0

42 1 1 0.51196 0 0 0 0 042 0 1 -0.76174 0 0 0 0 0

43 1 3 2.48709 0 0 0 0 043 0 3 0.00000 1 0 0 1 3

44 1 3 2.65434 0 0 0 0 044 0 3 0.00000 1 0 0 1 3

45 1 2 0.00000 1 0 1 0 245 0 2 0.00000 1 0 1 0 2

46 1 1 0.00000 1 1 0 0 146 0 1 0.00000 1 1 0 0 1

18

THREAD II: SOME THEORY

• Develop a theory for predictable missingness (doesn’tdepend on case-control status) .

– Consider complete data cohort intensity (rates)under proportional hazards.

– Use the double expectation formula to derivegeneral form of missing data “induced” intensity.

– Extend the proportional hazards model toencompass the possible form(s) of the inducedintensity.

– Propose a semi-parametric version of this model.

– Fitting the model.

– Complete case analysis.

• Adapted (depends on case-control status) missingness.

19

KEY ASSUMPTION FOR THIS PART

• I assume that missingness is predictable - does notdepend on case-control status.

20

COHORT - CASE-CONTROL STUDIES

Key point (Ørnulf’s talk):

• If the true intensity is a member of the specifiedproportional hazards model, then the partiallikelihood(s) provide valid estimation of β0.

• Works for both Cox (for full cohort) and conditionallogistic (for individually matched case-control data)regression.

21

INTENSITY FOR COMPLETE COVARIATEDATA

• Let (Ni, Yi, Zi) counting, censoring, and covariateprocesses (i.i.d.).

• Z - corresponding filtration.

• Then the “complete data” Z-intensity is

λi(t;Z)dt = E[dNi(t)|Zt− ]

0Andersen et al (1992) Statistical Models Based on Counting Pro-cesses, Springer.

22

PROPORTIONAL HAZARDS MODEL FORCOMPLETE DATA

• Semi-parametric model for the rate of disease as afunction of factors (e.g., radiation, genes).

• λ(t, y, z;α(·), β) = y α(t) r(z;β).

– α - rate of disease in “unexposed” as a functionof time (nuisance function).

– r - “rate ratio” as a function of z.

– β - rate ratio parameters to be estimated.

• Assume that there exist λ0, β0 such thatλi(t;Z) = Yi(t) λ0(t) r(Zi(t);β0).

23

DATA MODEL FOR (PREDICTABLE)MISSING DATA

• Zi(t) - covariate with no missing data.

• Xi(t) - single covariate with missing data.

• Mi(t) - missing indicator, independent and leftcontinuous.

• Representation for missing covariate data(Ni, Yi, (1−Mi)Xi,Mi, Zi).

• M-filtration generated by{Ni(t), Yi(t), [1−Mi(t)]Xi(t),Mi(t), Zi(t)}.

Note: Mi is M-predictable.

24

INTENSITY PROCESS FOR(PREDICTABLE) MISSING DATA

• The “Innovation Theorem” says that the M (missingdata)-intensity

λi(t;M)dt = E[dNi(t)|Mt− ] = E[λi(t;Z)|Mt− ]dt

(Double expectation formula for processes)

• Proportional hazards:

λi(t;M) = Yi(t)λ0(t)E[r(Xi, Zi(t);β0)|Mt− ]

25

MISSING DATA INDUCED INTENSITY

Assumptions:

• Distribution of X only depends on summaries of otherfactors at t.

• Recall i.i.d.

Define:

• FX(·; z, t; β0, α0) - distribution function for X conditionalon Y (t) = 1, M(t) = 1, and Z(t) = z.

26

MISSING DATA INDUCED INTENSITY

λi(t;M) = Yi(t) λ0(t) r(Xi(t), Zi(t);β0)1−Mi(t)

× η[t, Zi(t);β0, λ0, FX(·; t, Zi(t);β0, λ0)]Mi(t)

where

η[t, z;β0, λ0(·), FX(·; t, z;β0, λ0)] =

E[r(X, z;β0)|Y (t) = 1,M(t) = 1]

=∫

r(x, z;β0)dFX(x; t, z;β0, λ0).

27

MISSING DATA INDUCED MODEL

• We will need to expand the complete data modelwith a component to accommodate η.

• In (very general) the model would look like:

λ(t, y, (x, z),m;α, β) =

y α(t) r(x, z;β)1−m η[t, z;β, α, Fψ(·; t, z;β, α)]m

where

η(t, z;β, α) =∫

r(x, z;β) dFψ(x; t, z;β, α)

and Fψ is a family of distribution functions.

28

“MODELLING THE MISSINGNESS”

• Modelling the missingness is modelling η, the“expected rate ratio.”

• η(t, z;β, α, Fψ(x, z, t;β, α)) depends β, α and Fψ.

• Missing data methods for failure time data can beseen as different ways of structuring η.

• Prentice (1982)1 assumes a distributional form onX|Z ∼ N(µZ , σ2) and take expectation as functionof β.

1Prentice (1982) Biometrika 69:331-342.

29

A SEMI-PARAMETRIC APPROACH

Model that encompasses the “missing data” inducedintensity under proportional hazards:

λ(t, y, (x, z),m;α, β) = y α(t) r(x, z;β)1−m η(t, z; γ)m.

where η(t, z; γ) corresponds to the “expected rate ratio”E[r(X, Z;β)|Y = 1,M = 1, Z = z)].

• Since the model is still proportional hazards, standardpartial likelihood methods are valid for both cohortand case-control data.

• Need to capture the variation in expected rate ratio ηwith t and (known) modelled covariates Z.

30

SOME OBSERVATIONS

• “Non-parametric” with respect to distribution of X(just like complete data model).

• Probably don’t need to model η too well as it is anuisance parameter in the model.

31

DATA ANALYSIS: PARTIAL LIKELIHOODFOR THE SEMI-PARAMETRIC

MISSINGNESS MODEL

“Missing indicator” method.

32

COMPLETE CASE ANALYSIS

Theorem: If η(t, z; γ) = γ(t, z) (“unstructured”) then, thepartial likelihood estimator is the complete case analysis.

(Pseudo) Proof:

y α(t) r(x, z; β)1−m γ(t, z)m =

{y α(t) r(x, z; β) if m = 0y γ̄(t, z) if m = 1

where γ̄(t, z) = α(t)γ(t, z) captures the disease rate in missing.

• “Missing data” stratified model.

• PL contribution from subjects with missing data isconstant, so “dropped” from analysis.

• Complete case analysis.

33

CONSEQUENCES OF THE THEOREM

• Complete case analysis is “infinite parameter” missingindicator-t-Z interaction model.

• Complete case analysis is thenon-parametric-limiting-case of missing indicatormodelling.

• Spectrum:

– Single parameter - most bias, least variance.

– Complete case - least bias, most variance.

Model missingness with “enough but not too many”parameters.

34

SIMULATION STUDY: PREDICTABLEMISSINGNESS

• X - Dichotomous, can be missing (RR=2).

• Z - Three levels, no missing (RR=4/level).

• 1:1 matched data (as before), matching factor is not aconfounder.

• X-missingness dependence (predictable): none, Z, X,Z −X.

• Missing data methods:

– Complete case.

– Z-specific missing indicators (right model).

• Results based on 500 trials.

35

SIMULATION RESULTS: X MISSING IN 50%:

BIAS

Missing Eβ̂X Eβ̂ZX Complete Missing Complete Missing

depend. case indicators case indicatorsnone 0.72 0.71 1.44 1.41Z 0.72 0.69 1.44 1.42X 0.73 0.71 1.44 1.42Z −X 0.71 0.70 1.43 1.42

1No missing: Eβ̂X = 0.69, Eβ̂Z = 1.40

36

SIMULATION RESULTS: X MISSING IN 50%

VARIANCE

Missing var β̂aX var β̂aZX Complete Missing Complete Missing

depend. case indicators case indicatorsnone 0.138 0.064 0.068 0.029Z 0.121 0.062 0.081 0.040X 0.129 0.064 0.062 0.027Z −X 0.105 0.061 0.049 0.026aVariance well estimated by usual inverse information estimator.

37

ADAPTED MISSINGNESS

Definition:

• Missingness depends on case-control status.

38

ADAPTED MISSINGNESSSIMULATION RESULTS: X MISSING IN 50%:

BIAS

Missing Eβ̂X Eβ̂ZX Complete Missing Complete Missing

depend. case indicators case indicatorsD 0.72 0.71 1.46 1.43D − Z 0.72 0.72 0.70 0.69D −X -0.26 -0.24 1.46 1.42

39

THEORY FOR ADAPTED MISSINGNESS:SECOND STAGE SAMPLING

• r1 - case-control set.

• r2 - subjects in the case-control set with completedata (“2nd stage sample”).

• Ni,r1,r2 - counts i,r1,r2 ”events.”

• π(r2|i, r1) - probability of complete data (r2) givencase-control status.

Proportional hazards:

λi,r1,r2(t) = Yi λ0(t) r(Xi, Zi;β0) π(r1|i) π(r2|i, r1)

40

CONJECTURE: PREDICTABLE VS.ADAPTED MISSINGNESS

• Predictable (missingness does not depend oncase-control status) - can estimate all parameters.

• Adapted (missing depends on case-control status)-can’t estimate all parameters. 2

Conjecture:To estimate all parameters in the model predictability issufficient and necessary.

2D dependent missingness: can’t estimate λ0.

41

THREAD III: HAVE MISSING INDICATORMETHODS GOTTEN A BAD RAP?

42

SO, IS EVERYONE ELSE WRONG ABOUTMISSING INDICATORS?

Two considerations:

• Lack of understanding that one needs to “modellingthe missingness:” model the variation in expectedrate ratio.

• Something special about individually matchedcase-control setting.

43

“FULL COHORT” ANALYSIS

Mean EmpiricalMethod parameter s.e.

No missing 0.34 0.05

Complete case 0.34 0.11

Missing indicator modelsSingle MI 0.24 0.09Z-MI interaction 0.34 0.11Z-MI trend 0.32 0.11

280% missing, MCAR

44

LIKELY SITUATION: WE’RE ALL “RIGHT”

• In other settings, there is less (no?!?) efficiency gainusing missing indicator?

– Survival analysis.

– Single stratum unmatched case-control studies.

Need to evaluate each setting.

45

THREAD IV: “CLASSICAL”MCAR/MAR/NI CLASSIFICATION OF

MISSINGNESS

Piak and Sacco (2000) Regression calibration (imputation)Rathouz, et al. (2002) Missing data likelihoodPiak (2004) Regression calibration (imputation)

All assume MAR.

46

SIMULATION RESULTS: X MISSING IN 50%:

BIAS

Missing Eβ̂X Eβ̂ZX Miss Comp Miss Comp Miss

depend. type case ind case indPredictable missingness

none MCAR 0.72 0.71 1.44 1.41Z MAR 0.72 0.69 1.44 1.42X NI 0.73 0.71 1.44 1.42Z −X NI 0.71 0.70 1.43 1.42

Adapted missingnessD MAR 0.72 0.71 1.46 1.43D − Z MAR 0.72 0.72 0.70 0.69D −X NI -0.26 -0.24 1.46 1.42

47

CLASSICAL CLASSIFICATIONS:UNBIASEDNESS OF COMPLETE CASE OR

MISSING INDICATOR

• MCAR OK, but too strong (sufficient but notnecessary).

• MAR can be unbiased (e.g., MAR(Z)) or biased(e.g., MAR(D − Z)).

• NI can be unbiased (e.g., NI(X)) or biased (e.g.,NI(D −X)).

Not very useful for predicting validity ofmethods.

48

IS MCAR/MAR REASONABLE?

Even if asked prior to disease occurrence, missing thefollowing are likely to be dependent on the actual value(NI):

• Did you smoke during pregnancy?

• What is your current income?

• Radiation dose as abstracted from medical records.

Better not to make a MCAR/MAR assumption.

49

MULTIPLE IMPUTATION APPLIED TOMATCHED PAIRS

SIMULATION STUDY

• X and Z covariates, X with missing.

• Used multiple imputation (SAS PROCs MI, PHREG,and MIANALYZE).

• Imputation model has case-control status and Z.

• Only looked at performance for estimating βX = .69.

50

MISSING INDICATOR AND MULTIPLEIMPUTATION

Missing MultipleMissing Miss indicators imputation

depend. type β̂X varβ̂X β̂X varβ̂Xnone MCAR 0.71 0.26 0.70 0.27Z MAR 0.71 0.27 0.71 0.25X NI 0.70 0.27 0.70 0.28Z −X NI 0.70 0.26 0.69 0.28D MAR 0.70 0.27 0.72 0.30D − Z MAR 0.74 0.27 0.70 0.28D −X NI 1.06 – 2.04 –

51

MISSING INDICATOR AND MULTIPLEIMPUTATION FOR 1:1 DATA

Comparison:

• Performance, including precision, very similar tomissing indicators.

• Both methods require “modelling missingness.”

Classical classification and multiple imputation:

• MCAR/MAR/NI not relevant.

52

CLOSING THOUGHTS

53

MISSING INDICATOR FOR INDIVIDUALLYMATCHED CASE-CONTROL STUDIES

• Standard conditional logistic software using standardmodelling techniques.

• Predictability is key assumption for valid estimation.

• Need to model the variation in the expected rateratio (missing indicator-covariate interactions).

54

GENERAL COMMENTS

• Classical classification of missingness was not veryuseful in assessing estimator behavior nor guiding themethod of analysis: Some research to be done here?

• Predictable vs. adapted classifies well and is closelyassociated with the concept of information bias.

• Adding MAR or MCAR assumptions about missingtype may yield estimators with better efficiency.

• MAR and MCAR should be evaluated carefully.

55

missing covariate data in matched case-control studies: do...

Documents