missing data

63
Missing Data Michael C. Neale International Workshop on Methodology for Genetic Studies of Twins and Families Boulder CO 2006 Virginia Institute for Psychiatric and Behavioral Genetics Virginia Commonwealth University Vrije Universiteit Amsterdam

Upload: sai

Post on 13-Jan-2016

35 views

Category:

Documents


0 download

DESCRIPTION

Missing Data. Michael C. Neale. International Workshop on Methodology for Genetic Studies of Twins and Families Boulder CO 2006 Virginia Institute for Psychiatric and Behavioral Genetics Virginia Commonwealth University Vrije Universiteit Amsterdam. Various forms of missing data. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Missing Data

Missing DataMichael C. Neale

International Workshop on Methodology for Genetic Studies of Twins and Families Boulder CO 2006

Virginia Institute for Psychiatric and Behavioral GeneticsVirginia Commonwealth University

Vrije Universiteit Amsterdam

Page 2: Missing Data

Various forms of missing data

Twins volunteer to participate or not Twins are obtained from hospital records Attrition during longitudinal studies Equipment failure:

Systematic (if < value then scored zero) Random

Structured interviews Don’t bother asking question 2 if response to

Q1 is no Designs to increase statistical power

Page 3: Missing Data

Plan of Talk

1. Review likelihood & Mx facilities2. Conceptual view of ascertainment

1. Ordinal2. Continuous

3. Ascertainment in twin studies1. Ordinal2. Continuous

4. Missing data1. Systematic (if < value then scored zero)2. Random

5. Structured interviews1. Don’t bother asking question 2 if response to Q1 is

no

Page 4: Missing Data

Likelihood approach

Usual nice properties of ML remain

Flexible

Simple principle Consideration of possible outcomes Re-normalization

May be difficult to compute

Advantages & Disadvantages

Page 5: Missing Data

Maximum Likelihood Estimates

Asymptotically unbiased

Minimum variance of all asymptotically unbiased estimators

Invariant to transformations

a2 = (a)2

Have nice properties

^ ^

Page 6: Missing Data

Example: Two Coin Toss3 outcomes

HH HT/TH TTOutcome

0

0.5

1

1.5

2

2.5Frequency

Probability i = freq i / sum (freqs)

Page 7: Missing Data

Example: Two Coin Toss2 outcomes

HH HT/TH TTOutcome

0

0.5

1

1.5

2

2.5Frequency

Probability i = freq i / sum (freqs)

Page 8: Missing Data

Non-random ascertainment

Probability of observing TT globally 1 outcome from 4 = 1/4

Probability of observing TT if HH is not ascertained 1 outcome from 3 = 1/3

or 1/4 divided by 'ascertainment correction' of 3/4 = 1/3

Example

Page 9: Missing Data

Correcting for ascertainmentUnivariate continuous case; only subjects > t ascertained

0 1 2 3 4-1

-2

-3

-4

0

0.1

0.2

0.3

0.4

0.5

xi

likelihood

t

Page 10: Missing Data

Correcting for ascertainmentUnivariate continuous case; only subjects > t ascertained

0 1 2 3 4-1

-2

-3

-4

0

0.1

0.2

0.3

0.4

0.5

xi

likelihood

t

Page 11: Missing Data

Correcting for ascertainment

Without ascertainment, we compute

With ascertainment, the correction is

Dividing by the realm of possibilities

Does likelihood increase or decrease after correction?

pdf, (ij,ij), at observed value xidivided by:

-(ij,ij) dx = 1

t-(ij,ij) dx < 1

Page 12: Missing Data

Ascertainment in PracticeBG Studies

Studies of patients and controls

Patients and relatives Twin pairs with at least one affected

Single ascertainment pi 0 Complete ascertainment pi = 1 Incomplete 0 < pi <1

Linkage studies Affected sib pairs, DSP etc Multiple affected families

pi = probability that someone is ascertained given that they are affected

Page 13: Missing Data

Correction depends on model

1 Correction independent of model parameters: "sample weights"

2 Correction depends on model parameters: weights vary during optimization

In twin data almost always case 2 continuous data binary/ordinal data

Page 14: Missing Data

High correlation Concordant>t

tx

0 1

tx ty(x,y) dy dx

1

0

ty

Page 15: Missing Data

Medium correlation

tx

ty

0 1

tx ty(x,y) dy dx

1

0

Page 16: Missing Data

Low correlation

ty

0 1

tx ty(x,y) dy dx

1

0

tx

Page 17: Missing Data

Two approaches for twin data in Mx

Contingency table approach Automatic Limited to two variable case

Raw data approach Manual Multivariate Moderator / Covariates

Page 18: Missing Data

Contingency Table Case

Feed program contingency table as usual

Use -1 for frequency for non-ascertained cells

Correction for ascertainment handled automatically

Binary data

Page 19: Missing Data

At least one twin affected

tx

ty

1

0

0 1

1-- -(x,y) dy dxtx tyAscertainment Correction

Page 20: Missing Data

Ascertain iff twin 1 > t

tx

ty

1

0

0 1

ty -(x,y) dx dy ty (y) dy =

Twin 1

Twin 2

Page 21: Missing Data

Contingency Tables

Use -1 for cells not ascertained

Can be used for ordinal case

Need to start thinking about thresholds Supply estimated population values Estimate them jointly with model

Page 22: Missing Data

Mx SyntaxClassical Twin Study: Contingency Table

ftp://views.vcu.edu/pub/mx/examples/ncbook2/categor.mx

G1: Model parameters Data Calc NGroups=4 Begin Matrices; X Lower 1 1 Free Y Lower 1 1 Free Z Lower 1 1 Free W Lower 1 1 End Matrices; ! parameters are fixed by default, unless declared free Begin Algebra; A= X*X'; C= Y*Y'; E= Z*Z'; D= W*W'; End Algebra:End

Page 23: Missing Data

Mx SyntaxGroup 2

G2: young female MZ twin pairs Data Ninput=2 CTable 2 2 329 83 95 83 Begin Matrices= Group 1 T full 2 1 Free End Matrices; Covariances A+C+D+E | A+C+D _ A+C+D | A+C+D+E ; Thresholds T ; Options RSidualEnd

Page 24: Missing Data

Mx SyntaxGroup 3

G3: young female DZ twin pairs Data Ninput=2 CTable 2 2 201 94 82 63

Begin Matrices= Group 1 H Full 1 1 Q Full 1 1 T Full 2 1 Free End Matrices; Matrix H .5 Matrix Q .25 Start .6 All

Covariances A+C+D+E | H@A+C+Q@D _ H@A+C+Q@D | A+C+D+E / Thresholds T ; Options RSidual NDecimals=4End

Page 25: Missing Data

Mx SyntaxGroup 4

Group 4: constrain variance to 1 Constraint NI=1 Begin Matrices = Group 1 ; I unit 1 1 End Matrices;

Constraint I = A+C+E+D ; Option MultipleEnd Specify 2 t 8 8 Specify 3 t 8 8End

Page 26: Missing Data

Raw data approach

Correction not always necessary ML MCAR/MAR Prediction of missingness

Correct through weight formula

Page 27: Missing Data

What can we do with Mx?Normal theory likelihood function for raw data in Mx

j=1

ln Li = fi ln [ wj g(xi,ij,ij)]m

xi - vector of observed scores on n subjects

:ij - vector of predicted means

Gij - matrix of predicted covariances - functions of parameters

Page 28: Missing Data

Likelihood Function Itself

Example: Normal pdf

The guts of it

j=1

ln Li = fi ln [ wij g(xi,ij,ij)]m

g(xi,ij,ij) - likelihood function

Page 29: Missing Data

Normal distribution (ij,ij)Likelihood is height of the curve

0 1 2 3 4-1

-2

-3

-4

0

0.1

0.2

0.3

0.4

0.5

xi

likelihood

Page 30: Missing Data

Weighted mixture of modelsFinite mixture distribution

j=1

m

j = 1....m modelswij Weight for subject i model j

e.g., Segregation analysis

ln Li = fi ln [ wij g(xi,ij,ij)]

Page 31: Missing Data

Mixture of Normal Distributions Two normals, propotions w1 & w2, different means

But Likelihood Ratio not Chi-Squared - what is it?

0 1 2 3-1

-2

-3

-4

0

0.1

0.2

0.3

0.4

0.5

1

xi

g

2

w1 x l1

w2 x l2

Page 32: Missing Data

General Likelihood Function

Sample frequencies binary data Sometimes 'sample weights' Might also vary over model j

Finally the frequencies

fi - frequency of case i

j=1

ln Li = fi ln [ wj g(xi,ij,ij)]m

Page 33: Missing Data

General Likelihood Function

Model for Means can differ Model for Covariances can differ Weights can differ Frequencies can differ

Things that may differ over subjects

i = 1....n subjects (families)j=1

ln Li = fi ln [ wij g(xi,ij,ij)]m

Page 34: Missing Data

Raw Ordinal Data Syntax Read in ordinal file

May use frequency command to save space

Weight uses \mnor function \mnor(R_M_U_L_K)

R - covariance matrix (p x p) M - mean vector (1xp) U - upper threshold (1xp) L - lower threshold (1xp) K - indicator for type of integration in each dimension (1xp)

0: L=-4; 1: U=+4 2: Iu; 3: L=-4 U=4

L

Page 35: Missing Data

Mx SyntaxG1: Model parameters Data Calc NGroups=4 Begin Matrices; X Lower 1 1 Free Y Lower 1 1 Free Z Lower 1 1 Free W Lower 1 1 End Matrices; ! parameters are fixed by default, unless declared free Begin Algebra; A= X*X'; C= Y*Y'; E= Z*Z'; D= W*W'; End Algebra:End

Page 36: Missing Data

Mx SyntaxG2: MZ twin pairs Data Ninput=3 Ordinal File=mz.frq Labels T1 T2 Freq Definition Freq ; Begin Matrices= Group 1 T full 2 1 Free F full 1 1 ! Frequency End Matrices; Specify F Freq Covariances A+C+D+E | A+C+D _ A+C+D | A+C+D+E ; Thresholds T ; Frequency F; Options RSidualEnd

Page 37: Missing Data

Mx SyntaxG3: DZ twin pairs Data Ninput=3 Labels T1 T2 Freq Ordinal File=dz.frq Definition Freq ;

Begin Matrices= Group 1 H Full 1 1 Q Full 1 1 T Full 2 1 Free F full 1 1 ! Frequency End Matrices; Specify F Freq Matrix H .5 Matrix Q .25 Start .6 All

Covariances A+C+D+E | H@A+C+Q@D _ H@A+C+Q@D | A+C+D+E / Thresholds T ; Frequency F ; Options RSidual NDecimals=4End

Page 38: Missing Data

Mx Syntax

Group 4: constrain variance to 1 Constraint NI=1 Begin Matrices = Group 1 ; I unit 1 1 End Matrices;

Constraint I = A+C+E+D ; Option MultipleEnd Specify 2 t 8 9 Specify 3 t 8 9End

Page 39: Missing Data

Ascertainment additional commands

Why inverse of J and K?

Begin Algebra; M=(A+C+E|A+C_A+C|A+C+E); N=(A+C+E|h@A+C_h@A+C|A+C+E); J=I-\mnor(M_Z_T_T_Z); ! Z = [0 0] K=I-\mnor(N_Z_T_T_Z); ! DZ caseEnd Algebra;

Weight J~; ! for MZ groupWeight K~; ! DZ group

Page 40: Missing Data

Correcting for ascertainment

Multivariate selection: multiple integrals double integral for ASP four double integrals for EDAC

Use (or extend) weight formula

Precompute in a calculation groupunless they vary by subject

Linkage studies

Page 41: Missing Data

When ascertainment is NOT necessary

Various flavors of missing data mechanisms

MCAR: Missing completely at random

MAR: Missing at random

NMAR: Not missing at random

Little & Rubin Terminology

Page 42: Missing Data

Simulation: 3 types of missing data

Selrand: MCAR missingness function of independent random

variable

Selonx: MAR missingness predicted by other measured

variable in analysis + MCAR

Selony: NMAR missingness mechanism related to ''residual"

variance in dependent variable

Page 43: Missing Data

Method Simulate bivariate normal data X,Y

Sigma = 1 .5 .5 1

Mu = 0, 0

Make some variables missing Generate independent random normal variable, Z, if Z>0 then Y missing If X>0 then Y missing If Y>0 then Y missing

Estimate elements of Sigma & Mu

Constrain elements to population values 1,.5, 0 etc

Compare fit

Ideally, repeat multiple times and see if expected 'null' distribution emerges

Page 44: Missing Data

Method Check to see if estimates ‘look right’

Quantify ‘look right’

Test model with parameters fixed to population values

Look at chi-squared

Repeat!

Examine distribution of chi-squareds

Page 45: Missing Data

R simulation 'model'

1.00

A

1.00

C

1.00

E

S1 S2

sqrt(1-r) sqrt(r) sqrt(r) sqrt(1-r)

Page 46: Missing Data

R simulation script: selrand.Rr<-0.5 #Correlation (r)N<-1000 #Number of pairs of scoresoutfile<-"selrand.rec" #output file for the simulation

vals<-matrix(rnorm((N*3)),,3); #3 random normals per pair of observed scores

rootr<-sqrt(r); #Square root of rroot1mr<-sqrt(1-r); #Square root of (1-r)

#Data generation follows: pairs<-matrix(cbind(rootr*vals[,1]+root1mr*vals[,2],rootr*vals[,1] +root1mr*vals[,3]),,2)

# pairs now contains pairs of scores drawn from population # with unit variances and correlation r

Page 47: Missing Data

# Now to make some of the scores missing, using a loopfor (i in 1:N) { x<-rnorm(1,0,1) # x is an independent random number from N(0,1)if (x>0) pairs[i,2] <- NA}write.table(pairs,file=outfile,row.names=FALSE,na=".",col.names=FALSE)

R simulation script: selrand.R

Page 48: Missing Data

R simulation script: selonx.R

# Now to make some of the scores missing, using a loopfor (i in 1:N) { if (pairs[i,1]>0) pairs[i,2] <- NA}write.table(pairs,file=outfile,row.names=FALSE,na=".",col.names=FALSE)

# note how missingness of 2nd column of pairs depends on # value of 1st column

Page 49: Missing Data

R simulation script: selony.R

# Now to make some of the scores missing, using a loopfor (i in 1:N) { if (pairs[i,2]>0) pairs[i,2] <- NA}write.table(pairs,file=outfile,row.names=FALSE,na=".",col.names=FALSE)

# note how missingness of 2nd column of pairs depends on # value of 2nd column itself

Page 50: Missing Data

Mx ScriptRather basic, like Monday morning

Estimate pop cov matrix of X&Y, with Y observed iff X>0 Data ngroups=1 ninput=2 Rectangular file=selonx.rec Begin Matrices; a sy 2 2 free ! covariance of x,y m fu 1 2 free ! mean of x,y End Matrices; Means M / Covariance A / matrix a 1 .3 1 bound .1 2 a 1 1 a 2 2 option rs mu issatend

fix all matrix 1 a 1 .5 1matrix 1 m0 0end

Page 51: Missing Data

Mx Scripts & Data

Check output: Summary statistics (obs means) Estimated means & covariance matrices Difference in fit between estimated values and

population values

Interpretation?

F:\mcn\2006\sel

Page 52: Missing Data

ML estimation under different missingness mechanisms

Missingness mean x mean y var x cov xy var y LR Chisq

MCAR (rand) MLE

<sample>

MAR (on x) MLE

<sample>

NMAR (on y) MLE

<sample>

Page 53: Missing Data

ML estimation under different missingness mechanisms

Missingness mean x mean y var x cov xy var yLR

Chisq

MCAR (rand) MLE

-0.0116 -0.1 1.0505 0.4998 0.8769 6.492

sample -0.0116 -0.0919 1.0505 0.8839

MAR (on x) MLE

0.0048 0.0998 1.0084 0.4481 1.1025 5.768

sample 0.0014 0.4437 1.0084 0.9762

NMAR (on y) MLE

-0.0204 0.6805 0.9996 0.1356 0.2894 227.262

sample 0.0448 0.7373 0.9996 0.2851

Page 54: Missing Data

BG Studies: Screen + Examination

Bivariate analysis of screen & exam No ascertainment correction required Example: all pairs where at least one screens

positive are examined Works for continuous & ordinal

Undersampling: some proportion of pairs concordant negative for screen are also examined Ascertainment correction required Different correction for screen -- vs +-/-+/++

Only a subset, selected on basis of screen, are examined

Page 55: Missing Data

Other examples of necessarily missing data

Consider: Measures of abuse/dependence in non-users Age at onset in the healthy Symptoms of schizophrenia in non-

schizophrenics

How do we know whether, e.g., Drug use is related to drug abuse Age at onset predicts severity

Page 56: Missing Data

Two binary variables X and Y

Both X and Y observed XY Outcomes: 00, 01, 10, 11

But Y observed only if X=1 XY Outcomes 0?, 10, 11

Only three cell frequencies “0/1/2” No use / Use with no abuse / Use and abuse

No variance in use when abuse is observed

Amnesiac researcher, forgot to get data from relatives

Page 57: Missing Data

Two binary variables in twins

Twin 1No Yes

No

Yes

No Yes

No

Yes

No Yes

No

Yes

Twin 2

1 0 1 0

1 0 1 1

1 1 1 0

1 1 1 1

0 0 0 0

0 0 0 1

0 1 0 0

0 1 0 1

0 0 1 0

0 0 1 1

0 1 1 0

0 1 1 1

1 0 0 0

1 0 0 1

1 1 0 0

1 1 0 1

Enlightened researcher, good memory, has data from relatives

Page 58: Missing Data

Censored case Y iff X=1

Twin 1No Yes

No

Yes

No Yes

No

Yes

No Yes

No

Yes

Twin 2

0 ? 0 ?

0 ? 1 0

0 ? 1 1

1 0 0 ? 1 1 0 ?

1 0 1 0

1 0 1 1

1 1 1 0

1 1 1 1

Page 59: Missing Data

Can estimate parameters!

• Correlation for Initiation• Correlation for

Dependence• Correlation Twin 1

initiation with Twin 2 Dependence (“proxy”)

• Indirectly assess correlation between initiation and dependence

A Twin 1

A Twin 2

B Twin 1

B Twin 2

b

b

VA VB*

VA VB*

rB*rA

Page 60: Missing Data

Multivariate case: Stem and probe Stem = Yes

STEM P2 P3 P4 P5 P6

F1

l6l1

Page 61: Missing Data

Stem and probe itemsStem = No

Probes Missing

STEM P2 P3 P4 P5 P6

F1

l6l1

Page 62: Missing Data

Proxy information from cotwin identifies the model

STEM P2 P3 P4 P5 P6

F1

l6l1

STEM P2 P3 P4 P5 P6

F2

l6l1

R > 0

Page 63: Missing Data

Conclusion

Be careful when designing studies with non-random ascertainment

Usually possible to correct

In principle, heritability should not change

In practice, it might