combining data in inference – calibration vs. application ... · combining data in inference –...

1

Combining Data in Inference – Calibration vs.

Application of the Missing Information Principle

Ray Chambers University of Wollongong

Presentation at the Workshop on Survey Sampling in Honour of

Jean-Claude Deville, Neuchatel, June 24-26, 2009

2

Overview

• The Missing Information Principle

• Problem 1: Linear regression with marginal population information

• Pseudo-likelihood inference

• Calibrated weighting + pseudo-likelihood for Problem 1

• Empirical comparisons

• Partial marginal information

• Problem 2: Logistic regression with marginal population information

• Empirical comparisons

3

The Missing Information Principle – A General Paradigm for Combining Data for Likelihood Inference

Likelihood-based inference using a ‘messy’ observed dataset ds is

equivalent to likelihood-based inference using a larger ‘clean’ but unobserved dataset dU provided the sufficient statistics defined by

dU are replaced by their expected values given ds

Note

1. It doesn’t matter what dU is. The only requirement is that ds (the

data we have) is a subset of dU (the data we would like to have)

2. First developed (Orchard and Woodbury, 1972) for inference with missing data, and forms basis for EM algorithm (Dempster, Laird and Rubin, 1977) used widely with missing data

3. Application to analysis of survey data by Breckling et al (1994)

4

The Inference Framework

• Population vector yU generated as a ‘random draw’ realisation of

a random vector YU with density parameterised by unknown

YU ~ f (yU ; )

• The available data are ds = yresp ,rs , iU ,zU ,g(wU ){ }

yresp = respondents’ values of Y

rs = response indicators for sampled units

iU = sample inclusion indicators

zU = population values of auxiliary variables

g(wU ) = population summary statistics

• The ideal data are dU = yU ,rU , iU ,zU ,wU{ } ~ f (dU ; ) density f (dU ; ) has ‘simple’ structure

MLE for MLE for

5

MIP Provided the ideal data include the available data, the available data score scs for is the conditional expectation, given these data,

of the ideal data score scU for , i.e.

scs = E scU ds{ } = Es scU( ) Furthermore, the available data information infos for is the

conditional expectation, given these data, of the ideal data information infoU for minus the corresponding conditional

variance of the ideal data score scU , i.e.

infos = E infoU ds{ } Var scU ds{ }= Es infoU( ) Vars scU( )

6

Linear Regression with Marginal Population Information

Motivating Scenario

Population U is such that values yi and xi of two scalar variables, Y

and X are stored on separate registers, each of size N. A sample s of

n units from one register is linked to the other via a unique common identifier, thus defining n matched (yi ,xi) pairs

Aim To use these linked sample data to estimate the parameters

, and 2 that characterise the population regression model

yi = + xi + i

where i ~ iid N(0,1)

7

Assumption Y is independent of sample inclusion indicator I given X. That is, sampling method is non-informative for regression

parameters of interest

Maximum likelihood estimators for , and 2 that are based on

the linked data only are the usual ‘sample-based’ MLEs

ˆsmle =

(xi xs )(yi ys )s

(xi xs )2

s

ˆsmle = ys

ˆsmlexs

ˆsmle2

= n 1 (yi ˆsmle

ˆsmlexi )

2

s

8

Auxiliary Information Register summary data are available. In particular, we know the population means yU and xU of Y and X...

- ‘sample-based’ MLEs are no longer the ‘full information’

MLEs for , and 2…

- Can use the MIP to combine this population marginal

information with the survey data to obtain full information MLEs

9

Components of the available data score function are then

sc1s =12 Es (yi ) xi{ }

U

sc2s =12 xi Es (yi ) xi{ }

U

sc3s =N

2 2 +1

2 4 Es (yi ) xi{ }2

U+ Vars (yi )U

Here a subscript of s denotes conditioning on the available data, i.e.

the sample values of Y and X and their corresponding population

means

10

If we know the population mean yU and the sample mean ys , we

know the non-sample mean yU s ... Straightforward to then show

yi xU s , yU s N yU s + (xi xU s ),2 1

1

N n

Leads to an available data score with components

sc1s =12 (yi xi )s

+ (N n) yU s xU s( ){ }

sc2s =12 xi (yi xi )s

+ (N n)xU s yU s xU s( ){ }

sc3s =(n +1)

2 2 +1

2 4 (yi xi )2

s+ (N n) yU s xU s( )

2{ }

11

Full Information MLEs

ˆfimle =

(xi xs )(yi ys )s+ nxs (ys yU ) + (N n)xU s (yU s yU )

(xi xs )2

s+ nxs (xs xU ) + (N n)xU s (xU s xU )

ˆfimle = yU

ˆfimlexU

and

ˆfimle2

=1

n +1yi ˆ

fimleˆfimlexi( )

2

s+ (N n) yU s

ˆfimle

ˆfimlexU s( )

2

12

Note FIML estimators identical to estimators defined by a weighted least squares fit to an extended sample consisting of

• the data values in s (each with weight equal to 1)

• an additional data value (with weight equal to N – n) defined by the known non-sample means yU s and xU s

13

Variances Can use MIP information identity, but easier to just use WLS

formulation to write down variances of ˆ fimle and ˆ fimle

Var( ˆ fimle ) =2

n

xs(2) (1 nN 1)(xs

(2) xU s2 )

xs(2) xU s

2+ Nn 1(xU s

2 xU2 )

Var( ˆ fimle ) =n 1 2

xs(2) xs

2+ N 1(N n)(xs xU s )

2

• Var( ˆ fimle ) Var( ˆ smle ) - equality only if xs(2)

= xsxU s (very unlikely)

• Var( ˆ fimle ) Var( ˆsmle ) - equality only if xs = xU s (more likely)

14

An Alternative: Pseudo-Likelihood Inference Kish and Frankel (1974), Binder (1983), Godambe and Thompson (1986), Pfeffermann (1993) f (yU ; ) = probability density of population Y values

• If yU were observed, would be estimated by solution of scU ( ) = 0

• For any specified value of , scU defines a finite population

parameter (‘census score’), which we can estimate from the sample data

scw = sample-weighted estimator of scU

maximum pseudo-likelihood estimator of is solution to scw ( ) = 0

15

Weight Calibration + Pseudo-Likelihood Assume SRSWOR. There are three calibration constraints

- the population size N - population mean of X - population mean of Y

wcal=N

n1n + N 1n ys xs[ ]

1n1n 1n ys 1nxsys1n ysys ysxsxs1n xsys xsxs

10

yU ysxU xs

ˆcal = wi

cal xi (xi xws )s{ }1

wical xi (yi yws )s

ˆcal = yws

ˆcal xws

ˆcal2

= N 1 wical (yi ˆ

calˆcal xi )

2

s

16

Model-Based Simulations

Data model yi = 5 + xi + i with xi LN (0,1)

i N(0,1)

Sampling methods SRS i = nN1

PPX i = nN1xixU

1

PPY i = nN1yiyU

1

Auxiliary Data yU , xU

1000 independent repetitions of population generation followed by sample selection

17

SRS % relative efficiencies with respect to 5% trimmed RMSE of unweighted SMLE

N = 500

n = 20

N = 1000

n = 50

N = 5000

n = 200

Parameter

CAL MIP CAL MIP CAL MIP

103 134 127 145 143 150

81 106 90 102 96 101

2 84 102 94 100 99 100

18

PPX % relative efficiencies with respect to 5% trimmed

RMSE of 1-weighted PMLE

N = 500

n = 20

N = 1000

n = 50

N = 5000

n = 200

Parameter


75 477 70 502 76 623

27 198 26 225 28 270

2 58 131 70 143 82 146

19

PPY % relative efficiencies with respect to 5% trimmed

RMSE of 1-weighted PMLE

N = 500

n = 20

N = 1000

n = 50

N = 5000

n = 200

Parameter


118 201 143 210 159 222

63 109 73 110 81 117

2 78 106 89 106 91 111

20

Observations

• MIP-based approach is efficient (and robust to informativeness) As expected, MIP-based estimate of benefits most from

auxiliary information. However there are non-negligible

gains for MIP-based estimates of and 2 as well

These gains are most substantial under PPX sampling. However, they are still substantial when a highly informative sampling method (PPY) is used

Though not shown, gains decrease with increasing error in the auxiliary information

• Pseudo-likelihood + calibrated weighting does not perform well, except when used to estimate in the SRS case. In other cases, it appears preferable to use ‘standard’ selection weights in maximum pseudo-likelihood estimator

21

Why Does Calibration Have Problems?

Look at distribution of calibrated weights for one simulation of PPX

A = inverse probability weighting, YX = calibration on both Y and X, Y = calibration only on Y, X = calibration only on X

22

What If We Have Partial Population Summary Information?

We know xU but not yU (+ sampling is non-informative given X)

FIMLE = SMLE since knowing xU s tells us nothing more

about the conditional distribution of Y given X in the non-

sampled part of the population than we already know from

the sample data

We can calibrate our sample weights to xU , but it provides

no extra efficiency (and will typically decrease efficiency)

for estimating the regression of Y on X

23

If we know yU but not xU , things are a little different...

The available data score depends on the mean and

variance of xU s given the available data (including yU s).

In large samples xU s , xs, yU s and ys will be approximately

jointly normal, so we can write down this conditional mean

and variance, which will depend on the parameters of the

regression model. Using simple ‘plug-in’ estimates derived

from the sample data then allows us to approximate the

actual data score and hence approximate the MLE

Calibration is straightforward – we just calibrate to yU

24

Simulation Results Where Only yU is Known (% Rel Eff)

N = 500,

n = 20

N = 1000,

n = 50

N = 5000,

n = 200

Parameter Sampling Method


SRS 89 85 99 121 101 134

PPX 91 193 95 172 94 128

PPY 91 136 97 134 97 96

SRS 88 47 94 66 97 80

PPX 64 155 71 183 67 224

PPY 67 76 79 90 87 91

SRS 93 85 97 99 100 100

PPX 88 126 91 133 95 141

2

PPY 94 104 98 106 99 111

Calibration strategy is still inefficient, while MIP approach does give gains, but not consistently

25

Extension to Logistic Regression Chambers and Wang (2008) Y = zero-one variable

Population level model

(xi ) = Pr yi = 1 xi( ) =exp( + xi )

1+ exp( + xi )

Auxiliary Information Basic assumption is that yU is known, so non-sample total tU s of Y

(i.e. number of non-sample ‘successes’ is known

26

Scenario 1 Individual non-sample values of X are also known (i.e.

X-register is available)

Components of the full information score

sc1s = yiU(xi )U

sc2s = xi yi (xi ){ }s

+ Es xiyiU s( ) xi (xi )U s

where Es denotes conditioning on the available data

FIMLE defined by calculating a saddlepoint approximation to sc2s Note FIMLE depends on individual non-sample X values

27

Scenario 2 Further auxiliary information consists only of value of xU , so non-sample mean xU s of X is known, but not individual X-

values

If sampling is SRS, we can approximate (xi )U and xi (xi )U s

in population level score using a ‘smearing’ argument (Duan, 1983)

1

N ng(xi )U s

=1

N ng xU s + (xi xU s ){ }

U s

1

ng xU s + (xi xs ){ }

s

This is combined with the saddlepoint approximation to calculate the FIMLE

28

Simulation: Linear Logistic Model + SRSWOR Percent relative efficiencies with respect to 5% trimmed RMSE of SMLE. Values of X drawn from the standard lognormal distribution.

In all cases N = 5000 and n = 200

True ( , ) (–3, 1) (–5, 2) (–5, 1) (–8, 2)

Scenario 1 116 112 121 114

Scenario 2 112 108 114 115

Scenario 1 101 102 105 110

Scenario 2 102 104 106 116

29

Simulation: Linear Logistic Model + Case Control Sampling Cases are sampled differentially from controls (2 strata). Values in table are percent relative efficiencies with respect to 5% trimmed root mean squared error of stratified pseudo-likelihood estimator. In all cases N = 5000 and n1 = n0 = 100 . Values of X drawn

from the standard lognormal distribution.

True ( , ) (–3, 1) (–5, 2) (–5, 1) (–8, 2)

Scenario 1 108 112 144 161

Scenario 2 106 109 120 126

Scenario 1 121 123 191 189

Scenario 2 113 112 128 127

SMLE 106 108 127 129

Note 1. Saddlepoint approximation and smearing estimator used in MIP are adjusted for case-

control sampling. See Chambers and Wang (2008) 2. SMLE of slope coefficient (from unweighted logistic fit) is standard MLE approximation

used in case-control

30

Conclusions/Lessons

• Incorporating auxiliary information via the MIP can significantly improve efficiency and seems to be reasonably robust to measurement errors in this information

gains generally increase as the ‘granularity’ of the auxiliary

information increases

• One needs to be very careful when incorporating auxiliary information via calibrated weighting. This is appropriate IF our targets of inference are parameters of marginal distributions AND the implied linear model induced by the calibration constraints is valid AND these constraints are accurately specified

31

References Binder, D.A. (1983). On the variances of asymptotically normal estimators from complex surveys.

International Statistical Review, 51, 279-292. Breckling, J.U., Chambers, R.L., Dorfman, A.H., Tam, S.M. and Welsh, A.H. (1994). Maximum

likelihood inference from survey data. International Statistical Review, 62, 349 - 363. Chambers, R. and Wang, S. (2008). Maximum likelihood logistic regression with auxiliary information.

Working Paper 12-08, Centre for Statistical and Survey Methodology, University of Wollongong. Dempster, A.P., Laird, N.M. and Rubin, D.B. (1977). Maximum likelihood from incomplete data via the

EM algorithm (with discussion). Journal of the Royal Statistical Society Series B, 39, 1-37. Dorfman, A., Chambers, R. and Wang, S. (2002). Are survey weights necessary? The maximum

likelihood approach to sample survey inference. Proceedings of the 162nd

Annual Meeting of the American Statistical Association, New York, August 11-15.

Duan, N. (1983). Smearing estimate: A nonparametric retransformation estimate. Journal of the American Statistical Association, 78, 605 - 610.

Godambe, V.P. and Thompson, M.E. (1986). Parameters of super populations and survey population: their relationship and estimation. International Statistical Review, 54, 37-59.

Kish, L. and Frankel, M.R. (1974). Inference from complex samples (with discussion). Journal of the Royal Statistical Society, Series B, 36, 1-37.

Orchard, T. and Woodbury, M.A. (1972). A missing information principle: theory and application. Proc. 6th Berkeley Symp. Math. Statist., 1, 697-715.

Pfeffermann, D. (1993). The role of sampling weights when modelling survey data. International Statistical Review, 61, 317-337.

combining data in inference – calibration vs. application ... · combining data in inference –...

Documents