combining data in inference – calibration vs. application ... · combining data in inference –...
TRANSCRIPT
1
Combining Data in Inference – Calibration vs.
Application of the Missing Information Principle
Ray Chambers University of Wollongong
Presentation at the Workshop on Survey Sampling in Honour of
Jean-Claude Deville, Neuchatel, June 24-26, 2009
2
Overview
• The Missing Information Principle
• Problem 1: Linear regression with marginal population information
• Pseudo-likelihood inference
• Calibrated weighting + pseudo-likelihood for Problem 1
• Empirical comparisons
• Partial marginal information
• Problem 2: Logistic regression with marginal population information
• Empirical comparisons
3
The Missing Information Principle – A General Paradigm for Combining Data for Likelihood Inference
Likelihood-based inference using a ‘messy’ observed dataset ds is
equivalent to likelihood-based inference using a larger ‘clean’ but unobserved dataset dU provided the sufficient statistics defined by
dU are replaced by their expected values given ds
Note
1. It doesn’t matter what dU is. The only requirement is that ds (the
data we have) is a subset of dU (the data we would like to have)
2. First developed (Orchard and Woodbury, 1972) for inference with missing data, and forms basis for EM algorithm (Dempster, Laird and Rubin, 1977) used widely with missing data
3. Application to analysis of survey data by Breckling et al (1994)
4
The Inference Framework
• Population vector yU generated as a ‘random draw’ realisation of
a random vector YU with density parameterised by unknown
YU ~ f (yU ; )
• The available data are ds = yresp ,rs , iU ,zU ,g(wU ){ }
yresp = respondents’ values of Y
rs = response indicators for sampled units
iU = sample inclusion indicators
zU = population values of auxiliary variables
g(wU ) = population summary statistics
• The ideal data are dU = yU ,rU , iU ,zU ,wU{ } ~ f (dU ; ) density f (dU ; ) has ‘simple’ structure
MLE for MLE for
5
MIP Provided the ideal data include the available data, the available data score scs for is the conditional expectation, given these data,
of the ideal data score scU for , i.e.
scs = E scU ds{ } = Es scU( ) Furthermore, the available data information infos for is the
conditional expectation, given these data, of the ideal data information infoU for minus the corresponding conditional
variance of the ideal data score scU , i.e.
infos = E infoU ds{ } Var scU ds{ }= Es infoU( ) Vars scU( )
6
Linear Regression with Marginal Population Information
Motivating Scenario
Population U is such that values yi and xi of two scalar variables, Y
and X are stored on separate registers, each of size N. A sample s of
n units from one register is linked to the other via a unique common identifier, thus defining n matched (yi ,xi) pairs
Aim To use these linked sample data to estimate the parameters
, and 2 that characterise the population regression model
yi = + xi + i
where i ~ iid N(0,1)
7
Assumption Y is independent of sample inclusion indicator I given X. That is, sampling method is non-informative for regression
parameters of interest
Maximum likelihood estimators for , and 2 that are based on
the linked data only are the usual ‘sample-based’ MLEs
ˆsmle =
(xi xs )(yi ys )s
(xi xs )2
s
ˆsmle = ys
ˆsmlexs
ˆsmle2
= n 1 (yi ˆsmle
ˆsmlexi )
2
s
8
Auxiliary Information Register summary data are available. In particular, we know the population means yU and xU of Y and X...
- ‘sample-based’ MLEs are no longer the ‘full information’
MLEs for , and 2…
- Can use the MIP to combine this population marginal
information with the survey data to obtain full information MLEs
9
Components of the available data score function are then
sc1s =12 Es (yi ) xi{ }
U
sc2s =12 xi Es (yi ) xi{ }
U
sc3s =N
2 2 +1
2 4 Es (yi ) xi{ }2
U+ Vars (yi )U
Here a subscript of s denotes conditioning on the available data, i.e.
the sample values of Y and X and their corresponding population
means
10
If we know the population mean yU and the sample mean ys , we
know the non-sample mean yU s ... Straightforward to then show
yi xU s , yU s N yU s + (xi xU s ),2 1
1
N n
Leads to an available data score with components
sc1s =12 (yi xi )s
+ (N n) yU s xU s( ){ }
sc2s =12 xi (yi xi )s
+ (N n)xU s yU s xU s( ){ }
sc3s =(n +1)
2 2 +1
2 4 (yi xi )2
s+ (N n) yU s xU s( )
2{ }
11
Full Information MLEs
ˆfimle =
(xi xs )(yi ys )s+ nxs (ys yU ) + (N n)xU s (yU s yU )
(xi xs )2
s+ nxs (xs xU ) + (N n)xU s (xU s xU )
ˆfimle = yU
ˆfimlexU
and
ˆfimle2
=1
n +1yi ˆ
fimleˆfimlexi( )
2
s+ (N n) yU s
ˆfimle
ˆfimlexU s( )
2
12
Note FIML estimators identical to estimators defined by a weighted least squares fit to an extended sample consisting of
• the data values in s (each with weight equal to 1)
• an additional data value (with weight equal to N – n) defined by the known non-sample means yU s and xU s
13
Variances Can use MIP information identity, but easier to just use WLS
formulation to write down variances of ˆ fimle and ˆ fimle
Var( ˆ fimle ) =2
n
xs(2) (1 nN 1)(xs
(2) xU s2 )
xs(2) xU s
2+ Nn 1(xU s
2 xU2 )
Var( ˆ fimle ) =n 1 2
xs(2) xs
2+ N 1(N n)(xs xU s )
2
• Var( ˆ fimle ) Var( ˆ smle ) - equality only if xs(2)
= xsxU s (very unlikely)
• Var( ˆ fimle ) Var( ˆsmle ) - equality only if xs = xU s (more likely)
14
An Alternative: Pseudo-Likelihood Inference Kish and Frankel (1974), Binder (1983), Godambe and Thompson (1986), Pfeffermann (1993) f (yU ; ) = probability density of population Y values
• If yU were observed, would be estimated by solution of scU ( ) = 0
• For any specified value of , scU defines a finite population
parameter (‘census score’), which we can estimate from the sample data
scw = sample-weighted estimator of scU
maximum pseudo-likelihood estimator of is solution to scw ( ) = 0
15
Weight Calibration + Pseudo-Likelihood Assume SRSWOR. There are three calibration constraints
- the population size N - population mean of X - population mean of Y
wcal=N
n1n + N 1n ys xs[ ]
1n1n 1n ys 1nxsys1n ysys ysxsxs1n xsys xsxs
10
yU ysxU xs
ˆcal = wi
cal xi (xi xws )s{ }1
wical xi (yi yws )s
ˆcal = yws
ˆcal xws
ˆcal2
= N 1 wical (yi ˆ
calˆcal xi )
2
s
16
Model-Based Simulations
Data model yi = 5 + xi + i with xi LN (0,1)
i N(0,1)
Sampling methods SRS i = nN1
PPX i = nN1xixU
1
PPY i = nN1yiyU
1
Auxiliary Data yU , xU
1000 independent repetitions of population generation followed by sample selection
17
SRS % relative efficiencies with respect to 5% trimmed RMSE of unweighted SMLE
N = 500
n = 20
N = 1000
n = 50
N = 5000
n = 200
Parameter
CAL MIP CAL MIP CAL MIP
103 134 127 145 143 150
81 106 90 102 96 101
2 84 102 94 100 99 100
18
PPX % relative efficiencies with respect to 5% trimmed
RMSE of 1-weighted PMLE
N = 500
n = 20
N = 1000
n = 50
N = 5000
n = 200
Parameter
CAL MIP CAL MIP CAL MIP
75 477 70 502 76 623
27 198 26 225 28 270
2 58 131 70 143 82 146
19
PPY % relative efficiencies with respect to 5% trimmed
RMSE of 1-weighted PMLE
N = 500
n = 20
N = 1000
n = 50
N = 5000
n = 200
Parameter
CAL MIP CAL MIP CAL MIP
118 201 143 210 159 222
63 109 73 110 81 117
2 78 106 89 106 91 111
20
Observations
• MIP-based approach is efficient (and robust to informativeness) As expected, MIP-based estimate of benefits most from
auxiliary information. However there are non-negligible
gains for MIP-based estimates of and 2 as well
These gains are most substantial under PPX sampling. However, they are still substantial when a highly informative sampling method (PPY) is used
Though not shown, gains decrease with increasing error in the auxiliary information
• Pseudo-likelihood + calibrated weighting does not perform well, except when used to estimate in the SRS case. In other cases, it appears preferable to use ‘standard’ selection weights in maximum pseudo-likelihood estimator
21
Why Does Calibration Have Problems?
Look at distribution of calibrated weights for one simulation of PPX
A = inverse probability weighting, YX = calibration on both Y and X, Y = calibration only on Y, X = calibration only on X
22
What If We Have Partial Population Summary Information?
We know xU but not yU (+ sampling is non-informative given X)
FIMLE = SMLE since knowing xU s tells us nothing more
about the conditional distribution of Y given X in the non-
sampled part of the population than we already know from
the sample data
We can calibrate our sample weights to xU , but it provides
no extra efficiency (and will typically decrease efficiency)
for estimating the regression of Y on X
23
If we know yU but not xU , things are a little different...
The available data score depends on the mean and
variance of xU s given the available data (including yU s).
In large samples xU s , xs, yU s and ys will be approximately
jointly normal, so we can write down this conditional mean
and variance, which will depend on the parameters of the
regression model. Using simple ‘plug-in’ estimates derived
from the sample data then allows us to approximate the
actual data score and hence approximate the MLE
Calibration is straightforward – we just calibrate to yU
24
Simulation Results Where Only yU is Known (% Rel Eff)
N = 500,
n = 20
N = 1000,
n = 50
N = 5000,
n = 200
Parameter Sampling Method
CAL MIP CAL MIP CAL MIP
SRS 89 85 99 121 101 134
PPX 91 193 95 172 94 128
PPY 91 136 97 134 97 96
SRS 88 47 94 66 97 80
PPX 64 155 71 183 67 224
PPY 67 76 79 90 87 91
SRS 93 85 97 99 100 100
PPX 88 126 91 133 95 141
2
PPY 94 104 98 106 99 111
Calibration strategy is still inefficient, while MIP approach does give gains, but not consistently
25
Extension to Logistic Regression Chambers and Wang (2008) Y = zero-one variable
Population level model
(xi ) = Pr yi = 1 xi( ) =exp( + xi )
1+ exp( + xi )
Auxiliary Information Basic assumption is that yU is known, so non-sample total tU s of Y
(i.e. number of non-sample ‘successes’ is known
26
Scenario 1 Individual non-sample values of X are also known (i.e.
X-register is available)
Components of the full information score
sc1s = yiU(xi )U
sc2s = xi yi (xi ){ }s
+ Es xiyiU s( ) xi (xi )U s
where Es denotes conditioning on the available data
FIMLE defined by calculating a saddlepoint approximation to sc2s Note FIMLE depends on individual non-sample X values
27
Scenario 2 Further auxiliary information consists only of value of xU , so non-sample mean xU s of X is known, but not individual X-
values
If sampling is SRS, we can approximate (xi )U and xi (xi )U s
in population level score using a ‘smearing’ argument (Duan, 1983)
1
N ng(xi )U s
=1
N ng xU s + (xi xU s ){ }
U s
1
ng xU s + (xi xs ){ }
s
This is combined with the saddlepoint approximation to calculate the FIMLE
28
Simulation: Linear Logistic Model + SRSWOR Percent relative efficiencies with respect to 5% trimmed RMSE of SMLE. Values of X drawn from the standard lognormal distribution.
In all cases N = 5000 and n = 200
True ( , ) (–3, 1) (–5, 2) (–5, 1) (–8, 2)
Scenario 1 116 112 121 114
Scenario 2 112 108 114 115
Scenario 1 101 102 105 110
Scenario 2 102 104 106 116
29
Simulation: Linear Logistic Model + Case Control Sampling Cases are sampled differentially from controls (2 strata). Values in table are percent relative efficiencies with respect to 5% trimmed root mean squared error of stratified pseudo-likelihood estimator. In all cases N = 5000 and n1 = n0 = 100 . Values of X drawn
from the standard lognormal distribution.
True ( , ) (–3, 1) (–5, 2) (–5, 1) (–8, 2)
Scenario 1 108 112 144 161
Scenario 2 106 109 120 126
Scenario 1 121 123 191 189
Scenario 2 113 112 128 127
SMLE 106 108 127 129
Note 1. Saddlepoint approximation and smearing estimator used in MIP are adjusted for case-
control sampling. See Chambers and Wang (2008) 2. SMLE of slope coefficient (from unweighted logistic fit) is standard MLE approximation
used in case-control
30
Conclusions/Lessons
• Incorporating auxiliary information via the MIP can significantly improve efficiency and seems to be reasonably robust to measurement errors in this information
gains generally increase as the ‘granularity’ of the auxiliary
information increases
• One needs to be very careful when incorporating auxiliary information via calibrated weighting. This is appropriate IF our targets of inference are parameters of marginal distributions AND the implied linear model induced by the calibration constraints is valid AND these constraints are accurately specified
31
References Binder, D.A. (1983). On the variances of asymptotically normal estimators from complex surveys.
International Statistical Review, 51, 279-292. Breckling, J.U., Chambers, R.L., Dorfman, A.H., Tam, S.M. and Welsh, A.H. (1994). Maximum
likelihood inference from survey data. International Statistical Review, 62, 349 - 363. Chambers, R. and Wang, S. (2008). Maximum likelihood logistic regression with auxiliary information.
Working Paper 12-08, Centre for Statistical and Survey Methodology, University of Wollongong. Dempster, A.P., Laird, N.M. and Rubin, D.B. (1977). Maximum likelihood from incomplete data via the
EM algorithm (with discussion). Journal of the Royal Statistical Society Series B, 39, 1-37. Dorfman, A., Chambers, R. and Wang, S. (2002). Are survey weights necessary? The maximum
likelihood approach to sample survey inference. Proceedings of the 162nd
Annual Meeting of the American Statistical Association, New York, August 11-15.
Duan, N. (1983). Smearing estimate: A nonparametric retransformation estimate. Journal of the American Statistical Association, 78, 605 - 610.
Godambe, V.P. and Thompson, M.E. (1986). Parameters of super populations and survey population: their relationship and estimation. International Statistical Review, 54, 37-59.
Kish, L. and Frankel, M.R. (1974). Inference from complex samples (with discussion). Journal of the Royal Statistical Society, Series B, 36, 1-37.
Orchard, T. and Woodbury, M.A. (1972). A missing information principle: theory and application. Proc. 6th Berkeley Symp. Math. Statist., 1, 697-715.
Pfeffermann, D. (1993). The role of sampling weights when modelling survey data. International Statistical Review, 61, 317-337.