synthetic data for small area estimation in the u.s ...€¦ · synthetic small area microdata...

Synthetic Data for Small Area Estimation

in the U.S. Federal Statistical System

Joseph W. Sakshaug

Institute for Social Research, University of Michigan

Institute for Employment Research, Nürnberg, Germany

Ludwig-Maximilian University of Munich, Germany

UNECE/Eurostat Work Session on Statistical Data Confidentiality

October 27, 2011

Outline

• Background

• Small Area Estimation Programs in the United States

– Pros/Cons

• Synthetic Microdata for Small Area Estimation

• Applications

– American Community Survey

– National Health Interview Survey

• Conclusions

• Future directions

2

Background/Motivation

• Increasing demand for small area estimates (Tranmer et al.,

2005)

– states, counties, cities/towns, neighborhoods, etc.

• Small area effects can impact policy decisions and

interventions at local levels

• Microdata for small areas typically not released to

the public due to disclosure concerns

• Thus, statistical agencies are responsible for

producing the majority of small area information

3

Small Area Estimation Programs

in the United States

• SAIPE – U.S. Census Bureau

o County-level estimates on income and poverty rates

• SAHIE – U.S. Census Bureau

o County-level estimates of health insurance coverage

• National Cancer Institute

o County-level prevalence of smoking, mammography, pap

smear test

o Combines two surveys (Raghunathan et al., 2007)

�National Health Interview Survey (NHIS)

�Behaviorial Risk Factor Surveillance Survey (BRFSS)

4

Pros/Cons of Small Area Programs

• Advantages– Provides important estimates at the local level

– Sufficient for basic analytic purposes

– Often used to inform policy decisions

– Data confidentiality is maintained

• Disadvantages– No microdata available for small areas

– Customized analyses not feasible

• Variable recodes, subgroup analysis, alternative definitions of construct of interest

– Multivariate estimates usually not released

• Correlations, regression coefficients, etc.

5

Microdata for Small Geographic Areas

Two main data dissemination approaches:

1) Release microdata files for areas with > 100,000

residents (U.S. Census Bureau)

o 626 (out of 3141) counties meet this minimum

o Other counties are combined with larger counties until

threshold is met

2) Access restricted data via Research Data Centers

o Limited # of Census RDCs in U.S. (13 locations)

o Proposal and special sworn status required

6

Alternative Method:

Release Synthetic Microdata (Rubin, 1993)

• Treat unsampled portion of population as missing data

• Replace missing data with imputed (or synthetic)

values drawn from a posterior predictive distribution

• Release samples of synthetic data which comprise the

public-use microdata

• Apply standard combining rules to obtain valid

inferences (Raghunathan, Reiter, and Rubin, 2003)

• Released data need not contain any observed records

7

Previous Applications of Synthetic Data

• IAB Establishment Panel (Drechsler et al., 2008)

• SIPP/SSA/IRS (Abowd et al., 2006)

• ACS Group Quarters (Rodriguez, 2007)

• Longitudinal Business Database (Kinney & Reiter, 2008)

• Applications focus on preserving statistics about the

entire sample, but ignores small area statistics.

8

Synthetic Small Area Microdata Project

• Jointly funded by the U.S. Census Bureau and the Centers

for Disease Control and Prevention

• Project goals

• 1) Develop synthetic data generation method

– Hierarchical Bayesian model

• 2) Generate synthetic microdata for counties

– American Community Survey (Northeast region)

– National Health Interview Survey (sampled/nonsampled areas)

• 3) Compare inferences obtained from synthetic and

actual data

– Descriptive and analytic statistics

9

Selected Items

• Household items:

– Household size, income, tenure (own, mortgage,

rent), electricity payment, number of rooms

• Person items:

– Age, sex, race, ethnicity, poverty status, self-

reported health status, body mass index, smoking

status, moderate activity, hypertension

10

Average County-Level Estimates: Household

Avg. County Means Avg. County

Standard Errors

Household variables Actual Synthetic Actual Synthetic

Household size 2.12 2.12 0.02 0.01

Sampling weight 9.99 10.20 0.11 0.11

Number of bedrooms 2.88 2.82 0.01 0.01

Electricity cost/month 118.89 119.37 1.25 1.10

Number of rooms 3.23 3.18 0.02 0.02

Income 67983.89 67382.38 1067.29 692.56

Tenure: Mortgage/loan (%) 49.00 47.03 0.82 0.74

Tenure: Own free & clear (%) 31.12 30.37 0.77 0.72

Tenure: Rent (%) 19.88 22.60 0.63 0.6312

County Means: Actual vs. Synthetic

Cross-Validation Study:

Nonsampled County Means (N=63)

14

15

A) Mean Income by Household Tenure

B) Household Income > 50th, 75th, and 90th percentiles

Average County-Level Regression Estimates

16

Avg. County

Coefficients

Avg. County

Standard Errors

Linear regression of

household income on

Actual Synthetic Actual Synthetic

Intercept 24.34 24.26 1.11 1.09

Household size 1.52 1.44 0.14 0.14

Sampling weight -0.04 -0.05 0.24 0.26



Number of rooms 1.25 1.26 0.14 0.13

Tenure: Mortgage/loan Ref Ref Ref Ref

Tenure: Own free & clear -3.47 -3.05 0.37 0.34

Tenure: Rent -6.01 -6.84 0.44 0.47

Conclusions

• Synthetic data preserves small area statistics reasonably well in most cases

– Univariate/multivariate, subgroup estimates

• Modeling approach could be improved

– non-standard distributions, multinomial distributions

• Practical Strengths

– Easy to implement; doesn’t require MCMC

– Data can presumably be released to the public without restriction (needs disclosure risk analysis)

– Method could be adopted for large scale survey projects

17

Thanks for your attention!

[email protected]

18

Modeling Approach

• Extension of SRMI (Raghunathan et al., 2001)

• Hierarchical Bayesian Model

– Two levels; e.g., counties nested within states

• Fit sequential regression models within each small area,

� ��,� , � ��,�|��,� ,…,� ��,|��,�, … , ��,��• Approximate distribution of design-based parameter estimates,

� ��,�~�� ,�, ��,�• Assign proper prior to the unknown population parameter ,

��,�~�� , Σ��• Draw unknown parameter from posterior distribution,

��,�~�� ,�� Σ�� ,�� ,� � Σ�� , ��,�� Σ��

19

Model Setup

• Estimate weighted � ��,� by fitting sequential regression models

� ��,�~�� ,�, ��,� [1]

��,�~�� ,��, Σ�,� [2]

� �,�~�� ,�, ��,� [3]

��,�~�� , Ω� [4]

• Hyperparameters estimated using EM algorithm (Dempster et al., 1977)

��,�~�� ,�� Ω�� ,�� ,� � Ω�� , ��,�� Ω��

��,�~�� ,�� Σ��,�� ,�� ,� � Σ��,�� ,�� , ��,�� Σ��,��

• Models [2] and [4] used to simulate coeffs for nonsampled areas

• Simulated coefficients used as inputs for drawing synthetic values

20

Generating Synthetic Values

• Simulating a synthetic variable �� is achieved by drawing

from the posterior predictive distribution in sequential order,

� ��,�|�� , � ��,�|��,�, �� , …, � ��,|��,�, ��,�, … , ��,��, ��• For continuous variables,

��,�~� ��, ��,�, ��,�, ��,�� , ��• For binary variables,

��,�~�� 1, "̂ ��, ��,�, ��,�, … , ��,�� • Extension to count and multinomial distributions is possible

• Process is repeated � times to produce � synthetic data sets

– � = 5 (or 10) is usually sufficient to obtain valid inferences (Reiter, 2005)

• Valid inferences obtained using standard combining rules (Raghunathan, Reiter, Rubin, 2003)

21

Application 1: American Community Survey

• American Community Survey (2005-2009)

• “Small areas” defined as counties

• Northeast Region (N=217 counties)

• N = 599,450 households; 1,506,011 persons

• � $ 10 synthetic data sets

• Household-level variables

– Sampling weight, electricity cost/mo., income, household

size, # bedrooms, # total rooms, household tenure

(mortgage/loan, own free & clear, rent)

22

PSU Means: Actual vs. Synthetic (sampled)



25

27



Linear Regression of Income on:

HH Size

Sampling weight

Electricity cost/mo.

# Bedrooms

# Total rooms

Tenure: Own Free & Clear

Tenure: Rent

28

Simulation Study: CI Coverage – PUMA Means

Conditional

Unconditional

Synthetic Actual

Household size 0.99 0.98 0.98

Sampling weight 0.95 0.99 0.98

Number of bedrooms 0.89 0.93 0.98

Electricity cost/month 0.86 0.91 0.98

Number of rooms 0.97 0.98 0.98

Income 0.90 0.94 0.98

Tenure: Own free & clear 0.93 0.96 0.98

Tenure: Rent 0.94 0.96 0.98

Coverage mean 0.93 0.96 0.98

29

Application 2: National Health Interview Survey

• NHIS 2003-2005; Complex sample survey

• Treat PSUs as “small areas” nested within strata

• Generate synthetic data for both sampled PSUs and

nonsampled counties

• Incorporate PSU/county-level and stratum/state-level

covariates into hierarchical model

• N=93,606 sampled adults

• Continuous and binary items

– Age, body mass index, smoker, sex, moderate activity,

hypertension, fair/poor health rating

30

Actual vs. Synthetic (sampled/nonsampled)

PSU Means: Actual vs. Synthetic (sampled)



33

Simulation Study: CI Coverage

(Means & Regression Coefficients)Conditional Inference Unconditional Inference

CIC CIC Actual

BMI

Age

Smoker

Moderate activity

Male

Hypertension

Fair/poor health

0.99

0.91

0.99

0.99

0.99

0.99

0.99

0.99

0.99

0.99

0.99

0.99

0.99

0.99

0.97

0.98

0.98

0.98

0.98

0.97

0.97

34

Conditional Inference Unconditional Inference

Covariates CIC CIC Actual

Regression of

BMI(log) on

Intercept

Age

Smoker

Moderate activity

Male

Hypertension

Fair/poor health

0.99

0.99

0.99

0.99

0.99

0.99

0.99

0.99

0.99

0.99

0.99

0.99

0.99

0.99

0.97

0.97

0.98

0.97

0.98

0.98

0.96

Future Work

• Application to smaller areas is always desirable

– Census tracts, block groups

• Additional variable distributions

– E.g., mixed-type,

• Cross-classified tables

– add more details in public-use summary files

• Incorporate auxiliary information to improve

efficiency of synthetic data estimates

– E.g., Administrative data, external survey data (e.g., BRFSS)

• Longitudinal small area estimates (e.g., HRS, PSID)

35

PUMA Subgroup Means & Percentiles

(Synthetic vs. Actual)

37



Standard Errors of PUMA Means


PUMA Regression Coefficients


HH Income (y) = Intercept + HH Size + Sampling Weight + Bedrooms +

Electricity + Rooms + Own Free & Clear + Rent + Error

41

Standard Errors

(Household-Level Regression Coefficients)

42

Average ACS County-Level Estimates: Household

Avg. County Means Avg. County

Standard Errors

Household variables Actual Synthetic Actual Synthetic

Household size 2.12 2.12 0.02 0.01

Sampling weight 9.99 10.20 0.11 0.11



Number of rooms 3.23 3.18 0.02 0.02

Income 67983.89 67382.38 1067.29 692.56

Tenure: Mortgage/loan (%) 49.00 47.03 0.82 0.74

Tenure: Own free & clear (%) 31.12 30.37 0.77 0.72

Tenure: Rent (%) 19.88 22.60 0.63 0.6343

Average ACS County-Level Regression Estimates

44

Avg. County

Coefficients

Avg. County

Standard Errors

Linear regression of

household income on

Actual Synthetic Actual Synthetic

Intercept 24.34 24.26 1.11 1.09

Household size 1.52 1.44 0.14 0.14

Sampling weight -0.04 -0.05 0.24 0.26



Number of rooms 1.25 1.26 0.14 0.13

Tenure: Mortgage/loan Ref Ref Ref Ref

Tenure: Own free & clear -3.47 -3.05 0.37 0.34

Tenure: Rent -6.01 -6.84 0.44 0.47

Synthetic Data Inference

• Estimating a scalar quantity & is achieved using standard combining rules (Raghunathan, Reiter, and Rubin, 2003)

• Point estimate '() is obtained by averaging point estimates across the � $ *+ $ 1,2,… ,�- synthetic data sets,

'() $.' /)

/0�/�

• Variance of point estimate 2) consists of within- and between-variance components,

2) $ 1 �� 3) 4 56where 3) $ ∑ ' / 4 '() �/ � 4 1)/0� and 5̅) $ ∑ 5 /)/0� /�

• When , �9:, and � are large, inferences for scalar & can be

based on normal distributions.

– For moderate �, inferences can be based on t-distributions

45

Synthetic (samp) vs. Synthetic (nonsamp)

Propensity Score Balance Check• Actual and synthetic data sets stacked

• Fit logistic regression of belonging to actual data set

• Predicted probabilities sorted, grouped into deciles

• ;� -test of equality of synthetic data proportions across deciles

47

Mean Min Max

Estimated

probabilities "̂0.30 0.18 0.48

;� statistic 14.80 7.92 42.90

P-value 0.23 0.01 0.57

synthetic data for small area estimation in the u.s ...€¦ · synthetic small area microdata...

Documents