sampling with synthesis: a new approach for releasing ...jerry/papers/jasa10.pdfsampling with...

Sampling with Synthesis: A New Approach for

Releasing Public Use Census Microdata

Jorg Drechsler and Jerome P. Reiter∗

Abstract

Many statistical agencies disseminate samples of census microdata, i.e., data

on individual records, to the public. Before releasing the microdata, agencies

typically alter identifying or sensitive values to protect data subjects’ confiden-

tiality, for example by coarsening, perturbing, or swapping data. These stan-

dard disclosure limitation techniques distort relationships and distributional

features in the original data, especially when applied with high intensity. Fur-

thermore, it can be difficult for analysts of the masked public use data to adjust

inferences for the effects of the disclosure limitation. Motivated by these short-

comings, we propose an approach to census microdata dissemination called

sampling with synthesis. The basic idea is to replace the identifying or sensi-

tive values in the census with multiple imputations, and release samples from

these multiply-imputed populations. We demonstrate that sampling with syn-

thesis can improve the quality of public use data relative to sampling followed

∗Jorg Drechsler is research scientist, Institute for Employment Research, Department for Statisti-

cal Methods, Regensburger Straße 104, 90478 Nurnberg, Germany,(e-mail: [email protected]);

and Jerome P. Reiter is Mrs. Alexander F. Hehmeyer Associate Professor of Statistical Science, Duke

University, Durham, NC 27708-0251, (e-mail: [email protected]). This research was supported

by a grant from the National Science Foundation (NSF-MMS-0751671).

1

by standard statistical disclosure limitation; simulation results showing this

are available online as supplemental material. We derive methods for analyzing

the multiple datasets generated by sampling with synthesis. We present algo-

rithms for selecting which census values to synthesize based on considerations

of disclosure risk and data utility. We illustrate sampling with synthesis on a

population constructed with data from the U. S. Current Population Survey.

Keywords: Confidentiality, Disclosure, Multiple imputation, PUMS, Synthetic

1 INTRODUCTION

Many national statistical agencies disseminate microdata, i.e., data on individual

records, to the public. Wide dissemination of microdata facilitates advances in sci-

ence and public policy, enables students to develop skills at data analysis, and helps

ordinary citizens learn about their communities. It also can improve the quality of

agencies’ future data collection efforts via feedback from those analyzing the data.

Disseminating microdata is complicated when the data come from a census. Large

files can be unwieldy for users, and releasing census data can compromise the data

subjects’ confidentiality. Agencies address the file size issue by releasing a randomly

selected sample of records from the census. As examples, the U.S. Bureau of the Cen-

sus releases both 1% and 5% samples from the decennial census, and statistical agen-

cies across the world deposit samples from censuses with the Integrated Public Use

Microdata Series international project; sampling fractions range from .07% in India to

10% in South American countries (https://international.ipums.org/international/).

In addition to reducing file size, sampling can reduce the risks of unintended dis-

closures. The literature on data confidentiality (e.g., Duncan and Lambert, 1989;

Fienberg et al., 1997; Reiter, 2005a) indicates that malicious users—henceforth called

2

intruders—seeking to identify data subjects have better chances of success when they

know that their targets are in the released data, as is the case in a census (except for

coverage errors and nonresponse). With sampling, intruders no longer are guaranteed

that their targets are in the released data.

Sampling alone, however, may not sufficiently reduce disclosure risks for records

with unusual characteristics. Intruders still may be able to link such records in the

released data to records in external files by matching on variables common to both

sources. Therefore, agencies typically alter identifying or sensitive data values before

releasing the samples. For example, in its public use microdata, the U.S. Bureau of the

Census aggregates geographical information, collapses levels of categorical variables,

perturbs individuals’ ages in large houses, swaps data values across records, and uses

top codes for incomes, i.e., set all incomes exceeding a threshold equal to a constant

(U. S. Bureau of the Census, 2003, p. 2-1).

Such disclosure-protected, census microdata samples are widely used by the pub-

lic. Social service agencies use census microdata samples in the allocation formulas

to fund some government programs and for general program planning (Blewett and

Davern, 2007). Survey organizations use census microdata samples to come up with

post-stratification weights. Demographers, economists, political scientists, sociolo-

gists, statisticians, and many others use census public use microdata samples in their

research and teaching. All of these data users demand high quality public use files;

hence, they have a keen interest in how public use data are protected. A recent illus-

tration of this interest is the analysis done by Alexander et al. (2010), who uncovered

substantial inconsistencies between analyses done with disclosure-protected, public

use census microdata samples and those done with actual census data. Their findings

caused enough concern among data users that the Wall Street Journal ran a story on

3

the effects of data protection on the quality of secondary analyses (Bialek, 2010).

As resources available to intruders continue to expand, agencies using standard

disclosure limitation techniques—like coarsening, perturbation, and swapping—on

samples of census data may need to apply them with high intensities to protect

confidentiality adequately. Unfortunately, this protection has a price: the released

data can be seriously degraded for statistical analysis. Furthermore, it can be difficult

for data users to account for the effects of the disclosure limitation in inferences.

Motivated by the shortcomings of standard disclosure limitation, particularly at

high intensities, we propose an alternative approach to disseminate public use cen-

sus microdata: sampling with synthesis. The basic idea is to replace identifying or

sensitive values in the census with multiple imputations before releasing samples to

the public. Specifically, the agency (i) selects the set of values to replace in the cen-

sus with imputations, so that the released data comprise a mixture of genuine and

simulated data; (ii) determines the synthesis models with the entire census, thereby

taking advantage of all available information; (iii) repeatedly simulates replacement

values for the selected data to create multiple, disclosure-protected populations; and,

(iv) releases samples from the populations. The agency can sample independently

in each population, or it can select the same records from each population. As we

argue in Section 2 and empirically demonstrate with simulations summarized in the

online supplement, using sampling with synthesis rather than sampling with standard

disclosure limitation can improve the quality of the public release samples.

Multiple imputation has been previously suggested for protecting confidentiality

of microdata collected in random samples. This idea, also called synthetic data, was

first proposed by Rubin (1993) and Little (1993); a related approach was proposed

by Fienberg (1994). Our approach extends these methodologies. Specifically, Rubin’s

4

(1993) approach, now called full synthesis, involves replacing all values and releasing

independent samples from the simulated populations; our approach allows the agency

to replace only some values and, if desired, to release a common set of records from

the populations. Little’s (1993) approach, now called partial synthesis, involves re-

placing only a subset of values and releasing the disclosure-protected populations; our

approach allows agencies to release samples from these populations.

Because the released data are generated in a novel way, inferential methods for

full synthesis (Raghunathan et al., 2003) and partial synthesis (Reiter, 2003) are

not appropriate for sampling with synthesis. In Section 3, we present inferential

methods that appropriately account for the process of data generation. These are

based on adaptations of multiple imputation combining rules (Rubin, 1987; Reiter

and Raghunathan, 2007).

Key to our approach is selecting the values to be replaced with multiple imputa-

tions, i.e., the first stage of the process. We use an algorithm that, given a set of risky

records, provides a principled way for agencies to maintain acceptable confidentiality

protection with minimal synthesis. We apply this algorithm to perform sampling with

synthesis on a population constructed with the U.S. Current Population Survey; the

algorithm and the application are summarized in Section 4. We discuss issues related

to implementation in very large censuses in Section 5. We conclude with directions

for future research and other applications of sampling with synthesis in Section 6.

2 MOTIVATION FOR THE APPROACH

To motivate sampling with synthesis, we compare its features with those of standard

disclosure limitation strategies, including perturbation, swapping, aggregation, and

5

top coding. The discussion is in accord with the results of Winkler (2007), who showed

that standard statistical disclosure control methods fail in terms of data utility or

disclosure risk even in simple settings. In the online supplement, we present results

of simulation studies showing that sampling with synthesis has the potential to avoid

the problems documented by Winkler (2007).

Adding random noise. Agencies can disguise identifying or sensitive values by

adding some randomly selected amount to the observed values, for example a random

draw from a normal distribution with mean equal to zero. To insure confidentiality,

agencies may need to draw from a distribution with large variance. This introduces

measurement error that, for example, stretches marginal distributions and attenuates

regression coefficients (Yancey et al., 2002); see the online supplement for illustrations

of this attenuation. In contrast, synthetic data explicitly aims to preserve marginal

and joint distributions of the data attributes through modeling. Hence, it has the

potential to avoid the pitfalls of adding random noise.

Data swapping. Agencies can swap data values for selected records, e.g., switch

values of age, race, and sex for at-risk records with those for other records, to discour-

age users from matching, since matches may be based on incorrect data (Dalenius and

Reiss, 1982). Swapping is used extensively by government agencies. It is generally

presumed that swapping fractions are low—agencies do not reveal the rates to the

public—because swapping at high levels destroys relationships involving the swapped

and unswapped variables. However, even seemingly modest swapping can be prob-

lematic. For example, using data from the Survey of Youth in Custody, Mitra and

Reiter (2006) find that 5% random swapping of two identifying variables results in

very poor confidence interval coverage rates for regression coefficients involving those

variables. In contrast, partial synthesis of 100% of these variables results in near

6

nominal coverage rates with greater protection. Little et al. (2004) illustrate similar

benefits of synthesis over swapping. In the supplement, we illustrate that even 1%

swapping can result in terrible coverage rates.

Top coding. Monetary variables and ages are frequently reported with top codes,

and sometimes with bottom codes as well. Top or bottom coding by definition elimi-

nates detailed inferences about the distribution beyond the thresholds. Chopping off

tails also negatively impacts estimation of whole-data quantities. For example, Ken-

nickell and Lane (2006) show that commonly used top codes distort estimates of the

Gini coefficient, a key measure of income inequality. In contrast, An and Little (2007)

show that replacing large values with draws from models that respect the thresholds,

e.g., replacement imputations are larger than the top code, can preserve tail features.

Aggregation. Aggregation reduces disclosure risks by turning atypical records—

which generally are most at risk—into typical records. For example, there may be

only one person with a particular combination of demographic characteristics in a

county but many people with those characteristics in a state. Aggregation makes

analysis at finer levels difficult and often impossible, and it creates problems of eco-

logical inferences. With synthetic data, it is possible to preserve finer details. For

example, agencies can release geography at higher resolutions by intense synthesis of

demographic identifiers, yet still maintain relationships in the data. This would not

be possible with aggregation coupled with intense swapping or noise addition.

A key advantage of sampling with synthesis compared to sampling with standard

disclosure limitation is the relative ease of inference. Sampling with synthesis enables

analysts to make valid inferences using standard, complete-data statistical methods

and software. For most standard disclosure limitation methods, analysts need to

model the disclosure limitation procedure in the likelihood function (Little, 1993) or

7

use measurement error models (Fuller, 1993), both of which are complicated for com-

plex estimands. Thus, with sampling with synthesis, analysts can focus on modeling

the science of their problem rather than trying to account for the effects of disclosure

limitation on inferences.

Because of the potential advantages of partially synthetic data over standard

disclosure limitation, several U. S. statistical agencies have adopted the approach

to create public use products. Among these are the Survey of Consumer Finances

(Kennickell, 1997), the American Community Survey group quarters data (Hawala,

2008), the Survey of Income and Program Participation (Abowd et al., 2006), the

Longitudinal Business Database (Kinney and Reiter, 2007), the Longitudinal Em-

ployer Household Dynamics program (Abowd and Woodcock, 2004), and On The

Map (http://lehdmap.did.census.gov/). Other statistical agencies developing syn-

thetic data include Statistics New Zealand (Graham and Penny, 2005) and the Ger-

man Institute for Employment Research (Drechsler et al., 2008a,b).

3 INFERENCE FOR SAMPLING WITH SYN-

THESIS

Let D denote the data for the census of a population of N units. To implement

sampling with synthesis, the agency first replaces identifying or sensitive values in D

with multiple imputations. Imputation models and their parameters are determined

using the at-risk records in D. Imputations are done independently m times, resulting

in m synthetic populations, Dsyn = {Di : i = 1, . . . , m}. The agency then takes a

sample of n < N records from each Di. The agency can choose records independently

in each Di, which we call the different samples approach, or choose the same set of

8

records in each Di, which we call the same sample approach. We discuss the merits

of each approach in Section 5. For now, we assume that the agency uses simple

random sampling; we extend to stratified sampling shortly. The agency releases the

m synthetic samples, dsyn = {di : i = 1, . . . , m}, to the public.

The analyst of dsyn seeks inferences about some estimand Q, such as a population

mean or regression coefficient. In each di, the analyst estimates Q with some point

estimator q and estimates the variance of q with some estimator u, where u is specified

ignoring any finite population correction factors; for example, when q is the sample

mean, u = s2/n. For i = 1, . . . , m, let qi and ui be respectively the values of q and u

in di. The following quantities are needed for inferences:

qm =m∑

i=1

qi/m (1)

bm =

m∑

i=1

(qi − qm)2/(m − 1) (2)

um =m∑

i=1

ui/m. (3)

For the different samples approach, the analyst uses qm to estimate Q and Td =

bm/m to estimate the variance of qm. Inferences are based on the t-distribution,

(qm − Q) ∼ tm−1(0, Td). Derivations of these methods are in the appendix.

For the same sample approach, the analyst again uses qm to estimate Q. The

variance estimate is Ts = ((1 − n/N)δ + (1 − δ))(um − bm) + bm/m. Here, δ = 1

when the analyst uses the finite population correction factor, and δ = 0 when the

analyst does not use the finite population correction factor. Inferences are based

on the t-distribution, (qm − Q) ∼ tν(0, Ts), where ν = (m − 1)(Ts/(((1 − n/N)δ +

(1 − δ))bm − bm/m))2. Derivations of these methods are in the appendix. It is

theoretically possible that Ts < 0. When this occurs, we adopt the conservative

estimate, ((1−n/N)δ+(1−δ))um. Negative variances are unlikely when only modest

9

fractions of values (e.g., 10% or less) are replaced in the census and generally can be

avoided with large m (e.g., m ≥ 25).

Why are new inferential methods necessary when using sampling with synthesis?

Consider first the variance estimator developed by Raghunathan et al. (2003) for full

synthesis, Tf = bm − ((1 − n/N)δ + (1 − δ))u + bm/m. When replacing only a small

fraction of values in sampling with synthesis, bm < ((1 − n/N)δ + (1 − δ))u, so that

Tf generally would be negative if applied in sampling with synthesis contexts.

Next consider the variance estimator developed by Reiter (2003) for partial syn-

thesis, Tp = ((1 − n/N)δ + (1 − δ))u + bm/m. Obviously, Tp can be substantially

larger than Ts or Td. Reiter’s (2003) derivations presume that the records used to

estimate the imputation models are the same as those used in the analysis. However,

this is not the case for sampling with synthesis: the imputation model utilizes the

entire census but analyses are based only on available samples. As a result of this

mismatch, Tp is positively biased if applied in sampling with synthesis contexts. A

related bias was identified in the context of multiple imputation for missing data by

Schenker and Raghunathan (2007) and Reiter (2008).

One might wonder why we do not take the sample first, estimate the synthesis

model with the sample second, and then impute replacements. This strategy is equiv-

alent to partially synthetic data, acting as if the sample was the only available data

to the agency. By generating imputations from models determined with the entire

census rather than with a sample from it, the agency enables analysts’ inferences to

be more efficient than inferences with standard partial synthesis. Using the entire

census for imputation has another interesting implication: inferences based on dsyn

can be more efficient than inferences based on the original data from the sample alone.

This is because the multiple imputations pass additional information from the census

10

to the analysts’ inferences. This is evident in the application described in the next

section and in the simulations presented in the online supplement.

To improve estimation accuracy in the public use samples, agencies can select

records using stratified random sampling. Analysts should account for the stratifi-

cation in their inferences. For model-based analyses, this involves including func-

tions of the stratum indicators in models (Gelman et al., 2004, Chapter 7) and using

the t-distributions previously outlined in this section with δ = 0. For design-based

(survey-weighted) analyses, the t-distribution based on Td is appropriate for the dif-

ferent samples approach. For the same sample approach, design-based estimation

requires modified procedures analogous to those from traditional stratified sampling

estimation. To illustrate, let Nh be the population size and nh be the sample size

in stratum h, where h = 1, . . . , H . To estimate the census mean of some Y , let

qmh be the sample mean of Y in stratum h, and let Tsh be the value of Ts com-

puted with the records in stratum h using the finite population correction factor

(1− nh/Nh). Analysts can use qmh and Tsh in inferences for the census mean in stra-

tum h. For the entire census mean, the point estimate is qm =∑

h(Nh/N)qmh, and

its estimated variance is Ts =∑

h(Nh/N)2Tsh. Design-based estimates of functions of

means and totals, such as domain quantities or regression coefficients, are obtained

by using relevant qm to estimate the component census means and totals; estimated

variances can be determined by linearization methods. Inferences can be based on

(Q − qm) ∼ N(0, Ts). In cases where m is small (e.g., m ≤ 5) and the amount of

replaced data is large (e.g., more than 50% of some variable), analyses can be im-

proved by using (Q− qm) ∼ tνst(0, Ts); an approach for determining νst is outlined in

the appendix. For many contexts, νst is large enough that a normal distribution is

adequate for inferences.

11

Table 1: Description of variables used in the empirical study

Variable Label Range

Sex X male, female

Race R white, black, American Indian, Asian

Marital status M 7 categories, coded 1–7

Highest attained education level E 16 categories, coded 31–46

Age (years) A 0 – 90

Number of people in household H 1 – 16

Number of people in household under age 18 Y 0, 1 – 11

Household property taxes ($) P 0, 1 – 99,997

Social security payments ($) S 0, 1 – 50,000

Household income ($) I -21,011 – 768,742

4 APPLICATION ON A CONSTRUCTED POP-

ULATION

To illustrate sampling with synthesis, we use a subset of public use data from the

March 2000 U.S. Current Population Survey. The data comprise nine variables mea-

sured on 51,016 heads of households; see Table 1. These variables include demographic

data—age, race, sex, and marital status—that are key identifiers in many censuses

of individuals. They also include a mix of other variables that represent several

generic modeling challenges, including numerical and categorical variables, numerical

variables with clearly non-Gaussian distributions, and numerical variables with large

percentages of values equal to zero. Similar data are used by Reiter (2005b,c) to

illustrate and evaluate releasing synthetic microdata for probability samples.

12

Although these data are truly a sample, we suppose that they are a census to

illustrate the process of sampling with synthesis. We suppose that the intruder knows

age, race, marital status, and sex precisely for all records in the census; the intruder

does not know the other variables. There are 521 records with unique combinations

of age, race, marital status, and sex; and, there are 578 combinations of the four

variables with two cases. Compared to the full population, these 1089 people are

disproportionately non-white (17% versus 86% white) with roughly equal distribution

among non-white races, have marital status other than married civilian (13% versus

53% married civilians), and are more likely to be women (52% versus 43% women).

To make the public use file, we simulate values of age, race, and marital status in

the census for these 1089 records; we leave sex unchanged. This selection ensures that,

for records with no synthesized data, intruders at best have a one in three chance of

making a correct identification based only on the four demographic variables. If one

in three is not deemed low enough, the agency can simulate data for more records.

Intruders might have access to other variables, for example property taxes, in which

case the agency may want to simulate those variables as well.

We begin the process by synthesizing age, race, and marital status for all 1089

records using the models described in Section 4.1. This results in m = 5 synthetic

populations. We then assess the disclosure risks in the synthesized populations using

the approach described in Section 4.2. To increase data usefulness, we add back real

values to these populations in ways that do not substantially increase the disclosure

risks; the algorithm for doing so is described in Section 4.3. Finally, we take a 10%

sample from the synthetic populations using the same sample approach of Section 3.

The resulting five datasets would be released to the public. We evaluate inferences

from the samples for selected estimands in Section 4.4. Similar evaluations for the

13

different samples approach are presented in the online supplement.

4.1 Generating the synthetic populations

We use classification and regression trees (CART) to generate synthetic data (Reiter,

2005c). CART has several advantages for synthesis: it handles diverse data types,

captures non-linear relationships and complex interactions, and runs quickly on large

datasets. Furthermore, it can be applied without much tuning.

For the synthesis of age, race, and marital status, we proceed sequentially. First,

using the 1089 records, we fit a regression tree of age on all other variables except

race and marital status. Label the age tree as Y (A). We grow Y (A) by finding the

splits that successively minimize the deviance of age in the leaves. We cease splitting

any particular leaf when the deviance of ages in that leaf is less than .0001 or when

we cannot ensure at least five records in each child leaf. These are default specifi-

cations of tuning parameters in many applications of regression trees. We did not

prune the leaves further, as experiments with further pruning worsened the quality of

the synthetic datasets without substantially improving the confidentiality protection.

In other contexts, agencies may need to prune the leaves to respect confidentiality

criterion as discussed in Reiter (2005c).

Let LAw be the wth leaf in Y (A), and let Y(A)LAw

be the nLAwvalues of age in

leaf LAw. In each LAw in the tree, we generate a new set of values by drawing

from Y(A)LAw

using the Bayesian bootstrap (Rubin, 1981). These sampled values are

the replacement imputations for the nLAwunits that belong to LAw. Repeating the

Bayesian bootstrap in each leaf of the age tree results in the ith set of synthetic ages,

Y(A)rep,i. We repeat this process m = 5 times to generate five populations with synthetic

ages, {D1, . . . ,D5}.

14

We next simulate values of marital status. In each Di, we fit the marital status tree

with all variables except race as predictors. We minimize the Gini index to determine

the recursive binary splits. This is a default criterion for many applications of CART

for categorical outcomes. We again require a minimum of five records in each leaf and

do not otherwise prune the tree. To maintain consistency with Y(A)rep,i, units’ leaves

in Y (M) are located using Y(A)rep,i. Occasionally, some units may have combinations of

values that do not belong to one of the leaves of Y (M). For these units, we search up

the tree until we find a node that contains the combination, and treat that node as if

it were the unit’s leaf. Once each unit’s leaf is located, values of Y(M)rep,i are generated

using the Bayesian bootstrap. Imputing races follows the same process: we fit the

tree Y (R) using all variables as predictors, place each unit in the leaves of Y (R) based

on their synthesized values of age and marital status, and sample new races using the

Bayesian bootstrap. The entire process is repeated for each Di, resulting in m = 5

synthetic populations.

All CART models are fit in R using the “tree” function. The sequence of impu-

tations is A − M − R; see Reiter (2005c) for a discussion of sequencing.

4.2 Disclosure risk

To evaluate disclosure risks, we compute probabilities of identification using methods

developed by Reiter and Mitra (2009) for partially synthetic data. We summarize

these methods below for a census; see Drechsler and Reiter (2008) for modifications

for sample data. Related approaches are described by Duncan and Lambert (1989),

Fienberg et al. (1997), and Reiter (2005a).

Suppose the intruder has a vector of information, t, on a particular target unit in

the population D. Let t0 be the unique identifier of the target, and let Dj0 be the

15

(not released) unique identifier for record j in dsyn, where j = 1, . . . , N . Let S be

any information released about the simulation models.

The intruder’s goal is to match unit j in Dsyn to the target when Dj0 = t0. Let

J be a random variable that equals j when Dj0 = t0 for j ∈ dsyn. The intruder thus

seeks to calculate the Pr(J = j|t,Dsyn,S) for j = 1, . . . , N . Because the intruder

does not know the actual values in Yrep, he or she should integrate over its possible

values when computing the match probabilities. Hence, for each record we compute

Pr(J = j|t,Dsyn,S) =

∫

Pr(J = j|t,Dsyn,Yrep,S)Pr(Yrep|t,Dsyn,S)dYrep. (4)

This construction suggests a Monte Carlo approach to estimating each Pr(J =

j|t,Dsyn,S). First, sample a value of Yrep from Pr(Yrep|t,Dsyn,S). Let Ynew repre-

sent one set of simulated values. Second, compute Pr(J = j|t,Dsyn,Yrep = Ynew,S)

using exact matching assuming Ynew are collected values. This two-step process is

iterated h times, where ideally h is large, and (4) is estimated as the average of the

resultant h values of Pr(J = j|t,Dsyn,Yrep = Ynew,S). When S has no information,

the intruder treats the simulated values as plausible draws of Yrep.

Following Reiter (2005a), we quantify disclosure risk with summaries of these

identification probabilities. It is reasonable to assume that the intruder selects as a

match for t the record j with the highest value of Pr(J = j|t,Dsyn,S), if a unique

maximum exists. We consider two risk measures: the true match rate and the false

match rate. Let cj be the number of records with the highest match probability for

the target tj where j = 1, ..., N ; let Ij = 1 if the true match is among the cj units and

Ij = 0 otherwise. Let Kj = 1 when cjIj = 1 and Kj = 0 otherwise. The true match

rate equals∑

Kj/N . Let Fj = 1 when cj(1 − Ij) = 1 and Fj = 0 otherwise; and, let

s equal the number of records with cj = 1. The false match rate equals∑

Fj/s.

In our application, we presume that intruders’ targets are among the 1089 sensitive

16

cases. Since these data will contain synthetic values that change across Di, the

intruder can identify the 1089 sensitive cases and ignore records with unsynthesized

data, i.e., set their probabilities to zero. Thus, we compute risks based only on the

1089 records. The true match rate among these 1089 records is 5.7%, and the false

match rate is 88%. Hence, intruders are not likely to make correct matches, and very

likely to make mistakes. In contrast, nearly 50% of these records could be identified

in the original census data without the possibility of false matches. By these risk

measures, the synthesis has substantially reduced disclosure risks. We note that the

confidentiality risks in the public use data samples will be even smaller, since the act

of sampling introduces additional uncertainty in intruders’ abilities to match.

4.3 Algorithm for selecting values to synthesize

It may be sufficient from the perspective of disclosure risk to replace only some iden-

tifying variables. For example, a person might possess a unique combination of age,

race, sex, and marital status. This person might no longer be at risk if age is not

released exactly or if marital status and race both are not released exactly. When

such cases occur, it can improve data usefulness to include genuine values in dsyn

rather than leaving all values synthesized, since including genuine data reduces the

sensitivity of inferences to the synthesis model. However, substituting genuine values

for some records could impact risks for other records.

When the agency can safely put back some record’s genuine data in multiple ways,

ideally it should select the way that allows the released data to have as much utility

as possible with acceptable disclosure risk. One approach is to select certain key

estimands and add back genuine values so that synthetic-data and genuine-data in-

ferences are as similar as possible, e.g., the two sets of confidence intervals overlap as

17

much as possible (Karr et al., 2006). This will result in decisions that are nearly opti-

mal for some analyses and suboptimal for others. Another possibility is to maximize

some global measure of utility, such as summaries of the distance between the dis-

tributions of the synthetic and genuine datasets (Woo et al., 2009). However, global

measures of data utility are not especially sensitive to changes in a single datum, so

that they may not be fine enough for this purpose. Perhaps the ideal approach is to

release more synthetic values for variables for which the synthesis models reflect the

relationships in the data well, and more genuine values for variables that have com-

paratively poorly fitting models. Translating this idea into practical implementation

is an area for future research.

We choose a simple utility metric: make the number of synthesized values for

each variable as similar as possible. Effectively, this treats each variable as equally

important to data analysts, which is sensible absent specific planned uses of the public

use data. It also is simple to implement and transparent to data users.

We select the values to replace using an iterative approach. For iteration k, let

r(A)k be the number of synthetic ages, r(M)k be the number of synthetic marital

statuses, and r(R)k be the number of synthetic races in any one of the synthetic

populations at the start of the iteration. Initially, r(A)1 = r(M)1 = r(R)1 = 1089.

Let Ck =(

Ck[1], Ck[2], Ck[3])

be the order in which variables are considered for

replacement at iteration k, where Ck[1] represents the first variable considered, Ck[2]

the second, and Ck[3] the third. Initially, C1[1] is age, C1[2] is marital status, and

C1[3] is race.

For iteration k, which corresponds to the kth record among the 1089 sensitive

cases, we begin by replacing Ck[1] for record k in all five Di with the record’s genuine

value of Ck[1]. We re-synthesize that record’s values of Ck[2] and Ck[3] from appro-

18

priate CART models to maintain consistency with the updated value of Ck[1]. Let

Dksyn =

(

Dk1, . . . ,D

k5

)

be the resulting updated synthetic populations. Next, we com-

pute the disclosure risk measures using Dksyn. If the measures satisfy three criteria, we

keep the genuine data value of Ck[1] in each Dki ; otherwise, we put back the synthetic

data for that record. The three criteria include (i) the individual record cannot be

correctly identified from Dksyn, (ii) the true match rate cannot exceed 5.7%, and (ii)

the false match rate cannot dip below 88%. In essence, we do not allow genuine data

replacements when the disclosure risks increase over those in the initial synthetic pop-

ulations Dsyn. These criteria can be relaxed to allow greater substitution of genuine

values. We check the true match rate and false match rate because, even if record k

is not identifiable, replacement of its Ck[1] may make other records identifiable.

When Ck[1] is replaced with genuine data, we decrease the corresponding rk by

one for the next iteration, e.g., if age is replaced, set r(A)k+1 = r(A)k − 1. We then

consider replacing Ck[2] with genuine data. We re-synthesize Ck[3] only for record

k based on genuine values of Ck[1] and Ck[2], and recompute the disclosure risks. If

the three criteria are satisfied, we replace Ck[2] in Dksyn with genuine data, set the

corresponding rk+1 = rk − 1, and move on to the next iteration. If the criteria are

not satisfied, we try to replace Ck[3] with genuine data using similar procedures. If

we cannot replace Ck[3] as well, then only Ck[1] is replaced in Dksyn for record k. We

note that the algorithm never puts back genuine values for all three variables for any

record k.

If Ck[1] cannot be replaced with genuine data, we attempt to replace Ck[2] and, if

possible, Ck[3] in Dksyn using similar procedures. Again, we decrease the corresponding

rk+1 as appropriate. If Ck[2] cannot be replaced, we attempt to replace only Ck[3] in

Dksyn. If no values can be replaced with genuine data, we return all values for record

19

k to the synthetic values in Dsyn, and set rk+1 = rk for all variables.

For each successive iteration, we reset Ck+1 to match the descending order of the

rk. For example, if r(A)k = 1000, r(M)k = 1002, and r(R)k = 1001, then Ck+1[1] is

marital status, Ck+1[2] is race, and Ck+1[3] is age. In this way, variables with the

least amount of genuine data are given the first opportunity to have genuine data put

back.

We applied this algorithm to the five synthetic populations. After all 1089 itera-

tions, synthetic values remained on the file for 951 values of age, 951 values of marital

status, and 947 values of race. The disclosure risk measures for this updated dataset

are essentially unchanged. Hence, by using the algorithm, we put back about 13%

genuine values for each variable, thereby lessening reliance on the synthesis, with no

increase in disclosure risk.

4.4 Illustration of analytic properties

As the final step of sampling with synthesis, we take a 10% random sample of the

same records from the five updated synthetic populations. Of the 5100 records, 117

records have at least one value synthesized. To illustrate the quality of inferences

attainable from the resulting samples, we compute 95% confidence intervals for the

estimands listed in Table 2 and Table 3. These estimands include quantities for the

entire population, for small sub-groups likely to be impacted by the synthesis, and

for regression models with interactions among the variables subject to synthesis. The

tables also include the population values and 95% confidence intervals based on the

samples without any data synthesis. No values of Ts were negative.

Nearly all of the 95% confidence intervals from the synthetic data cover the pop-

ulation values Q. When this does not happen, the intervals based on the samples

20

without synthesis also do not cover their Q, indicating that lack of coverage results

primarily from natural variability inherent in taking samples. Inferences from both

sets of data are similar, as might be expected when replacing only 2.3% of values

overall. For 34 out of the 57 estimands, the point estimates from the synthetic sam-

ples are closer to their corresponding population values than the point estimates from

the samples without synthesis are. This is the case even for the estimate computed

with the highest fraction of replaced values (average age for single American Indians is

based on five synthetic values out of fourteen total): the synthetic data point estimate

of 33.9 is closer to the population value of 33.4 than the original data estimate of 30.8.

For 47 out of 57 estimands, the confidence intervals based on sampling with synthesis

are narrower than the confidence intervals based on the samples without synthesis.

The smaller errors in the point estimates and smaller widths of the intervals result

because the imputation models are determined with the entire census.

5 IMPLEMENTATION IN LARGE CENSUSES

The example in Section 4 illustrates the issues involved in sampling with synthe-

sis; however, many censuses have far greater than 1089 records at risk of disclosure.

Fitting imputation models, including the CART model used here, can be computa-

tionally challenging when the number of at-risk records is very large. In such contexts,

agencies can partition the at-risk records into subsets of manageable sizes, for exam-

ple by geography or by categories of important variables like income, and perform

the imputation tasks separately in each partition. When using stratified sampling

to create the public use files, agencies can base the partitions on the strata. Impor-

tantly, the agency only needs to fit the synthesis models on the at-risk records in each

21

Table 2: Point estimates and 95% confidence intervals for demographic variables using

sampling with synthesis and, for benchmarking, sampling without synthesis. Census

values are in column labeled with a Q.Sampling w/ Sampling w/o

Estimand Q Synthesis SynthesisAvg. age 48.7 48.9 (48.5, 49.4) 48.9 (48.5, 49.4)Avg. age single Amer. Ind. 33.4 33.9 (28.8, 39.2) 30.8 (25.8, 35.9)Avg. educ. divorced black women 39.4 39.2 (38.2, 40.3) 39.2 (37.9, 40.4)% Married civilian 53.3 53.5 (52.2, 54.8) 53.7 (52.3, 54.9)% Married in armed forces .23 .25 (.13, .37) .24 (.11, .36)% Married spouse not present 1.5 1.7 (1.4, 2.1) 1.8 (1.4, 2.1)% Widowed 10.8 10.9 (10.1, 11.7) 11.0 (10.1, 11.8)% Divorced 14.0 14.1 (13.2, 15.0) 14.1 (13.2, 15.0)% Separated 2.9 2.6 (2.2, 3.0) 2.6 (2.2, 3.0)% Single 17.2 16.8 (15.8, 17.8) 16.8 (15.8, 17.8)% White 85.7 85.8 (84.9, 86.7) 85.6 (84.7, 86.5)% Black 10.2 10.3 (9.5, 11.0) 10.3 (9.5, 11.1)% Amer. Ind. 1.2 1.2 (.9, 1.5) 1.2 (.9, 1.5)% Asian 2.9 2.8 (2.4, 3.2) 2.9 (2.4, 3.3)% Male, divorced Amer. Ind. .11 .12 (.04, .21) .10 (.02, .18)% Separated | Asian 2.3 3.2 (.8, 5.7) 2.7 (.2, 5.2)% Tax > 1000 | Black 19.1 18.5 (15.4, 21.7) 18.7 (15.5, 21.2)

partition group, not the entire census.

Even with extensive partitioning, sampling with synthesis still can result in effi-

ciency gains compared to standard partial synthesis. To illustrate, suppose that some

census data are partitioned into 3000 groups, e.g., the roughly 3000 counties in the

United States. Suppose further that the agency takes 1% random samples in each

county to create the public use file. In any county’s sample, there may be relatively

small numbers of people that are considered at risk of disclosure. If the agency es-

timates synthesis models within each county using only those sampled people, the

models would be based on few observations and hence inefficient. However, if the

agency builds synthesis models within each county using all people at risk in the

county, models will be based on much larger sets of observations, leading to efficiency

gains. When the distributions of replaced data are similar across the partitioning

22

Table 3: Point estimates and 95% confidence intervals for coefficients in regressions

using sampling with synthesis and, for benchmarking, sampling without synthesis.

Census values are in column labeled with a Q.Public Use Samples

Estimand Q Synthesis No SynthesisCoefficient in regression of log I on:

Intercept 4.9 5.2 (4.9, 5.6) 5.2 (4.9, 5.6)Black -.27 -.19 (-.26, -.12) -.19 (-.26, -.12)American Indian -.25 -.17 (-.34, .01) -.10 (-.29, .09)Asian -.01 -.07 (-.20, .05) -.11 (-.24, .01)Female .00 .01 (-.05, .08) .01 (-.05, .08)Married spouse not present -.03 .01 (-.21, .02) .01 (-.16, .33)Widowed -.01 .07 (-.09, .22) .10 (-.06, .25)Divorced -.16 -.10 (-.20, -.01) -.11 (-.22, -.01)Separated -.24 -.09 (-.30, .11) -.12 (-.34, .09)Single -.17 -.10 (-.19, -.01) -.10 (-.19, -.01)Education .11 .10 (.10, .11) .10 (.10, .11)Household size > 1 .50 .52 (.46, .59) .52 (.46, .59)Females married spouse not present -.52 -.35 (-.63, -.07) -.43 (-.75, -.11)Widowed females -.31 -.36 (-.53, -.19) -.38 (-.55, .-21)Divorced females -.31 -.41 (-.54, -.28) -.41 (-.54, -.28)Separated females -.52 -.57 (-.83, -.31) -.55 (-.82, -.28)Single females -.32 -.28 (-.39, -.16) -.27 (-.39, -.15)Age ×10 .43 .38 (.31, .46) .38 (.31, .46)Age2 ×1000 -.44 -.40 (-.48, -.33) -.40 (-.48, -.33)Property tax ×10000 .37 .38 (.29, .46) .38 (.30, .46)

Coefficient in regression of√

S on:Intercept 79.9 85.0 (73.6, 96.5) 83.4 (71.8, 95.1)Female -13.3 -15.0 (-17.9, -12.1) -15.4 (-18.2, -12.5)Black -5.9 -4.6 (-8.8, -.5) -5.0 (-9.3, -.8)American Indian -7.0 -14.3 (-27.0, -1.5) -11.3 (-24.1, 1.5)Asian -3.3 -8.8 (-16.9, -.7) -6.0 (-13.9, 1.9)Married spouse not present 2.1 -.9 (-9.3, 7.5) -1.9 (-10.4, 6.5)Widowed 7.3 7.3 (4.2, 10.5) 7.6 (4.5, 10.8)Divorced -0.9 -.69 (-5.1, 3.7) -.53 (-5.0, 4.0)Separated -5.4 -3.5 (-13.7, 6.6) -1.0 (-12.3, 10.3)Single -1.5 -1.9 (-8.2, 4.5) -2.2 (-8.9, 4.5)High school 5.5 7.4 (4.6, 10.3) 7.5 (4.6, 10.4)Some college 6.8 7.8 (4.4, 11.1) 7.9 (4.5, 11.2)College degree 8.3 7.3 (3.0, 11.6) 7.4 (3.0, 11.7)Advanced degree 10.7 8.8 (3.6, 13.9) 8.7 (3.6, 13.9)Age .21 .14 (-.01, .29) .16 (.01, .32)

Income regression fit using people with I > 0.

Social security regression fit using people with S > 0 and A > 54.

23

groups, the efficiency gains are lower compared to those from fitting one model to the

entire census. On the other hand, partitioning can reduce bias when the distributions

of replaced data are dissimilar across partitions. We note that the inferential methods

of Section 3 apply regardless of the partitioning.

Although we illustrated sampling with synthesis using the same sample approach

in Section 4, agencies also can take independent samples from each synthesized pop-

ulation. For a fixed number of samples m, the different sample approach generally

enables estimation with higher efficiency. As m increases, reductions in variance are

typically larger with the different samples approach; in fact, with modest fractions

of replacement, var(qm) goes to zero as m → ∞ for the different samples approach

but not for the same sample approach. However, m independent samples potentially

require nearly m times as much storage space as the m versions of the same sample,

which public data analysts’ might find unwieldy for large files with many variables.

The disclosure risk properties of the two approaches are harder to pin down than

the utility properties. On the one hand, the different samples approach releases more

units to the public, increasing the number of potential intruder targets. The chance

that intruders’ specific targets are in the released data increases, so that intruders

can attempt re-identifications with higher confidence. On the other hand, it can be

difficult for intruders to determine whether or not records that appear in only one

of the different samples have been synthesized or not. This should reduce disclosure

risks. Furthermore, when many records that have synthesized values do not appear

in all m datasets, intruders trying to identify units with approaches akin to those in

Section 4 have less information to compute the probabilities.

For either approach, agencies need to select the value of m. Increasing m generally

increases the utility of the synthetic data, but it also increases disclosure risks and

24

storage needs. For either approach, agencies can compute overall disclosure risks for

different m, and select the largest m that provides acceptable risks and storage costs.

The literature on multiple imputation of missing data provides additional guidance

for selecting m for the same sample approach. Accuracy gains from increasing m

are typically modest for small fractions of replaced data (e.g., 10% or less), so that

agencies can release m = 5 or m = 10 datasets to keep risks and storage costs low

without sacrificing accuracy, whereas for large fractions of replaced data accuracy

gains can be substantial with large m.

6 CONCLUDING REMARKS

The main employers of sampling with synthesis should be national statistics agen-

cies that disseminate or could disseminate census data, like the Census Bureau, the

National Agricultural Statistic Service, the National Center for Education Statistics,

and dozens of international agencies. However, the approach is not limited to national

statistics agencies. Sampling with synthesis can be used to create research samples

of confidential administrative data, e.g., state records of people receiving services. A

related application is to create research samples of large organizations, e.g., employees

at large companies or health care records from large insurers. These data sources are

effectively censuses of their respective populations. Such data sources may play an

important role in future data analyses, since new surveys are ever-more expensive to

mount. Administrative and organizational data sources have potentially high disclo-

sure risks, since large portions of the data are available to members of the public, e.g.,

state social service workers, employees, and physicians. Creating public use files of

these data will require intense data alteration. Traditional disclosure limitation tech-

25

niques have little chance of generating analytically useful data in such cases, whereas

sampling with synthesis does.

The algorithm for selecting values to synthesize applies outside the sampling with

synthesis context. It could be used, for example, to implement partial synthesis

for a survey sample. The only change in the algorithm for this case is in the risk

measures: they should account for intruders’ uncertainty that the target records are

in the dataset via the methods of Drechsler and Reiter (2008).

For different orderings of the variables, and possibly different orderings of the

records, the algorithm could result in different synthetic datasets. In general, we do

not anticipate these differences to be of great consequence for data utility, since the

algorithm automatically attempts to balance the number of replacement values for

each variable being synthesized. Nonetheless, agencies may be able to obtain slightly

higher quality inferences, e.g., more substitutions of genuine values, by running the

algorithm for several orderings, and selecting the resulting synthetic datasets that

yield the highest quality.

In addition to censuses, sampling with synthesis has the potential to be useful

for large probability samples with confidential data. For example, the sample sizes

in previous decennial long-form surveys were large enough that the Census Bureau

released public use samples rather than the entire surveys; similar issues now arise

with its successor, the American Community Survey. Sampling with synthesis from

a survey requires different, and not yet developed, methods for inferences, since the

additional uncertainty from the initial stage of sampling must be accounted for.

Finally, when missing data are present, it is natural for the agency to use multiple

imputation to handle the missing data and to protect confidentiality simultaneously.

Reiter (2004) develops an approach for doing this in regular partial synthesis settings.

26

Extending these ideas to sampling with synthesis is a topic for future research.

Appendix

A.1 Derivation of inferences for different samples approach

The analyst of the m different samples seeks f(Q|dsyn). The two-step process for

creating dsyn suggests that

f(Q|dsyn) =

∫

f(Q|Dsyn,dsyn)f(Dsyn|dsyn)dDsyn. (5)

As in other applications of multiple imputation, we find each component of this

integral by assuming that the analyst’s distributions are identical to those used by

the agency for creating Dsyn. We also assume that the sample sizes are large enough

to permit normal approximations for these distributions. Thus, we require only the

first two moments for each distribution, which can be derived using standard large

sample Bayesian arguments. Diffuse priors are assumed for all parameters; that is, we

presume that the information in the likelihood function dominates any information

in the analyst’s prior distribution. This is reasonable in large datasets.

Let Qi be the estimate of Q in Di; let Qm =∑m

i=1 Qi/m; and, let Bm =

∑mi=1(Qi − Qm)2/(m − 1). Given Qm and Bm, dsyn is irrelevant for inferences about

Q. Hence, from Reiter (2003), f(Q|Dsyn,dsyn) = N(Qm, Bm/m). The posterior vari-

ance, Bm/m, is simply the variance estimator from standard partial synthesis, Tp, for

a census, i.e., set Um = 0.

We next assume that (qi|Di) ∼ N(Qi, ui), which is reasonable for simple random

samples. Here, we need not be concerned with the presence or absence of finite

population correction factors in ui, since, as we shall show, Td does not depend on

ui. We further assume that the sampling variability in each ui is negligible, so that

27

ui ≈ um. Thus, (Qm|qm, um) ∼ N(qm, um/m). Using Bayesian analysis of variance

arguments, we also have

(

(m − 1)bm

Bm + um

|dsyn

)

∼ χ2m−1. (6)

From these results, we have f(Q|dsyn, Bm) = N(qm, Bm/m+um/m). We now have

to integrate this normal distribution over f(Bm|dsyn) in (6). We could evaluate this

numerically, but we desire a closed form approximation that public use data analysts

could easily apply. For large m, we have E(Bm|dsyn) ≈ bm − um. Hence, we have

var(Q|dsyn) ≈ (bm − um)/m + um/m = bm/m = Td. (7)

From (7) we approximate f(Q|dsyn) as a t distribution with m−1 degrees of freedom.

A.2 Derivation of inferences for same sample approach

The analyst again seeks f(Q|dsyn), but now recognizing that the same records com-

prise each di. Let q∞ = lim∑m

i=1 qi/m as m → ∞; let b∞ = lim∑m

i=1(qi−qm)2/(m−1)

as m → ∞; and, let v∞ = var(Q|q∞,dsyn). We write f(Q|dsyn) as

f(Q|dsyn) =

∫

f(Q|dsyn, q∞, v∞, b∞)f(q∞|dsyn, v∞, b∞)f(v∞, b∞|dsyn)dq∞dv∞db∞. (8)

As before, we assume that the sample sizes are large enough to permit normal ap-

proximations for the distributions involving Q and q∞. Diffuse priors are assumed

for all parameters. We first present results ignoring finite population corrections. We

then incorporate finite population corrections.

To begin, we assume that E(qi|Qi) = Qi, where Qi is the value of Q in synthesized

population Di. This is reasonable since di is a simple random sample from Di. Since

E(Qi|Q) = Q, the E(q∞|Q) = Q. Using the definition of v∞, we therefore have

f(Q|dsyn, q∞, v∞) = N(q∞, v∞). We next assume that f(qi|q∞, b∞) = N(q∞, b∞), so

28

that f(q∞|dsyn, b∞) = N(qm, b∞/m). The sampling distribution of qi also implies

that

(

(m − 1)bm

b∞|dsyn

)

∼ χ2m−1. (9)

From these results, the posterior distribution of Q given the variance parameters is

f(Q|dsyn, b∞, v∞) = N(qm, v∞ + b∞/m). (10)

We now turn to f(v∞|dsyn, b∞), which we estimate using a similar approach to

the one in Reiter (2008). First, define vi = var(Q|di, q∞, b∞). Given only one dataset

di, the analyst would use ui to estimate var(Q|di, b∞). Relating ui and vi, and using

an iterated variance computation, we have

ui = E(vi|di, b∞) + var(q∞|di, b∞). (11)

Rewriting this as an expression for vi, we have E(vi|di, b∞) = ui − b∞. We assume

that the sampling distribution of vi has mean v∞, so that E(v∞|dsyn, b∞) equals

E{E(v∞|dsyn, b∞, q∞)|dsyn, b∞} = E

{

∑

i

vi/m|dsyn, b∞

}

= um − b∞. (12)

Finally, we assume that the variance in the sampling distribution of vi is of lower

order than v∞. This implies negligible sampling variance in um, which typically is

reasonable in multiple imputation contexts with large sample sizes (Rubin, 1987, Ch.

3). Thus, we write f(v∞|dsyn, b∞) as a distribution concentrated at um − b∞ with

negligible variance.

To obtain f(Q|dsyn), we should replace v∞ with um−b∞ in (10), and integrate with

respect to the distribution of b∞ in (9). Although this integration can be carried out

numerically, we desire a straightforward approximation that can be easily computed

by analysts using dsyn. For large m, we can approximate V ar(Q|dsyn) by substituting

29

bm for b∞. Thus, for inferences without finite population corrections, we have

var(Q|dsyn) ≈ um − bm + bm/m = Ts. (13)

For finite population quantities, the derivations proceed as before up to (10). At

this stage, we have to modify the estimate of v∞ to reflect the finite population infer-

ence; we do not modify b∞/m, as it appropriately measures the additional uncertainty

from using finite m. For simple random samples, the modification follows the usual

form: simply multiply the without replacement variance estimate, um − bm, by the

finite population correction factor. This can be justified by large-sample approxima-

tions to f(Q|dsyn, q∞), as we now illustrate for a simple synthesis setting.

Suppose that the agency wishes to perform sampling with synthesis on a census of

N records measured on one variable Y. Let E(Y) = µ and V ar(Y) = σ2. Suppose

that the agency randomly selects N1 records in the census whose data will be replaced

with synthetic values. The agency draws replacements from f(Y) for each of these

records to create one partially synthetic population Di. It repeats the draws m = ∞

times. From each partially synthetic census, the agency takes a sample of the same

n records, and it releases these records to the public. Suppose that n1 records in the

sample have synthetic values; let n2 = n − n1.

The analyst seeks inferences for the population mean of Y. Let R index the set

of records in D whose values are replaced. In any Di, this mean is Qi = (∑

j /∈R yj +

∑

j∈R yj,i)/N , where yj,i is the replaced value for record j in Di. When m = ∞, the

finite population quantity of interest to the analyst, Q, is Q∞ = limm→∞

∑mi=1 Qi/m.

Since E(yj,i) = µ for all j ∈ R, Q∞ = (∑

j /∈R yj + N1µ)/N is the target of inference.

Of course, the analyst does not know Q and must find its posterior distribution

using only the infinite number of synthetic samples dsyn. Let S be the set of records

in the sample whose values are not replaced. Let y∗

j = yj if record j /∈ R, and let

30

y∗

j = µ if record j ∈ R. Note that E(y∗

j ) = µ, but V ar(y∗

j ) = τ 2 6= σ2. Each qi equals

the sample mean of the n1 replaced values and the n2 not replaced values, so that

q∞ = (∑

j∈S yj + n1µ)/n = y∗. Let E be the set of (N − n) records that are not in

the sample (but in the census). Given q∞, the analyst’s target of inference can be

written as Q = (nq∞ + (N − n)y∗

E)/N , where y∗

E is the mean of y∗

j for j ∈ E.

Using large-sample approximations, the posterior mean and variance are

E(Q|dsyn, q∞) = (nq∞ + (N − n)E(µ|dsyn, q∞))/N (14)

V ar(Q|dsyn, q∞) = ((N − n)/N)2(E(τ 2/(N − n)|dsyn, q∞) + V ar(µ|dsyn, q∞)).(15)

From standard Bayesian theory, E(µ|dsyn, q∞) = q∞; V ar(µ|dsyn, q∞) = s2/n; and

E(τ 2|dsyn, q∞) = s2. Here, s2 = (∑

j∈S(yj − q∞)2 + n1(µ − q∞)2)/n. Hence,

E(Q|dsyn, q∞) = q∞ (16)

V ar(Q|dsyn, q∞) = ((N − n)/N)s2/n, (17)

which contains the finite population correction factor.

We next show that u∞ − b∞ ≈ s2/n for large n. Because replacements are drawn

from f(Y|µ, σ2), each ui ≈ σ2/n. And, b∞ ≈ (n1/n)σ2/n, because the n2 values in S

cancel from (qi − q∞) for all i. Hence, u∞ − b∞ = σ2/n − (n1/n)σ2/n = (n2/n)σ2/n.

Turning now to s2/n, for large n we have s2 ≈ τ 2. Because N1/N values of y∗ equal

µ, we have τ 2 = ((N −N1)/N)σ2. Finally, for large n, (n2/n) ≈ (N −N1)/N . Hence,

we have s2/n ≈ (n2/n)σ2/n, as desired.

Regardless of whether analysts use finite population corrections or not, inferences

for Q can be based on the t-distribution, (qM − Q) ∼ tν(0, Ts). The degrees of

freedom, ν, is derived by matching the first two moments of (νTs)/{((1 − n/N)δ +

(1 − δ))(uM − b∞) + b∞/m} to those of a χ2ν distribution as follows. Let φ = b∞/bm,

and let f = (1 − n/N)δ + (1 − δ). Then, dropping the ν, the random variable can

31

be expressed as Ts/(fum + bmφ(1/m − f)). We approximate the expectation and

variance of this quantity by using first order Taylor series expansions in φ−1 around

its expectation, which equals one. Thus, in the expectation, we essentially substitute

one for φ, resulting in an expectation equal to one. Since V ar(φ−1) = 2/(m− 1), we

have

V ar

(

Ts

fum + φbm(1/m − f)| dsyn

)

=T 2

s (bm(1/m − f))2

(fum + bm(1/m − f))4

(

2

m − 1

)

=(bm(1/m − 1))2

(fum + bm(1/m − f))2

(

2

m − 1

)

.(18)

Since a mean square random variable has variance equal to 2 divided by its degrees

of freedom, we conclude that νs = (m − 1)(Ts/(bm/m − fbm))2.

A.3 Degrees of freedom for the same sample approach with

stratified sampling

We now derive the degrees of freedom for a reference t-distribution in inferences for

the census mean using the same sample approach with stratified sampling. We have

not found a single formula for these degrees of freedom that works for all estimands.

However, the general approach used here can be applied to linear functions of means

and totals.

Let Wh = Nh/N , and let umh and bmh be the values of um and bm computed in

stratum h. As suggested in Section 3, the point estimate for the census mean of Y is

∑

h Whymh, and its estimated variance is

Ts ≈H∑

h=1

W 2hTsh =

H∑

h=1

W 2h (1 − nh/Nh)(umh − bmh) + bmh/m.

We match the first two moments of (νstTs)/{∑H

h=1 W 2h (1 − nh/Nh)(umh − b∞h) +

b∞h/m} to those of a χ2νst

distribution. Let φ = (φ1, ..., φH), with φh = b∞h/bmh,

32

and let fh = (1 − nh/Nh). Dropping νst, the random variable can be expressed

as Ts/(∑H

h=1 W 2hfhumh + bmhφh(1/m − fh)). We approximate the expectation and

variance of this quantity by using a multivariate first order Taylor series expansion

in φ−1 around its expectation, which equals a vector of ones. Since V ar(φ−1h ) =

2/(m − 1), we have

V ar

(

Ts∑H

h=1 W 2hfhumh + bmhφh(1/m − fh)

| dsyn

)

≈ 2

m − 1

(∑H

h=1 W 2hbmh(1/m − fh))

2

T 2st

(19)

Since a mean square random variable has variance equal to 2 divided by its degrees

of freedom, we conclude that νst = (m − 1)(Tst/(∑H

h=1 W 2hbmh(1/m − fh)))

2.

7 Supplemental materials

Simulation studies: Results from simulations showing that the inferential methods

in Section 3 of the text have good frequentist properties, and that standard

disclosure limitation techniques can have poor frequentist properties.

References

Abowd, J., Stinson, M., and Benedetto, G. (2006). Final report to the Social Security

Administration on the SIPP/SSA/IRS Public Use File Project. Tech. rep., U.S.

Census Bureau Longitudinal Employer-Household Dynamics Program. Available

at http://www.bls.census.gov/sipp/synth data.html.

Abowd, J. M. and Woodcock, S. D. (2004). Multiply-imputing confidential character-

istics and file links in longitudinal linked data. In J. Domingo-Ferrer and V. Torra,

eds., Privacy in Statistical Databases, 290–297. New York: Springer-Verlag.

Alexander, J. T., Davern, M., and Stevenson, B. (2010). Inaccurate age and sex data

33

in the Census PUMS files: Evidence and implications. Tech. rep., National Bureau

of Economic Research, Working Paper 15703.

An, D. and Little, R. (2007). Multiple imputation: an alternative to top coding for

statistical disclosure control. Journal of the Royal Statistical Society, Series A 170,

923–940.

Bialek, C. (2010). Census Bureau obscured personal data—too well, some say. Wall

Street Journal, February 5.

Blewett, L. A. and Davern, M. (2007). Distributing state children’s health insurance

funds: A critical review of the design and implementation of the funding formula.

Journal of Health Politics, Policy and Law 32, 415–455.

Dalenius, T. and Reiss, S. P. (1982). Data-swapping: A technique for disclosure

control. Journal of Statistical Planning and Inference 6, 73–85.

Drechsler, J., Bender, S., and Rassler, S. (2008a). Comparing fully and partially syn-

thetic datasets for statistical disclosure control in the German IAB Establishment

Panel. Transactions on Data Privacy 1, 105–130.

Drechsler, J., Dundler, A., Bender, S., Rassler, S., and Zwick, T. (2008b). A new ap-

proach for disclosure control in the IAB Establishment Panel–Multiple imputation

for a better data access. Advances in Statistical Analysis 92, 439 – 458.

Drechsler, J. and Reiter, J. P. (2008). Accounting for intruder uncertainty due to

sampling when estimating identification disclosure risks in partially synthetic data.

In J. Domingo-Ferrer and Y. Saygin, eds., Privacy in Statistical Databases (LNCS

5262), 227–238. New York: Springer-Verlag.

34

Duncan, G. T. and Lambert, D. (1989). The risk of disclosure for microdata. Journal

of Business and Economic Statistics 7, 207–217.

Fienberg, S. E. (1994). A radical proposal for the provision of micro-data samples and

the preservation of confidentiality. Tech. rep., Department of Statistics, Carnegie-

Mellon University.

Fienberg, S. E., Makov, U. E., and Sanil, A. P. (1997). A Bayesian approach to

data disclosure: Optimal intruder behavior for continuous data. Journal of Official

Statistics 13, 75–89.

Fuller, W. A. (1993). Masking procedures for microdata disclosure limitation. Journal

of Official Statistics 9, 383–406.

Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (2004). Bayesian Data

Analysis. London: Chapman & Hall.

Graham, P. and Penny, R. (2005). Multiply imputed syn-

thetic data files. Tech. rep., University of Otago,

http://www.uoc.otago.ac.nz/departments/pubhealth/pgrahpub.htm.

Hawala, S. (2008). Producing partially synthetic data to avoid disclosure. In Pro-

ceedings of the Joint Statistical Meetings. Alexandria, VA: American Statistical

Association.

Karr, A. F., Kohnen, C. N., Oganian, A., Reiter, J. P., and Sanil, A. P. (2006). A

framework for evaluating the utility of data altered to protect confidentiality. The

American Statistician 60, 224–232.

Kennickell, A. and Lane, J. (2006). Measuring the impact of data protection tech-

niques on data utility: Evidence from the Survey of Consumer Finances. In

35

J. Domingo-Ferrar, ed., Privacy in Statistical Databases 2006 (Lecture Notes in

Computer Science), 291–303. New York: Springer-Verlag.

Kennickell, A. B. (1997). Multiple imputation and disclosure protection: The case of

the 1995 Survey of Consumer Finances. In W. Alvey and B. Jamerson, eds., Record

Linkage Techniques, 1997, 248–267. Washington, D.C.: National Academy Press.

Kinney, S. K. and Reiter, J. P. (2007). Making public use, synthetic files of the

Longitudinal Business Database. In Proceedings of the Joint Statistical Meetings.

Alexandria, VA: American Statistical Association.

Little, R. J. A. (1993). Statistical analysis of masked data. Journal of Official


Little, R. J. A., Liu, F., and Raghunathan, T. E. (2004). Statistical disclosure tech-

niques based on multiple imputation. In A. Gelman and X. L. Meng, eds., Applied

Bayesian Modeling and Causal Inference from Incomplete-Data Perspectives, 141–

152. New York: John Wiley & Sons.

Mitra, R. and Reiter, J. P. (2006). Adjusting survey weights when altering identifying

design variables via synthetic data. In J. Domingo-Ferrer and L. Franconi, eds.,

Privacy in Statistical Databases, 177–188. New York: Springer-Verlag.

Raghunathan, T. E., Reiter, J. P., and Rubin, D. B. (2003). Multiple imputation for

statistical disclosure limitation. Journal of Official Statistics 19, 1–16.

Reiter, J. P. (2003). Inference for partially synthetic, public use microdata sets.

Survey Methodology 29, 181–189.

Reiter, J. P. (2004). Simultaneous use of multiple imputation for missing data and

disclosure limitation. Survey Methodology 30, 235–242.

36

Reiter, J. P. (2005a). Estimating identification risks in microdata. Journal of the

American Statistical Association 100, 1103–1113.

Reiter, J. P. (2005b). Releasing multiply-imputed, synthetic public use microdata:

An illustration and empirical study. Journal of the Royal Statistical Society, Series

A 168, 185–205.

Reiter, J. P. (2005c). Using CART to generate partially synthetic, public use micro-

data. Journal of Official Statistics 21, 441–462.

Reiter, J. P. (2008). Multiple imputation when records used for imputation are not

used or disseminated for analysis. Biometrika 95, 933–946.

Reiter, J. P. and Mitra, R. (2009). Estimating risks of identification disclosure in

partially synthetic data. Journal of Privacy and Confidentiality 1, 99–110.

Reiter, J. P. and Raghunathan, T. E. (2007). The multiple adaptations of multiple

imputation. Journal of the American Statistical Association 102, 1462–1471.

Rubin, D. B. (1981). The Bayesian bootstrap. The Annals of Statistics 9, 130–134.

Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys. New York:

John Wiley & Sons.

Rubin, D. B. (1993). Discussion: Statistical disclosure limitation. Journal of Official


Schenker, N. and Raghunathan, T. E. (2007). Combining information from multiple

surveys to enhance estimation of measures of health. Statistics in Medicine 26,

1802–1811.

37

U. S. Bureau of the Census (2003). 2000 census of population and housing, public

use microdata sample, united states: Technical documentation.

Winkler, W. E. (2007). Examples of easy-to-implement, widely used methods of

masking for which analytic properties are not justified. Tech. rep., U.S. Census

Bureau Research Report Series, No. 2007-21.

Woo, M. J., Reiter, J. P., Oganian, A., and Karr, A. F. (2009). Global measures of

data utility for microdata masked for disclosure limitation. Journal of Privacy and

Confidentiality 1, 111–124.

Yancey, W. E., Winkler, W. E., and Creecy, R. H. (2002). Disclosure risk assessment

in perturbative microdata protection. In J. Domingo-Ferrer, ed., Inference Control

in Statistical Databases, 135–152. Berlin: Springer-Verlag.

38

sampling with synthesis: a new approach for releasing ...jerry/papers/jasa10.pdfsampling with...

Documents