sampling with synthesis: a new approach for releasing ...jerry/papers/jasa10.pdfsampling with...
TRANSCRIPT
Sampling with Synthesis: A New Approach for
Releasing Public Use Census Microdata
Jorg Drechsler and Jerome P. Reiter∗
Abstract
Many statistical agencies disseminate samples of census microdata, i.e., data
on individual records, to the public. Before releasing the microdata, agencies
typically alter identifying or sensitive values to protect data subjects’ confiden-
tiality, for example by coarsening, perturbing, or swapping data. These stan-
dard disclosure limitation techniques distort relationships and distributional
features in the original data, especially when applied with high intensity. Fur-
thermore, it can be difficult for analysts of the masked public use data to adjust
inferences for the effects of the disclosure limitation. Motivated by these short-
comings, we propose an approach to census microdata dissemination called
sampling with synthesis. The basic idea is to replace the identifying or sensi-
tive values in the census with multiple imputations, and release samples from
these multiply-imputed populations. We demonstrate that sampling with syn-
thesis can improve the quality of public use data relative to sampling followed
∗Jorg Drechsler is research scientist, Institute for Employment Research, Department for Statisti-
cal Methods, Regensburger Straße 104, 90478 Nurnberg, Germany,(e-mail: [email protected]);
and Jerome P. Reiter is Mrs. Alexander F. Hehmeyer Associate Professor of Statistical Science, Duke
University, Durham, NC 27708-0251, (e-mail: [email protected]). This research was supported
by a grant from the National Science Foundation (NSF-MMS-0751671).
1
by standard statistical disclosure limitation; simulation results showing this
are available online as supplemental material. We derive methods for analyzing
the multiple datasets generated by sampling with synthesis. We present algo-
rithms for selecting which census values to synthesize based on considerations
of disclosure risk and data utility. We illustrate sampling with synthesis on a
population constructed with data from the U. S. Current Population Survey.
Keywords: Confidentiality, Disclosure, Multiple imputation, PUMS, Synthetic
1 INTRODUCTION
Many national statistical agencies disseminate microdata, i.e., data on individual
records, to the public. Wide dissemination of microdata facilitates advances in sci-
ence and public policy, enables students to develop skills at data analysis, and helps
ordinary citizens learn about their communities. It also can improve the quality of
agencies’ future data collection efforts via feedback from those analyzing the data.
Disseminating microdata is complicated when the data come from a census. Large
files can be unwieldy for users, and releasing census data can compromise the data
subjects’ confidentiality. Agencies address the file size issue by releasing a randomly
selected sample of records from the census. As examples, the U.S. Bureau of the Cen-
sus releases both 1% and 5% samples from the decennial census, and statistical agen-
cies across the world deposit samples from censuses with the Integrated Public Use
Microdata Series international project; sampling fractions range from .07% in India to
10% in South American countries (https://international.ipums.org/international/).
In addition to reducing file size, sampling can reduce the risks of unintended dis-
closures. The literature on data confidentiality (e.g., Duncan and Lambert, 1989;
Fienberg et al., 1997; Reiter, 2005a) indicates that malicious users—henceforth called
2
intruders—seeking to identify data subjects have better chances of success when they
know that their targets are in the released data, as is the case in a census (except for
coverage errors and nonresponse). With sampling, intruders no longer are guaranteed
that their targets are in the released data.
Sampling alone, however, may not sufficiently reduce disclosure risks for records
with unusual characteristics. Intruders still may be able to link such records in the
released data to records in external files by matching on variables common to both
sources. Therefore, agencies typically alter identifying or sensitive data values before
releasing the samples. For example, in its public use microdata, the U.S. Bureau of the
Census aggregates geographical information, collapses levels of categorical variables,
perturbs individuals’ ages in large houses, swaps data values across records, and uses
top codes for incomes, i.e., set all incomes exceeding a threshold equal to a constant
(U. S. Bureau of the Census, 2003, p. 2-1).
Such disclosure-protected, census microdata samples are widely used by the pub-
lic. Social service agencies use census microdata samples in the allocation formulas
to fund some government programs and for general program planning (Blewett and
Davern, 2007). Survey organizations use census microdata samples to come up with
post-stratification weights. Demographers, economists, political scientists, sociolo-
gists, statisticians, and many others use census public use microdata samples in their
research and teaching. All of these data users demand high quality public use files;
hence, they have a keen interest in how public use data are protected. A recent illus-
tration of this interest is the analysis done by Alexander et al. (2010), who uncovered
substantial inconsistencies between analyses done with disclosure-protected, public
use census microdata samples and those done with actual census data. Their findings
caused enough concern among data users that the Wall Street Journal ran a story on
3
the effects of data protection on the quality of secondary analyses (Bialek, 2010).
As resources available to intruders continue to expand, agencies using standard
disclosure limitation techniques—like coarsening, perturbation, and swapping—on
samples of census data may need to apply them with high intensities to protect
confidentiality adequately. Unfortunately, this protection has a price: the released
data can be seriously degraded for statistical analysis. Furthermore, it can be difficult
for data users to account for the effects of the disclosure limitation in inferences.
Motivated by the shortcomings of standard disclosure limitation, particularly at
high intensities, we propose an alternative approach to disseminate public use cen-
sus microdata: sampling with synthesis. The basic idea is to replace identifying or
sensitive values in the census with multiple imputations before releasing samples to
the public. Specifically, the agency (i) selects the set of values to replace in the cen-
sus with imputations, so that the released data comprise a mixture of genuine and
simulated data; (ii) determines the synthesis models with the entire census, thereby
taking advantage of all available information; (iii) repeatedly simulates replacement
values for the selected data to create multiple, disclosure-protected populations; and,
(iv) releases samples from the populations. The agency can sample independently
in each population, or it can select the same records from each population. As we
argue in Section 2 and empirically demonstrate with simulations summarized in the
online supplement, using sampling with synthesis rather than sampling with standard
disclosure limitation can improve the quality of the public release samples.
Multiple imputation has been previously suggested for protecting confidentiality
of microdata collected in random samples. This idea, also called synthetic data, was
first proposed by Rubin (1993) and Little (1993); a related approach was proposed
by Fienberg (1994). Our approach extends these methodologies. Specifically, Rubin’s
4
(1993) approach, now called full synthesis, involves replacing all values and releasing
independent samples from the simulated populations; our approach allows the agency
to replace only some values and, if desired, to release a common set of records from
the populations. Little’s (1993) approach, now called partial synthesis, involves re-
placing only a subset of values and releasing the disclosure-protected populations; our
approach allows agencies to release samples from these populations.
Because the released data are generated in a novel way, inferential methods for
full synthesis (Raghunathan et al., 2003) and partial synthesis (Reiter, 2003) are
not appropriate for sampling with synthesis. In Section 3, we present inferential
methods that appropriately account for the process of data generation. These are
based on adaptations of multiple imputation combining rules (Rubin, 1987; Reiter
and Raghunathan, 2007).
Key to our approach is selecting the values to be replaced with multiple imputa-
tions, i.e., the first stage of the process. We use an algorithm that, given a set of risky
records, provides a principled way for agencies to maintain acceptable confidentiality
protection with minimal synthesis. We apply this algorithm to perform sampling with
synthesis on a population constructed with the U.S. Current Population Survey; the
algorithm and the application are summarized in Section 4. We discuss issues related
to implementation in very large censuses in Section 5. We conclude with directions
for future research and other applications of sampling with synthesis in Section 6.
2 MOTIVATION FOR THE APPROACH
To motivate sampling with synthesis, we compare its features with those of standard
disclosure limitation strategies, including perturbation, swapping, aggregation, and
5
top coding. The discussion is in accord with the results of Winkler (2007), who showed
that standard statistical disclosure control methods fail in terms of data utility or
disclosure risk even in simple settings. In the online supplement, we present results
of simulation studies showing that sampling with synthesis has the potential to avoid
the problems documented by Winkler (2007).
Adding random noise. Agencies can disguise identifying or sensitive values by
adding some randomly selected amount to the observed values, for example a random
draw from a normal distribution with mean equal to zero. To insure confidentiality,
agencies may need to draw from a distribution with large variance. This introduces
measurement error that, for example, stretches marginal distributions and attenuates
regression coefficients (Yancey et al., 2002); see the online supplement for illustrations
of this attenuation. In contrast, synthetic data explicitly aims to preserve marginal
and joint distributions of the data attributes through modeling. Hence, it has the
potential to avoid the pitfalls of adding random noise.
Data swapping. Agencies can swap data values for selected records, e.g., switch
values of age, race, and sex for at-risk records with those for other records, to discour-
age users from matching, since matches may be based on incorrect data (Dalenius and
Reiss, 1982). Swapping is used extensively by government agencies. It is generally
presumed that swapping fractions are low—agencies do not reveal the rates to the
public—because swapping at high levels destroys relationships involving the swapped
and unswapped variables. However, even seemingly modest swapping can be prob-
lematic. For example, using data from the Survey of Youth in Custody, Mitra and
Reiter (2006) find that 5% random swapping of two identifying variables results in
very poor confidence interval coverage rates for regression coefficients involving those
variables. In contrast, partial synthesis of 100% of these variables results in near
6
nominal coverage rates with greater protection. Little et al. (2004) illustrate similar
benefits of synthesis over swapping. In the supplement, we illustrate that even 1%
swapping can result in terrible coverage rates.
Top coding. Monetary variables and ages are frequently reported with top codes,
and sometimes with bottom codes as well. Top or bottom coding by definition elimi-
nates detailed inferences about the distribution beyond the thresholds. Chopping off
tails also negatively impacts estimation of whole-data quantities. For example, Ken-
nickell and Lane (2006) show that commonly used top codes distort estimates of the
Gini coefficient, a key measure of income inequality. In contrast, An and Little (2007)
show that replacing large values with draws from models that respect the thresholds,
e.g., replacement imputations are larger than the top code, can preserve tail features.
Aggregation. Aggregation reduces disclosure risks by turning atypical records—
which generally are most at risk—into typical records. For example, there may be
only one person with a particular combination of demographic characteristics in a
county but many people with those characteristics in a state. Aggregation makes
analysis at finer levels difficult and often impossible, and it creates problems of eco-
logical inferences. With synthetic data, it is possible to preserve finer details. For
example, agencies can release geography at higher resolutions by intense synthesis of
demographic identifiers, yet still maintain relationships in the data. This would not
be possible with aggregation coupled with intense swapping or noise addition.
A key advantage of sampling with synthesis compared to sampling with standard
disclosure limitation is the relative ease of inference. Sampling with synthesis enables
analysts to make valid inferences using standard, complete-data statistical methods
and software. For most standard disclosure limitation methods, analysts need to
model the disclosure limitation procedure in the likelihood function (Little, 1993) or
7
use measurement error models (Fuller, 1993), both of which are complicated for com-
plex estimands. Thus, with sampling with synthesis, analysts can focus on modeling
the science of their problem rather than trying to account for the effects of disclosure
limitation on inferences.
Because of the potential advantages of partially synthetic data over standard
disclosure limitation, several U. S. statistical agencies have adopted the approach
to create public use products. Among these are the Survey of Consumer Finances
(Kennickell, 1997), the American Community Survey group quarters data (Hawala,
2008), the Survey of Income and Program Participation (Abowd et al., 2006), the
Longitudinal Business Database (Kinney and Reiter, 2007), the Longitudinal Em-
ployer Household Dynamics program (Abowd and Woodcock, 2004), and On The
Map (http://lehdmap.did.census.gov/). Other statistical agencies developing syn-
thetic data include Statistics New Zealand (Graham and Penny, 2005) and the Ger-
man Institute for Employment Research (Drechsler et al., 2008a,b).
3 INFERENCE FOR SAMPLING WITH SYN-
THESIS
Let D denote the data for the census of a population of N units. To implement
sampling with synthesis, the agency first replaces identifying or sensitive values in D
with multiple imputations. Imputation models and their parameters are determined
using the at-risk records in D. Imputations are done independently m times, resulting
in m synthetic populations, Dsyn = {Di : i = 1, . . . , m}. The agency then takes a
sample of n < N records from each Di. The agency can choose records independently
in each Di, which we call the different samples approach, or choose the same set of
8
records in each Di, which we call the same sample approach. We discuss the merits
of each approach in Section 5. For now, we assume that the agency uses simple
random sampling; we extend to stratified sampling shortly. The agency releases the
m synthetic samples, dsyn = {di : i = 1, . . . , m}, to the public.
The analyst of dsyn seeks inferences about some estimand Q, such as a population
mean or regression coefficient. In each di, the analyst estimates Q with some point
estimator q and estimates the variance of q with some estimator u, where u is specified
ignoring any finite population correction factors; for example, when q is the sample
mean, u = s2/n. For i = 1, . . . , m, let qi and ui be respectively the values of q and u
in di. The following quantities are needed for inferences:
qm =m∑
i=1
qi/m (1)
bm =
m∑
i=1
(qi − qm)2/(m − 1) (2)
um =m∑
i=1
ui/m. (3)
For the different samples approach, the analyst uses qm to estimate Q and Td =
bm/m to estimate the variance of qm. Inferences are based on the t-distribution,
(qm − Q) ∼ tm−1(0, Td). Derivations of these methods are in the appendix.
For the same sample approach, the analyst again uses qm to estimate Q. The
variance estimate is Ts = ((1 − n/N)δ + (1 − δ))(um − bm) + bm/m. Here, δ = 1
when the analyst uses the finite population correction factor, and δ = 0 when the
analyst does not use the finite population correction factor. Inferences are based
on the t-distribution, (qm − Q) ∼ tν(0, Ts), where ν = (m − 1)(Ts/(((1 − n/N)δ +
(1 − δ))bm − bm/m))2. Derivations of these methods are in the appendix. It is
theoretically possible that Ts < 0. When this occurs, we adopt the conservative
estimate, ((1−n/N)δ+(1−δ))um. Negative variances are unlikely when only modest
9
fractions of values (e.g., 10% or less) are replaced in the census and generally can be
avoided with large m (e.g., m ≥ 25).
Why are new inferential methods necessary when using sampling with synthesis?
Consider first the variance estimator developed by Raghunathan et al. (2003) for full
synthesis, Tf = bm − ((1 − n/N)δ + (1 − δ))u + bm/m. When replacing only a small
fraction of values in sampling with synthesis, bm < ((1 − n/N)δ + (1 − δ))u, so that
Tf generally would be negative if applied in sampling with synthesis contexts.
Next consider the variance estimator developed by Reiter (2003) for partial syn-
thesis, Tp = ((1 − n/N)δ + (1 − δ))u + bm/m. Obviously, Tp can be substantially
larger than Ts or Td. Reiter’s (2003) derivations presume that the records used to
estimate the imputation models are the same as those used in the analysis. However,
this is not the case for sampling with synthesis: the imputation model utilizes the
entire census but analyses are based only on available samples. As a result of this
mismatch, Tp is positively biased if applied in sampling with synthesis contexts. A
related bias was identified in the context of multiple imputation for missing data by
Schenker and Raghunathan (2007) and Reiter (2008).
One might wonder why we do not take the sample first, estimate the synthesis
model with the sample second, and then impute replacements. This strategy is equiv-
alent to partially synthetic data, acting as if the sample was the only available data
to the agency. By generating imputations from models determined with the entire
census rather than with a sample from it, the agency enables analysts’ inferences to
be more efficient than inferences with standard partial synthesis. Using the entire
census for imputation has another interesting implication: inferences based on dsyn
can be more efficient than inferences based on the original data from the sample alone.
This is because the multiple imputations pass additional information from the census
10
to the analysts’ inferences. This is evident in the application described in the next
section and in the simulations presented in the online supplement.
To improve estimation accuracy in the public use samples, agencies can select
records using stratified random sampling. Analysts should account for the stratifi-
cation in their inferences. For model-based analyses, this involves including func-
tions of the stratum indicators in models (Gelman et al., 2004, Chapter 7) and using
the t-distributions previously outlined in this section with δ = 0. For design-based
(survey-weighted) analyses, the t-distribution based on Td is appropriate for the dif-
ferent samples approach. For the same sample approach, design-based estimation
requires modified procedures analogous to those from traditional stratified sampling
estimation. To illustrate, let Nh be the population size and nh be the sample size
in stratum h, where h = 1, . . . , H . To estimate the census mean of some Y , let
qmh be the sample mean of Y in stratum h, and let Tsh be the value of Ts com-
puted with the records in stratum h using the finite population correction factor
(1− nh/Nh). Analysts can use qmh and Tsh in inferences for the census mean in stra-
tum h. For the entire census mean, the point estimate is qm =∑
h(Nh/N)qmh, and
its estimated variance is Ts =∑
h(Nh/N)2Tsh. Design-based estimates of functions of
means and totals, such as domain quantities or regression coefficients, are obtained
by using relevant qm to estimate the component census means and totals; estimated
variances can be determined by linearization methods. Inferences can be based on
(Q − qm) ∼ N(0, Ts). In cases where m is small (e.g., m ≤ 5) and the amount of
replaced data is large (e.g., more than 50% of some variable), analyses can be im-
proved by using (Q− qm) ∼ tνst(0, Ts); an approach for determining νst is outlined in
the appendix. For many contexts, νst is large enough that a normal distribution is
adequate for inferences.
11
Table 1: Description of variables used in the empirical study
Variable Label Range
Sex X male, female
Race R white, black, American Indian, Asian
Marital status M 7 categories, coded 1–7
Highest attained education level E 16 categories, coded 31–46
Age (years) A 0 – 90
Number of people in household H 1 – 16
Number of people in household under age 18 Y 0, 1 – 11
Household property taxes ($) P 0, 1 – 99,997
Social security payments ($) S 0, 1 – 50,000
Household income ($) I -21,011 – 768,742
4 APPLICATION ON A CONSTRUCTED POP-
ULATION
To illustrate sampling with synthesis, we use a subset of public use data from the
March 2000 U.S. Current Population Survey. The data comprise nine variables mea-
sured on 51,016 heads of households; see Table 1. These variables include demographic
data—age, race, sex, and marital status—that are key identifiers in many censuses
of individuals. They also include a mix of other variables that represent several
generic modeling challenges, including numerical and categorical variables, numerical
variables with clearly non-Gaussian distributions, and numerical variables with large
percentages of values equal to zero. Similar data are used by Reiter (2005b,c) to
illustrate and evaluate releasing synthetic microdata for probability samples.
12
Although these data are truly a sample, we suppose that they are a census to
illustrate the process of sampling with synthesis. We suppose that the intruder knows
age, race, marital status, and sex precisely for all records in the census; the intruder
does not know the other variables. There are 521 records with unique combinations
of age, race, marital status, and sex; and, there are 578 combinations of the four
variables with two cases. Compared to the full population, these 1089 people are
disproportionately non-white (17% versus 86% white) with roughly equal distribution
among non-white races, have marital status other than married civilian (13% versus
53% married civilians), and are more likely to be women (52% versus 43% women).
To make the public use file, we simulate values of age, race, and marital status in
the census for these 1089 records; we leave sex unchanged. This selection ensures that,
for records with no synthesized data, intruders at best have a one in three chance of
making a correct identification based only on the four demographic variables. If one
in three is not deemed low enough, the agency can simulate data for more records.
Intruders might have access to other variables, for example property taxes, in which
case the agency may want to simulate those variables as well.
We begin the process by synthesizing age, race, and marital status for all 1089
records using the models described in Section 4.1. This results in m = 5 synthetic
populations. We then assess the disclosure risks in the synthesized populations using
the approach described in Section 4.2. To increase data usefulness, we add back real
values to these populations in ways that do not substantially increase the disclosure
risks; the algorithm for doing so is described in Section 4.3. Finally, we take a 10%
sample from the synthetic populations using the same sample approach of Section 3.
The resulting five datasets would be released to the public. We evaluate inferences
from the samples for selected estimands in Section 4.4. Similar evaluations for the
13
different samples approach are presented in the online supplement.
4.1 Generating the synthetic populations
We use classification and regression trees (CART) to generate synthetic data (Reiter,
2005c). CART has several advantages for synthesis: it handles diverse data types,
captures non-linear relationships and complex interactions, and runs quickly on large
datasets. Furthermore, it can be applied without much tuning.
For the synthesis of age, race, and marital status, we proceed sequentially. First,
using the 1089 records, we fit a regression tree of age on all other variables except
race and marital status. Label the age tree as Y (A). We grow Y (A) by finding the
splits that successively minimize the deviance of age in the leaves. We cease splitting
any particular leaf when the deviance of ages in that leaf is less than .0001 or when
we cannot ensure at least five records in each child leaf. These are default specifi-
cations of tuning parameters in many applications of regression trees. We did not
prune the leaves further, as experiments with further pruning worsened the quality of
the synthetic datasets without substantially improving the confidentiality protection.
In other contexts, agencies may need to prune the leaves to respect confidentiality
criterion as discussed in Reiter (2005c).
Let LAw be the wth leaf in Y (A), and let Y(A)LAw
be the nLAwvalues of age in
leaf LAw. In each LAw in the tree, we generate a new set of values by drawing
from Y(A)LAw
using the Bayesian bootstrap (Rubin, 1981). These sampled values are
the replacement imputations for the nLAwunits that belong to LAw. Repeating the
Bayesian bootstrap in each leaf of the age tree results in the ith set of synthetic ages,
Y(A)rep,i. We repeat this process m = 5 times to generate five populations with synthetic
ages, {D1, . . . ,D5}.
14
We next simulate values of marital status. In each Di, we fit the marital status tree
with all variables except race as predictors. We minimize the Gini index to determine
the recursive binary splits. This is a default criterion for many applications of CART
for categorical outcomes. We again require a minimum of five records in each leaf and
do not otherwise prune the tree. To maintain consistency with Y(A)rep,i, units’ leaves
in Y (M) are located using Y(A)rep,i. Occasionally, some units may have combinations of
values that do not belong to one of the leaves of Y (M). For these units, we search up
the tree until we find a node that contains the combination, and treat that node as if
it were the unit’s leaf. Once each unit’s leaf is located, values of Y(M)rep,i are generated
using the Bayesian bootstrap. Imputing races follows the same process: we fit the
tree Y (R) using all variables as predictors, place each unit in the leaves of Y (R) based
on their synthesized values of age and marital status, and sample new races using the
Bayesian bootstrap. The entire process is repeated for each Di, resulting in m = 5
synthetic populations.
All CART models are fit in R using the “tree” function. The sequence of impu-
tations is A − M − R; see Reiter (2005c) for a discussion of sequencing.
4.2 Disclosure risk
To evaluate disclosure risks, we compute probabilities of identification using methods
developed by Reiter and Mitra (2009) for partially synthetic data. We summarize
these methods below for a census; see Drechsler and Reiter (2008) for modifications
for sample data. Related approaches are described by Duncan and Lambert (1989),
Fienberg et al. (1997), and Reiter (2005a).
Suppose the intruder has a vector of information, t, on a particular target unit in
the population D. Let t0 be the unique identifier of the target, and let Dj0 be the
15
(not released) unique identifier for record j in dsyn, where j = 1, . . . , N . Let S be
any information released about the simulation models.
The intruder’s goal is to match unit j in Dsyn to the target when Dj0 = t0. Let
J be a random variable that equals j when Dj0 = t0 for j ∈ dsyn. The intruder thus
seeks to calculate the Pr(J = j|t,Dsyn,S) for j = 1, . . . , N . Because the intruder
does not know the actual values in Yrep, he or she should integrate over its possible
values when computing the match probabilities. Hence, for each record we compute
Pr(J = j|t,Dsyn,S) =
∫
Pr(J = j|t,Dsyn,Yrep,S)Pr(Yrep|t,Dsyn,S)dYrep. (4)
This construction suggests a Monte Carlo approach to estimating each Pr(J =
j|t,Dsyn,S). First, sample a value of Yrep from Pr(Yrep|t,Dsyn,S). Let Ynew repre-
sent one set of simulated values. Second, compute Pr(J = j|t,Dsyn,Yrep = Ynew,S)
using exact matching assuming Ynew are collected values. This two-step process is
iterated h times, where ideally h is large, and (4) is estimated as the average of the
resultant h values of Pr(J = j|t,Dsyn,Yrep = Ynew,S). When S has no information,
the intruder treats the simulated values as plausible draws of Yrep.
Following Reiter (2005a), we quantify disclosure risk with summaries of these
identification probabilities. It is reasonable to assume that the intruder selects as a
match for t the record j with the highest value of Pr(J = j|t,Dsyn,S), if a unique
maximum exists. We consider two risk measures: the true match rate and the false
match rate. Let cj be the number of records with the highest match probability for
the target tj where j = 1, ..., N ; let Ij = 1 if the true match is among the cj units and
Ij = 0 otherwise. Let Kj = 1 when cjIj = 1 and Kj = 0 otherwise. The true match
rate equals∑
Kj/N . Let Fj = 1 when cj(1 − Ij) = 1 and Fj = 0 otherwise; and, let
s equal the number of records with cj = 1. The false match rate equals∑
Fj/s.
In our application, we presume that intruders’ targets are among the 1089 sensitive
16
cases. Since these data will contain synthetic values that change across Di, the
intruder can identify the 1089 sensitive cases and ignore records with unsynthesized
data, i.e., set their probabilities to zero. Thus, we compute risks based only on the
1089 records. The true match rate among these 1089 records is 5.7%, and the false
match rate is 88%. Hence, intruders are not likely to make correct matches, and very
likely to make mistakes. In contrast, nearly 50% of these records could be identified
in the original census data without the possibility of false matches. By these risk
measures, the synthesis has substantially reduced disclosure risks. We note that the
confidentiality risks in the public use data samples will be even smaller, since the act
of sampling introduces additional uncertainty in intruders’ abilities to match.
4.3 Algorithm for selecting values to synthesize
It may be sufficient from the perspective of disclosure risk to replace only some iden-
tifying variables. For example, a person might possess a unique combination of age,
race, sex, and marital status. This person might no longer be at risk if age is not
released exactly or if marital status and race both are not released exactly. When
such cases occur, it can improve data usefulness to include genuine values in dsyn
rather than leaving all values synthesized, since including genuine data reduces the
sensitivity of inferences to the synthesis model. However, substituting genuine values
for some records could impact risks for other records.
When the agency can safely put back some record’s genuine data in multiple ways,
ideally it should select the way that allows the released data to have as much utility
as possible with acceptable disclosure risk. One approach is to select certain key
estimands and add back genuine values so that synthetic-data and genuine-data in-
ferences are as similar as possible, e.g., the two sets of confidence intervals overlap as
17
much as possible (Karr et al., 2006). This will result in decisions that are nearly opti-
mal for some analyses and suboptimal for others. Another possibility is to maximize
some global measure of utility, such as summaries of the distance between the dis-
tributions of the synthetic and genuine datasets (Woo et al., 2009). However, global
measures of data utility are not especially sensitive to changes in a single datum, so
that they may not be fine enough for this purpose. Perhaps the ideal approach is to
release more synthetic values for variables for which the synthesis models reflect the
relationships in the data well, and more genuine values for variables that have com-
paratively poorly fitting models. Translating this idea into practical implementation
is an area for future research.
We choose a simple utility metric: make the number of synthesized values for
each variable as similar as possible. Effectively, this treats each variable as equally
important to data analysts, which is sensible absent specific planned uses of the public
use data. It also is simple to implement and transparent to data users.
We select the values to replace using an iterative approach. For iteration k, let
r(A)k be the number of synthetic ages, r(M)k be the number of synthetic marital
statuses, and r(R)k be the number of synthetic races in any one of the synthetic
populations at the start of the iteration. Initially, r(A)1 = r(M)1 = r(R)1 = 1089.
Let Ck =(
Ck[1], Ck[2], Ck[3])
be the order in which variables are considered for
replacement at iteration k, where Ck[1] represents the first variable considered, Ck[2]
the second, and Ck[3] the third. Initially, C1[1] is age, C1[2] is marital status, and
C1[3] is race.
For iteration k, which corresponds to the kth record among the 1089 sensitive
cases, we begin by replacing Ck[1] for record k in all five Di with the record’s genuine
value of Ck[1]. We re-synthesize that record’s values of Ck[2] and Ck[3] from appro-
18
priate CART models to maintain consistency with the updated value of Ck[1]. Let
Dksyn =
(
Dk1, . . . ,D
k5
)
be the resulting updated synthetic populations. Next, we com-
pute the disclosure risk measures using Dksyn. If the measures satisfy three criteria, we
keep the genuine data value of Ck[1] in each Dki ; otherwise, we put back the synthetic
data for that record. The three criteria include (i) the individual record cannot be
correctly identified from Dksyn, (ii) the true match rate cannot exceed 5.7%, and (ii)
the false match rate cannot dip below 88%. In essence, we do not allow genuine data
replacements when the disclosure risks increase over those in the initial synthetic pop-
ulations Dsyn. These criteria can be relaxed to allow greater substitution of genuine
values. We check the true match rate and false match rate because, even if record k
is not identifiable, replacement of its Ck[1] may make other records identifiable.
When Ck[1] is replaced with genuine data, we decrease the corresponding rk by
one for the next iteration, e.g., if age is replaced, set r(A)k+1 = r(A)k − 1. We then
consider replacing Ck[2] with genuine data. We re-synthesize Ck[3] only for record
k based on genuine values of Ck[1] and Ck[2], and recompute the disclosure risks. If
the three criteria are satisfied, we replace Ck[2] in Dksyn with genuine data, set the
corresponding rk+1 = rk − 1, and move on to the next iteration. If the criteria are
not satisfied, we try to replace Ck[3] with genuine data using similar procedures. If
we cannot replace Ck[3] as well, then only Ck[1] is replaced in Dksyn for record k. We
note that the algorithm never puts back genuine values for all three variables for any
record k.
If Ck[1] cannot be replaced with genuine data, we attempt to replace Ck[2] and, if
possible, Ck[3] in Dksyn using similar procedures. Again, we decrease the corresponding
rk+1 as appropriate. If Ck[2] cannot be replaced, we attempt to replace only Ck[3] in
Dksyn. If no values can be replaced with genuine data, we return all values for record
19
k to the synthetic values in Dsyn, and set rk+1 = rk for all variables.
For each successive iteration, we reset Ck+1 to match the descending order of the
rk. For example, if r(A)k = 1000, r(M)k = 1002, and r(R)k = 1001, then Ck+1[1] is
marital status, Ck+1[2] is race, and Ck+1[3] is age. In this way, variables with the
least amount of genuine data are given the first opportunity to have genuine data put
back.
We applied this algorithm to the five synthetic populations. After all 1089 itera-
tions, synthetic values remained on the file for 951 values of age, 951 values of marital
status, and 947 values of race. The disclosure risk measures for this updated dataset
are essentially unchanged. Hence, by using the algorithm, we put back about 13%
genuine values for each variable, thereby lessening reliance on the synthesis, with no
increase in disclosure risk.
4.4 Illustration of analytic properties
As the final step of sampling with synthesis, we take a 10% random sample of the
same records from the five updated synthetic populations. Of the 5100 records, 117
records have at least one value synthesized. To illustrate the quality of inferences
attainable from the resulting samples, we compute 95% confidence intervals for the
estimands listed in Table 2 and Table 3. These estimands include quantities for the
entire population, for small sub-groups likely to be impacted by the synthesis, and
for regression models with interactions among the variables subject to synthesis. The
tables also include the population values and 95% confidence intervals based on the
samples without any data synthesis. No values of Ts were negative.
Nearly all of the 95% confidence intervals from the synthetic data cover the pop-
ulation values Q. When this does not happen, the intervals based on the samples
20
without synthesis also do not cover their Q, indicating that lack of coverage results
primarily from natural variability inherent in taking samples. Inferences from both
sets of data are similar, as might be expected when replacing only 2.3% of values
overall. For 34 out of the 57 estimands, the point estimates from the synthetic sam-
ples are closer to their corresponding population values than the point estimates from
the samples without synthesis are. This is the case even for the estimate computed
with the highest fraction of replaced values (average age for single American Indians is
based on five synthetic values out of fourteen total): the synthetic data point estimate
of 33.9 is closer to the population value of 33.4 than the original data estimate of 30.8.
For 47 out of 57 estimands, the confidence intervals based on sampling with synthesis
are narrower than the confidence intervals based on the samples without synthesis.
The smaller errors in the point estimates and smaller widths of the intervals result
because the imputation models are determined with the entire census.
5 IMPLEMENTATION IN LARGE CENSUSES
The example in Section 4 illustrates the issues involved in sampling with synthe-
sis; however, many censuses have far greater than 1089 records at risk of disclosure.
Fitting imputation models, including the CART model used here, can be computa-
tionally challenging when the number of at-risk records is very large. In such contexts,
agencies can partition the at-risk records into subsets of manageable sizes, for exam-
ple by geography or by categories of important variables like income, and perform
the imputation tasks separately in each partition. When using stratified sampling
to create the public use files, agencies can base the partitions on the strata. Impor-
tantly, the agency only needs to fit the synthesis models on the at-risk records in each
21
Table 2: Point estimates and 95% confidence intervals for demographic variables using
sampling with synthesis and, for benchmarking, sampling without synthesis. Census
values are in column labeled with a Q.Sampling w/ Sampling w/o
Estimand Q Synthesis SynthesisAvg. age 48.7 48.9 (48.5, 49.4) 48.9 (48.5, 49.4)Avg. age single Amer. Ind. 33.4 33.9 (28.8, 39.2) 30.8 (25.8, 35.9)Avg. educ. divorced black women 39.4 39.2 (38.2, 40.3) 39.2 (37.9, 40.4)% Married civilian 53.3 53.5 (52.2, 54.8) 53.7 (52.3, 54.9)% Married in armed forces .23 .25 (.13, .37) .24 (.11, .36)% Married spouse not present 1.5 1.7 (1.4, 2.1) 1.8 (1.4, 2.1)% Widowed 10.8 10.9 (10.1, 11.7) 11.0 (10.1, 11.8)% Divorced 14.0 14.1 (13.2, 15.0) 14.1 (13.2, 15.0)% Separated 2.9 2.6 (2.2, 3.0) 2.6 (2.2, 3.0)% Single 17.2 16.8 (15.8, 17.8) 16.8 (15.8, 17.8)% White 85.7 85.8 (84.9, 86.7) 85.6 (84.7, 86.5)% Black 10.2 10.3 (9.5, 11.0) 10.3 (9.5, 11.1)% Amer. Ind. 1.2 1.2 (.9, 1.5) 1.2 (.9, 1.5)% Asian 2.9 2.8 (2.4, 3.2) 2.9 (2.4, 3.3)% Male, divorced Amer. Ind. .11 .12 (.04, .21) .10 (.02, .18)% Separated | Asian 2.3 3.2 (.8, 5.7) 2.7 (.2, 5.2)% Tax > 1000 | Black 19.1 18.5 (15.4, 21.7) 18.7 (15.5, 21.2)
partition group, not the entire census.
Even with extensive partitioning, sampling with synthesis still can result in effi-
ciency gains compared to standard partial synthesis. To illustrate, suppose that some
census data are partitioned into 3000 groups, e.g., the roughly 3000 counties in the
United States. Suppose further that the agency takes 1% random samples in each
county to create the public use file. In any county’s sample, there may be relatively
small numbers of people that are considered at risk of disclosure. If the agency es-
timates synthesis models within each county using only those sampled people, the
models would be based on few observations and hence inefficient. However, if the
agency builds synthesis models within each county using all people at risk in the
county, models will be based on much larger sets of observations, leading to efficiency
gains. When the distributions of replaced data are similar across the partitioning
22
Table 3: Point estimates and 95% confidence intervals for coefficients in regressions
using sampling with synthesis and, for benchmarking, sampling without synthesis.
Census values are in column labeled with a Q.Public Use Samples
Estimand Q Synthesis No SynthesisCoefficient in regression of log I on:
Intercept 4.9 5.2 (4.9, 5.6) 5.2 (4.9, 5.6)Black -.27 -.19 (-.26, -.12) -.19 (-.26, -.12)American Indian -.25 -.17 (-.34, .01) -.10 (-.29, .09)Asian -.01 -.07 (-.20, .05) -.11 (-.24, .01)Female .00 .01 (-.05, .08) .01 (-.05, .08)Married spouse not present -.03 .01 (-.21, .02) .01 (-.16, .33)Widowed -.01 .07 (-.09, .22) .10 (-.06, .25)Divorced -.16 -.10 (-.20, -.01) -.11 (-.22, -.01)Separated -.24 -.09 (-.30, .11) -.12 (-.34, .09)Single -.17 -.10 (-.19, -.01) -.10 (-.19, -.01)Education .11 .10 (.10, .11) .10 (.10, .11)Household size > 1 .50 .52 (.46, .59) .52 (.46, .59)Females married spouse not present -.52 -.35 (-.63, -.07) -.43 (-.75, -.11)Widowed females -.31 -.36 (-.53, -.19) -.38 (-.55, .-21)Divorced females -.31 -.41 (-.54, -.28) -.41 (-.54, -.28)Separated females -.52 -.57 (-.83, -.31) -.55 (-.82, -.28)Single females -.32 -.28 (-.39, -.16) -.27 (-.39, -.15)Age ×10 .43 .38 (.31, .46) .38 (.31, .46)Age2 ×1000 -.44 -.40 (-.48, -.33) -.40 (-.48, -.33)Property tax ×10000 .37 .38 (.29, .46) .38 (.30, .46)
Coefficient in regression of√
S on:Intercept 79.9 85.0 (73.6, 96.5) 83.4 (71.8, 95.1)Female -13.3 -15.0 (-17.9, -12.1) -15.4 (-18.2, -12.5)Black -5.9 -4.6 (-8.8, -.5) -5.0 (-9.3, -.8)American Indian -7.0 -14.3 (-27.0, -1.5) -11.3 (-24.1, 1.5)Asian -3.3 -8.8 (-16.9, -.7) -6.0 (-13.9, 1.9)Married spouse not present 2.1 -.9 (-9.3, 7.5) -1.9 (-10.4, 6.5)Widowed 7.3 7.3 (4.2, 10.5) 7.6 (4.5, 10.8)Divorced -0.9 -.69 (-5.1, 3.7) -.53 (-5.0, 4.0)Separated -5.4 -3.5 (-13.7, 6.6) -1.0 (-12.3, 10.3)Single -1.5 -1.9 (-8.2, 4.5) -2.2 (-8.9, 4.5)High school 5.5 7.4 (4.6, 10.3) 7.5 (4.6, 10.4)Some college 6.8 7.8 (4.4, 11.1) 7.9 (4.5, 11.2)College degree 8.3 7.3 (3.0, 11.6) 7.4 (3.0, 11.7)Advanced degree 10.7 8.8 (3.6, 13.9) 8.7 (3.6, 13.9)Age .21 .14 (-.01, .29) .16 (.01, .32)
Income regression fit using people with I > 0.
Social security regression fit using people with S > 0 and A > 54.
23
groups, the efficiency gains are lower compared to those from fitting one model to the
entire census. On the other hand, partitioning can reduce bias when the distributions
of replaced data are dissimilar across partitions. We note that the inferential methods
of Section 3 apply regardless of the partitioning.
Although we illustrated sampling with synthesis using the same sample approach
in Section 4, agencies also can take independent samples from each synthesized pop-
ulation. For a fixed number of samples m, the different sample approach generally
enables estimation with higher efficiency. As m increases, reductions in variance are
typically larger with the different samples approach; in fact, with modest fractions
of replacement, var(qm) goes to zero as m → ∞ for the different samples approach
but not for the same sample approach. However, m independent samples potentially
require nearly m times as much storage space as the m versions of the same sample,
which public data analysts’ might find unwieldy for large files with many variables.
The disclosure risk properties of the two approaches are harder to pin down than
the utility properties. On the one hand, the different samples approach releases more
units to the public, increasing the number of potential intruder targets. The chance
that intruders’ specific targets are in the released data increases, so that intruders
can attempt re-identifications with higher confidence. On the other hand, it can be
difficult for intruders to determine whether or not records that appear in only one
of the different samples have been synthesized or not. This should reduce disclosure
risks. Furthermore, when many records that have synthesized values do not appear
in all m datasets, intruders trying to identify units with approaches akin to those in
Section 4 have less information to compute the probabilities.
For either approach, agencies need to select the value of m. Increasing m generally
increases the utility of the synthetic data, but it also increases disclosure risks and
24
storage needs. For either approach, agencies can compute overall disclosure risks for
different m, and select the largest m that provides acceptable risks and storage costs.
The literature on multiple imputation of missing data provides additional guidance
for selecting m for the same sample approach. Accuracy gains from increasing m
are typically modest for small fractions of replaced data (e.g., 10% or less), so that
agencies can release m = 5 or m = 10 datasets to keep risks and storage costs low
without sacrificing accuracy, whereas for large fractions of replaced data accuracy
gains can be substantial with large m.
6 CONCLUDING REMARKS
The main employers of sampling with synthesis should be national statistics agen-
cies that disseminate or could disseminate census data, like the Census Bureau, the
National Agricultural Statistic Service, the National Center for Education Statistics,
and dozens of international agencies. However, the approach is not limited to national
statistics agencies. Sampling with synthesis can be used to create research samples
of confidential administrative data, e.g., state records of people receiving services. A
related application is to create research samples of large organizations, e.g., employees
at large companies or health care records from large insurers. These data sources are
effectively censuses of their respective populations. Such data sources may play an
important role in future data analyses, since new surveys are ever-more expensive to
mount. Administrative and organizational data sources have potentially high disclo-
sure risks, since large portions of the data are available to members of the public, e.g.,
state social service workers, employees, and physicians. Creating public use files of
these data will require intense data alteration. Traditional disclosure limitation tech-
25
niques have little chance of generating analytically useful data in such cases, whereas
sampling with synthesis does.
The algorithm for selecting values to synthesize applies outside the sampling with
synthesis context. It could be used, for example, to implement partial synthesis
for a survey sample. The only change in the algorithm for this case is in the risk
measures: they should account for intruders’ uncertainty that the target records are
in the dataset via the methods of Drechsler and Reiter (2008).
For different orderings of the variables, and possibly different orderings of the
records, the algorithm could result in different synthetic datasets. In general, we do
not anticipate these differences to be of great consequence for data utility, since the
algorithm automatically attempts to balance the number of replacement values for
each variable being synthesized. Nonetheless, agencies may be able to obtain slightly
higher quality inferences, e.g., more substitutions of genuine values, by running the
algorithm for several orderings, and selecting the resulting synthetic datasets that
yield the highest quality.
In addition to censuses, sampling with synthesis has the potential to be useful
for large probability samples with confidential data. For example, the sample sizes
in previous decennial long-form surveys were large enough that the Census Bureau
released public use samples rather than the entire surveys; similar issues now arise
with its successor, the American Community Survey. Sampling with synthesis from
a survey requires different, and not yet developed, methods for inferences, since the
additional uncertainty from the initial stage of sampling must be accounted for.
Finally, when missing data are present, it is natural for the agency to use multiple
imputation to handle the missing data and to protect confidentiality simultaneously.
Reiter (2004) develops an approach for doing this in regular partial synthesis settings.
26
Extending these ideas to sampling with synthesis is a topic for future research.
Appendix
A.1 Derivation of inferences for different samples approach
The analyst of the m different samples seeks f(Q|dsyn). The two-step process for
creating dsyn suggests that
f(Q|dsyn) =
∫
f(Q|Dsyn,dsyn)f(Dsyn|dsyn)dDsyn. (5)
As in other applications of multiple imputation, we find each component of this
integral by assuming that the analyst’s distributions are identical to those used by
the agency for creating Dsyn. We also assume that the sample sizes are large enough
to permit normal approximations for these distributions. Thus, we require only the
first two moments for each distribution, which can be derived using standard large
sample Bayesian arguments. Diffuse priors are assumed for all parameters; that is, we
presume that the information in the likelihood function dominates any information
in the analyst’s prior distribution. This is reasonable in large datasets.
Let Qi be the estimate of Q in Di; let Qm =∑m
i=1 Qi/m; and, let Bm =
∑mi=1(Qi − Qm)2/(m − 1). Given Qm and Bm, dsyn is irrelevant for inferences about
Q. Hence, from Reiter (2003), f(Q|Dsyn,dsyn) = N(Qm, Bm/m). The posterior vari-
ance, Bm/m, is simply the variance estimator from standard partial synthesis, Tp, for
a census, i.e., set Um = 0.
We next assume that (qi|Di) ∼ N(Qi, ui), which is reasonable for simple random
samples. Here, we need not be concerned with the presence or absence of finite
population correction factors in ui, since, as we shall show, Td does not depend on
ui. We further assume that the sampling variability in each ui is negligible, so that
27
ui ≈ um. Thus, (Qm|qm, um) ∼ N(qm, um/m). Using Bayesian analysis of variance
arguments, we also have
(
(m − 1)bm
Bm + um
|dsyn
)
∼ χ2m−1. (6)
From these results, we have f(Q|dsyn, Bm) = N(qm, Bm/m+um/m). We now have
to integrate this normal distribution over f(Bm|dsyn) in (6). We could evaluate this
numerically, but we desire a closed form approximation that public use data analysts
could easily apply. For large m, we have E(Bm|dsyn) ≈ bm − um. Hence, we have
var(Q|dsyn) ≈ (bm − um)/m + um/m = bm/m = Td. (7)
From (7) we approximate f(Q|dsyn) as a t distribution with m−1 degrees of freedom.
A.2 Derivation of inferences for same sample approach
The analyst again seeks f(Q|dsyn), but now recognizing that the same records com-
prise each di. Let q∞ = lim∑m
i=1 qi/m as m → ∞; let b∞ = lim∑m
i=1(qi−qm)2/(m−1)
as m → ∞; and, let v∞ = var(Q|q∞,dsyn). We write f(Q|dsyn) as
f(Q|dsyn) =
∫
f(Q|dsyn, q∞, v∞, b∞)f(q∞|dsyn, v∞, b∞)f(v∞, b∞|dsyn)dq∞dv∞db∞. (8)
As before, we assume that the sample sizes are large enough to permit normal ap-
proximations for the distributions involving Q and q∞. Diffuse priors are assumed
for all parameters. We first present results ignoring finite population corrections. We
then incorporate finite population corrections.
To begin, we assume that E(qi|Qi) = Qi, where Qi is the value of Q in synthesized
population Di. This is reasonable since di is a simple random sample from Di. Since
E(Qi|Q) = Q, the E(q∞|Q) = Q. Using the definition of v∞, we therefore have
f(Q|dsyn, q∞, v∞) = N(q∞, v∞). We next assume that f(qi|q∞, b∞) = N(q∞, b∞), so
28
that f(q∞|dsyn, b∞) = N(qm, b∞/m). The sampling distribution of qi also implies
that
(
(m − 1)bm
b∞|dsyn
)
∼ χ2m−1. (9)
From these results, the posterior distribution of Q given the variance parameters is
f(Q|dsyn, b∞, v∞) = N(qm, v∞ + b∞/m). (10)
We now turn to f(v∞|dsyn, b∞), which we estimate using a similar approach to
the one in Reiter (2008). First, define vi = var(Q|di, q∞, b∞). Given only one dataset
di, the analyst would use ui to estimate var(Q|di, b∞). Relating ui and vi, and using
an iterated variance computation, we have
ui = E(vi|di, b∞) + var(q∞|di, b∞). (11)
Rewriting this as an expression for vi, we have E(vi|di, b∞) = ui − b∞. We assume
that the sampling distribution of vi has mean v∞, so that E(v∞|dsyn, b∞) equals
E{E(v∞|dsyn, b∞, q∞)|dsyn, b∞} = E
{
∑
i
vi/m|dsyn, b∞
}
= um − b∞. (12)
Finally, we assume that the variance in the sampling distribution of vi is of lower
order than v∞. This implies negligible sampling variance in um, which typically is
reasonable in multiple imputation contexts with large sample sizes (Rubin, 1987, Ch.
3). Thus, we write f(v∞|dsyn, b∞) as a distribution concentrated at um − b∞ with
negligible variance.
To obtain f(Q|dsyn), we should replace v∞ with um−b∞ in (10), and integrate with
respect to the distribution of b∞ in (9). Although this integration can be carried out
numerically, we desire a straightforward approximation that can be easily computed
by analysts using dsyn. For large m, we can approximate V ar(Q|dsyn) by substituting
29
bm for b∞. Thus, for inferences without finite population corrections, we have
var(Q|dsyn) ≈ um − bm + bm/m = Ts. (13)
For finite population quantities, the derivations proceed as before up to (10). At
this stage, we have to modify the estimate of v∞ to reflect the finite population infer-
ence; we do not modify b∞/m, as it appropriately measures the additional uncertainty
from using finite m. For simple random samples, the modification follows the usual
form: simply multiply the without replacement variance estimate, um − bm, by the
finite population correction factor. This can be justified by large-sample approxima-
tions to f(Q|dsyn, q∞), as we now illustrate for a simple synthesis setting.
Suppose that the agency wishes to perform sampling with synthesis on a census of
N records measured on one variable Y. Let E(Y) = µ and V ar(Y) = σ2. Suppose
that the agency randomly selects N1 records in the census whose data will be replaced
with synthetic values. The agency draws replacements from f(Y) for each of these
records to create one partially synthetic population Di. It repeats the draws m = ∞
times. From each partially synthetic census, the agency takes a sample of the same
n records, and it releases these records to the public. Suppose that n1 records in the
sample have synthetic values; let n2 = n − n1.
The analyst seeks inferences for the population mean of Y. Let R index the set
of records in D whose values are replaced. In any Di, this mean is Qi = (∑
j /∈R yj +
∑
j∈R yj,i)/N , where yj,i is the replaced value for record j in Di. When m = ∞, the
finite population quantity of interest to the analyst, Q, is Q∞ = limm→∞
∑mi=1 Qi/m.
Since E(yj,i) = µ for all j ∈ R, Q∞ = (∑
j /∈R yj + N1µ)/N is the target of inference.
Of course, the analyst does not know Q and must find its posterior distribution
using only the infinite number of synthetic samples dsyn. Let S be the set of records
in the sample whose values are not replaced. Let y∗
j = yj if record j /∈ R, and let
30
y∗
j = µ if record j ∈ R. Note that E(y∗
j ) = µ, but V ar(y∗
j ) = τ 2 6= σ2. Each qi equals
the sample mean of the n1 replaced values and the n2 not replaced values, so that
q∞ = (∑
j∈S yj + n1µ)/n = y∗. Let E be the set of (N − n) records that are not in
the sample (but in the census). Given q∞, the analyst’s target of inference can be
written as Q = (nq∞ + (N − n)y∗
E)/N , where y∗
E is the mean of y∗
j for j ∈ E.
Using large-sample approximations, the posterior mean and variance are
E(Q|dsyn, q∞) = (nq∞ + (N − n)E(µ|dsyn, q∞))/N (14)
V ar(Q|dsyn, q∞) = ((N − n)/N)2(E(τ 2/(N − n)|dsyn, q∞) + V ar(µ|dsyn, q∞)).(15)
From standard Bayesian theory, E(µ|dsyn, q∞) = q∞; V ar(µ|dsyn, q∞) = s2/n; and
E(τ 2|dsyn, q∞) = s2. Here, s2 = (∑
j∈S(yj − q∞)2 + n1(µ − q∞)2)/n. Hence,
E(Q|dsyn, q∞) = q∞ (16)
V ar(Q|dsyn, q∞) = ((N − n)/N)s2/n, (17)
which contains the finite population correction factor.
We next show that u∞ − b∞ ≈ s2/n for large n. Because replacements are drawn
from f(Y|µ, σ2), each ui ≈ σ2/n. And, b∞ ≈ (n1/n)σ2/n, because the n2 values in S
cancel from (qi − q∞) for all i. Hence, u∞ − b∞ = σ2/n − (n1/n)σ2/n = (n2/n)σ2/n.
Turning now to s2/n, for large n we have s2 ≈ τ 2. Because N1/N values of y∗ equal
µ, we have τ 2 = ((N −N1)/N)σ2. Finally, for large n, (n2/n) ≈ (N −N1)/N . Hence,
we have s2/n ≈ (n2/n)σ2/n, as desired.
Regardless of whether analysts use finite population corrections or not, inferences
for Q can be based on the t-distribution, (qM − Q) ∼ tν(0, Ts). The degrees of
freedom, ν, is derived by matching the first two moments of (νTs)/{((1 − n/N)δ +
(1 − δ))(uM − b∞) + b∞/m} to those of a χ2ν distribution as follows. Let φ = b∞/bm,
and let f = (1 − n/N)δ + (1 − δ). Then, dropping the ν, the random variable can
31
be expressed as Ts/(fum + bmφ(1/m − f)). We approximate the expectation and
variance of this quantity by using first order Taylor series expansions in φ−1 around
its expectation, which equals one. Thus, in the expectation, we essentially substitute
one for φ, resulting in an expectation equal to one. Since V ar(φ−1) = 2/(m− 1), we
have
V ar
(
Ts
fum + φbm(1/m − f)| dsyn
)
=T 2
s (bm(1/m − f))2
(fum + bm(1/m − f))4
(
2
m − 1
)
=(bm(1/m − 1))2
(fum + bm(1/m − f))2
(
2
m − 1
)
.(18)
Since a mean square random variable has variance equal to 2 divided by its degrees
of freedom, we conclude that νs = (m − 1)(Ts/(bm/m − fbm))2.
A.3 Degrees of freedom for the same sample approach with
stratified sampling
We now derive the degrees of freedom for a reference t-distribution in inferences for
the census mean using the same sample approach with stratified sampling. We have
not found a single formula for these degrees of freedom that works for all estimands.
However, the general approach used here can be applied to linear functions of means
and totals.
Let Wh = Nh/N , and let umh and bmh be the values of um and bm computed in
stratum h. As suggested in Section 3, the point estimate for the census mean of Y is
∑
h Whymh, and its estimated variance is
Ts ≈H∑
h=1
W 2hTsh =
H∑
h=1
W 2h (1 − nh/Nh)(umh − bmh) + bmh/m.
We match the first two moments of (νstTs)/{∑H
h=1 W 2h (1 − nh/Nh)(umh − b∞h) +
b∞h/m} to those of a χ2νst
distribution. Let φ = (φ1, ..., φH), with φh = b∞h/bmh,
32
and let fh = (1 − nh/Nh). Dropping νst, the random variable can be expressed
as Ts/(∑H
h=1 W 2hfhumh + bmhφh(1/m − fh)). We approximate the expectation and
variance of this quantity by using a multivariate first order Taylor series expansion
in φ−1 around its expectation, which equals a vector of ones. Since V ar(φ−1h ) =
2/(m − 1), we have
V ar
(
Ts∑H
h=1 W 2hfhumh + bmhφh(1/m − fh)
| dsyn
)
≈ 2
m − 1
(∑H
h=1 W 2hbmh(1/m − fh))
2
T 2st
(19)
Since a mean square random variable has variance equal to 2 divided by its degrees
of freedom, we conclude that νst = (m − 1)(Tst/(∑H
h=1 W 2hbmh(1/m − fh)))
2.
7 Supplemental materials
Simulation studies: Results from simulations showing that the inferential methods
in Section 3 of the text have good frequentist properties, and that standard
disclosure limitation techniques can have poor frequentist properties.
References
Abowd, J., Stinson, M., and Benedetto, G. (2006). Final report to the Social Security
Administration on the SIPP/SSA/IRS Public Use File Project. Tech. rep., U.S.
Census Bureau Longitudinal Employer-Household Dynamics Program. Available
at http://www.bls.census.gov/sipp/synth data.html.
Abowd, J. M. and Woodcock, S. D. (2004). Multiply-imputing confidential character-
istics and file links in longitudinal linked data. In J. Domingo-Ferrer and V. Torra,
eds., Privacy in Statistical Databases, 290–297. New York: Springer-Verlag.
Alexander, J. T., Davern, M., and Stevenson, B. (2010). Inaccurate age and sex data
33
in the Census PUMS files: Evidence and implications. Tech. rep., National Bureau
of Economic Research, Working Paper 15703.
An, D. and Little, R. (2007). Multiple imputation: an alternative to top coding for
statistical disclosure control. Journal of the Royal Statistical Society, Series A 170,
923–940.
Bialek, C. (2010). Census Bureau obscured personal data—too well, some say. Wall
Street Journal, February 5.
Blewett, L. A. and Davern, M. (2007). Distributing state children’s health insurance
funds: A critical review of the design and implementation of the funding formula.
Journal of Health Politics, Policy and Law 32, 415–455.
Dalenius, T. and Reiss, S. P. (1982). Data-swapping: A technique for disclosure
control. Journal of Statistical Planning and Inference 6, 73–85.
Drechsler, J., Bender, S., and Rassler, S. (2008a). Comparing fully and partially syn-
thetic datasets for statistical disclosure control in the German IAB Establishment
Panel. Transactions on Data Privacy 1, 105–130.
Drechsler, J., Dundler, A., Bender, S., Rassler, S., and Zwick, T. (2008b). A new ap-
proach for disclosure control in the IAB Establishment Panel–Multiple imputation
for a better data access. Advances in Statistical Analysis 92, 439 – 458.
Drechsler, J. and Reiter, J. P. (2008). Accounting for intruder uncertainty due to
sampling when estimating identification disclosure risks in partially synthetic data.
In J. Domingo-Ferrer and Y. Saygin, eds., Privacy in Statistical Databases (LNCS
5262), 227–238. New York: Springer-Verlag.
34
Duncan, G. T. and Lambert, D. (1989). The risk of disclosure for microdata. Journal
of Business and Economic Statistics 7, 207–217.
Fienberg, S. E. (1994). A radical proposal for the provision of micro-data samples and
the preservation of confidentiality. Tech. rep., Department of Statistics, Carnegie-
Mellon University.
Fienberg, S. E., Makov, U. E., and Sanil, A. P. (1997). A Bayesian approach to
data disclosure: Optimal intruder behavior for continuous data. Journal of Official
Statistics 13, 75–89.
Fuller, W. A. (1993). Masking procedures for microdata disclosure limitation. Journal
of Official Statistics 9, 383–406.
Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (2004). Bayesian Data
Analysis. London: Chapman & Hall.
Graham, P. and Penny, R. (2005). Multiply imputed syn-
thetic data files. Tech. rep., University of Otago,
http://www.uoc.otago.ac.nz/departments/pubhealth/pgrahpub.htm.
Hawala, S. (2008). Producing partially synthetic data to avoid disclosure. In Pro-
ceedings of the Joint Statistical Meetings. Alexandria, VA: American Statistical
Association.
Karr, A. F., Kohnen, C. N., Oganian, A., Reiter, J. P., and Sanil, A. P. (2006). A
framework for evaluating the utility of data altered to protect confidentiality. The
American Statistician 60, 224–232.
Kennickell, A. and Lane, J. (2006). Measuring the impact of data protection tech-
niques on data utility: Evidence from the Survey of Consumer Finances. In
35
J. Domingo-Ferrar, ed., Privacy in Statistical Databases 2006 (Lecture Notes in
Computer Science), 291–303. New York: Springer-Verlag.
Kennickell, A. B. (1997). Multiple imputation and disclosure protection: The case of
the 1995 Survey of Consumer Finances. In W. Alvey and B. Jamerson, eds., Record
Linkage Techniques, 1997, 248–267. Washington, D.C.: National Academy Press.
Kinney, S. K. and Reiter, J. P. (2007). Making public use, synthetic files of the
Longitudinal Business Database. In Proceedings of the Joint Statistical Meetings.
Alexandria, VA: American Statistical Association.
Little, R. J. A. (1993). Statistical analysis of masked data. Journal of Official
Statistics 9, 407–426.
Little, R. J. A., Liu, F., and Raghunathan, T. E. (2004). Statistical disclosure tech-
niques based on multiple imputation. In A. Gelman and X. L. Meng, eds., Applied
Bayesian Modeling and Causal Inference from Incomplete-Data Perspectives, 141–
152. New York: John Wiley & Sons.
Mitra, R. and Reiter, J. P. (2006). Adjusting survey weights when altering identifying
design variables via synthetic data. In J. Domingo-Ferrer and L. Franconi, eds.,
Privacy in Statistical Databases, 177–188. New York: Springer-Verlag.
Raghunathan, T. E., Reiter, J. P., and Rubin, D. B. (2003). Multiple imputation for
statistical disclosure limitation. Journal of Official Statistics 19, 1–16.
Reiter, J. P. (2003). Inference for partially synthetic, public use microdata sets.
Survey Methodology 29, 181–189.
Reiter, J. P. (2004). Simultaneous use of multiple imputation for missing data and
disclosure limitation. Survey Methodology 30, 235–242.
36
Reiter, J. P. (2005a). Estimating identification risks in microdata. Journal of the
American Statistical Association 100, 1103–1113.
Reiter, J. P. (2005b). Releasing multiply-imputed, synthetic public use microdata:
An illustration and empirical study. Journal of the Royal Statistical Society, Series
A 168, 185–205.
Reiter, J. P. (2005c). Using CART to generate partially synthetic, public use micro-
data. Journal of Official Statistics 21, 441–462.
Reiter, J. P. (2008). Multiple imputation when records used for imputation are not
used or disseminated for analysis. Biometrika 95, 933–946.
Reiter, J. P. and Mitra, R. (2009). Estimating risks of identification disclosure in
partially synthetic data. Journal of Privacy and Confidentiality 1, 99–110.
Reiter, J. P. and Raghunathan, T. E. (2007). The multiple adaptations of multiple
imputation. Journal of the American Statistical Association 102, 1462–1471.
Rubin, D. B. (1981). The Bayesian bootstrap. The Annals of Statistics 9, 130–134.
Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys. New York:
John Wiley & Sons.
Rubin, D. B. (1993). Discussion: Statistical disclosure limitation. Journal of Official
Statistics 9, 462–468.
Schenker, N. and Raghunathan, T. E. (2007). Combining information from multiple
surveys to enhance estimation of measures of health. Statistics in Medicine 26,
1802–1811.
37
U. S. Bureau of the Census (2003). 2000 census of population and housing, public
use microdata sample, united states: Technical documentation.
Winkler, W. E. (2007). Examples of easy-to-implement, widely used methods of
masking for which analytic properties are not justified. Tech. rep., U.S. Census
Bureau Research Report Series, No. 2007-21.
Woo, M. J., Reiter, J. P., Oganian, A., and Karr, A. F. (2009). Global measures of
data utility for microdata masked for disclosure limitation. Journal of Privacy and
Confidentiality 1, 111–124.
Yancey, W. E., Winkler, W. E., and Creecy, R. H. (2002). Disclosure risk assessment
in perturbative microdata protection. In J. Domingo-Ferrer, ed., Inference Control
in Statistical Databases, 135–152. Berlin: Springer-Verlag.
38