methods to assess and quantify disclosure risk and ... · and repositories and these do a...
TRANSCRIPT
1
December 2018
Methods to assess and quantify disclosure risk and information loss under statistical disclosure control
Professor Natalie Shlomo
The University of Manchester
A contributing article to the National Statistician’s Quality Review into Privacy and Data Confidentiality Methods
2
Contents
1. Introduction .......................................................................................................... 3
2. Microdata from social surveys ................................................................................ 5
2.1 Disclosure risk assessment ............................................................................... 6
2.2 Statistical disclosure control methods ............................................................. 10
2.2.1 PRAM for categorical key variables .......................................................... 11
2.2.2 Additive noise for continuous variables ..................................................... 13
2.3. Data utility measures ..................................................................................... 14
2.3.1 Distance metrics ....................................................................................... 15
2.3.2 Impact on measures of association .......................................................... 15
2.3.3 Impact on a regression analysis ............................................................... 15
3. Frequency tables of whole population counts .................................................... 16
3.1 Types of disclosure risk................................................................................... 17
3.2 Statistical disclosure control methods .......................................................... 19
3.2.1 Record swapping ...................................................................................... 19
3.2.2 Semi-controlled random rounding ............................................................. 20
3.2.3 Stochastic perturbation ............................................................................. 21
3.3 Disclosure risk measures based on Information Theory ................................. 22
3.4 Data utility measures ...................................................................................... 24
4. Magnitude tables from business statistics ......................................................... 25
5. Disclosure risk–data utility confidentiality map ................................................... 26
6. New dissemination strategies ............................................................................ 28
6.1 Safe data enclaves and remote access .......................................................... 28
6.2 Web-based applications .................................................................................. 29
6.2.1 Flexible table generating servers .............................................................. 29
6.2.2 Remote analysis servers........................................................................... 31
6.3 Synthetic data ................................................................................................. 33
7. Statistical disclosure control: Where do we go from here? ................................ 33
8. References ........................................................................................................... 37
3
1. Introduction
Agencies producing official and national statistics, such as a National Statistical
Institute, have an obligation to release statistical data for research purposes and to
inform policies. On the other hand, they have a legal, moral and ethical obligation to
protect the confidentiality of individuals responding to their request for data and in
many countries there are codes of practice and legislation that must be strictly adhered
to. In addition, agencies issue confidentiality guarantees to respondents prior to their
participation in surveys or censuses. The key objective is to ensure public trust in
official statistics production and ensure high response rates.
There are two general approaches under statistical disclosure control (SDC): protect
the data for release using disclosure control techniques (‘safe data’), or restrict access
to the data for example by limiting its use to only approved researchers within a secure
data environment (‘safe access’). A combination of both approaches is usually applied
when releasing statistical data.
Statistical data that are traditionally released by agencies can be divided into two major
formats: microdata and tabular data.
a) Microdata are typically from social surveys, for example, the Labour Force
Survey and the Social Survey, where sampling fractions are very small and
hence the microdata contain only a small random subset of the population.
Assuming that there is no response knowledge on who has participated in the
survey, many producers of statistics within agencies have made provisions to
release public-use microdata from social surveys, usually through data
archives. Public-use microdata also undergo variable suppression, coarsening
and aggregation before they are released (these methods are discussed in
detail in Section 2.2). Microdata from business surveys where the data are
partially collected through take-all strata and may have very sensitive skewed
outliers are typically not released. Census microdata are also not released,
although the Office for National Statistics (ONS) has a tradition of releasing
microdata from a small sample drawn from the census.
b) Tabular data contain either frequency counts which can be based on whole
populations such as from a census or register, or on survey data where the
weighted sample frequency counts are calculated by aggregating individual
survey weights. Tabular data can also include magnitude data which are
typically calculated from business surveys, such as total revenue according to
industry code.
In traditional outputs, we define the notion of an ‘intruder’ as someone with malicious
intent who wants to probe the statistical data and reveal sensitive information about
an individual or group of individuals. For example, in health data, an intruder might be
individuals or organisations who wish to disclose sensitive information about
doctors/clinics performing abortions. Two main disclosure risks are: (1) identity
disclosure where a statistical unit can be identified based on a set of cross-classified
quasi-identifying variables, which identify individuals indirectly such as age, gender,
4
occupation, place of residence; (2) attribute disclosure where new information can be
learnt about an individual or a group of individuals. Disclosure risk scenarios form the
basis of possible means of disclosure, for example: the ability of an intruder to match
a dataset to an external public file based on a common set of quasi-identifying
variables; the ability of an intruder to identify uniques through visible and rare
attributes; the ability of an intruder to difference nested tables and obtain small counts;
the ability of an intruder to form coalitions.
In microdata from social surveys, the main concern is the risk of identification since
this is a prerequisite for individual attribute disclosure where many sensitive variables
such as income or health outcomes, can be revealed following an identification.
Naturally, sampling from the population provides a priori protection since an intruder
cannot be sure whether a sample unique on a set of quasi-identifiers is a population
unique.
In tabular data of whole population counts, attribute disclosure arises when there is a
row/column of zeros and only one populated cell. This leads to individual or group
attribute disclosure depending on the size of the populated cell since an intruder can
learn new attributes according to the remaining spanning variables of the table that
were not known previously. Therefore, in frequency tables containing whole population
counts, it is the zero cells that are the main cause of attribute disclosure. Frequency
tables of weighted survey counts are generally not a cause of concern due to the
ambiguity introduced by the sampling and the fact that survey weights differ across
units in the surveys due to non-response adjustments and benchmarking. In
magnitude tables arising from business surveys, disclosure risk is generally defined
by the ability of businesses to be able to form coalitions to disclose sensitive
information about competing businesses.
In order to preserve the confidentiality of respondents, agencies must assess the
disclosure risk in statistical data and, if required, choose appropriate SDC methods to
apply to the data. Measuring disclosure risk involves assessing and evaluating
numerically the risk of re-identifying statistical units. SDC methods perturb, modify, or
summarise the data in order to prevent re-identification by a potential intruder. Higher
levels of protection through SDC methods however often negatively impact the quality
of the data. The SDC decision problem involves finding the optimal balance between
managing and minimising disclosure risk to tolerable risk thresholds depending on the
mode for accessing the data, and ensuring high utility and fit-for-purpose data where
the data will remain useful for the purpose for which it was designed.
In this article we provide more details of SDC approaches, the quantification of
disclosure risks and data utility for traditional types of statistical outputs. Section 2
presents details for microdata from social surveys and Section 3 presents details for
tabular data from whole population counts. Section 4 presents a brief description of
SDC approaches for magnitude tables arising from business surveys. We then
demonstrate the disclosure risk-data utility trade-off in Section 5. Section 6 presents
future data dissemination strategies at agencies producing national and official
5
statistics and we close in Section 7 with a discussion of future directions for statistical
disclosure control.
2. Microdata from social surveys
Microdata from social surveys are released only after removing direct identifying
variables, such as names, addresses, and identity numbers. As mentioned, the main
disclosure risk is individual attribute disclosure where small counts on cross-classified
quasi-identifying key variables can be used to identify an individual and confidential
information may be learnt from the remaining sensitive variables in the dataset. The
quasi-identifying key variables are those variables that are visible and traceable and
are accessible to the public as well as to potential intruders with malicious intent to
disclose information. Since the prerequisite to individual attribute disclosure is identity
disclosure, SDC methods typically aim to reduce the risk of re-identification and the
disclosure risk assessment is based on estimating a probability of identification given
a set of quasi-identifying key variables. In addition, key variables are typically
categorical variables, and may include: sex, age, occupation, place of residence,
country of birth, family structure, etc. Sensitive variables can be continuous variables,
such as income and expenditures, but can also be categorical. We define the key as
the set of combined cross-classified identifying key variables, typically presented as a
contingency table containing the counts from the survey microdata. For example, if the
identifying key variables are sex (2 categories), age group (10 categories) and years
of education (8 categories), the key will have 160 (=2 by 10 by 8) cells following the
cross-classification of the key variables.
The disclosure risk scenario of concern at statistical agencies is the ability of a
potential intruder to match the released microdata to external sources containing the
target population based on a common key. External sources can be either in the form
of prior knowledge that the intruder might have about a specific population group, or
by having access to public files containing information about the population, such as
Phone Company listings, Voter’s Registration or even a National Population Register.
The agency does not generally assume that an intruder will have ‘response
knowledge’ on whether an individual is included in the survey dataset or not and
therefore relies on this extra layer of protection in their SDC strategies.
In order to protect a data set, one can either apply an SDC method on the identifying
key variables or on the sensitive variables. In the first case identification of a unit is
rendered more difficult, and the probability that a unit is identified is hence reduced. In
the second case, even if an intruder succeeds in identifying a unit by using the values
of the identifying key variables, the sensitive variables would hardly disclose any useful
information on the particular record. One can also apply SDC methods on both the
identifying and sensitive variables simultaneously. This offers more protection, but also
leads to more information loss.
Since the application of SDC methods leads to information loss, it is important to
develop quantitative information loss measures in order to assess whether the
6
resulting confidentialised survey microdata is fit-for-purpose according to some pre-
defined user specifications. As mentioned, survey microdata is typically released into
data archives and repositories where registered users can gain access to the data
following appropriate training and confidentiality agreements. Almost every country
has a data archive where researchers can gain access to microdata. One example is
the United Kingdom Data Service which is responsible for the dissemination of
microdata from many of the UK surveys [1]. There are other international data archives
and repositories and these do a particularly good job of archiving and making data
available for international comparisons. One example is the IPUMS archive located at
the University of Minnesota which provides census and survey data from
collaborations with 105 Statistical Agencies, national archives, and genealogical
organisations around the world. The staff ensures that the datasets are integrated
across time and space with common formatting, harmonising variable codes, archiving
and documentation [2].
Researchers may obtain more secure data that has undergone less SDC methods
through special licensing and on-site data enclaves, although this generally involves
a time-consuming application process and the need to travel to on-site facilities.
2.1 Disclosure risk assessment
We focus on microdata arising from social surveys. Disclosure risk for social survey
microdata is measured and quantified by estimating a probability of re-identification.
This probability is based on the notion of uniqueness on the key where a cell value in
the cross-classified identifying variables may have a value of one. Since survey
microdata is based on a sample, a cell value of one is only problematic if there is also
a cell value of one in the whole population. In other words, we need to know if the
sample unique in the key is also a population unique, or if it is an artefact of the
sampling. Based on the literature, methods for assessing disclosure risk for sample
microdata arising from social surveys can be classified into three types:
• Heuristics that identify special uniques on a set of cross-classified key variables,
i.e. sample uniques that are likely to be population uniques [3]
• Probabilistic record linkage on a set of key (matching) variables that can be used
to link the microdata to an external population file [4, 5]
• Probabilistic modelling of disclosure risk which was developed under two
approaches: a full model-based framework taking into account all of the information
available to intruders and modelling their behaviour [6] and a more simplified
approach that restricts the information that would be known to intruders [7, 8, 9].
Heuristics and record linkage suffer from the drawback that there is no framework for
obtaining consistent disclosure risk measures at both the individual record-level, and
the overall global file-level. In addition, these approaches do not take into account the
protection afforded by the sampling. Probabilistic modelling provides record-level
disclosure risk measures that can be used to target high-risk records in the microdata
7
for SDC methods. Global file-level disclosure risk measures are aggregated from
record-level risk measures and are essential for Microdata Review Boards.
The disclosure risk measure of re-identification can take two forms: the number of
sample uniques that are also population uniques, and the overall expected number of
correct matches for each sample unique if we were to link the sample uniques to a
population. When the population from which the sample is drawn is unknown,
probabilistic modelling can be used to estimate population parameters which form the
basis for estimating the disclosure risk measures and also accounts for the protection
afforded by the sampling.
Quasi-identifying key variables for disclosure risk assessment are determined by a
disclosure risk scenario, i.e. assumptions about available external files and IT tools
that can be used by intruders to identify individuals in released microdata. For
example, key variables may be chosen which would enable matching the released
microdata to a publicly available file containing names and addresses. Examples of
publicly available data might include data that is freely available over the internet such
as car registrations, phone book and electoral roles, or data than can be purchased,
such as supermarket loyalty cards and life-style datasets. Under a probabilistic
approach, disclosure risk is estimated based on the contingency table of sample
counts spanned by identifying key variables, for example place of residence, sex, age,
occupation, etc. The other variables in the file are sensitive variables.
Individual per-record risk measures in the form of a probability of re-identification are
first estimated. These per-record risk measures are then aggregated to obtain global
risk measures for the entire file.
Denoting kF the population size in cell k of a table spanned by key variables having
K cells and kf the sample size and NFK
k
k 1
and
K
k
k nf1
. The set of sample
uniques, is defined: }1:{ kfkSU since these are potential high-risk records, i.e.
population uniques.
Formally, the two global disclosure risk measures (where I is the indicator function)
are the following:
1. Number of sample uniques that are population uniques:
k
kk FfI )1,1(1
2. Expected number of correct matches for sample uniques (i.e. a matching
probability):
k
kk FfI /1)1(2 .
8
The individual risk measure for 2 is kF/1 . This is the probability that a match between
a record in the microdata and a record in the population having the same values of
key variables will be correct. If for example, there are two records in the population
with the same values of key variables, the probability is 0.5 that the match will be
correct. Adding up these probabilities over the sample uniques gives the expected
number (on average) of correctly matching a record in the microdata to the population
when we allow guessing. The population frequencies kF are unknown and are
estimated from the probabilistic model. The risk measures are then estimated by:
)1|1(ˆ)1(1 kk
k
k fFPfI and )1|/1(ˆ)1(ˆ2 kk
k
k fFEfI (1)
A Poisson model to estimate disclosure risk measures has been proposed [8, 10]. In
this model, they assume the natural assumption in contingency table literature:
)(~ kk PoissonF for each cell k. A sample is drawn by Poisson or Bernoulli sampling
with a sampling fraction k in cell k: ),(~| kkkk FBinFf . It follows that:
)(~ kkk Poisf and ))1((~| kkkk PoissonfF (2)
where kk fF | are conditionally independent.
The parameters }{ k are estimated using log-linear modeling. The sample frequencies
kf are independent Poisson distributed with a mean of kkk . A log-linear model
for the k is expressed as: kk x)log( where kx is a design vector which denotes
the main effects and interactions of the model for the key variables. The maximum
likelihood estimator (MLE) may be obtained by solving the score equations:
0)]exp([ kkkk
k
f xx (3)
The fitted values are calculated by: )ˆexp(ˆ kku x and kkk u /ˆˆ .
Individual disclosure risk measures for cell k are:
(4)
))1(exp()1|1( kkkk fFP
)]1(/[))]1(exp(1[)1|/1( kkkkkk fFE
9
Plugging k for k in (4) leads to the estimates )1|1(ˆ kk fFP and ]1|/1[ˆ kk fFE and
then to 1 and 2 of (1). Confidence intervals for these global risk measures are also
considered [11].
A method for selecting the log-linear model based on estimating and (approximately)
minimising the bias iB of the risk estimates 1 and 2 has been developed [9].The
method selects the model using a forward search algorithm which minimises the
standardised bias estimate ii vB ˆ/ˆ for 2,1,ˆ ii where i are variance estimates of
iB . In addition, the estimation of disclosure risk measures under complex survey
designs with stratification, clustering and survey weights are also addressed [9].
Empirical studies have shown that the probabilistic modelling approach can provide
unbiased estimates of the overall global level of disclosure risk in the microdata, but
are not accurate for the individual record level of disclosure risk. Thus care should be
taken when using record level measures of disclosure risk for targeting SDC methods
to high-risk records.
The probabilistic model assumes that there is no measurement error in the way the
data is recorded. Besides typical errors in data capture, key variables can also be
purposely misclassified as a means of masking the data. A method to estimate risk
measures to take into account measurement errors has been outlined [12]. Denoting
the cross-classified key variables in the population and the microdata as X and
assuming that X in the microdata have undergone some misclassification or
perturbation error denoted by the value �� and determined independently by a
misclassification matrix M,
)|~
( jXkXPM kj (5)
a record-level disclosure risk measure of a match with a sample unique under
measurement error is:
k
j
kjkjj
kkkk
FMMF
MM 1
)1/(
)1(
(6)
Under assumptions of small sampling fractions and small misclassification errors, the
measure can be approximated by: j
kjjkk MFM / or kkk FM
~/ where
kF~
is the
population count with kX ~
. Aggregating the per-record disclosure risk measures, the
global risk measure is:
k
kkkk FMfI~
/)1(2 (7)
Note that to calculate the measure only the diagonal of the misclassification matrix
needs to be known, i.e. the probabilities of not being perturbed. Since population
10
counts are generally not known, the estimate in (7) can be obtained by probabilistic
modeling with log-linear models as described above on the misclassified sample:
kkkk
k
k fFEMfI~
|~
/1ˆ)1~
(ˆ2 (8)
2.2 Statistical disclosure control methods
Based on the disclosure risk assessment, producers of statistics within agencies must
choose appropriate SDC methods either by perturbing, modifying, or summarising the
data. The choice depends on the mode of access, requirements of the users and the
impact on quality and information loss. Choosing an optimal SDC method is an
iterative process where a balance must be found between managing disclosure risk
and preserving the utility in the microdata.
SDC methods for microdata include perturbative methods which alter the data and
non-perturbative methods which limit the amount of information released in the
microdata without actually altering the data. Examples of non-perturbative SDC
methods are: coarsening and recoding where values of variables are grouped into
broader categories (for example single years of age are grouped into age groups);
variable suppression where variables such as low-level geographies are deleted from
the microdata; and sub-sampling where a random sample is drawn from the original
microdata. This latter approach is commonly used to produce research files from
census microdata, for example, in the UK a 1% sample is drawn from the census
microdata for use by the research community. Given that the sampling provides a priori
protection in the data there is less use of perturbative methods applied to microdata
from surveys although there are some cases where this is carried out.
For continuous variables, a common perturbative method for survey microdata is top-
coding where all values above a certain threshold receive the value of the threshold
itself. For example, any individual in survey microdata earning above £10,000 a month
will have their income amount replaced by £10,000. Another perturbative method is
adding random noise to continuous variables, such as income or expenditure. For
example, random noise is generated from a Normal distribution with a mean of zero
and a small variance for each individual in the dataset and this random noise is added
to the individual’s value of the variable. Micro-aggregation is another approach where
records are grouped together (usually using a clustering algorithm) and the values of
the continuous variable(s) are replaced by their average, or alternatively, values can
be rank-swapped within the group.
For categorical variables, the most common perturbative method used in microdata is
record swapping. In this approach, two records having similar control variables are
paired and the values of some variables are swapped, typically their geographical
variables. For example, two individuals having the same sex, age, and years of
education will be paired and their place of residence interchanged. Record swapping
11
is used in the United States and the United Kingdom on their census microdata as a
pre-tabular method prior to tabulation (see Section 3.2). A more general method is the
post-randomisation probability mechanism (PRAM) where categories of variables are
changed or not changed according to a prescribed probability mechanism and a
stochastic selection process [13] and is described below in more detail. Table 1
summarises the SDC methods mentioned above. Further information on perturbative
and non-perturbative methods can be found in the literature [14, 15] and references
therein.
Table 1: SDC methods for suvey microdata
Non-perturbative Methods Perturbative Methods
Categorical Variables Continuous Variables
Coarsening/recoding variables Record swapping Top-coding Variable or value suppression PRAM Adding random noise Sub-sampling Micro-aggregation Rank swapping
Each SDC method impacts differently on the level of protection obtained in the
microdata and information loss. Two SDC methods to preserve sufficient statistics as
well as logical consistencies in the microdata are described and summarised below
[16].
2.2.1 PRAM for categorical key variables
For protecting categorical identifying variables, the post-randomisation method
(PRAM) has been proposed [13]. As a perturbative method, PRAM alters the data,
and therefore we can expect consistent records to start failing edit rules. Edit rules
describe logical relationships that have to hold true, such as “a two-year old person
cannot be married” or “the profit and the costs of an enterprise should sum up to its
turnover”.
The process of applying PRAM is described as follows [14]:
Let P be a LL transition matrix containing conditional probabilities
) iscategory original| iscategory perturbed( ijppij for a categorical variable with L
categories, t the vector of frequencies and v the vector of relative frequencies: ntv
, where n is the number of records in the micro-data set.
In each record of the data set, the category of the variable is changed or not changed
according to the prescribed transition probabilities in the matrix P and the result of a
draw of a random multinomial variate u with parameters pij (j=1,…,L). If the j-th
category is selected, category i is moved to category j. When i = j, no change occurs.
12
Let *
t be the vector of the perturbed frequencies. *
t is a random variable and
tPtt )|(E * . Assuming that the transition probability matrix P has an inverse 1
P , this
can be used to obtain an unbiased moment estimator of the original data: 1*ˆ Ptt .
In order to ensure that the transition probability matrix has an inverse and to control
the amount of perturbation, the matrix P is chosen to be dominant on the main
diagonal, i.e. each entry on the main diagonal is over 0.5.
The condition of invariance can be placed on the transition matrix P , i.e. ttP . This
releases the users of the perturbed file of the extra effort to obtain unbiased moment
estimates of the original data, since *
t itself will be an unbiased estimate of t . To
obtain an invariant transition matrix, a matrix Q is calculated by transposing matrix P
, multiplying each column j by jv and then normalizing its rows so that the sum of
each row equals one. The invariant matrix is obtained by PQR . The invariant matrix
R may distort the desired probabilities on the diagonal, so a parameter is defined
to calculate: IRR )1(* where I is the identity matrix [16].
*R will also be invariant and the amount of perturbation is controlled by the value of
. The property of invariance means that the expected values of the marginal
distribution of the variable being perturbed are preserved. In order to obtain the exact
marginal distribution and reduce the additional variance caused by the perturbation, a
“without” replacement selection strategy for choosing values to perturb can be
implemented based on the expectations calculated from the transition probabilities.
As in most perturbative SDC methods, joint distributions between perturbed and
unperturbed variables are distorted, in particular for variables that are highly correlated
with each other. The perturbation can be controlled as follows:
1. Before applying PRAM, the variable to be perturbed is divided into subgroups,
Gg ,...,1 . The transition (and invariant) probability matrix is developed for each
subgroup g, gR . The transition matrices for each subgroup are placed on the main
diagonal of the overall transition matrix where the off-diagonal probabilities are all
zero, i.e. the variable is only perturbed within the subgroup and the difference in
the variable between the original value and the perturbed value will not exceed a
specified level. An example of this is perturbing age within broad age bands.
2. The variable to be perturbed may be highly correlated with other variables. Those
variables should be compounded into one single variable. PRAM should be carried
out on the compounded variable. Alternatively, the variable to be perturbed is
carried out within subgroups defined by the second highly correlated variable. An
example of this is when age is perturbed within groupings defined by marital status.
The control variables in the perturbation process will minimise the amount of logical
inconsistencies defined through editing rules, but they will not eliminate all of them,
especially edit rules that are out of scope of the variables that are being perturbed.
13
Remaining failed edit rules need to be manually or automatically corrected through
edit and imputation processes depending on the amount and types of edit rule failures.
2.2.2 Additive noise for continuous variables
In its basic form, random noise is generated independently and identically distributed
with a positive variance and a mean of zero. The random noise is then added to the
original variable. Adding random noise will not change the mean of the variable for
large datasets but will introduce more variance. This will impact on the ability to make
statistical inferences. Researchers may have suitable methodology to correct for this
type of measurement error but it is good practice to minimise these errors through
better implementation of the method.
Additive noise should be generated within small homogenous sub-groups (for
example, percentiles of the continuous variable) in order to use different initiating
perturbation variance for each sub-group. Generating noise in sub-groups also causes
less edit failures with respect to relationships in the data. Correlated random noise can
be added to the continuous variable thereby ensuring that not only means are
preserved but also the exact variance [17, 18]. A simple method for generating
correlated random noise for a continuous variable z is summarised below [16]:
Procedure 1 (univariate): Define a parameter which takes a value greater than 0
and less than equal to 1. When 1 , we obtain the case of fully modeled synthetic
data. The parameter controls the amount of random noise added to the variable z.
After selecting a , calculate: )1( 21 d and 2
2 d . Now, generate random
noise independently for each record with a mean of 2
11
d
d and the original
variance of the variable 2 . Typically, a Normal distribution is used to generate the
random noise. Calculate the perturbed variable iz for each record i in the sample
microdata (i=1,..,n) as a linear combination: iii dzdz 21 . Note that
)()](1
[)()(2
121 zEzE
d
ddzEdzE
and
)()()()1()( 22 zVarzVarzVarzVar since the random noise is generated
independently to the original variable z.
An additional problem when adding random noise is that there may be several
variables to perturb at once, and these variables may be connected through an edit
constraint of additivity. One procedure to preserve additivity would be to perturb two
of the variables and obtain the third from aggregating the perturbed variables.
However, this method will not preserve the total, mean and variance of the aggregated
variable and in general, it is not good practice to compound effects of perturbation by
aggregating perturbed variables since this causes unnecessary information loss.
14
Procedure 1 can also be implemented in a multivariate setting where correlated
Gaussian noise is added to the variables simultaneously [16]. The method not only
preserves the means of each of the three variables and their co-variance matrix, but
also preserves the edit constraint of additivity.
Procedure 1 (multivariate): Consider three variables yx, and z where zyx . This
procedure generates random noise that a priori preserves additivity and therefore
combining the random noise to the original variables will also ensure additivity. In
addition, means and the covariance structure are preserved. The technique is as
follows:
Generate multivariate random noise: )Σ,μ(~),,( NT
zyx , where the superscript T
denotes the transpose. In order to preserve sub-totals and limit the amount of noise,
the random noise should be generated within percentiles (note that we drop the index
for percentiles). The vector μ contains the corrected means of each of the three
variables yx, and z based on the noise parameter :
)μ1
,μ1
,μ1
()μ,μ,μ(μ2
1
2
1
2
1T
zyxzyxd
d
d
d
d
d . The matrix Σ is the original
covariance matrix. For each separate variable, calculate the linear combination of the
original variable and the random noise as previously described. For example, for
record i: ziii dzdz 21 . The mean vector and the covariance matrix remain the
same before and after the perturbation, and the additivity is exactly preserved.
2.3. Data utility measures
Obviously, SDC methods cause information loss and impact on the utility of the data.
The utility of microdata that has undergone SDC methods is based on whether the
same statistical analysis and inference can be drawn on the perturbed data compared
to the original data. Microdata is multi-purposed and used by many different types of
users with diverse reasons for analysing the data. To assess the utility in microdata,
proxy measures have been developed and include measuring distortions to
distributions and the impact on bias, variance and other statistics (Chi-squared
statistic, R2 goodness of fit, rankings, etc.). For example, some SDC methods, such
as adding random noise where the random noise is generated from a Normal
distribution with a mean of zero and a small variance, will not impact on the point
estimate of a total or an average, but will increase the variance and cause a wider
confidence interval. On the other hand, microaggregation will decrease the variance
and cause a narrower confidence interval. The use of such measures for assessing
utility in perturbed statistical data with empirical examples and applications has been
outlined [15, 19, 20, 21], and a brief summary of some useful proxy utility measures
are as follows:
15
2.3.1 Distance metrics
Distance metrics are used to measure distortions to distributions in the microdata as
a result of applying SDC methods. The AAD is a distance metric based on the average
absolute difference between observed and perturbed counts in a frequency
distribution. Let D represent a frequency distribution produced from the microdata and
let )(cD be the frequency in cell c. The average absolute distance per cell is defined
as:
c
c
origpertpertorig ncDcDDDAAD /|)()(|),( (9)
where cn is the number of cells in the distribution.
2.3.2 Impact on measures of association
Tests for independence are often carried out on joint frequency distributions between
categorical variables that span a table calculated from the microdata. The test for
independence for a two-way table is based on a Pearson Chi-squared statistic
i j ij
ijij
e
eo 2
2)(
where ijo is the observed count and nnne jiij /)( .. is the
expected count for row i and column j. If the row and column are independent then 2 has an asymptotic chi-square distribution with (R-1)(C-1)and for large values the
test rejects the null hypothesis in favor of the alternative hypothesis of association.
Typically, the Cramer’s V is used which is a measure of association between two
categorical variables:)1(),1min(
/2
CR
nCV
. The information loss measure is the
percent relative difference between the original and perturbed table:
)(
)()(100),(
orig
origpert
origpertDCV
DCVDCVDDRCV
(10)
For multiple dimensions, log-linear modeling is often used to examine associations. A
similar measure to (10) can be calculated by taking the relative difference in the
deviance obtained from the model based on the original and perturbed microdata.
2.3.3 Impact on a regression analysis
For continuous variables, it is useful to assess the impact on the correlation and in
particular the2R of a regression (or ANOVA) analysis. For example, in an ANOVA,
the test involves checking whether a continuous dependent variable has the same
means across groups defined by a categorical explanatory variable. The goodness of
fit criterion 2R is based on a decomposition of the variance of the mean of the
dependent variable. By perturbing the statistical data, the groupings may lose their
homogeneity, the “between” variance becomes smaller, and the “within” variance
becomes larger. In other words, the proportions within each of the groupings shrink
16
towards the overall mean. On the other hand, the “between” variance may become
artificially larger showing more association than in the original distribution.
The utility is based on assessing differences in the means of a response variable
across categories of an explanatory variable having K categories. Let ky be the mean
in category k and define the “between” variance of this mean by:
k
2korig )yy(
1K
1)y(BV
where y is the overall mean.
Information loss is measured by:
)y(BV
)y(BV)y(BV100)y,y(BVR
orig
origpert
origpert
(11)
In addition, other analysis of information loss involves comparing estimates of
coefficient when applying a regression model on both the original and perturbed
microdata and comparing the coverage of confidence intervals.
3. Frequency tables of whole population counts
In this section, we focus on confidentiality protection of frequency tables of whole
population counts. This is more challenging than protecting tables from a sample
where survey-weighted counts are disseminated. Survey weights are inflation factors
assigned to each respondent in the microdata and refer to the number of individuals
in the population represented by the respondent. They take into account survey design
sampling fractions, nonresponse adjustments and benchmarking to population totals,
and hence will vary across individuals in the surveys. The fact that only survey-
weighted counts are presented in the tables means that the underlying sample size is
not known exactly and this provides an extra layer of protection in the tables. In
addition, producers of statistics within agencies generally do not assume that
response knowledge is in the public domain, although they do consider very targeted
intrusion attempts which still may be relevant for survey data when there are outliers.
Nevertheless, there is generally little confidentiality protection needed in tabular data
arising from survey microdata. Typical SDC methods for survey-weighted counts
include coarsening the variables that define the tables, for example banded age
groups and broad categories of ethnicity, and ensuring safe table design to avoid low
or zero cell values in the tables. In particular, low sample cell values are also avoided
due to large confidence intervals and low-quality estimates.
Tabular data for census counts in the form of hard-copy frequency tables have been
the norm for releasing statistical data for decades, and remains true today. There are
recently developed web-based applications to automate extraction of certain portions
of tabular data on-request, such as neighborhood or crime statistics for a specific
region. One example of this type of application is the Nomis website [22]. Nomis is a
17
service provided by the Office for National Statistics in the United Kingdom to provide
free access to detailed and up-to-date labour market statistics from official sources.
Tabular outputs for censuses and whole population counts are pre-determined after
careful consideration of population thresholds, average cell sizes, collapsing and fixing
categories of variables spanning the tables. In spite of these efforts, SDC methods are
necessary. These techniques include pre-tabular methods, post-tabular methods and
combinations of both described in Section 3.2.
3.1 Types of disclosure risk
The main disclosure risk in a register/census context comes from small counts, i.e.
ones and twos, since these can lead to re-identification. The amount and placement
of the zeros in the table determines whether new information can be learnt about an
individual or a group of individuals. The disclosure risks defined are [14, 21]:
Individual attribute disclosure - An individual can be identified on the basis of some
of the variables spanning the table hence a new attribute revealed about the individual,
i.e. for tabular data, this means that there is a cell count of one in a margin of the table.
Identification is a necessary pre-condition for attribute disclosure and therefore should
be avoided. In a census/register context where many tables are released, an
identification made in a lower dimensional table will lead to attribute disclosure in a
higher dimensional table.
Group attribute disclosure - If there is a row or column that contain mostly zeros and
a small number of non-zero cells, then one can learn a new attribute about a group of
individuals and also learn about the group of individuals who do not have this attribute.
This type of disclosure risk does not require individual identification but will cause harm
to a group of individuals.
Disclosure by differencing - Two tables that are nested may be subtracted from each
other resulting in a new table containing small cells and the above disclosure risk
scenarios would apply. For example, a table containing the elderly population in
private households may be subtracted from a table containing the total elderly
population, resulting in a table of the elderly in communal establishments. This table
is typically very sparse compared to the two original tables.
Disclosure by linking tables - Since many tables are disseminated from one data
source, they can be linked though common cells and common margins thereby
increasing the chances for revealing SDC methods and original cell counts.
To protect against attribute disclosure, SDC methods should limit the risk of
identification and also introduce ambiguity into the zero counts. To avoid disclosure by
differencing, often only one set of variables and geographies are disseminated with no
possibilities for overlapping categories. To avoid disclosure by linking tables, margins
and cells of tables should be made consistent. Since census tables have traditionally
been released in hard-copy, these latter two disclosure risks from linking and
differencing tables were controlled by the agencies through strict definitions of the
18
variables defining the tables and no release of tables differing in a single category. In
addition, agencies often employ transparent and visible SDC methods to avoid any
perception that there may be disclosure risks in the data and resources are directed
to ensure that the public is informed about the measures taken to protect the
confidentiality of responding units.
In Table 2, we present an example census table where we discuss the disclosure risks.
Table 2: Example census table
Benefits 16-24 25-44 45-64 65-74 75+
Benefits Claimed 17 8 2 4 1 Benefits Not Claimed 23 35 20 0 0
Total 40 43 22 4 1
In Table 2 we see evidence of small cell values which can lead to re-identification, for
example the cell value of size 1 in ‘75+’ and ‘benefits claimed’. In addition, the cell
value of size 2 in ‘45-64’ and ‘benefits claimed’ can lead to re-identification since it is
possible for an individual in the cell to subtract him or herself out and therefore re-
identify the remaining individual. Moreover, we see evidence of attribute disclosure
through the zeros in the cell values for the columns labeled ‘65-74’ and ‘75+’ under
‘benefits not claimed’. This means that we have learnt that all individuals 65-74 are
claiming benefits although we have not made a re-identification. However, for the
single individual in the column labeled ‘75+’ we have made an identification and also
learnt that the individual is claiming a benefit.
Tables 3a and 3b show an example of two tables that are nested and may be
subtracted one from the other resulting in a new table containing small cell values.
Table 3a: Example census table of all elderly by long-term illness
Health 65-74 75-84 85+ Total
No long-term illness 17 9 6 32 Long-term illness 23 35 20 78
Total 40 44 26 110
Table 3b: Example census table of all elderly living in households by long-term
illness
Health 65-74 75-84 85+ Total
No long-term illness 15 9 5 29 Long-term illness 22 33 19 74
Total 37 42 24 103
19
We see that both Tables 3a and 3b are seemingly without disclosure risks on their
own. However, Table 3b can be subtracted from Table 3a to obtain a very disclosive
differenced table of elderly living in nursing homes and other institutionalised care
with very small and zero cell values.
3.2 Statistical disclosure control methods
In preliminary steps, tables are carefully designed with respect to the variables that
define the tables and the definition of their categories. There are also general rules-
of-thumb regarding population thresholds and the number of dimensions allowed in
the table.
Pre-tabular methods are implemented on the microdata prior to the tabulation of the
tables. The most commonly used method is record swapping between a pair of
households/individuals matching on some control variables. As mentioned, this
method has been used for protecting census tables at the United States Bureau of the
Census and the Office for National Statistics (ONS) in the United Kingdom. In general,
agencies prefer record swapping since the method is easy to implement, all
tabulations are consistent and marginal distributions are preserved exactly on higher
aggregations of the data.
Post-tabular methods are implemented on the entries of the tables after they are
computed and typically take the form of random rounding or its extension of using a
random probability mechanism (similar to PRAM), either on the small cells of the
tables or on all entries of the tables. Within the framework of developing the SDC
software package, Tau Argus, a fully controlled rounding option has been added [23].
The procedure uses linear programming techniques to round entries up or down and
in addition ensures that all rounded entries add up to the rounded totals. However, the
controlled rounding option is not able to cope with the size, scope and magnitude of
census tabular outputs. Cell suppression is also not typically used for census outputs
due to the need to suppress same cells across a large number of tables.
Here we describe in more detail three common SDC methods which have been used
to protect whole population frequency tables, a pre-tabular SDC method of record
swapping and post-tabular SDC methods of random rounding and stochastic
perturbation. Examples of applications and case studies can be found [21, 24, 25, 26].
3.2.1 Record swapping
Record swapping is based on the exchange of values of variable(s) between similar
pairs of population units (often households). In order to minimise bias, pairs of
population units are determined within groups defined by control variables. For
example, for swapping households in a census/register context, control variables may
include: a large geographical area, household size and the age-sex distribution of
individuals in the households. In addition, record swapping can be targeted to high-
risk population units found in small cells of census tables. Typically geographical
20
variables related to place of residence are often swapped. Swapping place of
residence have the following properties: (1) it minimises bias based on the assumption
that place of residence is independent of other target variables conditional on the
control variables; (2) it provides more protection in the tables since place of residence
is a highly visible variable which can be used to identify individuals; (3) it preserves
marginal distributions within a larger geographical area.
3.2.2 Semi-controlled random rounding
Another post-tabular method of SDC for frequency tables is based on unbiased
random rounding. Let Floor(x) be the largest multiple bk of the base b such that xbk
for any value of x. We define the residual as: res(x)=x-Floor(x). For an unbiased
rounding procedure, x is rounded up to Floor(x)+b with probability b
xres )( and rounded
down to Floor(x) with probability
b
xres )(1 . If x is already a multiple of b, it remains
unchanged.
In general, each small cell is rounded independently in the table, i.e. a random uniform
number u between 0 and 1 is generated for each cell. If b
xresu
)( then the entry is
rounded up, otherwise it is rounded down. This ensures an unbiased rounding
scheme, i.e. the expectation of the rounding perturbation is zero. However, the
realisation of this stochastic process on a finite number of cells in a table will not
ensure that the sum of the perturbations will exactly equal zero.
To place some control in the random rounding procedure, we use a semi-controlled
random rounding algorithm for selecting entries to round up or down as follows: First
the expected number of entries of a given res(x) that are to be rounded up is
predetermined (for the entire table or for each row/column of the table). The expected
number is rounded to the nearest integer. Based on this expected number, a random
sample of entries is selected (without replacement) and rounded up. The other entries
are rounded down. This process ensures that rounded internal cells aggregate to the
controlled rounded total.
Due to the large number of perturbations in the table, margins are typically rounded
separately from internal cells and tables are not additive. When using semi-controlled
random rounding this alleviates some of the problems of non-additivity since one of
the margins and the overall total will be controlled, i.e. the rounded internal cells
aggregate to the rounded total. Another problem with random rounding is the
consistency of the rounding across the same cells that are generated in different
tables. It is important to ensure that the cell value is always rounded consistently,
otherwise the true cell count can be learnt by generating many tables containing the
same cell and observing the perturbation patterns.
21
The use of microdata keys which can solve the consistency problem has been
proposed [27]. First, a random number (which they call a key) is defined for each
record in the microdata. When building a census frequency table, records in the
microdata are combined to form a cell defined by the spanning variables of the table.
When these records are combined to a cell, their keys are also aggregated. This
aggregated key serves as the seed for the rounding and therefore same cells will
always have the same seed and result in consistent rounding.
3.2.3 Stochastic perturbation
A more general method than random rounding is stochastic perturbation which
involves perturbing the internal cells of a table using a probability transition matrix and
is similar to the post-randomisation method (PRAM) that is used to perturb categorical
variables in microdata as described in Section 2.2. In this case, it is the cell counts in
a table that are perturbed [24, 27].
Here we focus on the random probability mechanism. Table 4 demonstrates one such
probability mechanism for small cell values of a table.
The probabilities define the chance that a cell value having an original value, of say 1,
will be changed to a perturbed cell value of 2 (in this case the probability is 0.05). The
probability of remaining a value of 1 is 0.80. Note that the sum of the rows is equal to
one. Depending on a random draw, the value of the cell will change or not change
according to the probabilities in the mechanism.
In the example above, if the random draw is below 0.05 then the value of 1 will be
perturbed to 0; if the random draw is between 0.05 and 0.85 then the value of 1 will
remain a 1; if the random draw is between 0.85 and 0.90 then the value of 1 will be
perturbed to 2; if the random draw is between 0.90 and 0.95 then the value of 1 will
be perturbed to 3; and finally if the random draw is above 0.95 then the value of 1 will
be perturbed to 4. Note that in this strategy, an original value of 0 is not perturbed.
Table 4: Example of probability mechanism to perturb small cell values
Original Value
Perturbed Value
0 1 2 3 4
0 1 0 0 0 0 1 0.05 0.80 0.05 0.05 0.05 2 0.05 0.05 0.80 0.05 0.05 3 0.05 0.05 0.05 0.80 0.05 4 0.05 0.05 0.05 0.05 0.80
The probability mechanism has been modified so that the internally perturbed cell
values in a row/column will (approximately) sum to the perturbed cell value in the
respective margin [24]. This is done by a transformation of the probability mechanism
so that the frequencies of the cell values from the original table will be preserved on
the perturbed table. The transformation is described in more detail in Section 2.2
22
where we introduce the invariant probability matrix R. In this case t is the vector of
frequencies of the cell values and v is the vector of relative frequencies: v=t/K, where
K is the number of cells in the table. Placing the condition of invariance on the
probability transition matrix ensures that the marginal distribution of the cell values are
approximately preserved under the perturbation. As described in the random rounding
procedure, in order to obtain the exact marginal distribution, a similar strategy for
selecting the cell values to change can be carried out. For each cell value i, the
expected number of cells that need to be changed to a different value j is calculated
according to the probabilities in the transition matrix. We then randomly select (without
replacement) the expected number of cells i and carry out the change to j.
To preserve exact additivity in the table, an iterative proportional fitting algorithm can
be used to fit the margins of the table after the perturbation according to the original
margins. This results in cell values that are not integers. Exact additivity with integer
counts can be achieved for simple tables by controlled rounding to base 1 using for
example Tau-Argus [28]. Cell values can also be rounded to their nearest integers
resulting in ‘close’ additivity because of the invariance property of the transition matrix.
Finally, the use of microdata keys can also be adapted to this SDC method to ensure
consistent perturbation of same cells across different tables by fixing the seed for the
perturbation.
3.3 Disclosure risk measures based on Information Theory
Since the tabular data are based on whole population counts, disclosure risk
measurement is straight-forward as the disclosure risk is observed. One disclosure
risk measure is simply the percentage of cell values containing a value of 1 or a 2. In
addition, the placement of zero cell values in the table and whether they appear in a
single row or column of the table pose a risk of attribute disclosure.
Degenerate distributions in tables where rows/columns have mainly zero cell values
with only a few non-zero cell values have a high risk of attribute disclosure whereas a
row/column with a uniform distribution of cell counts would have little attribute
disclosure risk. Moreover, a row/column with large counts would have less risk of re-
identification compared to a row/column with small counts. A new disclosure risk
measure based on Information Theory has been proposed [29]. These measures
assign a value between 0 and 1 for the level of risk caused by degenerate distributions
where a value of 1 denotes the extreme case where the entire row/column have zero
cell values and only one non-zero cell value.
Using Information Theory, an analytical expression is the following:
The entropy of the frequency vector in a table of size K, with population counts
),...,,( 21 KFFFF where NFK
i
i 1
is:
23
N
FFNN
N
F
N
F
N
FHPH
i
K
i
iK
i
ii
1
1
loglog
log)(
and to produce a disclosure risk measure between 0 and 1, the disclosure risk
measure is defined as:
K
N
FH
log1
. (12)
An extended disclosure risk measure is proposed in (13) which also accounts for the
overall population size of the table and the number of zeros and is defined as a
weighted average of three different terms, each term being a measure between 0 and
1.
NeNww
KN
FFNN
wK
AwwwFR
K
i
ii
1log
1)1(
log
loglog
1||
),,(
21
12121
(13)
where A is the set of zeroes in the table and |A| the number of zeros in the set. K, N
and F as defined above and w1, w2 are arbitrary weights: 1ww0 21 .
The first measure in (13) is the proportion of zeros which is relevant for attribute
disclosure. The third measure in (13) allows us to differentiate between tables with
different magnitudes. As the population size N gets larger in the table, the third
measure tends to zero. The weights 1w and 2w should be chosen depending on the
agency’s choice of how important each of the terms are in contributing to disclosure
risk. Alternatively, one can avoid weights altogether by taking the L2- norm of the three
terms of the risk measure in (13) as follows:
3
|x|
2/13
1i
2
i
where xi represents term i, i=1,2,3 in (13). This provides more weight to the larger term
of the risk measure.
The risk measure in (13) has been expanded to include the case of disclosure risk
assessment based on Information Theory following the application of SDC methods
[30]. This risk measure depends on the conditional entropy H(P|Q) where Q is the
distribution following perturbation. The conditional entropy represents the amount of
information needed to recover the original distribution P given that we observe the
confidentialised distribution Q.
24
3.4 Data utility measures
To assess the impact on data utility for frequency tables of whole population counts,
we can use similar utility measures as described for microdata in Section 2.3 since
many of the measures are based on examining frequency distributions in the original
data versus the perturbed data. For example, the utility measures based on distance
metrics between original and perturbed cell values in a table, and the impact on a Chi-
square test for independence which examines statistical associations between
variables defining the table, are particularly relevant.The aim is to ensure that the
power of such statistical tests are not impacted by the perturbation and there is no
change in Decision Theory under statistical inference.
Continuing with Information Theory based measures outlined above [29], the utility
can be measured by the Hellinger’s Distance (explained below) to assess the distance
between two distributions.
In the case of frequency distributions from whole-population tables, where
),...,,( 21 KFFFF is the vector of original counts and ),...,,( 21 KGGGG is the vector of
perturbed counts, and
K
i
i NF1
and
K
i
i MG1
, the Hellinger Distance is defined as:
K
i
ii GFGFGFHD1
2
22
1||||
2
1),( (14)
The Hellinger’s Distance takes into account the magnitude of the cells since the
difference between square roots of two ‘large’ numbers is smaller than the difference
between square roots of two ‘small’ numbers, even if these pairs have the same
absolute difference. Also, Hellinger’s Distance is not impacted on original cell counts
that are zero as would be the case for relative distances. The lower bound remains
zero and the upper bound of this distance of counts changes:
.2
22
1
22
1
2
1),(
1
11
2
MNGFMN
GFGFGFGFHD
K
i
ii
K
i
iiii
K
i
ii
The information loss measure which is bounded by 0 and 1 where 0 represents low
utility and 1 represents high utility is:
2
MN
)G,F(HD1
.
25
4. Magnitude tables from business statistics
Concerns about the disclosure of sensitive information arising from magnitude tables
of business statistics started back in the 1980’s. Magnitude tables are defined as
tables where the cells contain sums or averages of a continuous variable such as total
turnover, profits or revenue and the table is spanned by identifying variables, such as
region and economic activity. It was recognized that potential intruders to this type of
statistical data were other businesses that were interested in learning sensitive
commercial information about their competitors. Therefore, we assume that intruders
are competing businesses in a cell of the table and that the identity of other businesses
in the cell is known. In addition, we assume that the intruders also know the ranking
of the businesses with respect to their size. The main concern is therefore one of
attribute disclosure.
Disclosure risk measures are known as sensitivity measures and are based on
whether a contributor in the cell of a table can learn the values of the target variable
for the other contributors in the cell with sufficient precision. Since business surveys
have large sampling fractions and in particular take-all strata where all large
businesses are required to respond, we do not account for any protection afforded by
sampling.
In the general framework, a table is defined by cross-classification of categorical
variables, such as the standard industrial classification (SIC) and region. Let X denote
a generic cell, N(X) denote the number of contributors in the cell and xi denote the
value of the target variable for contributor i. We define the total in cell X as 𝑇(𝑋) =
∑ 𝑥𝑖.
Assume xi > 0 for i = 1, ……, N(X) and assume that the observations can be ordered
so that: x1 ≥ x2 ≥ ……. ≥ xN(x) >0. Assuming that an intruder is a contributor in the same
cell, we wish to avoid the intruder from being able to disclose an xi value for another
i. A crude approach of the intruder would be to estimate xi by T(X) and that may be a
good estimate if one business contributes a large proportion of the cell.
A (1,k) rule classifies a cell as disclosive if x1 >(k/100) T(X) and this rule can be
generalised to the (n, k) rule which classifies a cell as disclosive if x1 +…..+xn ≥ (k/100)
T(X). The generalized rule assumes that a number of businesses in a cell, say 2
businesses, can form a coalition to disclosure a value for the third business in a cell
of size 3. Therefore, a threshold rule is generally upheld and any cell of size n or less
is deemed disclosive. For example, if N(X)=2 we obtain exact disclosure x1 = T(X)-x2
and hence a general threshold rule of 3 is used.
Another sensitivity measure is the p% rule. The most precise estimate by the second
largest contributor for the value of the largest contributor in a cell is: ��1 = 𝑇(𝑥) − 𝑥2.
The percent error is: 100 × (��1 − 𝑥1)/ 𝑥1 = 100 × (𝑇(𝑥) − 𝑥1 − 𝑥2)/ 𝑥1. In the p% rule,
the cell is disclosive if 100 × (𝑇(𝑥) − 𝑥1 − 𝑥2)/ 𝑥1 ≤ 𝑝. It is also well known that if the
parameters of the sensitivity measures are known to intruders, such as the p, n or k,
26
they can be used to disclose sensitive information and hence the parameters are not
released.
To protect magnitude tables containing business statistics, table design and cell
suppression are generally used. Based on the sensitivity and threshold rules, the
disclosive cells are suppressed. These are called the primary suppressions. Then,
other cells need to be suppressed to ensure that the primary suppressions are not
revealed through the marginal totals. These are called secondary suppressions. For
a 2-dimensional table for example, at least 2 cells in a row and column, i.e. the vertices
of a rectangle, need to be suppressed to ensure that the primary suppressions are
safe and cannot be recalculated. This is known as the hypercube method for
secondary cell suppression.
To optimise secondary cell suppressions, mathematical linear programming is used
in Tau-Argus where an objective function ∑ 𝐶(𝑋) is minimized [28]. For C(X)=1 we
minimise the total number of cells suppressed, for C(X)=N(X) we minimise the number
of contributors suppressed and for C(X)=T(X) we minimise the total value of the target
variable suppressed. Note that depending on the objective function, information loss
measures should account for how the secondary suppressions are defined. The
solution of the linear programming can be heavy (NP hard) so simplified and
alternative solutions may be used. The constraints of the mathematical linear
programming are the preservation of margins and ensuring non-negative values in the
table [14, 23, 31].
5. Disclosure risk–data utility confidentiality map
The disclosure risk and utility measures can be used to produce a disclosure risk-data utility confidentiality map [32]. We conceptualise the map in Figure 1.
Figure 1: Conceptualised disclosure risk-data utility confidentiality map
In the lower left-hand quadrant of the map in Figure 1, we have low disclosure risk and
low utility. In fact, not releasing data at all will have no utility although some disclosure
risk remains as information about the disclosive nature of the data is leaked by not
allowing its release. In the upper right-hand quadrant of the map in Figure 1, we have
high disclosure risk and high utility. We can see that the original data is above a
27
maximal tolerable risk threshold determined by the agency and hence SDC methods
need to be applied. SDC methods impact negatively on the utility of the data. Thus
SDC is an iterative process, where different SDC methods are applied with different
parameterisations, and the disclosure risk and data utility are quantified and mapped
on to the disclosure risk-data utility confidentiality map. The SDC method that is below
the risk threshold and having the highest utility is selected. Note that the data points
form a frontier on the map which allows the selection of the optimal SDC method.
A more realistic example of disclosure risk-data utility confidentiality mapping
compares random and targeted data swapping procedure, and a random and targeted
PRAM procedure on the variable Local Authority District (LAD) [12]. In this example,
the population includes N = 1,468,255 individuals from an extract of the 2001 UK
Census. We drew 1% Bernoulli samples (n = 14,683) and defined six key categorical
variables for the risk assessment: LAD (11), sex (2), age groups (24), marital status
(6), ethnicity (17), economic activity (10), where the numbers of categories of each
variable are in parentheses (K = 538,560). Both the random data swap and the random
PRAM on the LAD variable were carried out at rates of 2%, 5%, 10% and 20%. The
remaining individuals were not perturbed. For the targeted data swap and targeted
PRAM we carried out more perturbations in the ‘other’ ethnicity group compared to the
‘White British’ ethnicity group.
In Figure 2 we plot the disclosure risk-data utility confidentiality map. The points on the
map represent different candidate releases, that is, perturbation methods with different
levels of perturbation. The points are denoted T for targeted or R for random; 20 for
20%, 10 for 10%, 5 for 5% or 2 for 2%; and S for swapping or P for PRAM. The points
are plotted against the risk measure in (8) in Section 2.1 on the Y-axis and a utility
measure based on a distance metric of counts as defined by LAD by ethnicity. The
distance metric is first calculated as the absolute perturbation per cell. Then, a relative
distance from the average cell count is calculated so that the higher the measure, the
higher the utility. This is denoted as the relative absolute average distance per cell
(RAAD).
Figure 2: Disclosure risk-data utility confidentiality map on real example
From Figure 2, we see that we have approximately the same level of utility between
the targeted 10% perturbation and the random 20% perturbation with respect to the
28
RAAD, although we obtain lower disclosure risk with the targeted 10% perturbation.
The same applies to the targeted 5% perturbation and the random 10% perturbation,
with the targeted 5% perturbation having less disclosure risk than the random 10%
perturbation at the same level of utility. We draw a line to connect points on the
disclosure risk- data utility frontier and note that in all cases, at given levels of utility,
the targeted data swapping provides the lowest disclosure risk compared to the other
methods, although there is little difference between targeted swapping and targeted
PRAM. Targeting did not appear to impact much on the utility, and the general
conclusion here is that targeting seems useful, enabling less perturbation to be applied
and hence more utility for a given level of disclosure risk protection. Of course, this
finding could vary in other settings and producers of statistics within agencies should
use a similar risk-utility approach, based on its own data, to determine its preferred
SDC approach.
6. New dissemination strategies
Up until now, we have focused on traditional types of statistical data that are
disseminated by producers of statistics within agencies: tabular data and microdata.
However, with increasing demand for more open and accessible statistical data,
agencies are now considering alternative dissemination strategies. In this section, we
examine some of these strategies.
6.1 Safe data enclaves and remote access
To meet increasing demands for high resolution data, many agencies have set up data
enclaves on their premises where approved researchers can go onsite and gain
access to confidential statistical data. The secure servers within the enclave have no
connection to printers or the internet and only authorised researchers are allowed to
access them. To minimise disclosure risk, no data is to be removed from the enclave
and researchers undergo specialised training to understand the confidentiality
guidelines. Researchers are generally provided with standard software within the
system, such as STATA, SAS and R, but any specialised software would not be
available. All information flow is controlled and monitored. Any outputs to be taken out
of the data enclave are dropped in a folder and manually checked by experienced
confidentiality officers for disclosure risks. Examples of disclosure risks in outputs are
small cell counts in tables, residual plots from regression models which may highlight
outliers and Kernel density estimation with small band-widths.
The disadvantage of the data enclave is the need to travel, sometimes long distances,
to access confidential data. In recent years, some agencies have piloted remote
access by extending the concept of the data enclave to a ‘virtual’ data enclave.These
‘virtual’ data enclaves can be set up at other government agencies, universities and
even on a researcher’s own laptop. Users log on to secure servers via VPN
connections to access the confidential data. All activity is logged and audited at the
keystroke level and outputs are reviewed remotely by confidentiality officers before
29
being sent back to the researchers via a secure file transfer protocol site. The
technology also allows users within the same research group to interact with one
another while working on the same dataset. An example of this technology is the Inter-
University Consortium for Political and Social Research (ICPSR) housed at the
University of Michigan. The ICPSR maintains access to data archives of social science
data for research and operates both a physical on-site data enclave and a ‘virtual’ data
enclave [33].
6.2 Web-based applications
In recent years, two types of web-based dissemination applications have been
considered by producers of statistics within agencies: flexible table generators and
remote analysis servers.
6.2.1 Flexible table generating servers
Driven by demand from policy makers and researchers for specialised and tailored
tables from statistical data, particularly census data, some agencies are developing
flexible table generating servers that allow users to define and generate their own
tables. The United States Census Bureau [34] and the Australian Bureau of Statistics
[35] have developed such servers for disseminating census tables. Eurostat also
provides a platform for a flexible table generating server for European census counts
[36] although this is based on a series of large uniform hyper-tables that were
produced by each member state and hence the tabulations are more limited and do
not go beyond the underlying tables. Users access the servers via the internet and
define their own table of interest from a set of pre-defined variables and categories
typically from drop down lists. The generated table undergoes a series of checks, and
if it passes the criteria, it is downloaded onto the user’s PC without the need for human
intervention.
Whilst the online flexible table generators have the same types of disclosure risks as
mentioned in Section 3.1, the disclosure risks based on disclosure by differencing and
disclosure by linking tables are now very much relevant since there are no
interventions or manual checks on what tables are produced or how many times tables
are generated. Therefore, for these types of online systems for tables, the statistical
community has recognised the need for perturbative methods to protect against
disclosures. When selecting the SDC technique to apply to the generated output table,
there are two approaches: apply SDC to the underlying data so that all tables
generated in the server are deemed safe for dissemination (pre-tabular SDC), or
produce tables directly from original data and apply the SDC technique to the final
tabular output (post-tabular SDC). Although sometimes a neater and less resource
intensive for data from a single source, the pre-tabular approach is problematic since
a large amount of aggregation and perturbation is needed to protect the underlying
data to be used in the server. Therefore, when generating the table through the server,
the SDC is compounded and over-protects the data whilst decreasing the utility of the
30
table. The post-tabular approach has improved utility since the perturbation is only
carried out on the generated table. A discussion of disclosure risk and data utility for
a flexible table builder can be found [30]. This post-tabular approach is also motivated
by the computer science definition of differential privacy as discussed briefly in Section
7. Often, a combination of pre-tabular and post-tabular approaches is undertaken.
As mentioned, the design of remote table generating servers typically involves many
ad-hoc preliminary SDC checks that can easily be programmed within the system to
determine whether tables can be released or not. These SDC checks may include:
limiting the number of dimensions in the table, minimum population thresholds,
ensuring consistent and nested categories of variables to avoid disclosure by
differencing, etc. If the requested table does not meet the standards, it is not released
through the server and the user is advised to redesign the table.
For flexible table generating, the server has to quantify the disclosure risk in the
original table, apply an SDC technique and then reassess the disclosure risk.
Obviously, the disclosure risk will depend on whether the underlying data is a whole
population (census) and the zeros are real zeros, or the data are from a survey and
the zeros may be random zeros. After the table is protected, the server should also
calculate the impact on the utility by comparing the perturbed table to the original table.
Measures based on Information Theory presented in Sections 3.3 and 3.4 can be used
to assess disclosure risk and utility in a table generating server since they can be
calculated in real time. In addition, some perturbation methods for protecting census
tables are presented in Section 3.2.
As an example, in Table 5 we compare different SDC methods for a census table
defined in one region of the United Kingdom according to banded age groups,
education qualification and occupation. The table contained 2,457 cells where 62.4%
were real zeros. The underlying data in the flexible table generating server was a very
large hypercube which provided a priori protection since no units below the level of
the cells of the hypercube are disseminated. We compare three pre-tabular methods
on the hypercube: record swapping, semi-controlled random rounding and a
stochastic perturbation, and a post-tabular method of semi-controlled random
rounding applied directly to the output table. The measures are based on Information
Theory as described in Sections 3.3 and 3.4.
Table 5: Information Theory disclosure risk and utility for a generated table
Disclosure Risk Hellinger Distance
Original 0.318 -
Perturbed Input
Record Swapping 0.282 0.988 Semi-controlled Random Rounding
0.137 0.991
Stochastic Perturbation 0.239 0.995
Perturbed Output
Semi-Controlled Random Rounding
0.135 0.993
31
From Table 5, it is clear that the method of record swapping when applied to the input
data did little to reduce the disclosure risk in the final output table. This was due to the
fact that the small cells remain unperturbed in the table. Record swapping provided
the lowest data utility since the geography variable that was swapped was used to
select a sub-population of the table thus increasing the information loss. From among
the input perturbation methods, the semi-controlled random rounding provided the
most protection against disclosure. The stochastic perturbation still leaves small cells
in the table and hence is not as protective as the semi-controlled random rounding.
Both methods have similar information loss. Comparing the pre-tabular and post-
tabular semi-controlled random rounding procedure, we see slightly lower disclosure
risk according to the post-tabular rounding and slightly higher data utility since the
SDC method is not compounded by aggregating rounded cells.
6.2.2 Remote analysis servers
A remote analysis server is an online system which accepts a query from the
researcher runs it within a secure environment on the underlying data and returns a
confidentialised output without the need for human intervention to manually check the
outputs for disclosure risks. Similar to flexible table generators, the queries are
submitted through a remote interface and researchers do not have direct access to
the data. The queries may include exploratory analysis, measures of association,
regression models and statistical testing. The queries can be run on the original data
or confidentialised data and may be restricted and audited depending on the level of
required protection. An example of regression modeling via a remote analysis server
can be found [37].
Figure 3: Confidential residual plot from a regression analysis on receipts for
the Sugar Canes dataset
A comparison of outputs based on original data and two SDC approaches outputs
from confidentialised microdata and confidentialised outputs obtained from the original
data via a remote analysis server, has been done [38]. The comparison was carried
out on a dataset from the 1982 survey of the sugar cane industry in Queensland,
Australia [39]. The dataset corresponds to a sample of 338 Queensland sugar farms
and contained the following variables: region, area, harvest, receipts, costs and profits
(equal to receipts minus costs). The dataset was confidentialised by deleting large
outlier farms, coarsening the variable area and adding random noise to harvest,
32
receipts, costs and profits. Figure 3 shows what the residual plots would look like in a
remote analysis server where the response variable is receipts and the explanatory
variables: region, area, harvests and costs. As can be seen the scatterplot is
presented as sequential box plots and the Normal QQ plot is smoothed.
Original Confidential Input Confidential Output
Figure 4: Univariate analysis of receipts for the Sugar Canes dataset
33
Figure 4 presents the comparison of the univariate analysis of receipts on the original
dataset, the confidentialised input approach and the confidentialised output approach.
6.3 Synthetic data
Basic confidential data is a fundamental product of virtually all statistical agency
programs. These lead to the publication of public-use products such as summary data,
microdata from surveys, etc. Confidential data may also be used for internal use within
data enclaves. In recent years, there has been a move to produce synthetic microdata
as public-use files which preserve some of the statistical properties of microdata. The
data elements are replaced with synthetic values sampled from an appropriate
probability model. The model is fit to the original data to produce synthetic populations
through a posterior predictive distribution similar to the theory of multiple imputation.
Several samples are drawn from the population to take into account the uncertainty of
the model and to obtain proper variance estimates. Further references and details of
generating synthetic data can be found [40, 41]. The synthetic data can be
implemented on parts of data so that a mixture of real and synthetic data is released
[42]. One application which uses partially synthetic data is the US Census Bureau ‘On
the Map’ [43]. It is a web-based mapping and reporting application that shows where
workers are employed and where they live according to the Origin-Destination
Employment Statistics.
In practice it is very difficult to capture all conditional relationships between variables
and within sub-populations. If models used in a statistical analysis are sub-models of
the model used to generate data, then the analysis of multiple synthetic samples
should give valid inferences. In addition, partially synthetic datasets may still have
disclosure risks and need to be checked prior to dissemination.
For tabular data there are also techniques to develop synthetic magnitude tables
arising from business statistics. Controlled tabular adjustment (CTA) carries out cell
suppression and replaces the suppressed cells with imputed values that guarantee
some statistical properties such as preserving the margins of the table [44]. Other
perturbative methods such as correlated noise addition can also be used.
7. Statistical disclosure control: Where do we go from here?
This article focused on an understanding of the SDC research and methods involved
in preparing traditional statistical data before their release. The main goal of SDC is to
protect the confidentiality of those whose information is in a dataset, while still
maintaining the usefulness of the data itself. In recent years, however, agencies have
been restricting access to statistical data due to their inability to cope with the large
demand for data whilst ensuring the confidentiality of statistical units. On the other
hand, government initiatives for more open and accessible data is pushing agencies
producing official and national statistics to explore new ways of disseminating
statistical data. One disclosure risk that is often overlooked in traditional statistical
34
outputs, and is only now coming to prominence with ongoing development into web-
based interactive data dissemination, is inferential disclosure. Inferential disclosure
risk is the ability to learn new attributes with high probability. For example, a proportion
of some characteristic is very high within a subgroup, e.g. a high proportion of those
who smoke have heart disease, or a regression model may have very high predictive
power if the dependent and explanatory variables are highly correlated, e.g. regressing
BMI on height and weight. In fact, an individual does not even have be in the dataset
in order to disclose information. Another example of inferential disclosure is disclosure
by differencing frequency tables of whole population counts when multiple tabular
releases are disseminated from one data source.
Inferential disclosure risk forms the basis of the definition of differential privacy as
formulated in the computer science literature [45]. Differential privacy is a
mathematically principled method of measuring how secure a protection mechanism
is with respect to personal data disclosures. It incorporates all traditional disclosure
risks and inferential disclosure in a ‘worst-case’ scenario. The theory of differential
privacy was developed in the context of masking queries from a remote online query
system. It is now coming to the attention of agencies who are under increasing
pressures to broaden access to data and to provide better solutions for the release of
statistical data, for example through interactive web-based platforms, and therefore
are in need of stricter and more formal privacy guarantees.This has changed the
landscape of how disclosure risks are defined and has led to agencies recognising the
need for more use of perturbative methods of SDC. In addition, agencies need to
recognise that the SDC parameters, e.g. variance of noise or swap rates, need to be
released to researchers so that they can account for the measurement/perturbation
error in their statistical analysis. Since differential privacy is a cryptographic method,
the parameters of the noise addition are not secret and can be released to
researchers.
A discussion on how differential privacy (as introduced in the computer science
literature [45, 47, 48]) relates to the disclosure risk scenarios for survey microdata can
be found [46]. In addition, they investigate whether current SDC practices at agencies
producing official and national statistics for survey microdata meet the strict privacy
guarantees of differential privacy. Differential privacy for example makes no distinction
between key variables and sensitive variables, or whether the data comes from a
census or a sample. It is assumed in a worst-case scenario that an intruder has
complete information of the entire database except for one target individual and wishes
to learn about an attribute value for the target individual.
In the survey setting, there are two possible definitions of the database: the population
‘database’ 1( ,...., )U Nx xx and the sample 'database' 1( ,..., )s nx xx , where N denotes
the size of the population {1,..., }U N and, without loss of generality, we write
{1,..., }s n . The sample database might be viewed from one perspective as more
realistic, since it contains the data collected by the agency, whereas the population
database would include values of survey variables for non-sampled units, which are
35
unknown to the agency. In the context of differential privacy, we use the population
database Ux to define privacy, treat the sampling as part of the SDC mechanism and
suppose that prior intruder knowledge relates to aspects of Ux .
Let 𝑥�� denote the cell value of unit i in the microdata after SDC/sampling has been
applied and let 𝑓�� = ∑ 𝐼(𝑥�� = 𝑗)𝑖∈𝑠 denote the corresponding observed count in cell j
in the microdata. We view the released microdata as the vector of counts: 𝐟 =
(𝑓1 , … , 𝑓��), and 𝑃(𝐟|𝐱𝑈) as the probability of f~
with respect to an SDC/sampling
mechanism where Ux is treated as fixed. The following definition is considered [46].
Definition - differential privacy holds if:
𝑚𝑎𝑥 |𝑙𝑛 (𝑃(𝐟|𝐱
𝑈(1))
𝑃(𝐟|𝐱𝑈(2))
)| ≤ 𝜀 (15)
for some 0 , where the maximum is over all pairs (1) (2)( , )U Ux x , which differ in only one
element and across all possible values of f~
.
The disclosure risk in this definition is inferential disclosure where if only one value is
changed in the population database, the intruder is unable to gain any new knowledge
about a target individual given that all other individuals in the population are known.
A further definition of ),( -probabilistic differential privacy holds if (15) applies with
probability at least 1 for some , 0 [49]. More precisely, this definition holds if:
the space of possible outcomes f~
may be partitioned into ‘good’ and other
(unbounded) outcomes, and if (15) holds when the outcome is good and if the
probability that the outcome is good is at least 1 . This definition is essentially the
same as the notion of probabilistic differential privacy where the set of bad outcomes
is referred to as the disclosure set [50].
An investigation to determine whether common practices for preserving the
confidentiality of respondents in survey microdata at agencies producing official and
national statistics, such as sampling and perturbation, are differentially private, has
been carried out [46]. They found that non-perturbative methods such as sampling is
not differentially private since an unbounded ratio in (15) will occur if for neighboring
databases, the target individual is a population unique 1j jf F . In that case, for a
given f and any sampling scheme where some element jf of f may equal
jF with
positive probability, there exists a database (1)
Ux such that (1) 1j jf F for some j and
(1)Pr[ | ] 0U f x . Now if we change an element of (1)
Ux which takes the value j to construct
(2)
Ux for which jjj fFF 1)1()2( we obtain (2)Pr[ | ] 0U f x . Hence, - differential
privacy does not hold for a very broad class of sampling schemes.
One of the reasons why the disclosure implications of this finding might not be
considered a cause for concern by an agency is that it is unrealistic to assume that an
36
intruder will have precise information on all individuals in a population except for one
target individual. In addition, the probability of a population unique given a sample
unique for typical social survey microdata is very small and hence we can adopt the
),( - probabilistic differential privacy definition where the probability for a set of bad
outcomes is very small [46]. Further examination as to whether perturbation under a
misclassification mechanism similar to the one shown in formula (5) in Section 2.1 and
the stochastic mechanism shown in Table 2 in Section 3.2, is differentially private,
found that if all elements of M are positive, then the ratio in (15) will be bounded.
One area where differential privacy is showing promise for SDC applications at
agencies producing official and national statistics is for the case of developing an
online flexible table generating server as defined in Section 6.2.1. As mentioned,
differential privacy aims to avoid inferential disclosure by ensuring that an intruder
cannot make inference about a single unit when only one of its value is changed, given
that all other units in the population are known. This definition would include disclosure
by differencing and linking tables which are the main disclosure risks of concern when
developing a flexible table generator, particularly for whole population counts. The
solution for guarantying differential privacy is by adding noise/perturbation to the
outputs of the queries, i.e. cells of the table under specific parameterisations. The
privacy guarantee is set a priori and is used to define the prescribed probability
distribution of the perturbation. Research is still ongoing for use of the differential
privacy standard in an online flexible table generating server and it has yet to be
implemented. There are promising developments for flexible table generators that are
based on a small set of fixed variables and subject to other protection guarantees
where all potential tables (and their cells) are known in advance and hence can be
perturbed a priori. This is known as a non-interactive mechanism in differential privacy,
and all subsequent and repeated queries and analyses on the protected data do not
impact on the privacy guarantee [51].The challenges for the future of statistical
disclosure control are to examine the potential of formal privacy guarantees through
for example the differential privacy standard, and to develop applications for more
open and innovative dissemination strategies at agencies producing statistics.
37
8. References
[1] See https://www.ukdataservice.ac.uk/. [2] See https://www.ipums.org/ for more information. [3] Skinner, C.J., and Elliot, M. J. (2002). A Measure of Disclosure Risk for Microdata. Journal of the Royal Statistical Society, Ser. B 64, 855-867. [4] Yancey, W.E., Winkler, W.E., and Creecy, R.H. (2002) Disclosure Risk Assessment in Perturbative Micro-data Protection. In: Inference Control in Statistical Databases (ed. J. Domingo-Ferrer), New York: Springer, 135-151. [5] Domingo-Ferrer, J. and Torra, V.(2003) Disclosure Risk Assessment in Statistical Microdata Protection via Advanced Record Linkage, Statistics and Computing, Vol. 13, No. 4, 343-354. [6] Reiter, J.P. (2005a) Estimating Risks of Identification Disclosure in Microdata. Journal of the American Statistical Association 100, 1103-1112. [7] Bethlehem, J., Keller, W., and Pannekoek, J. (1990). Disclosure limitation of Microdata. Journal of the American Statistical Association 85, 38-45. [8] Elamir, E. and Skinner, C.J. (2006). Record-Level Measures of Disclosure Risk for Survey Micro-data. Journal of Official Statistics, 22, 525-539. [9] Skinner, C.J. and Shlomo, N. (2008). Assessing Identification Risk in Survey Micro-data Using Log-linear Models. Journal of American Statistical Association, Vol. 103, Number 483, 989-1001. [10] Skinner, C.J. and Holmes, D. (1998). Estimating the Re-identification Risk Per Record in Microdata. Journal of Official Statistics 14, 361-372. [11] Rinott, Y. and Shlomo, N. (2007 ). Variances and Confidence Intervals for Sample Disclosure Risk Measures. 56th Session of the International Statistical Institute Invited Paper, Lisbon 2007. [12] Shlomo, N. and Skinner, C.J. (2010). Assessing the Protection Provided by Misclassification-Based Disclosure Limitation Methods for Survey Microdata. Annals of Applied Statistics, Vol. 4, No. 3, 1291-1310. [13] Gouweleeuw, J., Kooiman, P., Willenborg, L.C.R.J., and De Wolf, P.P. (1998). Post Randomisation for Statistical Disclosure limitation: Theory and Implementation. Journal of Official Statistics, 14, 463-478. [14] Willenborg, L. and De Waal, T. (2001). Elements of Statistical Disclosure limitation in Practice. Lecture Notes in Statistics, 155. New York: Springer-Verlag.
38
[15] Domingo-Ferrer, J., Mateo-Sanz, J. and Torra, V. (2001). Comparing SDL Methods for Micro-Data on the Basis of Information Loss and Disclosure Risk. ETK-NTTS Proceedings of the Conference, Crete, June 2001. [16] Shlomo, N. and De Waal T. (2008). Protection of Micro-data Subject to Edit Constraints Against Statistical Disclosure. Journal of Official Statistics, 24, No. 2, 1-26. [17] Kim, J.J. (1986). A Method for Limiting Disclosure in Micro-data Based on Random Noise and Transformation. American Statistical Association, Proceedings of the Section on Survey Research Methods, 370-374. [18] Fuller, W. A. (1993). Masking Procedures for Micro-data Disclosure Limitation. Journal of Official Statistics, 9, 383-406. [19] Gomatam, S. and Karr, A. (2003). Distortion Measures for Categorical Data Swapping. Technical Report Number 131, National Institute of Statistical Sciences. [20] Shlomo, N. and Young, C. (2006). Statistical Disclosure Limitation Methods Through a Risk-Utility Framework. In PSD'2006 Privacy in Statistical Databases, (Eds. J. Domingo-Ferrer and L. Franconi), Springer LNCS 4302, pp. 68-81. [21] Shlomo, N. (2007). Statistical Disclosure Limitation Methods for Census Frequency Tables. International Statistical Review, Vol. 75, Number 2, pp. 199-217. [22] https://www.nomisweb.co.uk/. [23] Hundepool, A., Domingo-Ferrer, J., Franconi, L., Giessing, S., Nordholt, E. S.,Spicer, K. and de Wolf, P. P. (2012). Statistical Disclosure Control. Wiley Series inSurvey Methodology. John Wiley & Sons, United Kingdom. [24] Shlomo, N. and Young, C. (2008). Invariant Post-tabular Protection of Census Frequency Counts. In PSD'2008 Privacy in Statistical Databases, (Eds. J.Domingo-Ferrer and Y. Saygin), Springer LNCS 5262, 77-89. [25] Shlomo, N., Tudor, C. and Groom, P. (2010). Data Swapping for Protecting Census Tables. In PSD'2010 Privacy in Statistical Databases, (Eds. J. Domingo-Ferrer and E. Magkos), Springer LNCS 6344, pp. 41-51. [26] Shlomo, N., Antal, L. and Elliot, M. (2015). Measuring Disclosure Risk and Data Utility for Flexible Table Generators. Journal of Official Statistics, 31, 305-324. [27] Fraser, B. and Wooton, J. (2005). A Proposed Method for Confidentialising Tabular Output to Protect Against Differencing. Joint UNECE/Eurostat work session on statistical data confidentiality, Geneva, 9-11 November. [28] Salazar-Gonzalez, J.J., Bycroft, C. and Staggemeier, A.T. (2005). Controlled Rounding Implementation. Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality, Geneva, 9-11 November.
39
[29] Antal, L., Shlomo, N., and Elliot, M. (2014). Measuring Disclosure Risk with Entropy in Population Based Frequency Tables. In PSD'2014 Privacy in Statistical Databases, (Eds. J. Domingo-Ferrer). Germany: Springer LNCS 8744, 62-78. [30] Antal, L., Shlomo, N., and Elliot, M. (2015) Disclosure Risk Measurement with Entropy in Two-Dimensional Sample Based Frequency Tables. Joint UNECE/Eurostat work session on statistical data confidentiality, Helsinki, October 2015 https://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.46/20150/Paper_14_Session_1_-_Univ._Manchester.pdf [31] Duncan, G. T., Elliot, M. and Salazar-Gonz_alez, J. J. (2011). Statistical Confidentiality. Springer, New York. [32] Duncan, G., Keller-McNulty, S., and Stokes, S. (2001). Disclosure Risk vs. Data Utility: the R-U Confidentiality Map. Technical Report LA-UR-01-6428. Statistical Sciences Group,Los Alamos, N.M.:Los Alamos National Laboratory. [33] https://www.icpsr.umich.edu/icpsrweb/content/ICPSR/access/restricted/enclave.html [34] https://factfinder.census.gov/faces/nav/jsf/pages/index.xhtml [35] http://www.abs.gov.au/websitedbs/censushome.nsf/home/tablebuilder [36] https://ec.europa.eu/CensusHub2/query.do?step=selectHyperCube&qhc=false [37] O’Keefe, C.M. and Good, N. (2008). A Remote analysis Server – What Does Regression Output Look Like? In PSD'2008 Privacy in Statistical Databases, (Eds. J.Domingo-Ferrer and Y. Saygin), Springer LNCS 5262, 270-283. [38] O’Keefe, C.M. and Shlomo, N. (2012). Comparison of Remote Analysis with Statistical Disclosure Control for Protecting the Confidentiality of Business Data. Transactions on Data Privacy, Vol. 5, Issue 2, 403-432. [39] Chambers, R.L. and Dunstan,R. (1986). Estimating Distribution Functions from Survey Data. Biometrika, Vol. 73, 597-604. [40] Raghunathan, T.E., Reiter, J. and Rubin, D. (2003). Multiple Imputation for Statistical Disclosure Limitation. Journal of Official Statistics, 19, No. 1, 1-16. [41] Reiter, J.P. (2005b), Releasing Multiply Imputed, Synthetic Public-Use Microdata: An Illustration and Empirical Study. Journal of the Royal Statistical Society, A, Vol.168, No.1, 185-205. [42] Little, R.J.A., and Liu, F. (2003). Selective Multiple Imputation of Keys for Statistical Disclosure Control in Microdata. The University of Michigan Department of Biostatistics Working Paper Series. Working Paper 6. [43] http://onthemap.ces.census.gov/
40
[44] Dandekar, R.A. and Cox L. H. (2002). Synthetic Tabular Data: An Alternative to Complementary Cell Suppression. Manuscript, Energy Information Administration, U. S. Department of Energy. [45] Dwork, C., McSherry, F., Nissim, K. and Smith, A. (2006). Calibrating Noise to Sensitivity in Private Data Analysis. In Theory of Cryptography TCC (eds. S. Halevi and R. Rabin). Heidelberg: Springer, LNCS Vol. 3876, 265-284. [46] Shlomo, N. and Skinner. C.J. (2012). Privacy Protection from Sampling and Perturbation in Survey Microdata. Journal of Privacy and Confidentiality, Vol. 4, Issue 1. [47] Dinur, I. and Nissim, K. (2003). Revealing Information While Preserving Privacy. PODS 2003, 202-210. [48] Dwork, C. and Roth, A. (2014). The Algorithmic Foundations of Differential Privacy. Foundations and Trends in Theoretical Computer Science, 9, 211-407. [49] Chaudhuri, K. and Mishra, N. (2006). When Random Sampling Preserves Privacy. Proceedings of the 26th International Cryptology Conference. [50] Machanavajjhala, A., Kifer, D., Abowd, J., Gehrke, J. and Vilhuber, L. (2008). Privacy: Theory Meets Practice on the Map. In Proceedings of the 24th International Conference on Data Engineering, Cancun, Mexico, 277-286. [51] Rinott, Y., O’Keefe, C., Shlomo, N., and Skinner, C. (2018). Confidentiality and Differential Privacy in the Dissemination of Frequency Tables. Statistical Sciences, Vol. 33, No. 3, 358-385.