methods to assess and quantify disclosure risk and ... · and repositories and these do a...

40
1 December 2018 Methods to assess and quantify disclosure risk and information loss under statistical disclosure control Professor Natalie Shlomo The University of Manchester A contributing article to the National Statistician’s Quality Review into Privacy and Data Confidentiality Methods

Upload: others

Post on 03-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Methods to assess and quantify disclosure risk and ... · and repositories and these do a particularly good job of archiving and making data available for international comparisons

1

December 2018

Methods to assess and quantify disclosure risk and information loss under statistical disclosure control

Professor Natalie Shlomo

The University of Manchester

A contributing article to the National Statistician’s Quality Review into Privacy and Data Confidentiality Methods

Page 2: Methods to assess and quantify disclosure risk and ... · and repositories and these do a particularly good job of archiving and making data available for international comparisons

2

Contents

1. Introduction .......................................................................................................... 3

2. Microdata from social surveys ................................................................................ 5

2.1 Disclosure risk assessment ............................................................................... 6

2.2 Statistical disclosure control methods ............................................................. 10

2.2.1 PRAM for categorical key variables .......................................................... 11

2.2.2 Additive noise for continuous variables ..................................................... 13

2.3. Data utility measures ..................................................................................... 14

2.3.1 Distance metrics ....................................................................................... 15

2.3.2 Impact on measures of association .......................................................... 15

2.3.3 Impact on a regression analysis ............................................................... 15

3. Frequency tables of whole population counts .................................................... 16

3.1 Types of disclosure risk................................................................................... 17

3.2 Statistical disclosure control methods .......................................................... 19

3.2.1 Record swapping ...................................................................................... 19

3.2.2 Semi-controlled random rounding ............................................................. 20

3.2.3 Stochastic perturbation ............................................................................. 21

3.3 Disclosure risk measures based on Information Theory ................................. 22

3.4 Data utility measures ...................................................................................... 24

4. Magnitude tables from business statistics ......................................................... 25

5. Disclosure risk–data utility confidentiality map ................................................... 26

6. New dissemination strategies ............................................................................ 28

6.1 Safe data enclaves and remote access .......................................................... 28

6.2 Web-based applications .................................................................................. 29

6.2.1 Flexible table generating servers .............................................................. 29

6.2.2 Remote analysis servers........................................................................... 31

6.3 Synthetic data ................................................................................................. 33

7. Statistical disclosure control: Where do we go from here? ................................ 33

8. References ........................................................................................................... 37

Page 3: Methods to assess and quantify disclosure risk and ... · and repositories and these do a particularly good job of archiving and making data available for international comparisons

3

1. Introduction

Agencies producing official and national statistics, such as a National Statistical

Institute, have an obligation to release statistical data for research purposes and to

inform policies. On the other hand, they have a legal, moral and ethical obligation to

protect the confidentiality of individuals responding to their request for data and in

many countries there are codes of practice and legislation that must be strictly adhered

to. In addition, agencies issue confidentiality guarantees to respondents prior to their

participation in surveys or censuses. The key objective is to ensure public trust in

official statistics production and ensure high response rates.

There are two general approaches under statistical disclosure control (SDC): protect

the data for release using disclosure control techniques (‘safe data’), or restrict access

to the data for example by limiting its use to only approved researchers within a secure

data environment (‘safe access’). A combination of both approaches is usually applied

when releasing statistical data.

Statistical data that are traditionally released by agencies can be divided into two major

formats: microdata and tabular data.

a) Microdata are typically from social surveys, for example, the Labour Force

Survey and the Social Survey, where sampling fractions are very small and

hence the microdata contain only a small random subset of the population.

Assuming that there is no response knowledge on who has participated in the

survey, many producers of statistics within agencies have made provisions to

release public-use microdata from social surveys, usually through data

archives. Public-use microdata also undergo variable suppression, coarsening

and aggregation before they are released (these methods are discussed in

detail in Section 2.2). Microdata from business surveys where the data are

partially collected through take-all strata and may have very sensitive skewed

outliers are typically not released. Census microdata are also not released,

although the Office for National Statistics (ONS) has a tradition of releasing

microdata from a small sample drawn from the census.

b) Tabular data contain either frequency counts which can be based on whole

populations such as from a census or register, or on survey data where the

weighted sample frequency counts are calculated by aggregating individual

survey weights. Tabular data can also include magnitude data which are

typically calculated from business surveys, such as total revenue according to

industry code.

In traditional outputs, we define the notion of an ‘intruder’ as someone with malicious

intent who wants to probe the statistical data and reveal sensitive information about

an individual or group of individuals. For example, in health data, an intruder might be

individuals or organisations who wish to disclose sensitive information about

doctors/clinics performing abortions. Two main disclosure risks are: (1) identity

disclosure where a statistical unit can be identified based on a set of cross-classified

quasi-identifying variables, which identify individuals indirectly such as age, gender,

Page 4: Methods to assess and quantify disclosure risk and ... · and repositories and these do a particularly good job of archiving and making data available for international comparisons

4

occupation, place of residence; (2) attribute disclosure where new information can be

learnt about an individual or a group of individuals. Disclosure risk scenarios form the

basis of possible means of disclosure, for example: the ability of an intruder to match

a dataset to an external public file based on a common set of quasi-identifying

variables; the ability of an intruder to identify uniques through visible and rare

attributes; the ability of an intruder to difference nested tables and obtain small counts;

the ability of an intruder to form coalitions.

In microdata from social surveys, the main concern is the risk of identification since

this is a prerequisite for individual attribute disclosure where many sensitive variables

such as income or health outcomes, can be revealed following an identification.

Naturally, sampling from the population provides a priori protection since an intruder

cannot be sure whether a sample unique on a set of quasi-identifiers is a population

unique.

In tabular data of whole population counts, attribute disclosure arises when there is a

row/column of zeros and only one populated cell. This leads to individual or group

attribute disclosure depending on the size of the populated cell since an intruder can

learn new attributes according to the remaining spanning variables of the table that

were not known previously. Therefore, in frequency tables containing whole population

counts, it is the zero cells that are the main cause of attribute disclosure. Frequency

tables of weighted survey counts are generally not a cause of concern due to the

ambiguity introduced by the sampling and the fact that survey weights differ across

units in the surveys due to non-response adjustments and benchmarking. In

magnitude tables arising from business surveys, disclosure risk is generally defined

by the ability of businesses to be able to form coalitions to disclose sensitive

information about competing businesses.

In order to preserve the confidentiality of respondents, agencies must assess the

disclosure risk in statistical data and, if required, choose appropriate SDC methods to

apply to the data. Measuring disclosure risk involves assessing and evaluating

numerically the risk of re-identifying statistical units. SDC methods perturb, modify, or

summarise the data in order to prevent re-identification by a potential intruder. Higher

levels of protection through SDC methods however often negatively impact the quality

of the data. The SDC decision problem involves finding the optimal balance between

managing and minimising disclosure risk to tolerable risk thresholds depending on the

mode for accessing the data, and ensuring high utility and fit-for-purpose data where

the data will remain useful for the purpose for which it was designed.

In this article we provide more details of SDC approaches, the quantification of

disclosure risks and data utility for traditional types of statistical outputs. Section 2

presents details for microdata from social surveys and Section 3 presents details for

tabular data from whole population counts. Section 4 presents a brief description of

SDC approaches for magnitude tables arising from business surveys. We then

demonstrate the disclosure risk-data utility trade-off in Section 5. Section 6 presents

future data dissemination strategies at agencies producing national and official

Page 5: Methods to assess and quantify disclosure risk and ... · and repositories and these do a particularly good job of archiving and making data available for international comparisons

5

statistics and we close in Section 7 with a discussion of future directions for statistical

disclosure control.

2. Microdata from social surveys

Microdata from social surveys are released only after removing direct identifying

variables, such as names, addresses, and identity numbers. As mentioned, the main

disclosure risk is individual attribute disclosure where small counts on cross-classified

quasi-identifying key variables can be used to identify an individual and confidential

information may be learnt from the remaining sensitive variables in the dataset. The

quasi-identifying key variables are those variables that are visible and traceable and

are accessible to the public as well as to potential intruders with malicious intent to

disclose information. Since the prerequisite to individual attribute disclosure is identity

disclosure, SDC methods typically aim to reduce the risk of re-identification and the

disclosure risk assessment is based on estimating a probability of identification given

a set of quasi-identifying key variables. In addition, key variables are typically

categorical variables, and may include: sex, age, occupation, place of residence,

country of birth, family structure, etc. Sensitive variables can be continuous variables,

such as income and expenditures, but can also be categorical. We define the key as

the set of combined cross-classified identifying key variables, typically presented as a

contingency table containing the counts from the survey microdata. For example, if the

identifying key variables are sex (2 categories), age group (10 categories) and years

of education (8 categories), the key will have 160 (=2 by 10 by 8) cells following the

cross-classification of the key variables.

The disclosure risk scenario of concern at statistical agencies is the ability of a

potential intruder to match the released microdata to external sources containing the

target population based on a common key. External sources can be either in the form

of prior knowledge that the intruder might have about a specific population group, or

by having access to public files containing information about the population, such as

Phone Company listings, Voter’s Registration or even a National Population Register.

The agency does not generally assume that an intruder will have ‘response

knowledge’ on whether an individual is included in the survey dataset or not and

therefore relies on this extra layer of protection in their SDC strategies.

In order to protect a data set, one can either apply an SDC method on the identifying

key variables or on the sensitive variables. In the first case identification of a unit is

rendered more difficult, and the probability that a unit is identified is hence reduced. In

the second case, even if an intruder succeeds in identifying a unit by using the values

of the identifying key variables, the sensitive variables would hardly disclose any useful

information on the particular record. One can also apply SDC methods on both the

identifying and sensitive variables simultaneously. This offers more protection, but also

leads to more information loss.

Since the application of SDC methods leads to information loss, it is important to

develop quantitative information loss measures in order to assess whether the

Page 6: Methods to assess and quantify disclosure risk and ... · and repositories and these do a particularly good job of archiving and making data available for international comparisons

6

resulting confidentialised survey microdata is fit-for-purpose according to some pre-

defined user specifications. As mentioned, survey microdata is typically released into

data archives and repositories where registered users can gain access to the data

following appropriate training and confidentiality agreements. Almost every country

has a data archive where researchers can gain access to microdata. One example is

the United Kingdom Data Service which is responsible for the dissemination of

microdata from many of the UK surveys [1]. There are other international data archives

and repositories and these do a particularly good job of archiving and making data

available for international comparisons. One example is the IPUMS archive located at

the University of Minnesota which provides census and survey data from

collaborations with 105 Statistical Agencies, national archives, and genealogical

organisations around the world. The staff ensures that the datasets are integrated

across time and space with common formatting, harmonising variable codes, archiving

and documentation [2].

Researchers may obtain more secure data that has undergone less SDC methods

through special licensing and on-site data enclaves, although this generally involves

a time-consuming application process and the need to travel to on-site facilities.

2.1 Disclosure risk assessment

We focus on microdata arising from social surveys. Disclosure risk for social survey

microdata is measured and quantified by estimating a probability of re-identification.

This probability is based on the notion of uniqueness on the key where a cell value in

the cross-classified identifying variables may have a value of one. Since survey

microdata is based on a sample, a cell value of one is only problematic if there is also

a cell value of one in the whole population. In other words, we need to know if the

sample unique in the key is also a population unique, or if it is an artefact of the

sampling. Based on the literature, methods for assessing disclosure risk for sample

microdata arising from social surveys can be classified into three types:

• Heuristics that identify special uniques on a set of cross-classified key variables,

i.e. sample uniques that are likely to be population uniques [3]

• Probabilistic record linkage on a set of key (matching) variables that can be used

to link the microdata to an external population file [4, 5]

• Probabilistic modelling of disclosure risk which was developed under two

approaches: a full model-based framework taking into account all of the information

available to intruders and modelling their behaviour [6] and a more simplified

approach that restricts the information that would be known to intruders [7, 8, 9].

Heuristics and record linkage suffer from the drawback that there is no framework for

obtaining consistent disclosure risk measures at both the individual record-level, and

the overall global file-level. In addition, these approaches do not take into account the

protection afforded by the sampling. Probabilistic modelling provides record-level

disclosure risk measures that can be used to target high-risk records in the microdata

Page 7: Methods to assess and quantify disclosure risk and ... · and repositories and these do a particularly good job of archiving and making data available for international comparisons

7

for SDC methods. Global file-level disclosure risk measures are aggregated from

record-level risk measures and are essential for Microdata Review Boards.

The disclosure risk measure of re-identification can take two forms: the number of

sample uniques that are also population uniques, and the overall expected number of

correct matches for each sample unique if we were to link the sample uniques to a

population. When the population from which the sample is drawn is unknown,

probabilistic modelling can be used to estimate population parameters which form the

basis for estimating the disclosure risk measures and also accounts for the protection

afforded by the sampling.

Quasi-identifying key variables for disclosure risk assessment are determined by a

disclosure risk scenario, i.e. assumptions about available external files and IT tools

that can be used by intruders to identify individuals in released microdata. For

example, key variables may be chosen which would enable matching the released

microdata to a publicly available file containing names and addresses. Examples of

publicly available data might include data that is freely available over the internet such

as car registrations, phone book and electoral roles, or data than can be purchased,

such as supermarket loyalty cards and life-style datasets. Under a probabilistic

approach, disclosure risk is estimated based on the contingency table of sample

counts spanned by identifying key variables, for example place of residence, sex, age,

occupation, etc. The other variables in the file are sensitive variables.

Individual per-record risk measures in the form of a probability of re-identification are

first estimated. These per-record risk measures are then aggregated to obtain global

risk measures for the entire file.

Denoting kF the population size in cell k of a table spanned by key variables having

K cells and kf the sample size and NFK

k

k 1

and

K

k

k nf1

. The set of sample

uniques, is defined: }1:{ kfkSU since these are potential high-risk records, i.e.

population uniques.

Formally, the two global disclosure risk measures (where I is the indicator function)

are the following:

1. Number of sample uniques that are population uniques:

k

kk FfI )1,1(1

2. Expected number of correct matches for sample uniques (i.e. a matching

probability):

k

kk FfI /1)1(2 .

Page 8: Methods to assess and quantify disclosure risk and ... · and repositories and these do a particularly good job of archiving and making data available for international comparisons

8

The individual risk measure for 2 is kF/1 . This is the probability that a match between

a record in the microdata and a record in the population having the same values of

key variables will be correct. If for example, there are two records in the population

with the same values of key variables, the probability is 0.5 that the match will be

correct. Adding up these probabilities over the sample uniques gives the expected

number (on average) of correctly matching a record in the microdata to the population

when we allow guessing. The population frequencies kF are unknown and are

estimated from the probabilistic model. The risk measures are then estimated by:

)1|1(ˆ)1(1 kk

k

k fFPfI and )1|/1(ˆ)1(ˆ2 kk

k

k fFEfI (1)

A Poisson model to estimate disclosure risk measures has been proposed [8, 10]. In

this model, they assume the natural assumption in contingency table literature:

)(~ kk PoissonF for each cell k. A sample is drawn by Poisson or Bernoulli sampling

with a sampling fraction k in cell k: ),(~| kkkk FBinFf . It follows that:

)(~ kkk Poisf and ))1((~| kkkk PoissonfF (2)

where kk fF | are conditionally independent.

The parameters }{ k are estimated using log-linear modeling. The sample frequencies

kf are independent Poisson distributed with a mean of kkk . A log-linear model

for the k is expressed as: kk x)log( where kx is a design vector which denotes

the main effects and interactions of the model for the key variables. The maximum

likelihood estimator (MLE) may be obtained by solving the score equations:

0)]exp([ kkkk

k

f xx (3)

The fitted values are calculated by: )ˆexp(ˆ kku x and kkk u /ˆˆ .

Individual disclosure risk measures for cell k are:

(4)

))1(exp()1|1( kkkk fFP

)]1(/[))]1(exp(1[)1|/1( kkkkkk fFE

Page 9: Methods to assess and quantify disclosure risk and ... · and repositories and these do a particularly good job of archiving and making data available for international comparisons

9

Plugging k for k in (4) leads to the estimates )1|1(ˆ kk fFP and ]1|/1[ˆ kk fFE and

then to 1 and 2 of (1). Confidence intervals for these global risk measures are also

considered [11].

A method for selecting the log-linear model based on estimating and (approximately)

minimising the bias iB of the risk estimates 1 and 2 has been developed [9].The

method selects the model using a forward search algorithm which minimises the

standardised bias estimate ii vB ˆ/ˆ for 2,1,ˆ ii where i are variance estimates of

iB . In addition, the estimation of disclosure risk measures under complex survey

designs with stratification, clustering and survey weights are also addressed [9].

Empirical studies have shown that the probabilistic modelling approach can provide

unbiased estimates of the overall global level of disclosure risk in the microdata, but

are not accurate for the individual record level of disclosure risk. Thus care should be

taken when using record level measures of disclosure risk for targeting SDC methods

to high-risk records.

The probabilistic model assumes that there is no measurement error in the way the

data is recorded. Besides typical errors in data capture, key variables can also be

purposely misclassified as a means of masking the data. A method to estimate risk

measures to take into account measurement errors has been outlined [12]. Denoting

the cross-classified key variables in the population and the microdata as X and

assuming that X in the microdata have undergone some misclassification or

perturbation error denoted by the value �� and determined independently by a

misclassification matrix M,

)|~

( jXkXPM kj (5)

a record-level disclosure risk measure of a match with a sample unique under

measurement error is:

k

j

kjkjj

kkkk

FMMF

MM 1

)1/(

)1(

(6)

Under assumptions of small sampling fractions and small misclassification errors, the

measure can be approximated by: j

kjjkk MFM / or kkk FM

~/ where

kF~

is the

population count with kX ~

. Aggregating the per-record disclosure risk measures, the

global risk measure is:

k

kkkk FMfI~

/)1(2 (7)

Note that to calculate the measure only the diagonal of the misclassification matrix

needs to be known, i.e. the probabilities of not being perturbed. Since population

Page 10: Methods to assess and quantify disclosure risk and ... · and repositories and these do a particularly good job of archiving and making data available for international comparisons

10

counts are generally not known, the estimate in (7) can be obtained by probabilistic

modeling with log-linear models as described above on the misclassified sample:

kkkk

k

k fFEMfI~

|~

/1ˆ)1~

(ˆ2 (8)

2.2 Statistical disclosure control methods

Based on the disclosure risk assessment, producers of statistics within agencies must

choose appropriate SDC methods either by perturbing, modifying, or summarising the

data. The choice depends on the mode of access, requirements of the users and the

impact on quality and information loss. Choosing an optimal SDC method is an

iterative process where a balance must be found between managing disclosure risk

and preserving the utility in the microdata.

SDC methods for microdata include perturbative methods which alter the data and

non-perturbative methods which limit the amount of information released in the

microdata without actually altering the data. Examples of non-perturbative SDC

methods are: coarsening and recoding where values of variables are grouped into

broader categories (for example single years of age are grouped into age groups);

variable suppression where variables such as low-level geographies are deleted from

the microdata; and sub-sampling where a random sample is drawn from the original

microdata. This latter approach is commonly used to produce research files from

census microdata, for example, in the UK a 1% sample is drawn from the census

microdata for use by the research community. Given that the sampling provides a priori

protection in the data there is less use of perturbative methods applied to microdata

from surveys although there are some cases where this is carried out.

For continuous variables, a common perturbative method for survey microdata is top-

coding where all values above a certain threshold receive the value of the threshold

itself. For example, any individual in survey microdata earning above £10,000 a month

will have their income amount replaced by £10,000. Another perturbative method is

adding random noise to continuous variables, such as income or expenditure. For

example, random noise is generated from a Normal distribution with a mean of zero

and a small variance for each individual in the dataset and this random noise is added

to the individual’s value of the variable. Micro-aggregation is another approach where

records are grouped together (usually using a clustering algorithm) and the values of

the continuous variable(s) are replaced by their average, or alternatively, values can

be rank-swapped within the group.

For categorical variables, the most common perturbative method used in microdata is

record swapping. In this approach, two records having similar control variables are

paired and the values of some variables are swapped, typically their geographical

variables. For example, two individuals having the same sex, age, and years of

education will be paired and their place of residence interchanged. Record swapping

Page 11: Methods to assess and quantify disclosure risk and ... · and repositories and these do a particularly good job of archiving and making data available for international comparisons

11

is used in the United States and the United Kingdom on their census microdata as a

pre-tabular method prior to tabulation (see Section 3.2). A more general method is the

post-randomisation probability mechanism (PRAM) where categories of variables are

changed or not changed according to a prescribed probability mechanism and a

stochastic selection process [13] and is described below in more detail. Table 1

summarises the SDC methods mentioned above. Further information on perturbative

and non-perturbative methods can be found in the literature [14, 15] and references

therein.

Table 1: SDC methods for suvey microdata

Non-perturbative Methods Perturbative Methods

Categorical Variables Continuous Variables

Coarsening/recoding variables Record swapping Top-coding Variable or value suppression PRAM Adding random noise Sub-sampling Micro-aggregation Rank swapping

Each SDC method impacts differently on the level of protection obtained in the

microdata and information loss. Two SDC methods to preserve sufficient statistics as

well as logical consistencies in the microdata are described and summarised below

[16].

2.2.1 PRAM for categorical key variables

For protecting categorical identifying variables, the post-randomisation method

(PRAM) has been proposed [13]. As a perturbative method, PRAM alters the data,

and therefore we can expect consistent records to start failing edit rules. Edit rules

describe logical relationships that have to hold true, such as “a two-year old person

cannot be married” or “the profit and the costs of an enterprise should sum up to its

turnover”.

The process of applying PRAM is described as follows [14]:

Let P be a LL transition matrix containing conditional probabilities

) iscategory original| iscategory perturbed( ijppij for a categorical variable with L

categories, t the vector of frequencies and v the vector of relative frequencies: ntv

, where n is the number of records in the micro-data set.

In each record of the data set, the category of the variable is changed or not changed

according to the prescribed transition probabilities in the matrix P and the result of a

draw of a random multinomial variate u with parameters pij (j=1,…,L). If the j-th

category is selected, category i is moved to category j. When i = j, no change occurs.

Page 12: Methods to assess and quantify disclosure risk and ... · and repositories and these do a particularly good job of archiving and making data available for international comparisons

12

Let *

t be the vector of the perturbed frequencies. *

t is a random variable and

tPtt )|(E * . Assuming that the transition probability matrix P has an inverse 1

P , this

can be used to obtain an unbiased moment estimator of the original data: 1*ˆ Ptt .

In order to ensure that the transition probability matrix has an inverse and to control

the amount of perturbation, the matrix P is chosen to be dominant on the main

diagonal, i.e. each entry on the main diagonal is over 0.5.

The condition of invariance can be placed on the transition matrix P , i.e. ttP . This

releases the users of the perturbed file of the extra effort to obtain unbiased moment

estimates of the original data, since *

t itself will be an unbiased estimate of t . To

obtain an invariant transition matrix, a matrix Q is calculated by transposing matrix P

, multiplying each column j by jv and then normalizing its rows so that the sum of

each row equals one. The invariant matrix is obtained by PQR . The invariant matrix

R may distort the desired probabilities on the diagonal, so a parameter is defined

to calculate: IRR )1(* where I is the identity matrix [16].

*R will also be invariant and the amount of perturbation is controlled by the value of

. The property of invariance means that the expected values of the marginal

distribution of the variable being perturbed are preserved. In order to obtain the exact

marginal distribution and reduce the additional variance caused by the perturbation, a

“without” replacement selection strategy for choosing values to perturb can be

implemented based on the expectations calculated from the transition probabilities.

As in most perturbative SDC methods, joint distributions between perturbed and

unperturbed variables are distorted, in particular for variables that are highly correlated

with each other. The perturbation can be controlled as follows:

1. Before applying PRAM, the variable to be perturbed is divided into subgroups,

Gg ,...,1 . The transition (and invariant) probability matrix is developed for each

subgroup g, gR . The transition matrices for each subgroup are placed on the main

diagonal of the overall transition matrix where the off-diagonal probabilities are all

zero, i.e. the variable is only perturbed within the subgroup and the difference in

the variable between the original value and the perturbed value will not exceed a

specified level. An example of this is perturbing age within broad age bands.

2. The variable to be perturbed may be highly correlated with other variables. Those

variables should be compounded into one single variable. PRAM should be carried

out on the compounded variable. Alternatively, the variable to be perturbed is

carried out within subgroups defined by the second highly correlated variable. An

example of this is when age is perturbed within groupings defined by marital status.

The control variables in the perturbation process will minimise the amount of logical

inconsistencies defined through editing rules, but they will not eliminate all of them,

especially edit rules that are out of scope of the variables that are being perturbed.

Page 13: Methods to assess and quantify disclosure risk and ... · and repositories and these do a particularly good job of archiving and making data available for international comparisons

13

Remaining failed edit rules need to be manually or automatically corrected through

edit and imputation processes depending on the amount and types of edit rule failures.

2.2.2 Additive noise for continuous variables

In its basic form, random noise is generated independently and identically distributed

with a positive variance and a mean of zero. The random noise is then added to the

original variable. Adding random noise will not change the mean of the variable for

large datasets but will introduce more variance. This will impact on the ability to make

statistical inferences. Researchers may have suitable methodology to correct for this

type of measurement error but it is good practice to minimise these errors through

better implementation of the method.

Additive noise should be generated within small homogenous sub-groups (for

example, percentiles of the continuous variable) in order to use different initiating

perturbation variance for each sub-group. Generating noise in sub-groups also causes

less edit failures with respect to relationships in the data. Correlated random noise can

be added to the continuous variable thereby ensuring that not only means are

preserved but also the exact variance [17, 18]. A simple method for generating

correlated random noise for a continuous variable z is summarised below [16]:

Procedure 1 (univariate): Define a parameter which takes a value greater than 0

and less than equal to 1. When 1 , we obtain the case of fully modeled synthetic

data. The parameter controls the amount of random noise added to the variable z.

After selecting a , calculate: )1( 21 d and 2

2 d . Now, generate random

noise independently for each record with a mean of 2

11

d

d and the original

variance of the variable 2 . Typically, a Normal distribution is used to generate the

random noise. Calculate the perturbed variable iz for each record i in the sample

microdata (i=1,..,n) as a linear combination: iii dzdz 21 . Note that

)()](1

[)()(2

121 zEzE

d

ddzEdzE

and

)()()()1()( 22 zVarzVarzVarzVar since the random noise is generated

independently to the original variable z.

An additional problem when adding random noise is that there may be several

variables to perturb at once, and these variables may be connected through an edit

constraint of additivity. One procedure to preserve additivity would be to perturb two

of the variables and obtain the third from aggregating the perturbed variables.

However, this method will not preserve the total, mean and variance of the aggregated

variable and in general, it is not good practice to compound effects of perturbation by

aggregating perturbed variables since this causes unnecessary information loss.

Page 14: Methods to assess and quantify disclosure risk and ... · and repositories and these do a particularly good job of archiving and making data available for international comparisons

14

Procedure 1 can also be implemented in a multivariate setting where correlated

Gaussian noise is added to the variables simultaneously [16]. The method not only

preserves the means of each of the three variables and their co-variance matrix, but

also preserves the edit constraint of additivity.

Procedure 1 (multivariate): Consider three variables yx, and z where zyx . This

procedure generates random noise that a priori preserves additivity and therefore

combining the random noise to the original variables will also ensure additivity. In

addition, means and the covariance structure are preserved. The technique is as

follows:

Generate multivariate random noise: )Σ,μ(~),,( NT

zyx , where the superscript T

denotes the transpose. In order to preserve sub-totals and limit the amount of noise,

the random noise should be generated within percentiles (note that we drop the index

for percentiles). The vector μ contains the corrected means of each of the three

variables yx, and z based on the noise parameter :

)μ1

,μ1

,μ1

()μ,μ,μ(μ2

1

2

1

2

1T

zyxzyxd

d

d

d

d

d . The matrix Σ is the original

covariance matrix. For each separate variable, calculate the linear combination of the

original variable and the random noise as previously described. For example, for

record i: ziii dzdz 21 . The mean vector and the covariance matrix remain the

same before and after the perturbation, and the additivity is exactly preserved.

2.3. Data utility measures

Obviously, SDC methods cause information loss and impact on the utility of the data.

The utility of microdata that has undergone SDC methods is based on whether the

same statistical analysis and inference can be drawn on the perturbed data compared

to the original data. Microdata is multi-purposed and used by many different types of

users with diverse reasons for analysing the data. To assess the utility in microdata,

proxy measures have been developed and include measuring distortions to

distributions and the impact on bias, variance and other statistics (Chi-squared

statistic, R2 goodness of fit, rankings, etc.). For example, some SDC methods, such

as adding random noise where the random noise is generated from a Normal

distribution with a mean of zero and a small variance, will not impact on the point

estimate of a total or an average, but will increase the variance and cause a wider

confidence interval. On the other hand, microaggregation will decrease the variance

and cause a narrower confidence interval. The use of such measures for assessing

utility in perturbed statistical data with empirical examples and applications has been

outlined [15, 19, 20, 21], and a brief summary of some useful proxy utility measures

are as follows:

Page 15: Methods to assess and quantify disclosure risk and ... · and repositories and these do a particularly good job of archiving and making data available for international comparisons

15

2.3.1 Distance metrics

Distance metrics are used to measure distortions to distributions in the microdata as

a result of applying SDC methods. The AAD is a distance metric based on the average

absolute difference between observed and perturbed counts in a frequency

distribution. Let D represent a frequency distribution produced from the microdata and

let )(cD be the frequency in cell c. The average absolute distance per cell is defined

as:

c

c

origpertpertorig ncDcDDDAAD /|)()(|),( (9)

where cn is the number of cells in the distribution.

2.3.2 Impact on measures of association

Tests for independence are often carried out on joint frequency distributions between

categorical variables that span a table calculated from the microdata. The test for

independence for a two-way table is based on a Pearson Chi-squared statistic

i j ij

ijij

e

eo 2

2)(

where ijo is the observed count and nnne jiij /)( .. is the

expected count for row i and column j. If the row and column are independent then 2 has an asymptotic chi-square distribution with (R-1)(C-1)and for large values the

test rejects the null hypothesis in favor of the alternative hypothesis of association.

Typically, the Cramer’s V is used which is a measure of association between two

categorical variables:)1(),1min(

/2

CR

nCV

. The information loss measure is the

percent relative difference between the original and perturbed table:

)(

)()(100),(

orig

origpert

origpertDCV

DCVDCVDDRCV

(10)

For multiple dimensions, log-linear modeling is often used to examine associations. A

similar measure to (10) can be calculated by taking the relative difference in the

deviance obtained from the model based on the original and perturbed microdata.

2.3.3 Impact on a regression analysis

For continuous variables, it is useful to assess the impact on the correlation and in

particular the2R of a regression (or ANOVA) analysis. For example, in an ANOVA,

the test involves checking whether a continuous dependent variable has the same

means across groups defined by a categorical explanatory variable. The goodness of

fit criterion 2R is based on a decomposition of the variance of the mean of the

dependent variable. By perturbing the statistical data, the groupings may lose their

homogeneity, the “between” variance becomes smaller, and the “within” variance

becomes larger. In other words, the proportions within each of the groupings shrink

Page 16: Methods to assess and quantify disclosure risk and ... · and repositories and these do a particularly good job of archiving and making data available for international comparisons

16

towards the overall mean. On the other hand, the “between” variance may become

artificially larger showing more association than in the original distribution.

The utility is based on assessing differences in the means of a response variable

across categories of an explanatory variable having K categories. Let ky be the mean

in category k and define the “between” variance of this mean by:

k

2korig )yy(

1K

1)y(BV

where y is the overall mean.

Information loss is measured by:

)y(BV

)y(BV)y(BV100)y,y(BVR

orig

origpert

origpert

(11)

In addition, other analysis of information loss involves comparing estimates of

coefficient when applying a regression model on both the original and perturbed

microdata and comparing the coverage of confidence intervals.

3. Frequency tables of whole population counts

In this section, we focus on confidentiality protection of frequency tables of whole

population counts. This is more challenging than protecting tables from a sample

where survey-weighted counts are disseminated. Survey weights are inflation factors

assigned to each respondent in the microdata and refer to the number of individuals

in the population represented by the respondent. They take into account survey design

sampling fractions, nonresponse adjustments and benchmarking to population totals,

and hence will vary across individuals in the surveys. The fact that only survey-

weighted counts are presented in the tables means that the underlying sample size is

not known exactly and this provides an extra layer of protection in the tables. In

addition, producers of statistics within agencies generally do not assume that

response knowledge is in the public domain, although they do consider very targeted

intrusion attempts which still may be relevant for survey data when there are outliers.

Nevertheless, there is generally little confidentiality protection needed in tabular data

arising from survey microdata. Typical SDC methods for survey-weighted counts

include coarsening the variables that define the tables, for example banded age

groups and broad categories of ethnicity, and ensuring safe table design to avoid low

or zero cell values in the tables. In particular, low sample cell values are also avoided

due to large confidence intervals and low-quality estimates.

Tabular data for census counts in the form of hard-copy frequency tables have been

the norm for releasing statistical data for decades, and remains true today. There are

recently developed web-based applications to automate extraction of certain portions

of tabular data on-request, such as neighborhood or crime statistics for a specific

region. One example of this type of application is the Nomis website [22]. Nomis is a

Page 17: Methods to assess and quantify disclosure risk and ... · and repositories and these do a particularly good job of archiving and making data available for international comparisons

17

service provided by the Office for National Statistics in the United Kingdom to provide

free access to detailed and up-to-date labour market statistics from official sources.

Tabular outputs for censuses and whole population counts are pre-determined after

careful consideration of population thresholds, average cell sizes, collapsing and fixing

categories of variables spanning the tables. In spite of these efforts, SDC methods are

necessary. These techniques include pre-tabular methods, post-tabular methods and

combinations of both described in Section 3.2.

3.1 Types of disclosure risk

The main disclosure risk in a register/census context comes from small counts, i.e.

ones and twos, since these can lead to re-identification. The amount and placement

of the zeros in the table determines whether new information can be learnt about an

individual or a group of individuals. The disclosure risks defined are [14, 21]:

Individual attribute disclosure - An individual can be identified on the basis of some

of the variables spanning the table hence a new attribute revealed about the individual,

i.e. for tabular data, this means that there is a cell count of one in a margin of the table.

Identification is a necessary pre-condition for attribute disclosure and therefore should

be avoided. In a census/register context where many tables are released, an

identification made in a lower dimensional table will lead to attribute disclosure in a

higher dimensional table.

Group attribute disclosure - If there is a row or column that contain mostly zeros and

a small number of non-zero cells, then one can learn a new attribute about a group of

individuals and also learn about the group of individuals who do not have this attribute.

This type of disclosure risk does not require individual identification but will cause harm

to a group of individuals.

Disclosure by differencing - Two tables that are nested may be subtracted from each

other resulting in a new table containing small cells and the above disclosure risk

scenarios would apply. For example, a table containing the elderly population in

private households may be subtracted from a table containing the total elderly

population, resulting in a table of the elderly in communal establishments. This table

is typically very sparse compared to the two original tables.

Disclosure by linking tables - Since many tables are disseminated from one data

source, they can be linked though common cells and common margins thereby

increasing the chances for revealing SDC methods and original cell counts.

To protect against attribute disclosure, SDC methods should limit the risk of

identification and also introduce ambiguity into the zero counts. To avoid disclosure by

differencing, often only one set of variables and geographies are disseminated with no

possibilities for overlapping categories. To avoid disclosure by linking tables, margins

and cells of tables should be made consistent. Since census tables have traditionally

been released in hard-copy, these latter two disclosure risks from linking and

differencing tables were controlled by the agencies through strict definitions of the

Page 18: Methods to assess and quantify disclosure risk and ... · and repositories and these do a particularly good job of archiving and making data available for international comparisons

18

variables defining the tables and no release of tables differing in a single category. In

addition, agencies often employ transparent and visible SDC methods to avoid any

perception that there may be disclosure risks in the data and resources are directed

to ensure that the public is informed about the measures taken to protect the

confidentiality of responding units.

In Table 2, we present an example census table where we discuss the disclosure risks.

Table 2: Example census table

Benefits 16-24 25-44 45-64 65-74 75+

Benefits Claimed 17 8 2 4 1 Benefits Not Claimed 23 35 20 0 0

Total 40 43 22 4 1

In Table 2 we see evidence of small cell values which can lead to re-identification, for

example the cell value of size 1 in ‘75+’ and ‘benefits claimed’. In addition, the cell

value of size 2 in ‘45-64’ and ‘benefits claimed’ can lead to re-identification since it is

possible for an individual in the cell to subtract him or herself out and therefore re-

identify the remaining individual. Moreover, we see evidence of attribute disclosure

through the zeros in the cell values for the columns labeled ‘65-74’ and ‘75+’ under

‘benefits not claimed’. This means that we have learnt that all individuals 65-74 are

claiming benefits although we have not made a re-identification. However, for the

single individual in the column labeled ‘75+’ we have made an identification and also

learnt that the individual is claiming a benefit.

Tables 3a and 3b show an example of two tables that are nested and may be

subtracted one from the other resulting in a new table containing small cell values.

Table 3a: Example census table of all elderly by long-term illness

Health 65-74 75-84 85+ Total

No long-term illness 17 9 6 32 Long-term illness 23 35 20 78

Total 40 44 26 110

Table 3b: Example census table of all elderly living in households by long-term

illness

Health 65-74 75-84 85+ Total

No long-term illness 15 9 5 29 Long-term illness 22 33 19 74

Total 37 42 24 103

Page 19: Methods to assess and quantify disclosure risk and ... · and repositories and these do a particularly good job of archiving and making data available for international comparisons

19

We see that both Tables 3a and 3b are seemingly without disclosure risks on their

own. However, Table 3b can be subtracted from Table 3a to obtain a very disclosive

differenced table of elderly living in nursing homes and other institutionalised care

with very small and zero cell values.

3.2 Statistical disclosure control methods

In preliminary steps, tables are carefully designed with respect to the variables that

define the tables and the definition of their categories. There are also general rules-

of-thumb regarding population thresholds and the number of dimensions allowed in

the table.

Pre-tabular methods are implemented on the microdata prior to the tabulation of the

tables. The most commonly used method is record swapping between a pair of

households/individuals matching on some control variables. As mentioned, this

method has been used for protecting census tables at the United States Bureau of the

Census and the Office for National Statistics (ONS) in the United Kingdom. In general,

agencies prefer record swapping since the method is easy to implement, all

tabulations are consistent and marginal distributions are preserved exactly on higher

aggregations of the data.

Post-tabular methods are implemented on the entries of the tables after they are

computed and typically take the form of random rounding or its extension of using a

random probability mechanism (similar to PRAM), either on the small cells of the

tables or on all entries of the tables. Within the framework of developing the SDC

software package, Tau Argus, a fully controlled rounding option has been added [23].

The procedure uses linear programming techniques to round entries up or down and

in addition ensures that all rounded entries add up to the rounded totals. However, the

controlled rounding option is not able to cope with the size, scope and magnitude of

census tabular outputs. Cell suppression is also not typically used for census outputs

due to the need to suppress same cells across a large number of tables.

Here we describe in more detail three common SDC methods which have been used

to protect whole population frequency tables, a pre-tabular SDC method of record

swapping and post-tabular SDC methods of random rounding and stochastic

perturbation. Examples of applications and case studies can be found [21, 24, 25, 26].

3.2.1 Record swapping

Record swapping is based on the exchange of values of variable(s) between similar

pairs of population units (often households). In order to minimise bias, pairs of

population units are determined within groups defined by control variables. For

example, for swapping households in a census/register context, control variables may

include: a large geographical area, household size and the age-sex distribution of

individuals in the households. In addition, record swapping can be targeted to high-

risk population units found in small cells of census tables. Typically geographical

Page 20: Methods to assess and quantify disclosure risk and ... · and repositories and these do a particularly good job of archiving and making data available for international comparisons

20

variables related to place of residence are often swapped. Swapping place of

residence have the following properties: (1) it minimises bias based on the assumption

that place of residence is independent of other target variables conditional on the

control variables; (2) it provides more protection in the tables since place of residence

is a highly visible variable which can be used to identify individuals; (3) it preserves

marginal distributions within a larger geographical area.

3.2.2 Semi-controlled random rounding

Another post-tabular method of SDC for frequency tables is based on unbiased

random rounding. Let Floor(x) be the largest multiple bk of the base b such that xbk

for any value of x. We define the residual as: res(x)=x-Floor(x). For an unbiased

rounding procedure, x is rounded up to Floor(x)+b with probability b

xres )( and rounded

down to Floor(x) with probability

b

xres )(1 . If x is already a multiple of b, it remains

unchanged.

In general, each small cell is rounded independently in the table, i.e. a random uniform

number u between 0 and 1 is generated for each cell. If b

xresu

)( then the entry is

rounded up, otherwise it is rounded down. This ensures an unbiased rounding

scheme, i.e. the expectation of the rounding perturbation is zero. However, the

realisation of this stochastic process on a finite number of cells in a table will not

ensure that the sum of the perturbations will exactly equal zero.

To place some control in the random rounding procedure, we use a semi-controlled

random rounding algorithm for selecting entries to round up or down as follows: First

the expected number of entries of a given res(x) that are to be rounded up is

predetermined (for the entire table or for each row/column of the table). The expected

number is rounded to the nearest integer. Based on this expected number, a random

sample of entries is selected (without replacement) and rounded up. The other entries

are rounded down. This process ensures that rounded internal cells aggregate to the

controlled rounded total.

Due to the large number of perturbations in the table, margins are typically rounded

separately from internal cells and tables are not additive. When using semi-controlled

random rounding this alleviates some of the problems of non-additivity since one of

the margins and the overall total will be controlled, i.e. the rounded internal cells

aggregate to the rounded total. Another problem with random rounding is the

consistency of the rounding across the same cells that are generated in different

tables. It is important to ensure that the cell value is always rounded consistently,

otherwise the true cell count can be learnt by generating many tables containing the

same cell and observing the perturbation patterns.

Page 21: Methods to assess and quantify disclosure risk and ... · and repositories and these do a particularly good job of archiving and making data available for international comparisons

21

The use of microdata keys which can solve the consistency problem has been

proposed [27]. First, a random number (which they call a key) is defined for each

record in the microdata. When building a census frequency table, records in the

microdata are combined to form a cell defined by the spanning variables of the table.

When these records are combined to a cell, their keys are also aggregated. This

aggregated key serves as the seed for the rounding and therefore same cells will

always have the same seed and result in consistent rounding.

3.2.3 Stochastic perturbation

A more general method than random rounding is stochastic perturbation which

involves perturbing the internal cells of a table using a probability transition matrix and

is similar to the post-randomisation method (PRAM) that is used to perturb categorical

variables in microdata as described in Section 2.2. In this case, it is the cell counts in

a table that are perturbed [24, 27].

Here we focus on the random probability mechanism. Table 4 demonstrates one such

probability mechanism for small cell values of a table.

The probabilities define the chance that a cell value having an original value, of say 1,

will be changed to a perturbed cell value of 2 (in this case the probability is 0.05). The

probability of remaining a value of 1 is 0.80. Note that the sum of the rows is equal to

one. Depending on a random draw, the value of the cell will change or not change

according to the probabilities in the mechanism.

In the example above, if the random draw is below 0.05 then the value of 1 will be

perturbed to 0; if the random draw is between 0.05 and 0.85 then the value of 1 will

remain a 1; if the random draw is between 0.85 and 0.90 then the value of 1 will be

perturbed to 2; if the random draw is between 0.90 and 0.95 then the value of 1 will

be perturbed to 3; and finally if the random draw is above 0.95 then the value of 1 will

be perturbed to 4. Note that in this strategy, an original value of 0 is not perturbed.

Table 4: Example of probability mechanism to perturb small cell values

Original Value

Perturbed Value

0 1 2 3 4

0 1 0 0 0 0 1 0.05 0.80 0.05 0.05 0.05 2 0.05 0.05 0.80 0.05 0.05 3 0.05 0.05 0.05 0.80 0.05 4 0.05 0.05 0.05 0.05 0.80

The probability mechanism has been modified so that the internally perturbed cell

values in a row/column will (approximately) sum to the perturbed cell value in the

respective margin [24]. This is done by a transformation of the probability mechanism

so that the frequencies of the cell values from the original table will be preserved on

the perturbed table. The transformation is described in more detail in Section 2.2

Page 22: Methods to assess and quantify disclosure risk and ... · and repositories and these do a particularly good job of archiving and making data available for international comparisons

22

where we introduce the invariant probability matrix R. In this case t is the vector of

frequencies of the cell values and v is the vector of relative frequencies: v=t/K, where

K is the number of cells in the table. Placing the condition of invariance on the

probability transition matrix ensures that the marginal distribution of the cell values are

approximately preserved under the perturbation. As described in the random rounding

procedure, in order to obtain the exact marginal distribution, a similar strategy for

selecting the cell values to change can be carried out. For each cell value i, the

expected number of cells that need to be changed to a different value j is calculated

according to the probabilities in the transition matrix. We then randomly select (without

replacement) the expected number of cells i and carry out the change to j.

To preserve exact additivity in the table, an iterative proportional fitting algorithm can

be used to fit the margins of the table after the perturbation according to the original

margins. This results in cell values that are not integers. Exact additivity with integer

counts can be achieved for simple tables by controlled rounding to base 1 using for

example Tau-Argus [28]. Cell values can also be rounded to their nearest integers

resulting in ‘close’ additivity because of the invariance property of the transition matrix.

Finally, the use of microdata keys can also be adapted to this SDC method to ensure

consistent perturbation of same cells across different tables by fixing the seed for the

perturbation.

3.3 Disclosure risk measures based on Information Theory

Since the tabular data are based on whole population counts, disclosure risk

measurement is straight-forward as the disclosure risk is observed. One disclosure

risk measure is simply the percentage of cell values containing a value of 1 or a 2. In

addition, the placement of zero cell values in the table and whether they appear in a

single row or column of the table pose a risk of attribute disclosure.

Degenerate distributions in tables where rows/columns have mainly zero cell values

with only a few non-zero cell values have a high risk of attribute disclosure whereas a

row/column with a uniform distribution of cell counts would have little attribute

disclosure risk. Moreover, a row/column with large counts would have less risk of re-

identification compared to a row/column with small counts. A new disclosure risk

measure based on Information Theory has been proposed [29]. These measures

assign a value between 0 and 1 for the level of risk caused by degenerate distributions

where a value of 1 denotes the extreme case where the entire row/column have zero

cell values and only one non-zero cell value.

Using Information Theory, an analytical expression is the following:

The entropy of the frequency vector in a table of size K, with population counts

),...,,( 21 KFFFF where NFK

i

i 1

is:

Page 23: Methods to assess and quantify disclosure risk and ... · and repositories and these do a particularly good job of archiving and making data available for international comparisons

23

N

FFNN

N

F

N

F

N

FHPH

i

K

i

iK

i

ii

1

1

loglog

log)(

and to produce a disclosure risk measure between 0 and 1, the disclosure risk

measure is defined as:

K

N

FH

log1

. (12)

An extended disclosure risk measure is proposed in (13) which also accounts for the

overall population size of the table and the number of zeros and is defined as a

weighted average of three different terms, each term being a measure between 0 and

1.

NeNww

KN

FFNN

wK

AwwwFR

K

i

ii

1log

1)1(

log

loglog

1||

),,(

21

12121

(13)

where A is the set of zeroes in the table and |A| the number of zeros in the set. K, N

and F as defined above and w1, w2 are arbitrary weights: 1ww0 21 .

The first measure in (13) is the proportion of zeros which is relevant for attribute

disclosure. The third measure in (13) allows us to differentiate between tables with

different magnitudes. As the population size N gets larger in the table, the third

measure tends to zero. The weights 1w and 2w should be chosen depending on the

agency’s choice of how important each of the terms are in contributing to disclosure

risk. Alternatively, one can avoid weights altogether by taking the L2- norm of the three

terms of the risk measure in (13) as follows:

3

|x|

2/13

1i

2

i

where xi represents term i, i=1,2,3 in (13). This provides more weight to the larger term

of the risk measure.

The risk measure in (13) has been expanded to include the case of disclosure risk

assessment based on Information Theory following the application of SDC methods

[30]. This risk measure depends on the conditional entropy H(P|Q) where Q is the

distribution following perturbation. The conditional entropy represents the amount of

information needed to recover the original distribution P given that we observe the

confidentialised distribution Q.

Page 24: Methods to assess and quantify disclosure risk and ... · and repositories and these do a particularly good job of archiving and making data available for international comparisons

24

3.4 Data utility measures

To assess the impact on data utility for frequency tables of whole population counts,

we can use similar utility measures as described for microdata in Section 2.3 since

many of the measures are based on examining frequency distributions in the original

data versus the perturbed data. For example, the utility measures based on distance

metrics between original and perturbed cell values in a table, and the impact on a Chi-

square test for independence which examines statistical associations between

variables defining the table, are particularly relevant.The aim is to ensure that the

power of such statistical tests are not impacted by the perturbation and there is no

change in Decision Theory under statistical inference.

Continuing with Information Theory based measures outlined above [29], the utility

can be measured by the Hellinger’s Distance (explained below) to assess the distance

between two distributions.

In the case of frequency distributions from whole-population tables, where

),...,,( 21 KFFFF is the vector of original counts and ),...,,( 21 KGGGG is the vector of

perturbed counts, and

K

i

i NF1

and

K

i

i MG1

, the Hellinger Distance is defined as:

K

i

ii GFGFGFHD1

2

22

1||||

2

1),( (14)

The Hellinger’s Distance takes into account the magnitude of the cells since the

difference between square roots of two ‘large’ numbers is smaller than the difference

between square roots of two ‘small’ numbers, even if these pairs have the same

absolute difference. Also, Hellinger’s Distance is not impacted on original cell counts

that are zero as would be the case for relative distances. The lower bound remains

zero and the upper bound of this distance of counts changes:

.2

22

1

22

1

2

1),(

1

11

2

MNGFMN

GFGFGFGFHD

K

i

ii

K

i

iiii

K

i

ii

The information loss measure which is bounded by 0 and 1 where 0 represents low

utility and 1 represents high utility is:

2

MN

)G,F(HD1

.

Page 25: Methods to assess and quantify disclosure risk and ... · and repositories and these do a particularly good job of archiving and making data available for international comparisons

25

4. Magnitude tables from business statistics

Concerns about the disclosure of sensitive information arising from magnitude tables

of business statistics started back in the 1980’s. Magnitude tables are defined as

tables where the cells contain sums or averages of a continuous variable such as total

turnover, profits or revenue and the table is spanned by identifying variables, such as

region and economic activity. It was recognized that potential intruders to this type of

statistical data were other businesses that were interested in learning sensitive

commercial information about their competitors. Therefore, we assume that intruders

are competing businesses in a cell of the table and that the identity of other businesses

in the cell is known. In addition, we assume that the intruders also know the ranking

of the businesses with respect to their size. The main concern is therefore one of

attribute disclosure.

Disclosure risk measures are known as sensitivity measures and are based on

whether a contributor in the cell of a table can learn the values of the target variable

for the other contributors in the cell with sufficient precision. Since business surveys

have large sampling fractions and in particular take-all strata where all large

businesses are required to respond, we do not account for any protection afforded by

sampling.

In the general framework, a table is defined by cross-classification of categorical

variables, such as the standard industrial classification (SIC) and region. Let X denote

a generic cell, N(X) denote the number of contributors in the cell and xi denote the

value of the target variable for contributor i. We define the total in cell X as 𝑇(𝑋) =

∑ 𝑥𝑖.

Assume xi > 0 for i = 1, ……, N(X) and assume that the observations can be ordered

so that: x1 ≥ x2 ≥ ……. ≥ xN(x) >0. Assuming that an intruder is a contributor in the same

cell, we wish to avoid the intruder from being able to disclose an xi value for another

i. A crude approach of the intruder would be to estimate xi by T(X) and that may be a

good estimate if one business contributes a large proportion of the cell.

A (1,k) rule classifies a cell as disclosive if x1 >(k/100) T(X) and this rule can be

generalised to the (n, k) rule which classifies a cell as disclosive if x1 +…..+xn ≥ (k/100)

T(X). The generalized rule assumes that a number of businesses in a cell, say 2

businesses, can form a coalition to disclosure a value for the third business in a cell

of size 3. Therefore, a threshold rule is generally upheld and any cell of size n or less

is deemed disclosive. For example, if N(X)=2 we obtain exact disclosure x1 = T(X)-x2

and hence a general threshold rule of 3 is used.

Another sensitivity measure is the p% rule. The most precise estimate by the second

largest contributor for the value of the largest contributor in a cell is: ��1 = 𝑇(𝑥) − 𝑥2.

The percent error is: 100 × (��1 − 𝑥1)/ 𝑥1 = 100 × (𝑇(𝑥) − 𝑥1 − 𝑥2)/ 𝑥1. In the p% rule,

the cell is disclosive if 100 × (𝑇(𝑥) − 𝑥1 − 𝑥2)/ 𝑥1 ≤ 𝑝. It is also well known that if the

parameters of the sensitivity measures are known to intruders, such as the p, n or k,

Page 26: Methods to assess and quantify disclosure risk and ... · and repositories and these do a particularly good job of archiving and making data available for international comparisons

26

they can be used to disclose sensitive information and hence the parameters are not

released.

To protect magnitude tables containing business statistics, table design and cell

suppression are generally used. Based on the sensitivity and threshold rules, the

disclosive cells are suppressed. These are called the primary suppressions. Then,

other cells need to be suppressed to ensure that the primary suppressions are not

revealed through the marginal totals. These are called secondary suppressions. For

a 2-dimensional table for example, at least 2 cells in a row and column, i.e. the vertices

of a rectangle, need to be suppressed to ensure that the primary suppressions are

safe and cannot be recalculated. This is known as the hypercube method for

secondary cell suppression.

To optimise secondary cell suppressions, mathematical linear programming is used

in Tau-Argus where an objective function ∑ 𝐶(𝑋) is minimized [28]. For C(X)=1 we

minimise the total number of cells suppressed, for C(X)=N(X) we minimise the number

of contributors suppressed and for C(X)=T(X) we minimise the total value of the target

variable suppressed. Note that depending on the objective function, information loss

measures should account for how the secondary suppressions are defined. The

solution of the linear programming can be heavy (NP hard) so simplified and

alternative solutions may be used. The constraints of the mathematical linear

programming are the preservation of margins and ensuring non-negative values in the

table [14, 23, 31].

5. Disclosure risk–data utility confidentiality map

The disclosure risk and utility measures can be used to produce a disclosure risk-data utility confidentiality map [32]. We conceptualise the map in Figure 1.

Figure 1: Conceptualised disclosure risk-data utility confidentiality map

In the lower left-hand quadrant of the map in Figure 1, we have low disclosure risk and

low utility. In fact, not releasing data at all will have no utility although some disclosure

risk remains as information about the disclosive nature of the data is leaked by not

allowing its release. In the upper right-hand quadrant of the map in Figure 1, we have

high disclosure risk and high utility. We can see that the original data is above a

Page 27: Methods to assess and quantify disclosure risk and ... · and repositories and these do a particularly good job of archiving and making data available for international comparisons

27

maximal tolerable risk threshold determined by the agency and hence SDC methods

need to be applied. SDC methods impact negatively on the utility of the data. Thus

SDC is an iterative process, where different SDC methods are applied with different

parameterisations, and the disclosure risk and data utility are quantified and mapped

on to the disclosure risk-data utility confidentiality map. The SDC method that is below

the risk threshold and having the highest utility is selected. Note that the data points

form a frontier on the map which allows the selection of the optimal SDC method.

A more realistic example of disclosure risk-data utility confidentiality mapping

compares random and targeted data swapping procedure, and a random and targeted

PRAM procedure on the variable Local Authority District (LAD) [12]. In this example,

the population includes N = 1,468,255 individuals from an extract of the 2001 UK

Census. We drew 1% Bernoulli samples (n = 14,683) and defined six key categorical

variables for the risk assessment: LAD (11), sex (2), age groups (24), marital status

(6), ethnicity (17), economic activity (10), where the numbers of categories of each

variable are in parentheses (K = 538,560). Both the random data swap and the random

PRAM on the LAD variable were carried out at rates of 2%, 5%, 10% and 20%. The

remaining individuals were not perturbed. For the targeted data swap and targeted

PRAM we carried out more perturbations in the ‘other’ ethnicity group compared to the

‘White British’ ethnicity group.

In Figure 2 we plot the disclosure risk-data utility confidentiality map. The points on the

map represent different candidate releases, that is, perturbation methods with different

levels of perturbation. The points are denoted T for targeted or R for random; 20 for

20%, 10 for 10%, 5 for 5% or 2 for 2%; and S for swapping or P for PRAM. The points

are plotted against the risk measure in (8) in Section 2.1 on the Y-axis and a utility

measure based on a distance metric of counts as defined by LAD by ethnicity. The

distance metric is first calculated as the absolute perturbation per cell. Then, a relative

distance from the average cell count is calculated so that the higher the measure, the

higher the utility. This is denoted as the relative absolute average distance per cell

(RAAD).

Figure 2: Disclosure risk-data utility confidentiality map on real example

From Figure 2, we see that we have approximately the same level of utility between

the targeted 10% perturbation and the random 20% perturbation with respect to the

Page 28: Methods to assess and quantify disclosure risk and ... · and repositories and these do a particularly good job of archiving and making data available for international comparisons

28

RAAD, although we obtain lower disclosure risk with the targeted 10% perturbation.

The same applies to the targeted 5% perturbation and the random 10% perturbation,

with the targeted 5% perturbation having less disclosure risk than the random 10%

perturbation at the same level of utility. We draw a line to connect points on the

disclosure risk- data utility frontier and note that in all cases, at given levels of utility,

the targeted data swapping provides the lowest disclosure risk compared to the other

methods, although there is little difference between targeted swapping and targeted

PRAM. Targeting did not appear to impact much on the utility, and the general

conclusion here is that targeting seems useful, enabling less perturbation to be applied

and hence more utility for a given level of disclosure risk protection. Of course, this

finding could vary in other settings and producers of statistics within agencies should

use a similar risk-utility approach, based on its own data, to determine its preferred

SDC approach.

6. New dissemination strategies

Up until now, we have focused on traditional types of statistical data that are

disseminated by producers of statistics within agencies: tabular data and microdata.

However, with increasing demand for more open and accessible statistical data,

agencies are now considering alternative dissemination strategies. In this section, we

examine some of these strategies.

6.1 Safe data enclaves and remote access

To meet increasing demands for high resolution data, many agencies have set up data

enclaves on their premises where approved researchers can go onsite and gain

access to confidential statistical data. The secure servers within the enclave have no

connection to printers or the internet and only authorised researchers are allowed to

access them. To minimise disclosure risk, no data is to be removed from the enclave

and researchers undergo specialised training to understand the confidentiality

guidelines. Researchers are generally provided with standard software within the

system, such as STATA, SAS and R, but any specialised software would not be

available. All information flow is controlled and monitored. Any outputs to be taken out

of the data enclave are dropped in a folder and manually checked by experienced

confidentiality officers for disclosure risks. Examples of disclosure risks in outputs are

small cell counts in tables, residual plots from regression models which may highlight

outliers and Kernel density estimation with small band-widths.

The disadvantage of the data enclave is the need to travel, sometimes long distances,

to access confidential data. In recent years, some agencies have piloted remote

access by extending the concept of the data enclave to a ‘virtual’ data enclave.These

‘virtual’ data enclaves can be set up at other government agencies, universities and

even on a researcher’s own laptop. Users log on to secure servers via VPN

connections to access the confidential data. All activity is logged and audited at the

keystroke level and outputs are reviewed remotely by confidentiality officers before

Page 29: Methods to assess and quantify disclosure risk and ... · and repositories and these do a particularly good job of archiving and making data available for international comparisons

29

being sent back to the researchers via a secure file transfer protocol site. The

technology also allows users within the same research group to interact with one

another while working on the same dataset. An example of this technology is the Inter-

University Consortium for Political and Social Research (ICPSR) housed at the

University of Michigan. The ICPSR maintains access to data archives of social science

data for research and operates both a physical on-site data enclave and a ‘virtual’ data

enclave [33].

6.2 Web-based applications

In recent years, two types of web-based dissemination applications have been

considered by producers of statistics within agencies: flexible table generators and

remote analysis servers.

6.2.1 Flexible table generating servers

Driven by demand from policy makers and researchers for specialised and tailored

tables from statistical data, particularly census data, some agencies are developing

flexible table generating servers that allow users to define and generate their own

tables. The United States Census Bureau [34] and the Australian Bureau of Statistics

[35] have developed such servers for disseminating census tables. Eurostat also

provides a platform for a flexible table generating server for European census counts

[36] although this is based on a series of large uniform hyper-tables that were

produced by each member state and hence the tabulations are more limited and do

not go beyond the underlying tables. Users access the servers via the internet and

define their own table of interest from a set of pre-defined variables and categories

typically from drop down lists. The generated table undergoes a series of checks, and

if it passes the criteria, it is downloaded onto the user’s PC without the need for human

intervention.

Whilst the online flexible table generators have the same types of disclosure risks as

mentioned in Section 3.1, the disclosure risks based on disclosure by differencing and

disclosure by linking tables are now very much relevant since there are no

interventions or manual checks on what tables are produced or how many times tables

are generated. Therefore, for these types of online systems for tables, the statistical

community has recognised the need for perturbative methods to protect against

disclosures. When selecting the SDC technique to apply to the generated output table,

there are two approaches: apply SDC to the underlying data so that all tables

generated in the server are deemed safe for dissemination (pre-tabular SDC), or

produce tables directly from original data and apply the SDC technique to the final

tabular output (post-tabular SDC). Although sometimes a neater and less resource

intensive for data from a single source, the pre-tabular approach is problematic since

a large amount of aggregation and perturbation is needed to protect the underlying

data to be used in the server. Therefore, when generating the table through the server,

the SDC is compounded and over-protects the data whilst decreasing the utility of the

Page 30: Methods to assess and quantify disclosure risk and ... · and repositories and these do a particularly good job of archiving and making data available for international comparisons

30

table. The post-tabular approach has improved utility since the perturbation is only

carried out on the generated table. A discussion of disclosure risk and data utility for

a flexible table builder can be found [30]. This post-tabular approach is also motivated

by the computer science definition of differential privacy as discussed briefly in Section

7. Often, a combination of pre-tabular and post-tabular approaches is undertaken.

As mentioned, the design of remote table generating servers typically involves many

ad-hoc preliminary SDC checks that can easily be programmed within the system to

determine whether tables can be released or not. These SDC checks may include:

limiting the number of dimensions in the table, minimum population thresholds,

ensuring consistent and nested categories of variables to avoid disclosure by

differencing, etc. If the requested table does not meet the standards, it is not released

through the server and the user is advised to redesign the table.

For flexible table generating, the server has to quantify the disclosure risk in the

original table, apply an SDC technique and then reassess the disclosure risk.

Obviously, the disclosure risk will depend on whether the underlying data is a whole

population (census) and the zeros are real zeros, or the data are from a survey and

the zeros may be random zeros. After the table is protected, the server should also

calculate the impact on the utility by comparing the perturbed table to the original table.

Measures based on Information Theory presented in Sections 3.3 and 3.4 can be used

to assess disclosure risk and utility in a table generating server since they can be

calculated in real time. In addition, some perturbation methods for protecting census

tables are presented in Section 3.2.

As an example, in Table 5 we compare different SDC methods for a census table

defined in one region of the United Kingdom according to banded age groups,

education qualification and occupation. The table contained 2,457 cells where 62.4%

were real zeros. The underlying data in the flexible table generating server was a very

large hypercube which provided a priori protection since no units below the level of

the cells of the hypercube are disseminated. We compare three pre-tabular methods

on the hypercube: record swapping, semi-controlled random rounding and a

stochastic perturbation, and a post-tabular method of semi-controlled random

rounding applied directly to the output table. The measures are based on Information

Theory as described in Sections 3.3 and 3.4.

Table 5: Information Theory disclosure risk and utility for a generated table

Disclosure Risk Hellinger Distance

Original 0.318 -

Perturbed Input

Record Swapping 0.282 0.988 Semi-controlled Random Rounding

0.137 0.991

Stochastic Perturbation 0.239 0.995

Perturbed Output

Semi-Controlled Random Rounding

0.135 0.993

Page 31: Methods to assess and quantify disclosure risk and ... · and repositories and these do a particularly good job of archiving and making data available for international comparisons

31

From Table 5, it is clear that the method of record swapping when applied to the input

data did little to reduce the disclosure risk in the final output table. This was due to the

fact that the small cells remain unperturbed in the table. Record swapping provided

the lowest data utility since the geography variable that was swapped was used to

select a sub-population of the table thus increasing the information loss. From among

the input perturbation methods, the semi-controlled random rounding provided the

most protection against disclosure. The stochastic perturbation still leaves small cells

in the table and hence is not as protective as the semi-controlled random rounding.

Both methods have similar information loss. Comparing the pre-tabular and post-

tabular semi-controlled random rounding procedure, we see slightly lower disclosure

risk according to the post-tabular rounding and slightly higher data utility since the

SDC method is not compounded by aggregating rounded cells.

6.2.2 Remote analysis servers

A remote analysis server is an online system which accepts a query from the

researcher runs it within a secure environment on the underlying data and returns a

confidentialised output without the need for human intervention to manually check the

outputs for disclosure risks. Similar to flexible table generators, the queries are

submitted through a remote interface and researchers do not have direct access to

the data. The queries may include exploratory analysis, measures of association,

regression models and statistical testing. The queries can be run on the original data

or confidentialised data and may be restricted and audited depending on the level of

required protection. An example of regression modeling via a remote analysis server

can be found [37].

Figure 3: Confidential residual plot from a regression analysis on receipts for

the Sugar Canes dataset

A comparison of outputs based on original data and two SDC approaches outputs

from confidentialised microdata and confidentialised outputs obtained from the original

data via a remote analysis server, has been done [38]. The comparison was carried

out on a dataset from the 1982 survey of the sugar cane industry in Queensland,

Australia [39]. The dataset corresponds to a sample of 338 Queensland sugar farms

and contained the following variables: region, area, harvest, receipts, costs and profits

(equal to receipts minus costs). The dataset was confidentialised by deleting large

outlier farms, coarsening the variable area and adding random noise to harvest,

Page 32: Methods to assess and quantify disclosure risk and ... · and repositories and these do a particularly good job of archiving and making data available for international comparisons

32

receipts, costs and profits. Figure 3 shows what the residual plots would look like in a

remote analysis server where the response variable is receipts and the explanatory

variables: region, area, harvests and costs. As can be seen the scatterplot is

presented as sequential box plots and the Normal QQ plot is smoothed.

Original Confidential Input Confidential Output

Figure 4: Univariate analysis of receipts for the Sugar Canes dataset

Page 33: Methods to assess and quantify disclosure risk and ... · and repositories and these do a particularly good job of archiving and making data available for international comparisons

33

Figure 4 presents the comparison of the univariate analysis of receipts on the original

dataset, the confidentialised input approach and the confidentialised output approach.

6.3 Synthetic data

Basic confidential data is a fundamental product of virtually all statistical agency

programs. These lead to the publication of public-use products such as summary data,

microdata from surveys, etc. Confidential data may also be used for internal use within

data enclaves. In recent years, there has been a move to produce synthetic microdata

as public-use files which preserve some of the statistical properties of microdata. The

data elements are replaced with synthetic values sampled from an appropriate

probability model. The model is fit to the original data to produce synthetic populations

through a posterior predictive distribution similar to the theory of multiple imputation.

Several samples are drawn from the population to take into account the uncertainty of

the model and to obtain proper variance estimates. Further references and details of

generating synthetic data can be found [40, 41]. The synthetic data can be

implemented on parts of data so that a mixture of real and synthetic data is released

[42]. One application which uses partially synthetic data is the US Census Bureau ‘On

the Map’ [43]. It is a web-based mapping and reporting application that shows where

workers are employed and where they live according to the Origin-Destination

Employment Statistics.

In practice it is very difficult to capture all conditional relationships between variables

and within sub-populations. If models used in a statistical analysis are sub-models of

the model used to generate data, then the analysis of multiple synthetic samples

should give valid inferences. In addition, partially synthetic datasets may still have

disclosure risks and need to be checked prior to dissemination.

For tabular data there are also techniques to develop synthetic magnitude tables

arising from business statistics. Controlled tabular adjustment (CTA) carries out cell

suppression and replaces the suppressed cells with imputed values that guarantee

some statistical properties such as preserving the margins of the table [44]. Other

perturbative methods such as correlated noise addition can also be used.

7. Statistical disclosure control: Where do we go from here?

This article focused on an understanding of the SDC research and methods involved

in preparing traditional statistical data before their release. The main goal of SDC is to

protect the confidentiality of those whose information is in a dataset, while still

maintaining the usefulness of the data itself. In recent years, however, agencies have

been restricting access to statistical data due to their inability to cope with the large

demand for data whilst ensuring the confidentiality of statistical units. On the other

hand, government initiatives for more open and accessible data is pushing agencies

producing official and national statistics to explore new ways of disseminating

statistical data. One disclosure risk that is often overlooked in traditional statistical

Page 34: Methods to assess and quantify disclosure risk and ... · and repositories and these do a particularly good job of archiving and making data available for international comparisons

34

outputs, and is only now coming to prominence with ongoing development into web-

based interactive data dissemination, is inferential disclosure. Inferential disclosure

risk is the ability to learn new attributes with high probability. For example, a proportion

of some characteristic is very high within a subgroup, e.g. a high proportion of those

who smoke have heart disease, or a regression model may have very high predictive

power if the dependent and explanatory variables are highly correlated, e.g. regressing

BMI on height and weight. In fact, an individual does not even have be in the dataset

in order to disclose information. Another example of inferential disclosure is disclosure

by differencing frequency tables of whole population counts when multiple tabular

releases are disseminated from one data source.

Inferential disclosure risk forms the basis of the definition of differential privacy as

formulated in the computer science literature [45]. Differential privacy is a

mathematically principled method of measuring how secure a protection mechanism

is with respect to personal data disclosures. It incorporates all traditional disclosure

risks and inferential disclosure in a ‘worst-case’ scenario. The theory of differential

privacy was developed in the context of masking queries from a remote online query

system. It is now coming to the attention of agencies who are under increasing

pressures to broaden access to data and to provide better solutions for the release of

statistical data, for example through interactive web-based platforms, and therefore

are in need of stricter and more formal privacy guarantees.This has changed the

landscape of how disclosure risks are defined and has led to agencies recognising the

need for more use of perturbative methods of SDC. In addition, agencies need to

recognise that the SDC parameters, e.g. variance of noise or swap rates, need to be

released to researchers so that they can account for the measurement/perturbation

error in their statistical analysis. Since differential privacy is a cryptographic method,

the parameters of the noise addition are not secret and can be released to

researchers.

A discussion on how differential privacy (as introduced in the computer science

literature [45, 47, 48]) relates to the disclosure risk scenarios for survey microdata can

be found [46]. In addition, they investigate whether current SDC practices at agencies

producing official and national statistics for survey microdata meet the strict privacy

guarantees of differential privacy. Differential privacy for example makes no distinction

between key variables and sensitive variables, or whether the data comes from a

census or a sample. It is assumed in a worst-case scenario that an intruder has

complete information of the entire database except for one target individual and wishes

to learn about an attribute value for the target individual.

In the survey setting, there are two possible definitions of the database: the population

‘database’ 1( ,...., )U Nx xx and the sample 'database' 1( ,..., )s nx xx , where N denotes

the size of the population {1,..., }U N and, without loss of generality, we write

{1,..., }s n . The sample database might be viewed from one perspective as more

realistic, since it contains the data collected by the agency, whereas the population

database would include values of survey variables for non-sampled units, which are

Page 35: Methods to assess and quantify disclosure risk and ... · and repositories and these do a particularly good job of archiving and making data available for international comparisons

35

unknown to the agency. In the context of differential privacy, we use the population

database Ux to define privacy, treat the sampling as part of the SDC mechanism and

suppose that prior intruder knowledge relates to aspects of Ux .

Let 𝑥�� denote the cell value of unit i in the microdata after SDC/sampling has been

applied and let 𝑓�� = ∑ 𝐼(𝑥�� = 𝑗)𝑖∈𝑠 denote the corresponding observed count in cell j

in the microdata. We view the released microdata as the vector of counts: 𝐟 =

(𝑓1 , … , 𝑓��), and 𝑃(𝐟|𝐱𝑈) as the probability of f~

with respect to an SDC/sampling

mechanism where Ux is treated as fixed. The following definition is considered [46].

Definition - differential privacy holds if:

𝑚𝑎𝑥 |𝑙𝑛 (𝑃(𝐟|𝐱

𝑈(1))

𝑃(𝐟|𝐱𝑈(2))

)| ≤ 𝜀 (15)

for some 0 , where the maximum is over all pairs (1) (2)( , )U Ux x , which differ in only one

element and across all possible values of f~

.

The disclosure risk in this definition is inferential disclosure where if only one value is

changed in the population database, the intruder is unable to gain any new knowledge

about a target individual given that all other individuals in the population are known.

A further definition of ),( -probabilistic differential privacy holds if (15) applies with

probability at least 1 for some , 0 [49]. More precisely, this definition holds if:

the space of possible outcomes f~

may be partitioned into ‘good’ and other

(unbounded) outcomes, and if (15) holds when the outcome is good and if the

probability that the outcome is good is at least 1 . This definition is essentially the

same as the notion of probabilistic differential privacy where the set of bad outcomes

is referred to as the disclosure set [50].

An investigation to determine whether common practices for preserving the

confidentiality of respondents in survey microdata at agencies producing official and

national statistics, such as sampling and perturbation, are differentially private, has

been carried out [46]. They found that non-perturbative methods such as sampling is

not differentially private since an unbounded ratio in (15) will occur if for neighboring

databases, the target individual is a population unique 1j jf F . In that case, for a

given f and any sampling scheme where some element jf of f may equal

jF with

positive probability, there exists a database (1)

Ux such that (1) 1j jf F for some j and

(1)Pr[ | ] 0U f x . Now if we change an element of (1)

Ux which takes the value j to construct

(2)

Ux for which jjj fFF 1)1()2( we obtain (2)Pr[ | ] 0U f x . Hence, - differential

privacy does not hold for a very broad class of sampling schemes.

One of the reasons why the disclosure implications of this finding might not be

considered a cause for concern by an agency is that it is unrealistic to assume that an

Page 36: Methods to assess and quantify disclosure risk and ... · and repositories and these do a particularly good job of archiving and making data available for international comparisons

36

intruder will have precise information on all individuals in a population except for one

target individual. In addition, the probability of a population unique given a sample

unique for typical social survey microdata is very small and hence we can adopt the

),( - probabilistic differential privacy definition where the probability for a set of bad

outcomes is very small [46]. Further examination as to whether perturbation under a

misclassification mechanism similar to the one shown in formula (5) in Section 2.1 and

the stochastic mechanism shown in Table 2 in Section 3.2, is differentially private,

found that if all elements of M are positive, then the ratio in (15) will be bounded.

One area where differential privacy is showing promise for SDC applications at

agencies producing official and national statistics is for the case of developing an

online flexible table generating server as defined in Section 6.2.1. As mentioned,

differential privacy aims to avoid inferential disclosure by ensuring that an intruder

cannot make inference about a single unit when only one of its value is changed, given

that all other units in the population are known. This definition would include disclosure

by differencing and linking tables which are the main disclosure risks of concern when

developing a flexible table generator, particularly for whole population counts. The

solution for guarantying differential privacy is by adding noise/perturbation to the

outputs of the queries, i.e. cells of the table under specific parameterisations. The

privacy guarantee is set a priori and is used to define the prescribed probability

distribution of the perturbation. Research is still ongoing for use of the differential

privacy standard in an online flexible table generating server and it has yet to be

implemented. There are promising developments for flexible table generators that are

based on a small set of fixed variables and subject to other protection guarantees

where all potential tables (and their cells) are known in advance and hence can be

perturbed a priori. This is known as a non-interactive mechanism in differential privacy,

and all subsequent and repeated queries and analyses on the protected data do not

impact on the privacy guarantee [51].The challenges for the future of statistical

disclosure control are to examine the potential of formal privacy guarantees through

for example the differential privacy standard, and to develop applications for more

open and innovative dissemination strategies at agencies producing statistics.

Page 37: Methods to assess and quantify disclosure risk and ... · and repositories and these do a particularly good job of archiving and making data available for international comparisons

37

8. References

[1] See https://www.ukdataservice.ac.uk/. [2] See https://www.ipums.org/ for more information. [3] Skinner, C.J., and Elliot, M. J. (2002). A Measure of Disclosure Risk for Microdata. Journal of the Royal Statistical Society, Ser. B 64, 855-867. [4] Yancey, W.E., Winkler, W.E., and Creecy, R.H. (2002) Disclosure Risk Assessment in Perturbative Micro-data Protection. In: Inference Control in Statistical Databases (ed. J. Domingo-Ferrer), New York: Springer, 135-151. [5] Domingo-Ferrer, J. and Torra, V.(2003) Disclosure Risk Assessment in Statistical Microdata Protection via Advanced Record Linkage, Statistics and Computing, Vol. 13, No. 4, 343-354. [6] Reiter, J.P. (2005a) Estimating Risks of Identification Disclosure in Microdata. Journal of the American Statistical Association 100, 1103-1112. [7] Bethlehem, J., Keller, W., and Pannekoek, J. (1990). Disclosure limitation of Microdata. Journal of the American Statistical Association 85, 38-45. [8] Elamir, E. and Skinner, C.J. (2006). Record-Level Measures of Disclosure Risk for Survey Micro-data. Journal of Official Statistics, 22, 525-539. [9] Skinner, C.J. and Shlomo, N. (2008). Assessing Identification Risk in Survey Micro-data Using Log-linear Models. Journal of American Statistical Association, Vol. 103, Number 483, 989-1001. [10] Skinner, C.J. and Holmes, D. (1998). Estimating the Re-identification Risk Per Record in Microdata. Journal of Official Statistics 14, 361-372. [11] Rinott, Y. and Shlomo, N. (2007 ). Variances and Confidence Intervals for Sample Disclosure Risk Measures. 56th Session of the International Statistical Institute Invited Paper, Lisbon 2007. [12] Shlomo, N. and Skinner, C.J. (2010). Assessing the Protection Provided by Misclassification-Based Disclosure Limitation Methods for Survey Microdata. Annals of Applied Statistics, Vol. 4, No. 3, 1291-1310. [13] Gouweleeuw, J., Kooiman, P., Willenborg, L.C.R.J., and De Wolf, P.P. (1998). Post Randomisation for Statistical Disclosure limitation: Theory and Implementation. Journal of Official Statistics, 14, 463-478. [14] Willenborg, L. and De Waal, T. (2001). Elements of Statistical Disclosure limitation in Practice. Lecture Notes in Statistics, 155. New York: Springer-Verlag.

Page 38: Methods to assess and quantify disclosure risk and ... · and repositories and these do a particularly good job of archiving and making data available for international comparisons

38

[15] Domingo-Ferrer, J., Mateo-Sanz, J. and Torra, V. (2001). Comparing SDL Methods for Micro-Data on the Basis of Information Loss and Disclosure Risk. ETK-NTTS Proceedings of the Conference, Crete, June 2001. [16] Shlomo, N. and De Waal T. (2008). Protection of Micro-data Subject to Edit Constraints Against Statistical Disclosure. Journal of Official Statistics, 24, No. 2, 1-26. [17] Kim, J.J. (1986). A Method for Limiting Disclosure in Micro-data Based on Random Noise and Transformation. American Statistical Association, Proceedings of the Section on Survey Research Methods, 370-374. [18] Fuller, W. A. (1993). Masking Procedures for Micro-data Disclosure Limitation. Journal of Official Statistics, 9, 383-406. [19] Gomatam, S. and Karr, A. (2003). Distortion Measures for Categorical Data Swapping. Technical Report Number 131, National Institute of Statistical Sciences. [20] Shlomo, N. and Young, C. (2006). Statistical Disclosure Limitation Methods Through a Risk-Utility Framework. In PSD'2006 Privacy in Statistical Databases, (Eds. J. Domingo-Ferrer and L. Franconi), Springer LNCS 4302, pp. 68-81. [21] Shlomo, N. (2007). Statistical Disclosure Limitation Methods for Census Frequency Tables. International Statistical Review, Vol. 75, Number 2, pp. 199-217. [22] https://www.nomisweb.co.uk/. [23] Hundepool, A., Domingo-Ferrer, J., Franconi, L., Giessing, S., Nordholt, E. S.,Spicer, K. and de Wolf, P. P. (2012). Statistical Disclosure Control. Wiley Series inSurvey Methodology. John Wiley & Sons, United Kingdom. [24] Shlomo, N. and Young, C. (2008). Invariant Post-tabular Protection of Census Frequency Counts. In PSD'2008 Privacy in Statistical Databases, (Eds. J.Domingo-Ferrer and Y. Saygin), Springer LNCS 5262, 77-89. [25] Shlomo, N., Tudor, C. and Groom, P. (2010). Data Swapping for Protecting Census Tables. In PSD'2010 Privacy in Statistical Databases, (Eds. J. Domingo-Ferrer and E. Magkos), Springer LNCS 6344, pp. 41-51. [26] Shlomo, N., Antal, L. and Elliot, M. (2015). Measuring Disclosure Risk and Data Utility for Flexible Table Generators. Journal of Official Statistics, 31, 305-324. [27] Fraser, B. and Wooton, J. (2005). A Proposed Method for Confidentialising Tabular Output to Protect Against Differencing. Joint UNECE/Eurostat work session on statistical data confidentiality, Geneva, 9-11 November. [28] Salazar-Gonzalez, J.J., Bycroft, C. and Staggemeier, A.T. (2005). Controlled Rounding Implementation. Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality, Geneva, 9-11 November.

Page 39: Methods to assess and quantify disclosure risk and ... · and repositories and these do a particularly good job of archiving and making data available for international comparisons

39

[29] Antal, L., Shlomo, N., and Elliot, M. (2014). Measuring Disclosure Risk with Entropy in Population Based Frequency Tables. In PSD'2014 Privacy in Statistical Databases, (Eds. J. Domingo-Ferrer). Germany: Springer LNCS 8744, 62-78. [30] Antal, L., Shlomo, N., and Elliot, M. (2015) Disclosure Risk Measurement with Entropy in Two-Dimensional Sample Based Frequency Tables. Joint UNECE/Eurostat work session on statistical data confidentiality, Helsinki, October 2015 https://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.46/20150/Paper_14_Session_1_-_Univ._Manchester.pdf [31] Duncan, G. T., Elliot, M. and Salazar-Gonz_alez, J. J. (2011). Statistical Confidentiality. Springer, New York. [32] Duncan, G., Keller-McNulty, S., and Stokes, S. (2001). Disclosure Risk vs. Data Utility: the R-U Confidentiality Map. Technical Report LA-UR-01-6428. Statistical Sciences Group,Los Alamos, N.M.:Los Alamos National Laboratory. [33] https://www.icpsr.umich.edu/icpsrweb/content/ICPSR/access/restricted/enclave.html [34] https://factfinder.census.gov/faces/nav/jsf/pages/index.xhtml [35] http://www.abs.gov.au/websitedbs/censushome.nsf/home/tablebuilder [36] https://ec.europa.eu/CensusHub2/query.do?step=selectHyperCube&qhc=false [37] O’Keefe, C.M. and Good, N. (2008). A Remote analysis Server – What Does Regression Output Look Like? In PSD'2008 Privacy in Statistical Databases, (Eds. J.Domingo-Ferrer and Y. Saygin), Springer LNCS 5262, 270-283. [38] O’Keefe, C.M. and Shlomo, N. (2012). Comparison of Remote Analysis with Statistical Disclosure Control for Protecting the Confidentiality of Business Data. Transactions on Data Privacy, Vol. 5, Issue 2, 403-432. [39] Chambers, R.L. and Dunstan,R. (1986). Estimating Distribution Functions from Survey Data. Biometrika, Vol. 73, 597-604. [40] Raghunathan, T.E., Reiter, J. and Rubin, D. (2003). Multiple Imputation for Statistical Disclosure Limitation. Journal of Official Statistics, 19, No. 1, 1-16. [41] Reiter, J.P. (2005b), Releasing Multiply Imputed, Synthetic Public-Use Microdata: An Illustration and Empirical Study. Journal of the Royal Statistical Society, A, Vol.168, No.1, 185-205. [42] Little, R.J.A., and Liu, F. (2003). Selective Multiple Imputation of Keys for Statistical Disclosure Control in Microdata. The University of Michigan Department of Biostatistics Working Paper Series. Working Paper 6. [43] http://onthemap.ces.census.gov/

Page 40: Methods to assess and quantify disclosure risk and ... · and repositories and these do a particularly good job of archiving and making data available for international comparisons

40

[44] Dandekar, R.A. and Cox L. H. (2002). Synthetic Tabular Data: An Alternative to Complementary Cell Suppression. Manuscript, Energy Information Administration, U. S. Department of Energy. [45] Dwork, C., McSherry, F., Nissim, K. and Smith, A. (2006). Calibrating Noise to Sensitivity in Private Data Analysis. In Theory of Cryptography TCC (eds. S. Halevi and R. Rabin). Heidelberg: Springer, LNCS Vol. 3876, 265-284. [46] Shlomo, N. and Skinner. C.J. (2012). Privacy Protection from Sampling and Perturbation in Survey Microdata. Journal of Privacy and Confidentiality, Vol. 4, Issue 1. [47] Dinur, I. and Nissim, K. (2003). Revealing Information While Preserving Privacy. PODS 2003, 202-210. [48] Dwork, C. and Roth, A. (2014). The Algorithmic Foundations of Differential Privacy. Foundations and Trends in Theoretical Computer Science, 9, 211-407. [49] Chaudhuri, K. and Mishra, N. (2006). When Random Sampling Preserves Privacy. Proceedings of the 26th International Cryptology Conference. [50] Machanavajjhala, A., Kifer, D., Abowd, J., Gehrke, J. and Vilhuber, L. (2008). Privacy: Theory Meets Practice on the Map. In Proceedings of the 24th International Conference on Data Engineering, Cancun, Mexico, 277-286. [51] Rinott, Y., O’Keefe, C., Shlomo, N., and Skinner, C. (2018). Confidentiality and Differential Privacy in the Dissemination of Frequency Tables. Statistical Sciences, Vol. 33, No. 3, 358-385.