11 assessing disclosure risk in sample microdata under misclassification natalie shlomo southampton...

34
1 Assessing Disclosure Risk in Sample Microdata Under Misclassification Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton [email protected] This is joint work with Prof Chris Skinner

Upload: shanon-lambert

Post on 17-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 11 Assessing Disclosure Risk in Sample Microdata Under Misclassification Natalie Shlomo Southampton Statistical Sciences Research Institute University

11

Assessing Disclosure Risk in Sample Microdata Under

Misclassification

Natalie Shlomo Southampton Statistical Sciences Research InstituteUniversity of [email protected]

This is joint work with Prof Chris Skinner

Page 2: 11 Assessing Disclosure Risk in Sample Microdata Under Misclassification Natalie Shlomo Southampton Statistical Sciences Research Institute University

22

Topics Covered

• Introduction and motivation

• Disclosure risk assessment for sample microdata: • Probabilistic modelling extended for

misclassification• Probabilistic record linkage – linking the

frameworks

• Risk-utility framework for microdata subject to misclassification

• Discussion

Page 3: 11 Assessing Disclosure Risk in Sample Microdata Under Misclassification Natalie Shlomo Southampton Statistical Sciences Research Institute University

33

• Disclosure risk scenario: ‘intruder’ attack on microdata through linking to available public data sources

Linkage via identifying key variables common to both sources, eg. sex, age, region, ethnicity

• Agencies limit risk of identification through statistical disclosure limitation (SDL) methods:• Non-perturbative – sub-sampling, recoding and

collapsing categories of key variables, deleting variables

• Perturbative – data swapping, additive noise, misclassification (PRAM) and synthetic data

Introduction

Page 4: 11 Assessing Disclosure Risk in Sample Microdata Under Misclassification Natalie Shlomo Southampton Statistical Sciences Research Institute University

44

• Need to quantify the risk of identification to inform microdata release

• Probabilistic models for quantifying risk of identification based on population uniqueness on identifying key variables

• Population counts in contingency tables spanned by key variables unknown

• Distribution assumptions to draw inference from the sample for estimating population parameters

• Assumes that microdata has not been altered either through misclassification arising from data processes or purposely introduced for SDL

Introduction

Page 5: 11 Assessing Disclosure Risk in Sample Microdata Under Misclassification Natalie Shlomo Southampton Statistical Sciences Research Institute University

55

• Expand probabilistic modelling for quantifying risk of identification under misclassification/perturbation

• For perturbative methods of SDL, risk assessment typically based on probabilistic record linkage • Conservative assessment of risk of identification • Assumes that intruder has access to original dataset

and does not take into account protection afforded by sampling

• Fit probabilistic record linkage into the probabilistic modelling framework for categorical matching variables

Introduction

Page 6: 11 Assessing Disclosure Risk in Sample Microdata Under Misclassification Natalie Shlomo Southampton Statistical Sciences Research Institute University

6

Disclosure Risk Assessment

Probabilistic Modelling – no misclassification

• Let denote a q-way frequency table which is a sample from a population table where indicates a cell population count and sample count in cell

• Disclosure risk measure:

• For unknown population counts, estimate from the conditional distribution of

kF

kkf

k

kk FfI )1,1(1 k k

k FfI

1)1(2

)1|1(ˆ)1(1 kkk

k fFPfI )1|1

(ˆ)1(ˆ2 kkk

k fF

EfI

kk fF |

}{f kf

}{F kF),...,( 1 qkkk

Page 7: 11 Assessing Disclosure Risk in Sample Microdata Under Misclassification Natalie Shlomo Southampton Statistical Sciences Research Institute University

7

Disclosure Risk Assessment

• Natural assumption:

Bernoulli sampling:

It follows that: and

where are conditionally independent

)(~ kk PoissonF

),(~| kkkk FBinFf

)(~ kkk Poissonf

))1((~| kkkk PoissonfF

kk fF |

is the sampling fraction in cell kk

Page 8: 11 Assessing Disclosure Risk in Sample Microdata Under Misclassification Natalie Shlomo Southampton Statistical Sciences Research Institute University

8

Disclosure Risk Assessment

• Skinner and Holmes, 1998, Elamir and Skinner, 2006 use log linear models to estimate parameters

• Sample frequencies are independent Poisson distributed with a mean of

• Log-linear model for estimating expressed as:

where design matrix of key variables and their interactions

• MLE’s calculated by solving score function:

}{ k

kfkkk

kk x)log(

0x)]xexp([ kkkk

f

}{ k

x

Page 9: 11 Assessing Disclosure Risk in Sample Microdata Under Misclassification Natalie Shlomo Southampton Statistical Sciences Research Institute University

9

Disclosure Risk Assessment

• Fitted values calculated by: and

• Individual risk measures estimated by:

• Skinner and Shlomo (2009) develop goodness of fit criteria which minimize the bias of disclosure risk estimates, for example, for

)ˆexp(ˆ kku xk

kk

u

ˆˆ

))1(ˆexp()1|1(ˆkkkk fFP

)]1(ˆ/[))]1(ˆexp(1[)1|1

(ˆkkkkk

k

fF

E

1

)}2/(])ˆ)[(1()ˆ){(1)(ˆexp(ˆˆ 21 kkkk

kkkkkkk fffB

Page 10: 11 Assessing Disclosure Risk in Sample Microdata Under Misclassification Natalie Shlomo Southampton Statistical Sciences Research Institute University

10

Disclosure Risk Assessment

• Criteria related to tests for over and under-dispersion:

• over-fitting - sample marginal counts produce too many random zeros, leading to expected cell counts too high for non-zero cells and under-estimation of risk

• under-fitting - sample marginal counts don’t take into account structural zeros, leading to expected cell counts too low for non-zero cells and over-estimation of risk

• Criteria selects the model using a forward search algorithm which minimizes for where is the variance of ii vB ˆ/ˆ 2,1,ˆ ii

i iB

Page 11: 11 Assessing Disclosure Risk in Sample Microdata Under Misclassification Natalie Shlomo Southampton Statistical Sciences Research Institute University

11

Disclosure Risk Assessment

Example: Population of 944,793 from UK 2001 Census SRS sample size 9,448

Key: Area (2), Sex (2), Age (101), Marital Status (6), Ethnicity (17), Economic Activity (10) - 412,080 cells

Model Selection:

Starting solution: main-effects log-linear model which indicates under-fitting (minimum error statistics too large) Add in higher interaction terms until minimum error statistics indicate fit

Page 12: 11 Assessing Disclosure Risk in Sample Microdata Under Misclassification Natalie Shlomo Southampton Statistical Sciences Research Institute University

12

Model Search Example (SRS n=9,448) True values ,

Independence - I 386.6 701.2 48.54 114.19

All 2 way - II 104.9 280.1 -1.57 -2.65

1: I + {a*ec} 243.4 494.3 54.75 59.22

2: 1 + {a*et} 180.1 411.6 3.07 9.82

3: 2 + {a*m} 152.3 343.3 0.88 1.73

4: 3 + {s*ec} 149.2 337.5 0.26 0.92

5a: 4 + {ar*a} 148.5 337.1 -0.01 0.84

5b: 4 + {s*m} 147.7 335.3 0.02 0.66

6b: 5b + {ar*a} 147.0 335.0 -0.24 0.56

6c: 5b + {ar*m} 148.9 337.1 -0.04 0.72

6d: 5b + {m*ec} 146.3 331.4 -0.24 0.03

7c: 6c + {m*ec} 147.5 333.2 -0.34 0.06

7d: 6d + {ar*a} 145.6 331.0 -0.44 -0.03

Area–ar, Sex-s, Age–a, Marital Status–m, Ethnicity–et, and Economic Activity-ec

,

159~1 9.355~

2

11 /ˆ vB 22 /ˆ vB1 2

Page 13: 11 Assessing Disclosure Risk in Sample Microdata Under Misclassification Natalie Shlomo Southampton Statistical Sciences Research Institute University

13

Model Search ExamplePreferred Model: {a*ec}{a*et}{a*m}(s*ec}{ar*a}

True Global Risk: Estimated Global Risk

5.1481 1.337ˆ2

0.001

0.01

0.1

1

0.001 0.01 0.1 1

Log-scale

159~1 9.355~

2

Estimated per-record risk measure

True risk measure

Page 14: 11 Assessing Disclosure Risk in Sample Microdata Under Misclassification Natalie Shlomo Southampton Statistical Sciences Research Institute University

14

Disclosure Risk Assessment

• Skinner and Shlomo, 2009 address complex survey designs: • Sampling clusters introduce dependencies - key

variables cut across clusters and assumption holds in practice

• Stratification – include strata id in key variables to account for differential inclusion probabilities

• Survey weights - Use pseudo maximum likelihood estimation where score function modified to:

changed to: where

• Partition large tables and assess partitions separately (assumes that partitioning variable has an interaction with other key variables)

0)]exp(ˆ[ k

kkk xxF ˆˆ /k k kf F

ki

ik wFk

Page 15: 11 Assessing Disclosure Risk in Sample Microdata Under Misclassification Natalie Shlomo Southampton Statistical Sciences Research Institute University

15

Disclosure Risk Assessment Under

Misclassification

• Model assumes no misclassification errors either arising from data processes or purposely introduced for SDL

• Shlomo and Skinner, 2010 address misclassification errors

Let: where cross-classified key variables: in population fixed in microdata subject to misclassification

)|~

( jXkXPM kj

X

X

X~

Page 16: 11 Assessing Disclosure Risk in Sample Microdata Under Misclassification Natalie Shlomo Southampton Statistical Sciences Research Institute University

16

Disclosure Risk Assessment Under

misclassification

• The per-record disclosure risk measure of a match of external unit B to a unique record in microdata A that has undergone misclassification:

(1)

• For small misclassification and small sampling fractions: or (2)

• Global measure: estimated by:

(3)

where per-record risk:

kj

kjkjj

kkkkk FMMF

MMfBAP

1

)1/(

)1/()1

~|(

j

kjj

kk

MF

M

k

kk

F

M~

k k

kkk

F

MfI ~)1(2

k

k

kkk

k fF

EMfI~

|~1ˆ)1

~(ˆ2

1

~|~

1ˆk

k

kk fF

EM

Page 17: 11 Assessing Disclosure Risk in Sample Microdata Under Misclassification Natalie Shlomo Southampton Statistical Sciences Research Institute University

17

PRAM ( Post-randomisation method)

• Probability transition matrix containing conditional probabilities for a category c:

• Let T be a vector of frequencies

• On each record, category c changed or not changed according to and the result of a draw of a random variate u

• vector of perturbed frequencies

• Unbiased moment estimator of the original data: assuming has an inverse (dominant on the diagonals)

Perturbation Methods of SDL

ckjM

cM

*T

cM

)1(*ˆ cMTTcM

)|~

( jXkXPM ccckj

Page 18: 11 Assessing Disclosure Risk in Sample Microdata Under Misclassification Natalie Shlomo Southampton Statistical Sciences Research Institute University

18

PRAM ( Post-randomisation method) - cont.

• Invariant PRAM - Define:

(vector of the original frequencies eigenvector of )

• To ensure correct non-perturbation probability on diagonal, define

• Expected values of marginal distribution preserved

• Exact marginal distribution preserved using a without replacement selection strategy to select records for perturbation

Perturbation Methods of SDL

TTM c

cM

ˆc c cR M Q

* (1 )c cR R I

l

cckl

cckj

cccjk lXPMjXPMkXjXPQ )](/[)()

~|(

Page 19: 11 Assessing Disclosure Risk in Sample Microdata Under Misclassification Natalie Shlomo Southampton Statistical Sciences Research Institute University

1919

• Exchange values of a key variable between pairs of records

• Pairs selected within control strata to minimize bias• Typically geographical variable is swapped within a

large area:• Geography highly identifiable• Conditional independence assumption usually met

(sensitive variables relatively independent of geography)

• Does not produce inconsistent records• Marginal distributions preserved at higher

geographies• Can be targeted to high risk records

Random Record Swapping

Perturbation Methods of SDL

Page 20: 11 Assessing Disclosure Risk in Sample Microdata Under Misclassification Natalie Shlomo Southampton Statistical Sciences Research Institute University

2020

• Population of individuals from 2001 United Kingdom (UK) Census N=1,468,255

• 1% srs sample n=14,683

• Six key variables: Local Authority (LAD) (11), sex (2), age groups (24), marital status (6), ethnicity (17), economic activity (10) K=538,560.

Misclassification Example

Page 21: 11 Assessing Disclosure Risk in Sample Microdata Under Misclassification Natalie Shlomo Southampton Statistical Sciences Research Institute University

2121

• Record Swapping: LAD swapped randomly, eg. for a 20% swap:

Diagonal: Off diagonal: where is the number of records in the sample from LAD k

• Pram: LAD misclassified, eg. for a 20% misclassification

Diagonal: Off diagonal: Parameter:

Misclassification Example

kn

55.0

8.0ckkM

kl lkckj nnM /2.0

8.0ckkM

)10/2.0(02.0ckjM

Page 22: 11 Assessing Disclosure Risk in Sample Microdata Under Misclassification Natalie Shlomo Southampton Statistical Sciences Research Institute University

22

• Random 20% perturbation on LAD• Global risk measures: Expected correct matches

from SU’s

Misclassification Example

Global Risk Measure PRAM Swapping

True risk measure in original sample

358.1 362.4

Estimated naïve risk measure ignoring misclassification

349.5 358.6

Risk measure on non-perturbed records

292.2 292.8

Risk measure under misclassification (1)

Sample uniques

299.7

2,779

298.9

2,831

Approximation based on diagonals (2)

299.8 298.9

Estimated risk measure under misclassification (3)

283.1 286.8

Expected correct match per sample unique: Pram: 10.8% Record swapping: 10.6%

ckkM

Page 23: 11 Assessing Disclosure Risk in Sample Microdata Under Misclassification Natalie Shlomo Southampton Statistical Sciences Research Institute University

2323

• Estimating individual per-record risk measures for 20% random swap based on log linear modelling (log scale):

• From perspective of intruder, difficult to identify high risk (population unique) records

Misclassification Example

Estimated Risk Measure (3)

Risk Measure (1)

Page 24: 11 Assessing Disclosure Risk in Sample Microdata Under Misclassification Natalie Shlomo Southampton Statistical Sciences Research Institute University

24

• Utility measured by whether inference can be carried out on perturbed data similar to original data

• Use proxy information loss measures on distributions calculated from microdata:

• Distance Metrics: where number of cells in distribution Let

measure of average absolute perturbation compared to average cell size

Also, can consider Kolmogorov-Smirnov statistic,

Hellinger’s Distance and relative differences in means or variances

Information Loss Measures

RCcrDcrDDDAADcr origpertpertorig /|),(),(|),(

,

RC

avgavgpertorig DAADDDDRAAD /)(100),(

cr origavg RCcrDD,

/),(

Page 25: 11 Assessing Disclosure Risk in Sample Microdata Under Misclassification Natalie Shlomo Southampton Statistical Sciences Research Institute University

2525

• Pram record swapping Risk-Utility Map

jjjj FfBA~

/)1~

|Pr(

jF

X j

SUj k jkjkkjjjj F ])1/(/[)]1/([

)1~

|~

/1(ˆ jj fFE

Random perturbation versus Targeted perturbation on non-white ethnicities

Page 26: 11 Assessing Disclosure Risk in Sample Microdata Under Misclassification Natalie Shlomo Southampton Statistical Sciences Research Institute University

2626

• value of vector of cross-classified identifying key variables for unit in the microdata ( )

• corresponding value for unit in the external database ( ) ( )

• Misclassification mechanism via probability matrix:

• Comparison vector for pairs of units• For subset partition set of pairs in Matches (M) Non-matches (U) through likelihood ratio: where

Probabilistic Record Linkage

aX~

1saa

bX b

2sb

kjaa MjXkXP )|~

(

),~

( ba XX21),( ssba

21~ sss s~

)),(|),~

(()( MbaXXPm ba

)),(|),~

(()( UbaXXPu ba

Ps 2

)(/)( um

Page 27: 11 Assessing Disclosure Risk in Sample Microdata Under Misclassification Natalie Shlomo Southampton Statistical Sciences Research Institute University

2727

• probability that pair is in M• Probability of a correct match:

• Estimate parameters using previous test data or EM algorithm and assuming conditional independence

Probabilistic Record Linkage

)),(( MbaPp

)]1)(()(/[)()),~

(|),((| pupmpmXXMbaPp baM

)),(|),~

(()....),(|),~

(()),(|),~

((

)),(|),~

(()(

21 MbaXXPMbaXXPMbaXXP

MbaXXPm

baKbaba

ba

Page 28: 11 Assessing Disclosure Risk in Sample Microdata Under Misclassification Natalie Shlomo Southampton Statistical Sciences Research Institute University

2828

• No misclassification

Linking the Two Frameworks

)1()1( jj FfNn jfn jjFfNn

)1( jj Ffjf jj Ff

)1( Nn

Non-match Match Total

Disagree

Agree

Total n Nn

jjjj

jM FNnFfNnfN

nfNp

1

)1(/)1()/11(//1

//1|

nfm j /)( )1(/)1()( NnFfu jj Np /1

Page 29: 11 Assessing Disclosure Risk in Sample Microdata Under Misclassification Natalie Shlomo Southampton Statistical Sciences Research Institute University

2929

• With misclassification, denote

Linking the Two Frameworks

jk

kjkjjjj fMfMf~

jjjjj fMFfnNn ~

jjj fMn jj FfNn~

jjjjj fMFf ~

jjj fMjj Ff

~

nNn n Nn

Non-match Match Total

Disagree

Agree

Total

j

jj

j

jj

jjjjjjjj

jjjM

F

M

f

M

NnfMFfNnfMN

nfMNp ~~

)1(/)~

)(/11(//1

//1|

nfMm jjj /)( )1(/)~

()( NnfMFfu jjjjj Np /1

Page 30: 11 Assessing Disclosure Risk in Sample Microdata Under Misclassification Natalie Shlomo Southampton Statistical Sciences Research Institute University

3030

• Matching 2,853 sample uniques to the population and blocking on all key variables except LAD result

in 1,534,293 possible pairs

• On average across blocks, probability of a correct match given an agreement on LAD

Linking the Two Frameworks

Non-match Match Total

Disagree LAD 1,388,069 619 1,388,688

Agree LAD 143,321 2,234 145,555

Total 1,531,390 2,853 1,534,293

78.0)( m 09.0)( u 002.0p

015.0| Mp

Page 31: 11 Assessing Disclosure Risk in Sample Microdata Under Misclassification Natalie Shlomo Southampton Statistical Sciences Research Institute University

3131

• Probability of a correct match given an agreement for each

• Compare to risk measure

• Summing over the global disclosure risk measure of 289.5.

Linking the Two Frameworks

jXX ba ),~

(

jjj FM~

/

|Mp

|Mp

Page 32: 11 Assessing Disclosure Risk in Sample Microdata Under Misclassification Natalie Shlomo Southampton Statistical Sciences Research Institute University

3232

• Global disclosure risk measures accurately estimated for a risk-utility assessment assuming known non-misclassification probability

• Empirical evidence of connection between F&S record linkage and probabilistic modelling for estimating identification risk

• Estimation carried out through log linear modelling for the probabilistic modelling or the EM algorithm for the F&S record linkage

• Individual disclosure risk measures more difficult to estimate without knowing true population parameters in both frameworks

• From the perspective of the intruder, it is difficult to identify sample uniques that are population uniques

Discussion

Page 33: 11 Assessing Disclosure Risk in Sample Microdata Under Misclassification Natalie Shlomo Southampton Statistical Sciences Research Institute University

33

• Statistical Agencies (MRP) need to: - Assess disclosure risk objectively - Set tolerable risk thresholds according to different access

modes - Optimize and combine SDL techniques - Provide guidelines on how to analyze disclosure

controlled datasets

• Future dissemination strategies presents new challenges: - Synthetic data for web access prior to accessing real data - Online SDL techniques for flexible table generating

software and remote access - Auditing query systems

• Bridge the Statistical and Computer Science literature on privacy preserving algorithms

Discussion

Page 34: 11 Assessing Disclosure Risk in Sample Microdata Under Misclassification Natalie Shlomo Southampton Statistical Sciences Research Institute University

34

Thank you for your attention