keeping data confidential in an era of no privacy
DESCRIPTION
Keeping Data Confidential in an Era of No Privacy. Prof. Jerry Reiter Department of Statistical Science Duke University. Disclosure limitation setting. Agency seeks to release data on individuals Risk of re-identifications from matching to external databases - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Keeping Data Confidential in an Era of No Privacy](https://reader035.vdocuments.site/reader035/viewer/2022062812/568162f8550346895dd372a4/html5/thumbnails/1.jpg)
Keeping Data Confidential in an Era of No Privacy
Prof. Jerry ReiterDepartment of Statistical Science
Duke University
![Page 2: Keeping Data Confidential in an Era of No Privacy](https://reader035.vdocuments.site/reader035/viewer/2022062812/568162f8550346895dd372a4/html5/thumbnails/2.jpg)
Disclosure limitation setting Agency seeks to release data on
individuals
Risk of re-identifications from matching to external databases
Statistical disclosure limitation applied to data before release
![Page 3: Keeping Data Confidential in an Era of No Privacy](https://reader035.vdocuments.site/reader035/viewer/2022062812/568162f8550346895dd372a4/html5/thumbnails/3.jpg)
Standard approaches to disclosure limitation Recode variables
Suppress data
Swap data
Add random noise
![Page 4: Keeping Data Confidential in an Era of No Privacy](https://reader035.vdocuments.site/reader035/viewer/2022062812/568162f8550346895dd372a4/html5/thumbnails/4.jpg)
General issues with standard SDL Recoding
Loses information in tails, disables fine spatial analysis, creates ecological fallacies
Suppression Creates nonignorable missing data May not be fully protective
![Page 5: Keeping Data Confidential in an Era of No Privacy](https://reader035.vdocuments.site/reader035/viewer/2022062812/568162f8550346895dd372a4/html5/thumbnails/5.jpg)
General issues with standard SDL Swapping
Attenuates correlations Protection based on perception
Adding noise Inflates variances, distorts
distributions, attenuates correlations May need large noise variances
![Page 6: Keeping Data Confidential in an Era of No Privacy](https://reader035.vdocuments.site/reader035/viewer/2022062812/568162f8550346895dd372a4/html5/thumbnails/6.jpg)
Fully synthetic dataRubin (1993, JOS ): create multiple, fully synthetic datasets for public release so that: No unit in released data has sensitive data from actual unit in population Released data look like actual data Statistical procedures valid for original data are valid for released data
![Page 7: Keeping Data Confidential in an Era of No Privacy](https://reader035.vdocuments.site/reader035/viewer/2022062812/568162f8550346895dd372a4/html5/thumbnails/7.jpg)
Generating fully synthetic data Randomly sample new units from
frame (can use simple random samples)
Impute survey variables for new units using models fit from observed data
Repeat multiple times and release m datasets
![Page 8: Keeping Data Confidential in an Era of No Privacy](https://reader035.vdocuments.site/reader035/viewer/2022062812/568162f8550346895dd372a4/html5/thumbnails/8.jpg)
Inferences from fully synthetic datasetsRaghunathan, Reiter, Rubin
(2003, Journal of Official Statistics)
Estimand: Q = Q (X , Y )
In each synthetic dataset
)( ii dQq id
)( ii dUu
![Page 9: Keeping Data Confidential in an Era of No Privacy](https://reader035.vdocuments.site/reader035/viewer/2022062812/568162f8550346895dd372a4/html5/thumbnails/9.jpg)
Quantities needed for inferences
m
iim
mim
m
iim
muu
mqqb
mqq
1
2
1
/
)1/()(
/
![Page 10: Keeping Data Confidential in an Era of No Privacy](https://reader035.vdocuments.site/reader035/viewer/2022062812/568162f8550346895dd372a4/html5/thumbnails/10.jpg)
Inferences from fully synthetic data Estimate of Q : Estimate of variance is
For large n, s, m, use normal based inference for Q:
mq
mmf ubmT )/11(
fm Tq 96.1
![Page 11: Keeping Data Confidential in an Era of No Privacy](https://reader035.vdocuments.site/reader035/viewer/2022062812/568162f8550346895dd372a4/html5/thumbnails/11.jpg)
Advantages of full synthesis No sensitive data released: very high
protection No need to decide which values to alter
nor which variables are quasi-identifiers Potential to preserve associations,
maintain geographies, release data in tails
Analysts can use standard methods on simple random samples
Protection does not depend on hiding nature of SDL to public
![Page 12: Keeping Data Confidential in an Era of No Privacy](https://reader035.vdocuments.site/reader035/viewer/2022062812/568162f8550346895dd372a4/html5/thumbnails/12.jpg)
Drawbacks of full synthesis Analysts have to deal with multiple
datasets (not a serious issue) Quality of data highly dependent on
quality of synthesis models Relationships omitted in models are not in
released data Inaccurate distributions are passed on to
analysts
Only possible for analysts to rediscover what is the synthesis models
![Page 13: Keeping Data Confidential in an Era of No Privacy](https://reader035.vdocuments.site/reader035/viewer/2022062812/568162f8550346895dd372a4/html5/thumbnails/13.jpg)
A modification of the proposal: Partially synthetic dataLittle (1993, JOS ): create multiple, partially synthetic datasets for public release so that: Released data comprise mix of observed and synthetic values Released data look like actual data Statistical procedures valid for original data are valid for released data
![Page 14: Keeping Data Confidential in an Era of No Privacy](https://reader035.vdocuments.site/reader035/viewer/2022062812/568162f8550346895dd372a4/html5/thumbnails/14.jpg)
Observed Data
x y x y x y x y
Synthetic Datasets
![Page 15: Keeping Data Confidential in an Era of No Privacy](https://reader035.vdocuments.site/reader035/viewer/2022062812/568162f8550346895dd372a4/html5/thumbnails/15.jpg)
Observed Data
x y x y x y x y
Synthetic Datasets
![Page 16: Keeping Data Confidential in an Era of No Privacy](https://reader035.vdocuments.site/reader035/viewer/2022062812/568162f8550346895dd372a4/html5/thumbnails/16.jpg)
Observed Data
x y x y x y x y
Synthetic Datasets
![Page 17: Keeping Data Confidential in an Era of No Privacy](https://reader035.vdocuments.site/reader035/viewer/2022062812/568162f8550346895dd372a4/html5/thumbnails/17.jpg)
Existing applications Replace sensitive values for selected
units:Survey of Consumer FinancesCounty-to-county migration flows (current)
Replace values of identifiers for selected units:American Community Survey group quartersTract IDs for NCI SEER cancer registry data
Replace all values of sensitive variables:Longitudinal Business DatabaseOn the MapSurvey Income Program Participation
![Page 18: Keeping Data Confidential in an Era of No Privacy](https://reader035.vdocuments.site/reader035/viewer/2022062812/568162f8550346895dd372a4/html5/thumbnails/18.jpg)
Inference with partially synthetic datasets (no missing data)Reiter (2003, Survey Methodology)
Estimand: Q = Q (X , Y )
In each synthetic dataset
)( ii dQq id
)( ii dUu
![Page 19: Keeping Data Confidential in an Era of No Privacy](https://reader035.vdocuments.site/reader035/viewer/2022062812/568162f8550346895dd372a4/html5/thumbnails/19.jpg)
Inference with partially synthetic data (no missing data) Estimate of Q : Estimate of variance is
For large n and m, use normal based inference for Q:
mq
mbuT mmp /
pm Tq 96.1
![Page 20: Keeping Data Confidential in an Era of No Privacy](https://reader035.vdocuments.site/reader035/viewer/2022062812/568162f8550346895dd372a4/html5/thumbnails/20.jpg)
Fully synthetic Partially synthetic New units sampled Cannot match--low disclosure risk Full reliance on imputation models Released data SRS May need large synthetic sample
sizes or m
Collected units used Matches to observed data possible Partial reliance on imputation
models Original design Small m can be adequate for
replacements
![Page 21: Keeping Data Confidential in an Era of No Privacy](https://reader035.vdocuments.site/reader035/viewer/2022062812/568162f8550346895dd372a4/html5/thumbnails/21.jpg)
Open research questions Synthesis models for specific data
types: Data nested within households Longitudinal data Social network data And many more…
Record linkage with synthetic data
![Page 22: Keeping Data Confidential in an Era of No Privacy](https://reader035.vdocuments.site/reader035/viewer/2022062812/568162f8550346895dd372a4/html5/thumbnails/22.jpg)
Guide to literature:Overviews of synthetic data Rubin (1993, Journal of Official Statistics
) Little (1993, Journal of Official
Statistics ) Abowd and Woodcock (2001) in
Confidentiality, Disclosure, and Data Access: Theory and Practical Applications for Statistical Agencies
Reiter (2004, Chance )
![Page 23: Keeping Data Confidential in an Era of No Privacy](https://reader035.vdocuments.site/reader035/viewer/2022062812/568162f8550346895dd372a4/html5/thumbnails/23.jpg)
Guide to literature: Inferences with synthetic data Full synthesis: Raghunathan, Reiter, Rubin
(2003, Journal of Official Statistics ) Partial synthesis (no missing): Reiter (2003,
Survey Methodology ) Partial synthesis with missing data: Reiter
(2004, Survey Methodology ) Significance tests of multi-component
hypotheses Full synthesis and partial synthesis (no missing):
Reiter (2005, Journal of Statistical Planning and Inference )
Partial synthesis with missing: Kinney and Reiter (2010, Journal of Official Statistics )
Model selection in regression: Kinney, Reiter, and Berger (forthcoming, Journal of Privacy and Confidentiality )
![Page 24: Keeping Data Confidential in an Era of No Privacy](https://reader035.vdocuments.site/reader035/viewer/2022062812/568162f8550346895dd372a4/html5/thumbnails/24.jpg)
Guide to literature: Generating synthetic data Sequential regression approaches:
Abowd and Woodcock (2004) in Privacy in Statistical Databases
Classification and regression trees: Reiter (2005, Journal of Official Statistics )
Survey weights and partial synthesis:Mitra and Reiter (2006) in Privacy in Statistical Databases
Bayesian networks: Young, Graham, Penny (2009, Journal of Official Statistics )
Regression with kernel density transformations: Woodcock and Benedetto (2009, Computational Statistics and Data Analysis )
Random forests: Caiola and Reiter (2010, Transactions on Data Privacy )
Support vector machines:Drechsler (2010) in Privacy in Statistical Databases
![Page 25: Keeping Data Confidential in an Era of No Privacy](https://reader035.vdocuments.site/reader035/viewer/2022062812/568162f8550346895dd372a4/html5/thumbnails/25.jpg)
Guide to literature:Disclosure risk estimation Record linkage for partial synthesis:
Abowd, Stinson, Benedetto (2006) technical report
Identification risks in partial synthesis Reiter and Mitra (2009, Journal of Privacy
and Confidentiality ) Drechsler and Reiter (2008) in Privacy
in Statistical Databases Differential privacy and synthetic data:
Abowd and Vilhuber (2008) in Privacy in Statistical Databases
![Page 26: Keeping Data Confidential in an Era of No Privacy](https://reader035.vdocuments.site/reader035/viewer/2022062812/568162f8550346895dd372a4/html5/thumbnails/26.jpg)
Guide to literature:Utility of synthetic data Complex designs in full synthesis:
Reiter (2002, Journal of Official Statistics )
Impact of number of datasets on quality:Drechsler and Reiter (2009, Journal of Official Statistics )
Verification servers: Reiter, Oganian, and Karr (2009, Computational Statistics and Data Analysis)
![Page 27: Keeping Data Confidential in an Era of No Privacy](https://reader035.vdocuments.site/reader035/viewer/2022062812/568162f8550346895dd372a4/html5/thumbnails/27.jpg)
Guide to literature: Genuine applications Synthesis instead of topcoding:
An and Little (2007, Journal of the Royal Statistical Society – A )
Survey of Income and Program Participation linked data www.census.gov/sipp/synth_data.html
Longitudinal Business Database: Kinney and Reiter (2007, Proceedings of the Joint Statistical Meetings )
American Community Survey group quarters:Hawala (2008, Proceedings of the Joint Statistical Meetings )
OnTheMap: http://lehdmap4.did.census.gov/themap4/ German Establishment Panel:
Drechsler, Bender, and Rassler (2008, Transactions on Data Privacy )
![Page 28: Keeping Data Confidential in an Era of No Privacy](https://reader035.vdocuments.site/reader035/viewer/2022062812/568162f8550346895dd372a4/html5/thumbnails/28.jpg)
Guide to literature:Other adaptions Combining two confidential datasets
Kohnen and Reiter (2009, Journal of the Royal Statistical Society - A)
Reiter 2009, International Statistical Review Synthesize some variables m times and
others r times (Reiter and Drechsler 2010, Statistica Sinica)
Sampling from a census followed by synthesis of confidential data (Drechsler and Reiter 2010, Journal of the American Statistical Association)