statistical confidentiality and privacy: 1. general considerations * * * robert mccaa minnesota...

22
Statistical confidentiality and Statistical confidentiality and privacy: privacy: 1. General considerations 1. General considerations * * * * * * Robert McCaa Robert McCaa Minnesota Population Center Minnesota Population Center [email protected] Inadequate use of microdata has high costs” Inadequate use of microdata has high costs” --Len Cook (2003, registrar general, ONS) --Len Cook (2003, registrar general, ONS)

Post on 21-Dec-2015

220 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Statistical confidentiality and privacy: 1. General considerations * * * Robert McCaa Minnesota Population Center rmccaa@umn.edu rmccaa@umn.edu “ Inadequate

Statistical confidentiality and privacy:Statistical confidentiality and privacy:1. General considerations1. General considerations

* * ** * *Robert McCaaRobert McCaa

Minnesota Population CenterMinnesota Population [email protected]

““Inadequate use of microdata has high costs”Inadequate use of microdata has high costs”--Len Cook (2003, registrar general, ONS)--Len Cook (2003, registrar general, ONS)

Page 2: Statistical confidentiality and privacy: 1. General considerations * * * Robert McCaa Minnesota Population Center rmccaa@umn.edu rmccaa@umn.edu “ Inadequate

UNSD Principles and Recommendations (Rev. 1, 1997) UNSD Principles and Recommendations (Rev. 1, 1997) endorse dissemination of census microdataendorse dissemination of census microdata

» §§1.218: “There are a range of methods…that 1.218: “There are a range of methods…that can be used to make such microdata available can be used to make such microdata available while still protecting individuals’ rights to while still protecting individuals’ rights to privacy.” (Rev. 2 has a stronger statement.)privacy.” (Rev. 2 has a stronger statement.)

» In four decades of distributing microdata In four decades of distributing microdata there there is not a single is not a single allegationallegation of a breach of of a breach of confidentiality or privacy confidentiality or privacy (includes 100% (includes 100% microdata stored at CELADE in Santiago, microdata stored at CELADE in Santiago, Chile). Chile).

Page 3: Statistical confidentiality and privacy: 1. General considerations * * * Robert McCaa Minnesota Population Center rmccaa@umn.edu rmccaa@umn.edu “ Inadequate

Why disseminate microdata? Why disseminate microdata? Julia Lane, European Statisticians Conference (2003)Julia Lane, European Statisticians Conference (2003)

» 1. 1. AnalyzeAnalyze more realistic questions more realistic questions» 2. 2. DevelopDevelop reality-based policy reality-based policy » 3. 3. AcquireAcquire new constituencies and stakeholders new constituencies and stakeholders » 4. 4. BuildBuild trust; reduce suspicions of data cooking trust; reduce suspicions of data cooking» 5. 5. ReplicateReplicate findings findings

» a. use standards of UNSD, Eurostat, ISCO, ISCED, etc.a. use standards of UNSD, Eurostat, ISCO, ISCED, etc.» b. facilitate comparative research in time and spaceb. facilitate comparative research in time and space

» 6. 6. CalculateCalculate marginal effects marginal effects » 7.7. Assess Assess data quality data quality

» ……and much, much more….and much, much more….

Page 4: Statistical confidentiality and privacy: 1. General considerations * * * Robert McCaa Minnesota Population Center rmccaa@umn.edu rmccaa@umn.edu “ Inadequate

Confidentializing an integrated microdata base with:Confidentializing an integrated microdata base with:» 200+ samples of households (70+ countries)200+ samples of households (70+ countries)» Containing ½ billion person records with thousands Containing ½ billion person records with thousands

of variablesof variables» Available to tens of thousands of licensed users Available to tens of thousands of licensed users

regardless of country of birth, citizenship, residence regardless of country of birth, citizenship, residence or place of workor place of work

» Without a single Without a single allegationallegation of violation of privacy or of violation of privacy or statistical confidentiality--statistical confidentiality--

What’s the problem?What’s the problem?

Page 5: Statistical confidentiality and privacy: 1. General considerations * * * Robert McCaa Minnesota Population Center rmccaa@umn.edu rmccaa@umn.edu “ Inadequate

5

Usage: Off-site Usage: Off-site vs. on-site use (secure microdata laboratory)? vs. on-site use (secure microdata laboratory)?

Germany RDC, 2005-8: ten-to-oneGermany RDC, 2005-8: ten-to-one

677

1107

2271

154

1328

359

133

0

200

400

600

800

1000

1200

1400

2005 2006 2007 2008

Year

Remote data access On-site use

Jan-Sept

RDCs are expensive and attract few users.RDCs are expensive and attract few users.

Page 6: Statistical confidentiality and privacy: 1. General considerations * * * Robert McCaa Minnesota Population Center rmccaa@umn.edu rmccaa@umn.edu “ Inadequate

““Statistical disclosure control methods may modify the data Statistical disclosure control methods may modify the data or the design of the statistic, or a combination of both. or the design of the statistic, or a combination of both.

They will be judged sufficient when the guarantee of They will be judged sufficient when the guarantee of confidentiality can be maintained, taking account of confidentiality can be maintained, taking account of

information likely to be available to third parties, either from information likely to be available to third parties, either from other sources or as previously released National Statistics other sources or as previously released National Statistics

outputs, against the following standard:outputs, against the following standard:“It would take a disproportionate amount of time, “It would take a disproportionate amount of time,

effort and expertise for an intruder to identify effort and expertise for an intruder to identify a statistical unit to others, or to reveal information a statistical unit to others, or to reveal information

about that unit not already in the public domain.”about that unit not already in the public domain.” Protocols on Data Access and Confidentiality, Protocols on Data Access and Confidentiality,

pp. 7-8 pp. 7-8 --ONS-UK(2004)(2004)www.statistics.gov.uk/about_ns/cop/downloads/prot_data_access_confidentiality.pdf

Page 7: Statistical confidentiality and privacy: 1. General considerations * * * Robert McCaa Minnesota Population Center rmccaa@umn.edu rmccaa@umn.edu “ Inadequate

Risk assessment of household samples of Risk assessment of household samples of UK 1991 census: attempts at matching are “fruitless”UK 1991 census: attempts at matching are “fruitless”

few matches; many false positivesfew matches; many false positives

» After taking into account errors in the data, coding After taking into account errors in the data, coding variability and changing of personal characteristics in timevariability and changing of personal characteristics in time

» Dale and Elliott, JRSS-A (2003): Dale and Elliott, JRSS-A (2003): “For a user of an outside database, attempting this sort of “For a user of an outside database, attempting this sort of match with no opportunity for verification would prove match with no opportunity for verification would prove fruitless. In the first place, the small degree of expected fruitless. In the first place, the small degree of expected overlap would be a considerable deterrent to an intruder. overlap would be a considerable deterrent to an intruder. However, if a match between the two files was attempted However, if a match between the two files was attempted the large number of apparent matches would be highly the large number of apparent matches would be highly confusing as an intruder would have no way of checking confusing as an intruder would have no way of checking correct identificationcorrect identification.”.”

Page 8: Statistical confidentiality and privacy: 1. General considerations * * * Robert McCaa Minnesota Population Center rmccaa@umn.edu rmccaa@umn.edu “ Inadequate

8

complete

microdata

confidential microdata

de-facto anonymised microdata

delete direct identifier

anonymisationmethod

Degree of confidentiality

Degree ofanalysis potential

stronger anonymisationmethod

fully anonymised microdata

Level of Anonymization(FSO-Germany)

Trade-off between confidentiality and analysis potential: Trade-off between confidentiality and analysis potential: is it monotonic (as portrayed)? is it monotonic (as portrayed)?

Page 9: Statistical confidentiality and privacy: 1. General considerations * * * Robert McCaa Minnesota Population Center rmccaa@umn.edu rmccaa@umn.edu “ Inadequate

9

complete

microdata

confidential microdata

de-facto anonymised microdata

delete direct identifier

anonymisationmethod

Degree of confidentiality

Degree ofanalysis potential

stronger anonymisationmethod

fully anonymised microdata

Level of Anonymization—not monotonic

95%95%

& Construct sample& Construct sample

50%50% 25%25%45%45%

99%99% 99.9%99.9%

Trade-off is not monotonicTrade-off is not monotonic

Page 10: Statistical confidentiality and privacy: 1. General considerations * * * Robert McCaa Minnesota Population Center rmccaa@umn.edu rmccaa@umn.edu “ Inadequate

ResourcesResources

»UN-ECE (2007), Managing UN-ECE (2007), Managing Statistical Confidentiality & Statistical Confidentiality & Microdata AccessMicrodata Access http://www.unece.org/stats/documents/tfcm.htm

» IHSN Tools & Guidelines, IHSN Tools & Guidelines, anonymization:anonymization:www.surveynetwork.org

» Eurostat (1999) Eurostat (1999)

Page 11: Statistical confidentiality and privacy: 1. General considerations * * * Robert McCaa Minnesota Population Center rmccaa@umn.edu rmccaa@umn.edu “ Inadequate

UN-ECE (2007) UN-ECE (2007)

www.unece.org/stats/documents/tfcm.htm

Page 14: Statistical confidentiality and privacy: 1. General considerations * * * Robert McCaa Minnesota Population Center rmccaa@umn.edu rmccaa@umn.edu “ Inadequate

IHSN IHSN www.Surveynetwork.org

1.1. Remove variables Remove variables • Identifiers: name, address, low-level Identifiers: name, address, low-level

administrative geographyadministrative geography• Sensitive: tribe, disabilitySensitive: tribe, disability

2.2. Global recodingGlobal recoding• Aggregate classes: age (5 yr groups), country of Aggregate classes: age (5 yr groups), country of

birth (continent), administrative geography, birth (continent), administrative geography, occupation (4 digit occupation (4 digit 3), etc. 3), etc.

• Top and bottom coding (continuous variables--Top and bottom coding (continuous variables--income, size of residence, number of rooms, etc.) income, size of residence, number of rooms, etc.)

3.3. Local suppression--sparse categories Local suppression--sparse categories (population n < 250…2,500)(population n < 250…2,500)

4.4. Data swapping (household geography)Data swapping (household geography)5.5. Complex perturbationsComplex perturbations

Page 15: Statistical confidentiality and privacy: 1. General considerations * * * Robert McCaa Minnesota Population Center rmccaa@umn.edu rmccaa@umn.edu “ Inadequate

EUROSTAT statistical confidentiality standards EUROSTAT statistical confidentiality standards (Thorogood, 1999) --all endorsed by IPUMS-International(Thorogood, 1999) --all endorsed by IPUMS-International

» 1. Restrict access to samples1. Restrict access to samples

» 2. Limit geographical detail2. Limit geographical detail

» 3. Re-code unique categories--top and bottom3. Re-code unique categories--top and bottom

» 4. Sign non-disclosure agreement4. Sign non-disclosure agreement

» 5. Prohibit redistribution to third parties5. Prohibit redistribution to third parties

» 6. Prohibit attempts to identify individuals or the 6. Prohibit attempts to identify individuals or the making any claim to that effectmaking any claim to that effect

» 7. Require users to provide copies of publications7. Require users to provide copies of publications

Page 16: Statistical confidentiality and privacy: 1. General considerations * * * Robert McCaa Minnesota Population Center rmccaa@umn.edu rmccaa@umn.edu “ Inadequate

EUROSTAT statistical confidentiality standards EUROSTAT statistical confidentiality standards (Thorogood, 1999) --all endorsed by IPUMS-International(Thorogood, 1999) --all endorsed by IPUMS-International

• 8. Construct age from birthdate, if necessary8. Construct age from birthdate, if necessary• 9. Do not identify date of birth9. Do not identify date of birth• 10. Do not identify precise place of birth10. Do not identify precise place of birth• 11. Migration: timing/place not identified in detail11. Migration: timing/place not identified in detail• 12. Identify place of residence by major civil 12. Identify place of residence by major civil

division (pop>20k, 60k, 100k, 1 million—i.e., division (pop>20k, 60k, 100k, 1 million—i.e., national convention) national convention)

• 13. Do sensitivity analysis13. Do sensitivity analysis• 14. Do confidentiality assessment (not yet)14. Do confidentiality assessment (not yet)

Page 17: Statistical confidentiality and privacy: 1. General considerations * * * Robert McCaa Minnesota Population Center rmccaa@umn.edu rmccaa@umn.edu “ Inadequate

““There has been no known attempt at There has been no known attempt at identification with the 1991 SARs identification with the 1991 SARs

[microdata samples of the UK]-[microdata samples of the UK]-nor in any other countries nor in any other countries

that disseminate samples of microdata” that disseminate samples of microdata” --Elliott and Dale, --Elliott and Dale,

Journal of the Royal Statistical Society, 1999Journal of the Royal Statistical Society, 1999

Countering Fear, Hysteria and Countering Fear, Hysteria and Paranoia…with reasonParanoia…with reason

Page 18: Statistical confidentiality and privacy: 1. General considerations * * * Robert McCaa Minnesota Population Center rmccaa@umn.edu rmccaa@umn.edu “ Inadequate

ChoicePoint Data Sources and Clients. Source: Washington

Post

http://www.choicepoint.com/

Why Not?Why Not?Companies want linkable Companies want linkable data with names, data with names, addresses, ID #s, etc.addresses, ID #s, etc.

* * * * * * * * * * * * * * * * * * ** * * * * * * * * * * * * * * * * * *Probabilistic linking with Probabilistic linking with 90% of the population 90% of the population missing is not good missing is not good enoughenough

Page 20: Statistical confidentiality and privacy: 1. General considerations * * * Robert McCaa Minnesota Population Center rmccaa@umn.edu rmccaa@umn.edu “ Inadequate
Page 21: Statistical confidentiality and privacy: 1. General considerations * * * Robert McCaa Minnesota Population Center rmccaa@umn.edu rmccaa@umn.edu “ Inadequate

““There has been no known attempt at There has been no known attempt at identification with the 1991 SARs identification with the 1991 SARs

[microdata samples of the UK]-[microdata samples of the UK]-nor in any other countries nor in any other countries

that disseminate samples of microdata” that disseminate samples of microdata” --Elliott and Dale, --Elliott and Dale,

Journal of the Royal Statistical Society, 1999Journal of the Royal Statistical Society, 1999

Countering Fear, Hysteria and Countering Fear, Hysteria and Paranoia…with reasonParanoia…with reason

Page 22: Statistical confidentiality and privacy: 1. General considerations * * * Robert McCaa Minnesota Population Center rmccaa@umn.edu rmccaa@umn.edu “ Inadequate

Please allow me to invite you Please allow me to invite you to think about producing to think about producing

(or permitting IPUMS to produce) (or permitting IPUMS to produce) anonymized, integrated samples anonymized, integrated samples

for all the censuses of your country for all the censuses of your country for which microdata survive…for which microdata survive…

Thank youThank you

* * * * * ** * * * * *Contact: Contact: [email protected]

this ppt is available at:this ppt is available at:www.hist.umn.edu/~rmccaa/ipums-global

See “Port of Spain workshop”See “Port of Spain workshop”