october 2011

27
October 2011 Linda Fardell Cross Portfolio Data Integration Secretariat The secret lives of us: data confidentiality

Upload: clio

Post on 05-Jan-2016

31 views

Category:

Documents


0 download

DESCRIPTION

The secret lives of us: d ata confidentiality. October 2011. Linda Fardell Cross Portfolio Data Integration Secretariat. What is it & why should you care?. It’s about obligations – legal/ethical Aim – protect identity and release useful data It’s more than removing name & address - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: October 2011

October 2011

Linda FardellCross Portfolio Data Integration Secretariat

The secret lives of us:data confidentiality

Page 2: October 2011

What is it & why should you care?

• It’s about obligations – legal/ethical

• Aim – protect identity and release useful data

• It’s more than removing name & address

• Trust of providers is essential to get good stats

Page 3: October 2011

Information is power

• Banker in Maryland obtained a list of patients with cancer• compared with list of clients with outstanding

loans

• called in the loans of clients with cancer.

Source: Data confidentiality: a review of methods for statistical disclosure limitation and methods for assessing privacy (Statist. Surv. Volume 5 (2011), 1-29.

Page 4: October 2011

Legislative obligations

• Privacy Act

• Specific legislation governing collection & use of information e.g.• Social Security (Administration) Act 1999

• Taxation Administration Act 1953

Page 5: October 2011

Other obligations

• Principles based obligationse.g. High Level Principles for Data Integration Involving Commonwealth Data for Statistical and Research Purposes

Page 6: October 2011

How agencies meet these obligations

• Implement procedures to address all aspects of data protection

• To ensure that identifiable information:• is not released publicly;• is available on a ‘need to know’ basis;• can’t be derived from disseminated data; and• is maintained and accessed securely.

Page 7: October 2011

Understand your obligations

Establish policies and procedures

De-identify the data

Assess potential identification risks

Manage the risks of identification - confidentialise

Test and evaluate to mitigate risks

Provide safe access to data

Managing identification risk

Page 8: October 2011

Access to other information

• Keep track of all information released from the dataset.

Page 9: October 2011

When should a cell be confidentialised?

• Common confidentiality rules:• frequency (threshold) rule• cell dominance (cell concentration) rule

• Keep specific confidentiality procedures secret (e.g. the particular value chosen when applying the threshold rule)

Page 10: October 2011

Two general methods

• Data reduction

• Data modification (perturbation)

Page 11: October 2011

Example: frequency rule - 5

Age Income

Low Med High Total

15–19 20 0 0 20

20–29 14 11 8 33

30–39 8 12 7 27

40–49 6 18 24 48

50–59 4 5 14 23

60+ 12 9 7 28

Total 64 55 60 179

Before

Page 12: October 2011

Example: cont.Age Income

Low Med High Total

15–19 20 0 0 20

20–29 14 11 8 33

30–39 8 12 7 27

40–49 6 n.p. 18 n.p. 24 48

50–59 4 n.p. 5 n.p. 14 23

60+ 12 9 7 28

Total 64 55 60 179

After

Page 13: October 2011

Alternative: concealing totals

Age Income

Low Medium High Total

15–19 20 0 0 20

20–29 14 11 8 33

30–39 8 12 7 27

40–49 6 18 24 48

50–59 n.p. 5 14 >19

60+ 12 9 7 28

Total >60 55 60 >175

Page 14: October 2011

E.g. 2 – the cell dominance (n,k) rule

Widget brand Profit ($m)

A 150B 93C 21D 13E 8F 8G 6H 1Total 300

• Cell unsafe if combined contributions of the ‘n’ largest members of the cell represent more than ‘k’% of the total value of the cell

• n & k values are set by data custodian

• Example: (2, 75) rule• A & B contribute 81% of

total profit, so profit needs protecting

Page 15: October 2011

Data modification methodsAge Income

Low Med High Total

15–19 20 0 0 20

20–29 14 11 8 33

30–39 8 12 7 27

40–49 6 18 24 48

50–59 4 5 14 23

60+ 12 9 7 28

Total 64 55 60 179

Before roundingRR3

Page 16: October 2011

Data modification methods

Age Income

Low Med High Total

15–19 20 21 0 0 20 21

20–29 14 15 11 12 8 9 33

30–39 8 9 12 7 6 27

40–49 6 18 24 48

50–59 4 3 5 6 14 15 23 24

60+ 12 9 7 6 28 27

Total 64 63

55 54

60 179 180

After rounding RR3

Page 17: October 2011

Microdata

• Valuable resource

• 2 key types of disclosure risk:

1. spontaneous recognition

2. deliberate (malicious) attempt

Page 18: October 2011

Microdata – managing risks

• confidentialising

• deterrents

• restricting access

• educating data users about their obligations

• safe environment for access

Page 19: October 2011

Microdata – methods to assess risks

• cross-tabulation of variables;

• comparing sample data with pop’n data to see if the unique characteristics in the sample are unique in the population; and

• acquiring knowledge of other datasets & publicly available info. that could be used for list matching.

Page 20: October 2011

Protecting microdata

• 1st level of protection: remove direct identifiers

• Common ways to protect microdata are:

1. confidentialising; and/or

2. restricting access to the file

Page 21: October 2011

Confidentialising microdata

• Same principles as protecting aggregate data:

• limit variables

• introduce small amounts of random error (e.g. data swapping)

• combine categories (e.g. age in 5 year ranges)

• top/bottom code

• suppress particular values/records that can’t otherwise be protected.

Page 22: October 2011

Restricting access to microdata

Page 23: October 2011

What affects the risk of identification?

• motivation

• level of detail

• presence of rare characteristics

• accuracy of the data

• age of the data

• coverage of the data (completeness)

• presence of other information

Page 24: October 2011

A note on terminology…

• Confusion between de-identification and confidentialisation

Page 26: October 2011
Page 27: October 2011