2013 07-12 ncc-group_data_anonymisation_technical_aspects_v1 0

38
Technical Aspects of Data Anonymisation & Pseudonymisation Risks, Challenges & Mitigations Matt Lewis Principal Consultant

Upload: ncc-group

Post on 22-Nov-2014

483 views

Category:

Technology


3 download

DESCRIPTION

 

TRANSCRIPT

Page 1: 2013 07-12 ncc-group_data_anonymisation_technical_aspects_v1 0

Technical Aspects of Data Anonymisation & Pseudonymisation

Risks, Challenges & Mitigations

Matt Lewis

Principal Consultant

Page 2: 2013 07-12 ncc-group_data_anonymisation_technical_aspects_v1 0

04/08/2023 © NCC Group 2

Agenda

¨ NCC Group – who we are and what we do¨ Anonymisation, Pseudonymisation & Re-identification – overview

of concepts¨ Examples – when anonymisation goes wrong¨ Pitfalls of image anonymisation and other information leakage

through meta-data¨ A risk-based approach to anonymisation¨ Summary and advice¨ Questions

Page 3: 2013 07-12 ncc-group_data_anonymisation_technical_aspects_v1 0

04/08/2023 © NCC Group 3

NCC Group

¨ Global information assurance specialist¨ 15,000 customers worldwide across all sectors¨ The Group has two complementary divisions - escrow and

assurance¨ Independence from hardware and software providers ensures we

provide unbiased and impartial advice¨ Largest penetration testing team in the world, with approximately

250 consultants

Page 4: 2013 07-12 ncc-group_data_anonymisation_technical_aspects_v1 0

04/08/2023 © NCC Group 4

Me: Brief Bio

¨ Over 12 years working in Information Security¨ Previous Employers:

• CESG – The Information Assurance arm of GCHQ• Information Risk Management (IRM) plc – penetration testing• KPMG – Executive Advisor in the Information Protection

division of IT Advisory• NCC Group – Principal Consultant, providing penetration

testing and consultancy around all aspects of Information Security

Page 5: 2013 07-12 ncc-group_data_anonymisation_technical_aspects_v1 0

04/08/2023 © NCC Group 5

Anonymisation – Overview

¨ Anonymised data should be information that does not identify any individuals, either in isolation or when cross-referenced with other data already in the public domain

¨ A careful balance is required around the level of anonymisation versus the usefulness of the resultant data

¨ Quantitative versus Qualitative – the latter is harder to anonymise in a consistent way, and requires more rigour on a ‘per record’ basis – e.g. meeting minutes

Page 6: 2013 07-12 ncc-group_data_anonymisation_technical_aspects_v1 0

04/08/2023 © NCC Group 6

Pseudonymisation – Overview

¨ Information is anonymous to the receiver (e.g. researchers), but contains codes or identifiers to allow others to re-identify individuals from the pseudonymised data

¨ Universally protecting pseudonymised data whilst allowing general analysis of it is difficult – requires careful management of the ‘codes’ or ‘keys’ that uniquely identify individuals

¨ Quantitative versus Qualitative – again, the latter is harder to pseudonymise in a consistent way, and requires more rigour on a ‘per record’ basis – e.g. meeting minutes

Page 7: 2013 07-12 ncc-group_data_anonymisation_technical_aspects_v1 0

04/08/2023 © NCC Group 7

Anonymisation – Techniques and Methods

¨ There are four main operations available for anonymising data¨ Suppression, Substitution/Distortion, Generalisation, Aggregation¨ Consider the following dataset:

Name Sex Birth Date Post Code Complaint

John Male 02/12/1954 SE24 6TY Pain in left eye

Daniel Male 05/01/1984 NW1 6XD Chest pains

Sarah Female 04/08/1978 E17 7WE Chest pains

Samantha Female 03/10/1960 WC1 7RA Back pains

James Male 09/09/1990 NW7 5LK Headaches

Page 8: 2013 07-12 ncc-group_data_anonymisation_technical_aspects_v1 0

04/08/2023 © NCC Group 8

Anonymisation – Techniques and Methods

¨ Suppression - deleting or omitting data fields entirely

Sex Complaint

Male Pain in left eye

Male Chest pains

Female Chest pains

Female Back pains

Male Headaches

Page 9: 2013 07-12 ncc-group_data_anonymisation_technical_aspects_v1 0

04/08/2023 © NCC Group 9

Anonymisation – Techniques and Methods

¨ Substitution/Distortion – e.g. replace a person’s name with a unique number – this is also an example of pseudonymisation

Name Sex Birth Date Post Code Complaint

0000001 Male 02/12/1954 SE24 6TY Pain in left eye

0000002 Male 05/01/1984 NW1 6XD Chest pains

0000003 Female 04/08/1978 E17 7WE Chest pains

0000004 Female 03/10/1960 WC1 7RA Back pains

0000005 Male 09/09/1990 NW7 5LK Headaches

Page 10: 2013 07-12 ncc-group_data_anonymisation_technical_aspects_v1 0

04/08/2023 © NCC Group 10

Anonymisation – Techniques and Methods

¨ Generalisation - alter rather than delete identifier values to increase privacy while preserving utility

Name Sex Birth Year Post Code Complaint

John Male 1954 SE24 Pain in left eye

Daniel Male 1984 NW1 Chest pains

Sarah Female 1978 E17 Chest pains

Samantha Female 1960 WC1 Back pains

James Male 1990 NW7 Headaches

Page 11: 2013 07-12 ncc-group_data_anonymisation_technical_aspects_v1 0

04/08/2023 © NCC Group 11

Anonymisation – Techniques and Methods

¨ Aggregation - produce summary statistics across a dataset instead of an anonymised dataset

¨ 40% of patients complain of chest pains¨ 60% of patients are male¨ etc.

Name Sex Birth Date Post Code Complaint

John Male 02/12/1954 SE24 6TY Pain in left eye

Daniel Male 05/01/1984 NW1 6XD Chest pains

Sarah Female 04/08/1978 E17 7WE Chest pains

Samantha Female 03/10/1960 WC1 7RA Back pains

James Male 09/09/1990 NW7 5LK Headaches

Page 12: 2013 07-12 ncc-group_data_anonymisation_technical_aspects_v1 0

04/08/2023 © NCC Group 12

Anonymisation of Qualitative Data

Page 13: 2013 07-12 ncc-group_data_anonymisation_technical_aspects_v1 0

04/08/2023 © NCC Group 13

Anonymisation of Qualitative Data

Pseudonymisation in qualitative data can be much more difficult

The content/themes in this meeting for example might allow for re-identification of any pseudonymised individuals

Page 14: 2013 07-12 ncc-group_data_anonymisation_technical_aspects_v1 0

04/08/2023 © NCC Group 14

Re-identification – What is it?

¨ Re-identification is the act of cross-referencing anonymised data with other data sources, and using inference, deduction and correlation to identify individuals

¨ Depending on the nature of data re-identified, this might raise data protection concerns

Page 15: 2013 07-12 ncc-group_data_anonymisation_technical_aspects_v1 0

04/08/2023 © NCC Group 15

Re-identification – Who does this and Why?

¨ Researchers – e.g. computer scientists, genuinely interested in the challenges of re-identification

¨ Malicious individuals use re-identification information to discriminate, harass or discredit a victim

¨ Investigative journalists¨ Organised crime – re-identification can facilitate creation of fake

identities, or be used to extort victims (if data is personal/sensitive in nature)

¨ Competitors – seeking to re-identify and publish to discredit¨ State sponsored data mining and correlation

¨ The Internet is essentially a vast, ever-growing cross-correlation database; access to most of which is open and free to anyone…

Page 16: 2013 07-12 ncc-group_data_anonymisation_technical_aspects_v1 0

04/08/2023 © NCC Group 16

Re-identification

Page 17: 2013 07-12 ncc-group_data_anonymisation_technical_aspects_v1 0

04/08/2023 © NCC Group 17

Inference as a Starting Point

¨ Recall our example:

Name Sex Birth Date Post Code Complaint

John Male 02/12/1954 SE24 6TY Pain in left eye

Daniel Male 05/01/1984 NW1 6XD Chest pains

Sarah Female 04/08/1978 E17 7WE Chest pains

Samantha Female 03/10/1960 WC1 7RA Back pains

James Male 09/09/1990 NW7 5LK Headaches

Page 18: 2013 07-12 ncc-group_data_anonymisation_technical_aspects_v1 0

04/08/2023 © NCC Group 18

Inference as a Starting Point

¨ Suppose the following anonymised aggregations are published:

60% of patients complain of chest pains

60% of patients are male

100% of Back pain sufferers live in WC1

100% of patients are over 21 years of age

100% of females suffer from chest or back pains

20% of patients suffer from Pain in the left eye

¨ From this we can infer the following table fields:

Sex, Age, Condition, Post Code

Page 19: 2013 07-12 ncc-group_data_anonymisation_technical_aspects_v1 0

04/08/2023 © NCC Group 19

Inference as a Starting Point

¨ Suppose we know the sample size (i.e. 5), and the time of data publication

60% of patients are male

Sex Birth Date Complaint Post Code

Male

Male

Male

Female

Female

Page 20: 2013 07-12 ncc-group_data_anonymisation_technical_aspects_v1 0

04/08/2023 © NCC Group 20

Inference as a Starting Point

100% of patients are over 21 years of age

Sex Birth Date Complaint Post Code

Male <= 1992

Male <= 1992

Male <= 1992

Female <= 1992

Female <= 1992

Page 21: 2013 07-12 ncc-group_data_anonymisation_technical_aspects_v1 0

04/08/2023 © NCC Group 21

Inference as a Starting Point

100% of females suffer from chest or back pains

Sex Birth Date Complaint Post Code

Male <= 1992

Male <= 1992

Male <= 1992

Female <= 1992 Chest Pains

Female <= 1992 Back Pains

Page 22: 2013 07-12 ncc-group_data_anonymisation_technical_aspects_v1 0

04/08/2023 © NCC Group 22

Inference as a Starting Point

100% of Back pain sufferers live in WC1

Sex Birth Date Complaint Post Code

Male <= 1992

Male <= 1992

Male <= 1992

Female <= 1992 Chest Pains

Female <= 1992 Back Pains WC1

Page 23: 2013 07-12 ncc-group_data_anonymisation_technical_aspects_v1 0

04/08/2023 © NCC Group 23

Inference as a Starting Point

20% of patients suffer from Pain in the left eye

Sex Birth Date Complaint Post Code

Male <= 1992 Pain in left eye

Male <= 1992

Male <= 1992

Female <= 1992 Chest Pains

Female <= 1992 Back Pains WC1

Page 24: 2013 07-12 ncc-group_data_anonymisation_technical_aspects_v1 0

04/08/2023 © NCC Group 24

Inference as a Starting Point

60% of patients complain of chest pains

¨ The next step would be to correlate/cross-reference with other sources

Sex Birth Date Complaint Post Code

Male <= 1992 Pain in left eye

Male <= 1992 Chest Pains

Male <= 1992 Chest Pains

Female <= 1992 Chest Pains

Female <= 1992 Back Pains WC1

Page 25: 2013 07-12 ncc-group_data_anonymisation_technical_aspects_v1 0

04/08/2023 © NCC Group 25

An Example: Massachusetts Group Insurance Company

¨ In the mid-1990s GIC released anonymised data on state employees that showed every single hospital visit

¨ The goal was to help researchers; the state spent time removing all obvious identifiers such as name, address, and Social Security number

¨ William Weld, then Governor of Massachusetts, assured the public that GIC had protected patient privacy by deleting identifiers

¨ Computer Science graduate Dr. Latanya Sweeney requested a copy of the data and performed re-identification research on the dataset

¨ Main anonymisation technique used: Suppression

Page 26: 2013 07-12 ncc-group_data_anonymisation_technical_aspects_v1 0

04/08/2023 © NCC Group 26

An Example: Massachusetts Group Insurance Company

Governor Weld lived in

Cambridge MA

54,000 Residents

7 Post Codes Electoral Roll Purchase for $20:

Contained name, address, post code, birth

date, sex etc.

GIC Anonymised Data

Only 6 people in Cambridge

shared Weld’s birthday

Only 3 of these were men

Only 1 lived in Weld’s Post

Code

Dr. Sweeney sent the Governor’s health records (which included diagnoses and prescriptions) to his office

Page 27: 2013 07-12 ncc-group_data_anonymisation_technical_aspects_v1 0

04/08/2023 © NCC Group 27

Another Example: America Online (AOL) Data Release

¨ In 2006, AOL publicly released twenty million search queries for 650,000 users of AOL’s search engine summarising three months of activity

¨ AOL suppressed username and IP address, but replaced these with unique numbers that allowed researchers to correlate different searches with a specific user (pseudonymisation)

¨ New York Times reporters Michael Barbaro and Tom Zeller performed some research around User 4417749’s identity. His/her searches had included:

• “landscapers in Lilburn, Ga”• “several people with the last name Arnold”• “homes sold in shadow lake subdivision gwinnett county georgia”

Page 28: 2013 07-12 ncc-group_data_anonymisation_technical_aspects_v1 0

04/08/2023 © NCC Group 28

Another Example: America Online (AOL) Data Release

¨ The reporters tracked down Thelma Arnold, a sixty-two-year-old widow from Lilburn, Georgia who acknowledged that she had authored the searches, including queries such as • “numb fingers”, “60 single men” and “dog that urinates on everything”

¨ Main anonymisation technique used: Suppression and Substitution

Page 29: 2013 07-12 ncc-group_data_anonymisation_technical_aspects_v1 0

04/08/2023 © NCC Group 29

Anonymisation of non-textual data and Metadata

¨ Anonymisation might be required of non-textual data. E.g. images and videos (obfuscating faces)

¨ This might require release of hundreds or thousands of anonymised files, rather than one large dataset

¨ Often, hidden meta-data within those files is forgotten, and can be a valuable source for individuals attempting re-identification

Page 30: 2013 07-12 ncc-group_data_anonymisation_technical_aspects_v1 0

04/08/2023 © NCC Group 30

Anonymisation of non-textual data and Metadata

A simple example – a picture of a heron visiting my garden

Suppose I obfuscate the heron to protect its identity

Page 31: 2013 07-12 ncc-group_data_anonymisation_technical_aspects_v1 0

04/08/2023 © NCC Group 31

GPS/Meta-Data in Image Files

Page 32: 2013 07-12 ncc-group_data_anonymisation_technical_aspects_v1 0

04/08/2023 © NCC Group 32

Meta-Data Extraction

Page 33: 2013 07-12 ncc-group_data_anonymisation_technical_aspects_v1 0

04/08/2023 © NCC Group 33

A Risk-Based Approach

¨ There is anonymisation guidance, but there is no anonymisation formula

¨ Anonymisation is not an exact science

¨ Each data set presents a unique instance, and the choice of anonymisation operation(s) must be carefully considered in order to maintain anonymity and utility

¨ A risk-based approach is therefore the only option…

Page 34: 2013 07-12 ncc-group_data_anonymisation_technical_aspects_v1 0

04/08/2023 © NCC Group 34

Risk Mitigation Advice

¨ Consult with experts before embarking on anonymisation, on the proposed approach and potential risks

¨ If there is no business case or perceived benefit in going through the process, then don’t

¨ Consider release to limited audiences – only go public if strictly required

¨ Protect the anonymisation method/formula ¨ Qualitative anonymisation can typically be automated, Quantitative

anonymisation will require more manual efforts¨ Always vet anonymisations of Quantitative and Qualitative data

before release, don’t just fire and forget

Page 35: 2013 07-12 ncc-group_data_anonymisation_technical_aspects_v1 0

04/08/2023 © NCC Group 35

Risk Mitigation Advice

¨ Perform your own rudimentary Google searches and correlation attempts with other public data sources before publishing

¨ If in doubt, engage again with experts on the likelihood of re-identification given the derived anonymised data set

¨ Consider the quantity of released anonymised data - in practically all re-identification studies performed, researchers have been more successful with larger databases

¨ Try and remove one or more of the top 3 culprits: Post Code, Birthdate and Sex. • In 2000 Dr. Latanya Sweeney showed that 87% of all Americans could

be uniquely identified using only these three bits of information¨ Metadata – prior to release, ensure all meta-data in documents is

removed – A number of tools exist for this, depending on the document type (e.g. Adobe Acrobat, Microsoft Office, JPEGs etc.)

Page 36: 2013 07-12 ncc-group_data_anonymisation_technical_aspects_v1 0

04/08/2023 © NCC Group 36

Risk Mitigation Advice - Pseudonymisation

¨ Don’t chose reversible identifiers – for example, names replaced with identifiers that are based on record data

¨ John Smith, 05/01/1978 -> JS05011978¨ Be aware of potential inferences from sorted data (e.g. alphabetical

ordering might provide clues for re-identification)¨ Keep the pseudonymisation formula secret – protect it with the

same controls as for encryption keys and passwords¨ Perform pseudonymisation functions on segregated, secure

environments, only copy/migrate the pseudonymised data (keep them separate)

¨ Remove all meta-data from pseudonymised data files – make sure the individual(s) performing the pseudonymisation are not referenced in the meta-data

¨ Any cryptographic hashing used as identifiers/keys should always be salted

Page 38: 2013 07-12 ncc-group_data_anonymisation_technical_aspects_v1 0

Questions?

Matt [email protected]

UK Offices

Manchester - Head Office

Cheltenham

Edinburgh

Leatherhead

London

Thame

North American Offices

San Francisco

Atlanta

New York

Seattle

Australian Offices

Sydney

European Offices

Amsterdam - Netherlands

Munich – Germany

Zurich - Switzerland