2013 07-12 ncc-group_data_anonymisation_technical_aspects_v1 0

Technical Aspects of Data Anonymisation & Pseudonymisation

Risks, Challenges & Mitigations

Matt Lewis

Principal Consultant

04/08/2023 © NCC Group 2

Agenda

¨ NCC Group – who we are and what we do¨ Anonymisation, Pseudonymisation & Re-identification – overview

of concepts¨ Examples – when anonymisation goes wrong¨ Pitfalls of image anonymisation and other information leakage

through meta-data¨ A risk-based approach to anonymisation¨ Summary and advice¨ Questions

04/08/2023 © NCC Group 3

NCC Group

¨ Global information assurance specialist¨ 15,000 customers worldwide across all sectors¨ The Group has two complementary divisions - escrow and

assurance¨ Independence from hardware and software providers ensures we

provide unbiased and impartial advice¨ Largest penetration testing team in the world, with approximately

250 consultants

04/08/2023 © NCC Group 4

Me: Brief Bio

¨ Over 12 years working in Information Security¨ Previous Employers:

• CESG – The Information Assurance arm of GCHQ• Information Risk Management (IRM) plc – penetration testing• KPMG – Executive Advisor in the Information Protection

division of IT Advisory• NCC Group – Principal Consultant, providing penetration

testing and consultancy around all aspects of Information Security

04/08/2023 © NCC Group 5

Anonymisation – Overview

¨ Anonymised data should be information that does not identify any individuals, either in isolation or when cross-referenced with other data already in the public domain

¨ A careful balance is required around the level of anonymisation versus the usefulness of the resultant data

¨ Quantitative versus Qualitative – the latter is harder to anonymise in a consistent way, and requires more rigour on a ‘per record’ basis – e.g. meeting minutes

04/08/2023 © NCC Group 6

Pseudonymisation – Overview

¨ Information is anonymous to the receiver (e.g. researchers), but contains codes or identifiers to allow others to re-identify individuals from the pseudonymised data

¨ Universally protecting pseudonymised data whilst allowing general analysis of it is difficult – requires careful management of the ‘codes’ or ‘keys’ that uniquely identify individuals

¨ Quantitative versus Qualitative – again, the latter is harder to pseudonymise in a consistent way, and requires more rigour on a ‘per record’ basis – e.g. meeting minutes

04/08/2023 © NCC Group 7

Anonymisation – Techniques and Methods

¨ There are four main operations available for anonymising data¨ Suppression, Substitution/Distortion, Generalisation, Aggregation¨ Consider the following dataset:

Name Sex Birth Date Post Code Complaint

John Male 02/12/1954 SE24 6TY Pain in left eye

Daniel Male 05/01/1984 NW1 6XD Chest pains

Sarah Female 04/08/1978 E17 7WE Chest pains

Samantha Female 03/10/1960 WC1 7RA Back pains

James Male 09/09/1990 NW7 5LK Headaches

04/08/2023 © NCC Group 8


¨ Suppression - deleting or omitting data fields entirely

Sex Complaint

Male Pain in left eye

Male Chest pains

Female Chest pains

Female Back pains

Male Headaches

04/08/2023 © NCC Group 9


¨ Substitution/Distortion – e.g. replace a person’s name with a unique number – this is also an example of pseudonymisation


0000001 Male 02/12/1954 SE24 6TY Pain in left eye

0000002 Male 05/01/1984 NW1 6XD Chest pains

0000003 Female 04/08/1978 E17 7WE Chest pains

0000004 Female 03/10/1960 WC1 7RA Back pains

0000005 Male 09/09/1990 NW7 5LK Headaches

04/08/2023 © NCC Group 10


¨ Generalisation - alter rather than delete identifier values to increase privacy while preserving utility

Name Sex Birth Year Post Code Complaint

John Male 1954 SE24 Pain in left eye

Daniel Male 1984 NW1 Chest pains

Sarah Female 1978 E17 Chest pains

Samantha Female 1960 WC1 Back pains

James Male 1990 NW7 Headaches

04/08/2023 © NCC Group 11


¨ Aggregation - produce summary statistics across a dataset instead of an anonymised dataset

¨ 40% of patients complain of chest pains¨ 60% of patients are male¨ etc.







04/08/2023 © NCC Group 12

Anonymisation of Qualitative Data

04/08/2023 © NCC Group 13

Anonymisation of Qualitative Data

Pseudonymisation in qualitative data can be much more difficult

The content/themes in this meeting for example might allow for re-identification of any pseudonymised individuals

04/08/2023 © NCC Group 14

Re-identification – What is it?

¨ Re-identification is the act of cross-referencing anonymised data with other data sources, and using inference, deduction and correlation to identify individuals

¨ Depending on the nature of data re-identified, this might raise data protection concerns

04/08/2023 © NCC Group 15

Re-identification – Who does this and Why?

¨ Researchers – e.g. computer scientists, genuinely interested in the challenges of re-identification

¨ Malicious individuals use re-identification information to discriminate, harass or discredit a victim

¨ Investigative journalists¨ Organised crime – re-identification can facilitate creation of fake

identities, or be used to extort victims (if data is personal/sensitive in nature)

¨ Competitors – seeking to re-identify and publish to discredit¨ State sponsored data mining and correlation

¨ The Internet is essentially a vast, ever-growing cross-correlation database; access to most of which is open and free to anyone…

04/08/2023 © NCC Group 16

Re-identification

04/08/2023 © NCC Group 17

Inference as a Starting Point

¨ Recall our example:







04/08/2023 © NCC Group 18


¨ Suppose the following anonymised aggregations are published:

60% of patients complain of chest pains

60% of patients are male

100% of Back pain sufferers live in WC1

100% of patients are over 21 years of age

100% of females suffer from chest or back pains

20% of patients suffer from Pain in the left eye

¨ From this we can infer the following table fields:

Sex, Age, Condition, Post Code

04/08/2023 © NCC Group 19


¨ Suppose we know the sample size (i.e. 5), and the time of data publication

60% of patients are male

Sex Birth Date Complaint Post Code

Male

Male

Male

Female

Female

04/08/2023 © NCC Group 20


100% of patients are over 21 years of age


Male <= 1992

Male <= 1992

Male <= 1992

Female <= 1992

Female <= 1992

04/08/2023 © NCC Group 21


100% of females suffer from chest or back pains


Male <= 1992

Male <= 1992

Male <= 1992

Female <= 1992 Chest Pains

Female <= 1992 Back Pains

04/08/2023 © NCC Group 22


100% of Back pain sufferers live in WC1


Male <= 1992

Male <= 1992

Male <= 1992


Female <= 1992 Back Pains WC1

04/08/2023 © NCC Group 23


20% of patients suffer from Pain in the left eye


Male <= 1992 Pain in left eye

Male <= 1992

Male <= 1992



04/08/2023 © NCC Group 24


60% of patients complain of chest pains

¨ The next step would be to correlate/cross-reference with other sources


Male <= 1992 Pain in left eye

Male <= 1992 Chest Pains

Male <= 1992 Chest Pains



04/08/2023 © NCC Group 25

An Example: Massachusetts Group Insurance Company

¨ In the mid-1990s GIC released anonymised data on state employees that showed every single hospital visit

¨ The goal was to help researchers; the state spent time removing all obvious identifiers such as name, address, and Social Security number

¨ William Weld, then Governor of Massachusetts, assured the public that GIC had protected patient privacy by deleting identifiers

¨ Computer Science graduate Dr. Latanya Sweeney requested a copy of the data and performed re-identification research on the dataset

¨ Main anonymisation technique used: Suppression

04/08/2023 © NCC Group 26

An Example: Massachusetts Group Insurance Company

Governor Weld lived in

Cambridge MA

54,000 Residents

7 Post Codes Electoral Roll Purchase for $20:

Contained name, address, post code, birth

date, sex etc.

GIC Anonymised Data

Only 6 people in Cambridge

shared Weld’s birthday

Only 3 of these were men

Only 1 lived in Weld’s Post

Code

Dr. Sweeney sent the Governor’s health records (which included diagnoses and prescriptions) to his office

04/08/2023 © NCC Group 27

Another Example: America Online (AOL) Data Release

¨ In 2006, AOL publicly released twenty million search queries for 650,000 users of AOL’s search engine summarising three months of activity

¨ AOL suppressed username and IP address, but replaced these with unique numbers that allowed researchers to correlate different searches with a specific user (pseudonymisation)

¨ New York Times reporters Michael Barbaro and Tom Zeller performed some research around User 4417749’s identity. His/her searches had included:

• “landscapers in Lilburn, Ga”• “several people with the last name Arnold”• “homes sold in shadow lake subdivision gwinnett county georgia”

04/08/2023 © NCC Group 28

Another Example: America Online (AOL) Data Release

¨ The reporters tracked down Thelma Arnold, a sixty-two-year-old widow from Lilburn, Georgia who acknowledged that she had authored the searches, including queries such as • “numb fingers”, “60 single men” and “dog that urinates on everything”

¨ Main anonymisation technique used: Suppression and Substitution

04/08/2023 © NCC Group 29

Anonymisation of non-textual data and Metadata

¨ Anonymisation might be required of non-textual data. E.g. images and videos (obfuscating faces)

¨ This might require release of hundreds or thousands of anonymised files, rather than one large dataset

¨ Often, hidden meta-data within those files is forgotten, and can be a valuable source for individuals attempting re-identification

04/08/2023 © NCC Group 30

Anonymisation of non-textual data and Metadata

A simple example – a picture of a heron visiting my garden

Suppose I obfuscate the heron to protect its identity

04/08/2023 © NCC Group 33

A Risk-Based Approach

¨ There is anonymisation guidance, but there is no anonymisation formula

¨ Anonymisation is not an exact science

¨ Each data set presents a unique instance, and the choice of anonymisation operation(s) must be carefully considered in order to maintain anonymity and utility

¨ A risk-based approach is therefore the only option…

04/08/2023 © NCC Group 34

Risk Mitigation Advice

¨ Consult with experts before embarking on anonymisation, on the proposed approach and potential risks

¨ If there is no business case or perceived benefit in going through the process, then don’t

¨ Consider release to limited audiences – only go public if strictly required

¨ Protect the anonymisation method/formula ¨ Qualitative anonymisation can typically be automated, Quantitative

anonymisation will require more manual efforts¨ Always vet anonymisations of Quantitative and Qualitative data

before release, don’t just fire and forget

04/08/2023 © NCC Group 35

Risk Mitigation Advice

¨ Perform your own rudimentary Google searches and correlation attempts with other public data sources before publishing

¨ If in doubt, engage again with experts on the likelihood of re-identification given the derived anonymised data set

¨ Consider the quantity of released anonymised data - in practically all re-identification studies performed, researchers have been more successful with larger databases

¨ Try and remove one or more of the top 3 culprits: Post Code, Birthdate and Sex. • In 2000 Dr. Latanya Sweeney showed that 87% of all Americans could

be uniquely identified using only these three bits of information¨ Metadata – prior to release, ensure all meta-data in documents is

removed – A number of tools exist for this, depending on the document type (e.g. Adobe Acrobat, Microsoft Office, JPEGs etc.)

04/08/2023 © NCC Group 36

Risk Mitigation Advice - Pseudonymisation

¨ Don’t chose reversible identifiers – for example, names replaced with identifiers that are based on record data

¨ John Smith, 05/01/1978 -> JS05011978¨ Be aware of potential inferences from sorted data (e.g. alphabetical

ordering might provide clues for re-identification)¨ Keep the pseudonymisation formula secret – protect it with the

same controls as for encryption keys and passwords¨ Perform pseudonymisation functions on segregated, secure

environments, only copy/migrate the pseudonymised data (keep them separate)

¨ Remove all meta-data from pseudonymised data files – make sure the individual(s) performing the pseudonymisation are not referenced in the meta-data

¨ Any cryptographic hashing used as identifiers/keys should always be salted

04/08/2023 © NCC Group 37

References

¨ ICO guidance on anonymisation: http://www.ico.org.uk/for_organisations/data_protection/topic_guides/anonymisation

¨ Paper on the failures of anonymisation, Paul Ohm http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1450006

¨ Research/Blog on anonymisation http://33bits.org/

http://www.ico.org.uk/for_organisations/data_protection/topic_guides/anonymisation




http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1450006



http://33bits.org/

http://33bits.org/

http://33bits.org/

Questions?

Matt [email protected]

UK Offices

Manchester - Head Office

Cheltenham

Edinburgh

Leatherhead

London

Thame

North American Offices

San Francisco

Atlanta

New York

Seattle

Australian Offices

Sydney

European Offices

Amsterdam - Netherlands

Munich – Germany

Zurich - Switzerland

2013 07-12 ncc-group_data_anonymisation_technical_aspects_v1 0

Technology