2013 07-12 ncc-group_data_anonymisation_technical_aspects_v1 0
DESCRIPTION
TRANSCRIPT
Technical Aspects of Data Anonymisation & Pseudonymisation
Risks, Challenges & Mitigations
Matt Lewis
Principal Consultant
04/08/2023 © NCC Group 2
Agenda
¨ NCC Group – who we are and what we do¨ Anonymisation, Pseudonymisation & Re-identification – overview
of concepts¨ Examples – when anonymisation goes wrong¨ Pitfalls of image anonymisation and other information leakage
through meta-data¨ A risk-based approach to anonymisation¨ Summary and advice¨ Questions
04/08/2023 © NCC Group 3
NCC Group
¨ Global information assurance specialist¨ 15,000 customers worldwide across all sectors¨ The Group has two complementary divisions - escrow and
assurance¨ Independence from hardware and software providers ensures we
provide unbiased and impartial advice¨ Largest penetration testing team in the world, with approximately
250 consultants
04/08/2023 © NCC Group 4
Me: Brief Bio
¨ Over 12 years working in Information Security¨ Previous Employers:
• CESG – The Information Assurance arm of GCHQ• Information Risk Management (IRM) plc – penetration testing• KPMG – Executive Advisor in the Information Protection
division of IT Advisory• NCC Group – Principal Consultant, providing penetration
testing and consultancy around all aspects of Information Security
04/08/2023 © NCC Group 5
Anonymisation – Overview
¨ Anonymised data should be information that does not identify any individuals, either in isolation or when cross-referenced with other data already in the public domain
¨ A careful balance is required around the level of anonymisation versus the usefulness of the resultant data
¨ Quantitative versus Qualitative – the latter is harder to anonymise in a consistent way, and requires more rigour on a ‘per record’ basis – e.g. meeting minutes
04/08/2023 © NCC Group 6
Pseudonymisation – Overview
¨ Information is anonymous to the receiver (e.g. researchers), but contains codes or identifiers to allow others to re-identify individuals from the pseudonymised data
¨ Universally protecting pseudonymised data whilst allowing general analysis of it is difficult – requires careful management of the ‘codes’ or ‘keys’ that uniquely identify individuals
¨ Quantitative versus Qualitative – again, the latter is harder to pseudonymise in a consistent way, and requires more rigour on a ‘per record’ basis – e.g. meeting minutes
04/08/2023 © NCC Group 7
Anonymisation – Techniques and Methods
¨ There are four main operations available for anonymising data¨ Suppression, Substitution/Distortion, Generalisation, Aggregation¨ Consider the following dataset:
Name Sex Birth Date Post Code Complaint
John Male 02/12/1954 SE24 6TY Pain in left eye
Daniel Male 05/01/1984 NW1 6XD Chest pains
Sarah Female 04/08/1978 E17 7WE Chest pains
Samantha Female 03/10/1960 WC1 7RA Back pains
James Male 09/09/1990 NW7 5LK Headaches
04/08/2023 © NCC Group 8
Anonymisation – Techniques and Methods
¨ Suppression - deleting or omitting data fields entirely
Sex Complaint
Male Pain in left eye
Male Chest pains
Female Chest pains
Female Back pains
Male Headaches
04/08/2023 © NCC Group 9
Anonymisation – Techniques and Methods
¨ Substitution/Distortion – e.g. replace a person’s name with a unique number – this is also an example of pseudonymisation
Name Sex Birth Date Post Code Complaint
0000001 Male 02/12/1954 SE24 6TY Pain in left eye
0000002 Male 05/01/1984 NW1 6XD Chest pains
0000003 Female 04/08/1978 E17 7WE Chest pains
0000004 Female 03/10/1960 WC1 7RA Back pains
0000005 Male 09/09/1990 NW7 5LK Headaches
04/08/2023 © NCC Group 10
Anonymisation – Techniques and Methods
¨ Generalisation - alter rather than delete identifier values to increase privacy while preserving utility
Name Sex Birth Year Post Code Complaint
John Male 1954 SE24 Pain in left eye
Daniel Male 1984 NW1 Chest pains
Sarah Female 1978 E17 Chest pains
Samantha Female 1960 WC1 Back pains
James Male 1990 NW7 Headaches
04/08/2023 © NCC Group 11
Anonymisation – Techniques and Methods
¨ Aggregation - produce summary statistics across a dataset instead of an anonymised dataset
¨ 40% of patients complain of chest pains¨ 60% of patients are male¨ etc.
Name Sex Birth Date Post Code Complaint
John Male 02/12/1954 SE24 6TY Pain in left eye
Daniel Male 05/01/1984 NW1 6XD Chest pains
Sarah Female 04/08/1978 E17 7WE Chest pains
Samantha Female 03/10/1960 WC1 7RA Back pains
James Male 09/09/1990 NW7 5LK Headaches
04/08/2023 © NCC Group 12
Anonymisation of Qualitative Data
04/08/2023 © NCC Group 13
Anonymisation of Qualitative Data
Pseudonymisation in qualitative data can be much more difficult
The content/themes in this meeting for example might allow for re-identification of any pseudonymised individuals
04/08/2023 © NCC Group 14
Re-identification – What is it?
¨ Re-identification is the act of cross-referencing anonymised data with other data sources, and using inference, deduction and correlation to identify individuals
¨ Depending on the nature of data re-identified, this might raise data protection concerns
04/08/2023 © NCC Group 15
Re-identification – Who does this and Why?
¨ Researchers – e.g. computer scientists, genuinely interested in the challenges of re-identification
¨ Malicious individuals use re-identification information to discriminate, harass or discredit a victim
¨ Investigative journalists¨ Organised crime – re-identification can facilitate creation of fake
identities, or be used to extort victims (if data is personal/sensitive in nature)
¨ Competitors – seeking to re-identify and publish to discredit¨ State sponsored data mining and correlation
¨ The Internet is essentially a vast, ever-growing cross-correlation database; access to most of which is open and free to anyone…
04/08/2023 © NCC Group 16
Re-identification
04/08/2023 © NCC Group 17
Inference as a Starting Point
¨ Recall our example:
Name Sex Birth Date Post Code Complaint
John Male 02/12/1954 SE24 6TY Pain in left eye
Daniel Male 05/01/1984 NW1 6XD Chest pains
Sarah Female 04/08/1978 E17 7WE Chest pains
Samantha Female 03/10/1960 WC1 7RA Back pains
James Male 09/09/1990 NW7 5LK Headaches
04/08/2023 © NCC Group 18
Inference as a Starting Point
¨ Suppose the following anonymised aggregations are published:
60% of patients complain of chest pains
60% of patients are male
100% of Back pain sufferers live in WC1
100% of patients are over 21 years of age
100% of females suffer from chest or back pains
20% of patients suffer from Pain in the left eye
¨ From this we can infer the following table fields:
Sex, Age, Condition, Post Code
04/08/2023 © NCC Group 19
Inference as a Starting Point
¨ Suppose we know the sample size (i.e. 5), and the time of data publication
60% of patients are male
Sex Birth Date Complaint Post Code
Male
Male
Male
Female
Female
04/08/2023 © NCC Group 20
Inference as a Starting Point
100% of patients are over 21 years of age
Sex Birth Date Complaint Post Code
Male <= 1992
Male <= 1992
Male <= 1992
Female <= 1992
Female <= 1992
04/08/2023 © NCC Group 21
Inference as a Starting Point
100% of females suffer from chest or back pains
Sex Birth Date Complaint Post Code
Male <= 1992
Male <= 1992
Male <= 1992
Female <= 1992 Chest Pains
Female <= 1992 Back Pains
04/08/2023 © NCC Group 22
Inference as a Starting Point
100% of Back pain sufferers live in WC1
Sex Birth Date Complaint Post Code
Male <= 1992
Male <= 1992
Male <= 1992
Female <= 1992 Chest Pains
Female <= 1992 Back Pains WC1
04/08/2023 © NCC Group 23
Inference as a Starting Point
20% of patients suffer from Pain in the left eye
Sex Birth Date Complaint Post Code
Male <= 1992 Pain in left eye
Male <= 1992
Male <= 1992
Female <= 1992 Chest Pains
Female <= 1992 Back Pains WC1
04/08/2023 © NCC Group 24
Inference as a Starting Point
60% of patients complain of chest pains
¨ The next step would be to correlate/cross-reference with other sources
Sex Birth Date Complaint Post Code
Male <= 1992 Pain in left eye
Male <= 1992 Chest Pains
Male <= 1992 Chest Pains
Female <= 1992 Chest Pains
Female <= 1992 Back Pains WC1
04/08/2023 © NCC Group 25
An Example: Massachusetts Group Insurance Company
¨ In the mid-1990s GIC released anonymised data on state employees that showed every single hospital visit
¨ The goal was to help researchers; the state spent time removing all obvious identifiers such as name, address, and Social Security number
¨ William Weld, then Governor of Massachusetts, assured the public that GIC had protected patient privacy by deleting identifiers
¨ Computer Science graduate Dr. Latanya Sweeney requested a copy of the data and performed re-identification research on the dataset
¨ Main anonymisation technique used: Suppression
04/08/2023 © NCC Group 26
An Example: Massachusetts Group Insurance Company
Governor Weld lived in
Cambridge MA
54,000 Residents
7 Post Codes Electoral Roll Purchase for $20:
Contained name, address, post code, birth
date, sex etc.
GIC Anonymised Data
Only 6 people in Cambridge
shared Weld’s birthday
Only 3 of these were men
Only 1 lived in Weld’s Post
Code
Dr. Sweeney sent the Governor’s health records (which included diagnoses and prescriptions) to his office
04/08/2023 © NCC Group 27
Another Example: America Online (AOL) Data Release
¨ In 2006, AOL publicly released twenty million search queries for 650,000 users of AOL’s search engine summarising three months of activity
¨ AOL suppressed username and IP address, but replaced these with unique numbers that allowed researchers to correlate different searches with a specific user (pseudonymisation)
¨ New York Times reporters Michael Barbaro and Tom Zeller performed some research around User 4417749’s identity. His/her searches had included:
• “landscapers in Lilburn, Ga”• “several people with the last name Arnold”• “homes sold in shadow lake subdivision gwinnett county georgia”
04/08/2023 © NCC Group 28
Another Example: America Online (AOL) Data Release
¨ The reporters tracked down Thelma Arnold, a sixty-two-year-old widow from Lilburn, Georgia who acknowledged that she had authored the searches, including queries such as • “numb fingers”, “60 single men” and “dog that urinates on everything”
¨ Main anonymisation technique used: Suppression and Substitution
04/08/2023 © NCC Group 29
Anonymisation of non-textual data and Metadata
¨ Anonymisation might be required of non-textual data. E.g. images and videos (obfuscating faces)
¨ This might require release of hundreds or thousands of anonymised files, rather than one large dataset
¨ Often, hidden meta-data within those files is forgotten, and can be a valuable source for individuals attempting re-identification
04/08/2023 © NCC Group 30
Anonymisation of non-textual data and Metadata
A simple example – a picture of a heron visiting my garden
Suppose I obfuscate the heron to protect its identity
04/08/2023 © NCC Group 31
GPS/Meta-Data in Image Files
04/08/2023 © NCC Group 32
Meta-Data Extraction
04/08/2023 © NCC Group 33
A Risk-Based Approach
¨ There is anonymisation guidance, but there is no anonymisation formula
¨ Anonymisation is not an exact science
¨ Each data set presents a unique instance, and the choice of anonymisation operation(s) must be carefully considered in order to maintain anonymity and utility
¨ A risk-based approach is therefore the only option…
04/08/2023 © NCC Group 34
Risk Mitigation Advice
¨ Consult with experts before embarking on anonymisation, on the proposed approach and potential risks
¨ If there is no business case or perceived benefit in going through the process, then don’t
¨ Consider release to limited audiences – only go public if strictly required
¨ Protect the anonymisation method/formula ¨ Qualitative anonymisation can typically be automated, Quantitative
anonymisation will require more manual efforts¨ Always vet anonymisations of Quantitative and Qualitative data
before release, don’t just fire and forget
04/08/2023 © NCC Group 35
Risk Mitigation Advice
¨ Perform your own rudimentary Google searches and correlation attempts with other public data sources before publishing
¨ If in doubt, engage again with experts on the likelihood of re-identification given the derived anonymised data set
¨ Consider the quantity of released anonymised data - in practically all re-identification studies performed, researchers have been more successful with larger databases
¨ Try and remove one or more of the top 3 culprits: Post Code, Birthdate and Sex. • In 2000 Dr. Latanya Sweeney showed that 87% of all Americans could
be uniquely identified using only these three bits of information¨ Metadata – prior to release, ensure all meta-data in documents is
removed – A number of tools exist for this, depending on the document type (e.g. Adobe Acrobat, Microsoft Office, JPEGs etc.)
04/08/2023 © NCC Group 36
Risk Mitigation Advice - Pseudonymisation
¨ Don’t chose reversible identifiers – for example, names replaced with identifiers that are based on record data
¨ John Smith, 05/01/1978 -> JS05011978¨ Be aware of potential inferences from sorted data (e.g. alphabetical
ordering might provide clues for re-identification)¨ Keep the pseudonymisation formula secret – protect it with the
same controls as for encryption keys and passwords¨ Perform pseudonymisation functions on segregated, secure
environments, only copy/migrate the pseudonymised data (keep them separate)
¨ Remove all meta-data from pseudonymised data files – make sure the individual(s) performing the pseudonymisation are not referenced in the meta-data
¨ Any cryptographic hashing used as identifiers/keys should always be salted
04/08/2023 © NCC Group 37
References
¨ ICO guidance on anonymisation: http://www.ico.org.uk/for_organisations/data_protection/topic_guides/anonymisation
¨ Paper on the failures of anonymisation, Paul Ohm http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1450006
¨ Research/Blog on anonymisation http://33bits.org/
Questions?
Matt [email protected]
UK Offices
Manchester - Head Office
Cheltenham
Edinburgh
Leatherhead
London
Thame
North American Offices
San Francisco
Atlanta
New York
Seattle
Australian Offices
Sydney
European Offices
Amsterdam - Netherlands
Munich – Germany
Zurich - Switzerland