facilitating analytics while protecting privacy

45
Facilitating Analytics while Protecting Individual Privacy Using Data De-identification Khaled El Emam

Upload: kelemam

Post on 09-May-2015

288 views

Category:

Technology


0 download

DESCRIPTION

Presentation at the Strata Rx 2013

TRANSCRIPT

Page 1: Facilitating Analytics while Protecting Privacy

Facilitating Analytics while Protecting Individual Privacy Using Data De-identificationKhaled El Emam

Page 2: Facilitating Analytics while Protecting Privacy

Talk OutlinePresent two case studies where we conducted an analysis of

the privacy implications associated with sharing health data.Overview of methodology and risk measurement basicsState of Louisiana Department of Health and Hospitals and

Cajun Code Fest 2013Mount Sinai School of Medicine Department of Preventative

Medicine – World Trade Center Disaster Registry

Page 3: Facilitating Analytics while Protecting Privacy

Data Anonymization Resources

Book Signing:September 26, 2013 at 10:35am

Khaled El EmamLuk Arbuckle

Page 4: Facilitating Analytics while Protecting Privacy

Basic Methodology

Page 5: Facilitating Analytics while Protecting Privacy

Direct and In-Direct/Quasi-Identifiers

Examples of direct identifiers: Name, address, telephone number, fax number, MRN, health card number, health plan beneficiary number, license plate number, email address, photograph, biometrics, SSN, SIN, implanted device number

Examples of quasi identifiers: sex, date of birth or age, geographic locations (such as postal codes, census geography, information about proximity to known or unique landmarks), language spoken at home, ethnic origin, total years of schooling, marital status, criminal history, total income, visible minority status, profession, event dates

Page 6: Facilitating Analytics while Protecting Privacy

Terminology

Page 7: Facilitating Analytics while Protecting Privacy

Safe HarborSafe Harbor Direct Identifiers and Quasi-identifiers

1. Names2. ZIP Codes (except first

three)3. All elements of dates

(except year)4. Telephone numbers5. Fax numbers6. Electronic mail

addresses7. Social security

numbers8. Medical record

numbers9. Health plan beneficiary

numbers10.Account numbers11.Certificate/license

numbers

12.Vehicle identifiers and serial numbers, including license plate numbers

13.Device identifiers and serial numbers

14.Web Universal Resource Locators (URLs)

15.Internet Protocol (IP) address numbers

16.Biometric identifiers, including finger and voice prints

17.Full face photographic images and any comparable images;

18. Any other unique identifying number, characteristic, or code

Actual Knowledge

Page 8: Facilitating Analytics while Protecting Privacy

Statistical Method A person with appropriate knowledge of and experience with

generally accepted statistical and scientific principles and methods for rendering information not individually identifiable:I. Applying such principles and methods, determines that the risk is

“very small” that the information could be used, alone or in combination with other reasonably available information, by an anticipated recipient to identify and individual who is a subject of the information, and

II. Documents the methods and results of the analysis that justify such determination

Page 9: Facilitating Analytics while Protecting Privacy

Equivalence Classes - I An equivalence class is the set of records in a table that has the

same values for all quasi-identifiers.

Page 10: Facilitating Analytics while Protecting Privacy

Equivalence Classes - IIGender Year of Birth (10 years) DINMale 1970-1979 2046059Male 1980-1989 716839Male 1970-1979 2241497Female 1990-1999 2046059Female 1980-1989 392537Male 1990-1999 363766Male 1990-1999 544981Female 1980-1989 293512Male 1970-1979 544981Female 1990-1999 596612Male 1980-1989 725765

Page 11: Facilitating Analytics while Protecting Privacy

Equivalence Classes - IIIGender Year of Birth (10 years) DINMale 1970-1979 2046059Male 1980-1989 716839Male 1970-1979 2241497Female 1990-1999 2046059Female 1980-1989 392537Male 1990-1999 363766Male 1990-1999 544981Female 1980-1989 293512Male 1970-1979 544981Female 1990-1999 596612Male 1980-1989 725765

Page 12: Facilitating Analytics while Protecting Privacy

Equivalence Classes - IVGender Year of Birth (10 years) DINMale 1970-1979 2046059Male 1980-1989 716839Male 1970-1979 2241497Female 1990-1999 2046059Female 1980-1989 392537Male 1990-1999 363766Male 1990-1999 544981Female 1980-1989 293512Male 1970-1979 544981Female 1990-1999 596612Male 1980-1989 725765

Page 13: Facilitating Analytics while Protecting Privacy

Equivalence Classes - VGender Year of Birth (10 years) DINMale 1970-1979 2046059Male 1980-1989 716839Male 1970-1979 2241497Female 1990-1999 2046059Female 1980-1989 392537Male 1990-1999 363766Male 1990-1999 544981Female 1980-1989 293512Male 1970-1979 544981Female 1990-1999 596612Male 1980-1989 725765

Page 14: Facilitating Analytics while Protecting Privacy

Equivalence Classes - VIGender Year of Birth (10 years) DINMale 1970-1979 2046059Male 1980-1989 716839Male 1970-1979 2241497Female 1990-1999 2046059Female 1980-1989 392537Male 1990-1999 363766Male 1990-1999 544981Female 1980-1989 293512Male 1970-1979 544981Female 1990-1999 596612Male 1980-1989 725765

Page 15: Facilitating Analytics while Protecting Privacy

Equivalence Classes - VIIGender Year of Birth (10 years) DINMale 1970-1979 2046059Male 1980-1989 716839Male 1970-1979 2241497Female 1990-1999 2046059Female 1980-1989 392537Male 1990-1999 363766Male 1990-1999 544981Female 1980-1989 293512Male 1970-1979 544981Female 1990-1999 596612Male 1980-1989 725765

Page 16: Facilitating Analytics while Protecting Privacy

Maximum Risk In the example data set we had 5 equivalence classesThe largest equivalence class had a size of 3, and the smallest

equivalence class had a size of 2The probability of correctly re-identifying a record is 1 divided

by the size of the equivalence classThe maximum probability in this table is 50% (0.5 probability)

Page 17: Facilitating Analytics while Protecting Privacy

Average RiskThere were:- Four equivalence classes of size 2- One equivalence class of size 3The average risk is:

[(8 x 0.5) + (3 x 0.33)]/11= 5/11

This gives us an average risk of 5/11, or 45% This turns out to be the number of equivalence classes divided by the

number of records

Page 18: Facilitating Analytics while Protecting Privacy

Case Study: State of Louisiana – Cajun Code Fest

Page 19: Facilitating Analytics while Protecting Privacy

State of LouisianaDemonstrate how the State of Louisiana used a novel approach

to improve the health of its citizens by working with the Center for Business & Information Technologies (CBIT) at the University of Louisiana to provide data for Cajun Code FestDiscuss how providing realistic looking and behaving de-

identified Medicaid claims and immunization data, competitors were able to generate applications to help Louisiana’s “Own your Own Health” initiative – an initiative that encourages patients to make knowledgeable and informed decisions about their healthcare

Page 20: Facilitating Analytics while Protecting Privacy

Cajun Code Fest 2.0 April 24-26, 2013

27 Hours of coding put on by the Center for Business & Information Technology at the University of Louisiana Lafayette

Teams converged to work their innovative magic to analyze the de-identified data set to create new healthcare solutions that will allow patients to become engaged in their own health

Page 21: Facilitating Analytics while Protecting Privacy

Why De-identified Data?The core data that served as the basis for Cajun Code Fest

had to be de-identified before it could be released to the entrants in the challenge. It would not have been possible to have the coding challenge

without properly de-identified data.

Page 22: Facilitating Analytics while Protecting Privacy

Data by the Numbers200,000 unique individuals6,683,337 Medicaid claims6,410,969 Medicaid prescriptions4,085,977 Immunization records29,951 Providers

Page 23: Facilitating Analytics while Protecting Privacy

Data Model

Page 24: Facilitating Analytics while Protecting Privacy

Claims Summary

Page 25: Facilitating Analytics while Protecting Privacy

Long Tails & Truncation

Page 26: Facilitating Analytics while Protecting Privacy

Date Shifting – Simple Noise

Page 27: Facilitating Analytics while Protecting Privacy

Date Shifting – Fixed Shift

Page 28: Facilitating Analytics while Protecting Privacy

Date Shifting – Randomized Generalization I

Page 29: Facilitating Analytics while Protecting Privacy

Date Shifting - Randomized Generalization II

Page 30: Facilitating Analytics while Protecting Privacy

Geoproxy AttacksPatients tend to visit providers and obtain prescriptions from

pharmacies that are close to where they liveCan we use the provider and pharmacy location information to

predict where the patient lives ?This is called a geoproxy attackWe can measure the probability of a correct geoproxy attack

and incorporate that into our overall risk measurement framework

Page 31: Facilitating Analytics while Protecting Privacy

Geoproxy Risk on Claims Data

Page 32: Facilitating Analytics while Protecting Privacy

Case Study: Mount Sinai School of MedicineWorld Trade Center Disaster Registry

Page 33: Facilitating Analytics while Protecting Privacy

Over 50,000 people are estimated to have helped with the rescue and recovery efforts after 9/11, and over 27,000 of those are captured in the WTC disaster registry created by the Clinical Center of Excellence at Mount Sinai. The Mount Sinai did a lot of publicity and outreach, working with a variety of

organizations, to recruit 9/11 workers and volunteers. Those who participated have gone through comprehensive examinations including:- Medical questionnaires- Mental-health questionnaires- Exposure-assessment questionnaires- Standardised physical examinations- Optional follow-up assessments every 12 to 18 months

Background

Page 34: Facilitating Analytics while Protecting Privacy

Public Information

Page 35: Facilitating Analytics while Protecting Privacy

Series of Events

Page 36: Facilitating Analytics while Protecting Privacy

The visit date was used for questions that were specific to the date at which the visit occurred (e.g., “do you currently smoke?” would create an event for smoking at the time of visit.)Some questions included dates that could be used directly

with the quasi-identifier, and were more informative than the visit date. (e.g., the answer “when were you diagnosed with this disease?” was used to provide a date to the disease event).

Series of Events

Page 37: Facilitating Analytics while Protecting Privacy

Demographics

Page 38: Facilitating Analytics while Protecting Privacy

Examples of Events

Page 39: Facilitating Analytics while Protecting Privacy

Multiple LevelsSometimes it is reasonable to assume that the adversary will

not have a lot of details about an eventFor example, the adversary may know that an event has

occurred but not know the exact date that the event occurred at In such a case we change the data to match the adversary

background knowledge, but we release more detailed dataThis makes sense given the assumption – the more detailed

information that is released does not give the adversary additional useful information

Page 40: Facilitating Analytics while Protecting Privacy

Ten years after the fact, however, it seems unlikely that an adversary will know the dates of a patient’s events before 9/11. Often patients gave different years of diagnosis on follow-up visits because they themselves didn’t remember what medical conditions they had! So instead of the date of event, we used “pre-9/11” as a value. We made a distinction between childhood (under 18) and adulthood

(18 and over) diagnoses, these seemed like something you could reasonably know. These generalizations were done only for measuring risk, and weren’t

applied to the de-identified registry data.

Time of Events

Page 41: Facilitating Analytics while Protecting Privacy

Covering DesignsWhat are the quasi-identifiers when the series of events is

long?Will an adversary know all of the details in that sequence ? It is reasonable to assume that an adversary will only know p

events – this is the power of the adversaryBut which p out of m events does the adversary know ? If we look at all combinations of p from m we may end up with

quite a large number of combinations of quasi-identifiers to measure the risk

Page 42: Facilitating Analytics while Protecting Privacy

Combinations of 3

Page 43: Facilitating Analytics while Protecting Privacy

Covering Design

Page 44: Facilitating Analytics while Protecting Privacy

Reduction in Computation

Page 45: Facilitating Analytics while Protecting Privacy

Contact

Khaled El Emam:[email protected] ext 111

@PrivacyAnalytic