sachrp meeting: oct 20, 2010 measuring identifiability from a “reasonable” view bradley malin,...
TRANSCRIPT
SACHRP Meeting: Oct 20, 2010
Measuring Identifiability froma “Reasonable” View
Bradley Malin, Ph.D.Assistant Prof. of Biomedical Informatics, School of MedicineAssistant Prof. of Computer Science, School of EngineeringDirector, Health Information Privacy LaboratoryVanderbilt University
The Face that Launched a
Thousand Ships
The “Quasi-identifier”
• Hospital Discharge Data – Ethnicity– Visit date– Diagnosis– Procedure– Medication– Total charge
• Voter List – Name– Address– Date registered– Party affiliation– Date last voted
• Shared Data in Both– Zip Code– Birthdate– Gender
• Result:
Re-identification of Governor
L. Sweeney. Journal of Law, Medicine, and Ethics. 1997.
3
4Identifiability © Bradley Malin, 2010
Given
5-Digit Zip Code+ Birthdate+ Gender
63-87% of USA estimated to be unique
L. Sweeney. Uniqueness of Simple Demographics in the U.S Population. 2000.P. Golle. Revisiting the Uniqueness of US Population. ACM WPES. 2006.
Various Studies in Uniqueness
• It doesn’t take many [insert your favorite feature] to make you unique– Demographic Features (Sweeney 1997; Golle 2006; El Emam 2008)
– Diagnosis Codes (Loukides et al. 2010)
– DNA (SNPs) (Lin, Owen, & Altman 2004; Homer et al. 2008, Wang et al. 2009)
– Location Visits (Malin & Sweeney 2004, Golle & Partridge 2009)
– Movie Reviews (Narayanan & Shmatikov 2008)
– Pedigree Structure (Malin 2006)
– Search Queries (Barbaro & Zeller 2006)
– Social Network Structure (Backstrom et al. 2007, Narayanan & Shmatikov 2009)
5Identifiability © Bradley Malin, 2010
HIPAA Privacy Rule (2002)• Protected Health Information (PHI)
– explicitly linked to a particular individualor
– could reasonably be expected to allow individual identification
• Permitted Secondary Data Sharing– Limited Datasets– De-identified Data
• Safe Harbor• Expert Determination
6Identifiability © Bradley Malin, 2010
Beyond “unique”
7Identifiability © Bradley Malin, 2010
HIPAA & Secondary Sharing• Safe Harbor: Requires removal of 18 attributes
– geocodes with < 20,000 people
– All dates (except year) & ages > 89
– Any other unique identifying number, characteristic, or code• if the person holding the coded data can re-identify the patient
• Limited Dataset: Can retain dates & geocodes if recipient signs contract
Identifiability © Bradley Malin, 2010 8
Limited ReleaseSafe Harbor
Attacks onDemographics
– Consider population estimates from the U.S. Census Bureau
– They’re not perfect, but they’re a start
K. Benitez and B. Malin. Evaluating re-identification risk with respect to the HIPAA privacy policies. Journal of the American Medical Informatics Association. 2010; 17: 169-177.
9
IdentifiedRecords
Safe HarboredClinical Records
PrivateClinical Records
Limited Data Set
Clinical Records
Case Study: Tennessee
Safe Harbor(-like){Race, Gender, Year (of Birth), State}
Limited Dataset{Race, Gender, Date (of Birth), County}
Group size = 33
Identifiability © Bradley Malin, 2010 10
All U.S. States
Safe Harbor Data Set Limited Data Set
Group Size
Perc
ent I
denti
fiabl
e
0%
0.05%
0.10%
0.25%
0.30%0.35%
0.20%
0.15%
1 3 5 10
Group Size
0%
60%
80%
100%
40%
20%
1 3 5 10
11Identifiability © Bradley Malin, 2010
Policy Analysis via “Trust Differential”…
Risk (Limited Dataset)Risk (Safe Harbor)
• Uniques– Delaware’s risk increases by a factor ~1,000– Tennessee’s “ “ “ “ ~2,300– Illinois’s “ “ “ “ ~65,000
• 20,000– Delaware’s risk does not increase– Tennessee’s risk increases by a factor of ~8– Illinois’s risk increases by a factor of ~37
12
Which Leads Us To
• P. Ohm. Broken promises: Responding to the surprising failure of anonymization. UCLA Law Review. 2010; 57: 1701-1777.
13
But…There’s a Really Big But
Identifiability © Bradley Malin, 2010 14
DISTINGUISHABLE
IDENTIFIABLE
Identifiability © Bradley Malin, 2010 15
Central Dogma of Re-identification
B. Malin, M. Kantarcioglu, & C. Cassa. A survey of challenges and solutions for privacy in clinical genomics data mining. In Privacy-Aware Knowledge Discovery: Novel Applications and New Techniques. CRC Press. To appear.
De-identifiedDNA
IdentifiedData
Necessary Condition
Necessary Condition
Necessary Condition
Linking Mechanism
Identifiability © Bradley Malin, 2010 16
…We’ve Been Looking atWorst Case Scenarios...
• How would you use demographics?
• Could link to registries– Birth– Death– Marriage– Professional (Physicians, Lawyers)
• What’s in vogue?Back to voter registration databases
17
Going to the Source
• We polled all U.S. states for what voter information is collected & shared
• What fields are shared?
• Who has access?
• Who can use it?
• What’s the cost?
Safe HarboredClinical Records
IdentifiedClinical Records
Limited Data SetClinical Records
Private VersionIdentified
Voter Records
Public VersionIdentified
Voter Records
18
The eMERGE Networkelectronic Medical Records & Genomics
A consortium of biorepositories linked to electronic medical records data for conducting genomic studies
• Consortuim members (http://www.gwas.net)• Group Health of Puget Sound• Marshfield Clinic• Mayo Clinic• Northwestern University• Vanderbilt University
• Funding condition: contribute de-identified genomic and EMR-derived phenotype data to databaseof genotype and phenotype (dbGAP) at NCBI, NIH
U.S. State PolicyIL MN TN WA WI
WHO??? Registered Political Committees(ANYONE – In Person)
MN Voters Anyone Anyone Anyone
Format Disk Disk Disk Disk Disk
Cost $500 $46; “use ONLY for elections, political activities, or law enforcement”
$2500 $30 $12,500
Name
Address
Election History
Date of Birth
Date of Registration
Sex
Race
Phone Number
Identifiability © Bradley Malin, 2010 20
U.S. State PolicyIL MN TN WA WI
WHO??? Registered Political Committees(Anyone – In Person)
MN Voters Anyone Anyone Anyone
Format Disk Disk Disk Disk Disk
Cost $500 $46; “use only for elections, political activities, or law enforcement”
$2500 $30 $12500
Name X X X X X
Address X X X X X
Election History X X X X X
Date of Birth X o X X
Date of Registration X X X X
Sex X X X
Race X
Phone Number X X
Identifiability Changes!
Limited Data Set Limited Data Set Voter Reg.
22
Group Size
0%
60%
80%
100%
40%
20%
1 3 5 10
Perc
ent I
denti
fiabl
e
0%
60%
80%
100%
40%
20%
1 3 5 10
Group Size
And What About Cost?
Identifiability © Bradley Malin, 2010 23
U.S. State Limited Dataset Safe Harbor
At Risk Cost/Re-ID At Risk Cost/Re-ID
Virginia 3159764 $0 221 $0
S. Carolina 2231973 $0 1386 $0
Wisconsin 72 $174 2 $6250
W. Virginia 55 $309 1 $17000
A “Real” Study Reported Last Week!
• Challenge issued by U.S. Dept. Health & Human Services
• Provided ~15,000 Safe Harbor records to re-id team @ U. Chicago on self identified “minority” ethnicity
• Team purchased public records from commercial broker
• Could correctly identified 2 people
• Details at http://www.ehcca.com/presentations/HIPAAWest4/lafky_2.pdf
HIPAA Expert Determination(abridged)
• Certify via “generally accepted statistical and scientific principles and methods, that the risk is very small that the information could be used, alone or in combination with other reasonably available information, by the anticipated recipient to identify the subject of the information.”
Identifiability © Bradley Malin, 2010 25
Towards an Expert Model• So far, we’ve looked at on populations (e.g., U.S. state). • Let’s shift focus to specific samples
K. Benitez, G. Loukides, and B. Malin. Beyond Safe Harbor: automatic discovery of health information de-identification policy alternatives. Proceedings of the ACM International Health Informatics Symposium. 2010: to appear.
Safe HarborCohort
PopulationCounts
(e.g. CENSUS)
RiskEstimationProcedure
Risk MitigationProcedure
StatisticalStandardCohort
PatientCohort
SafeHarbor
Procedure
Case study: Vandy QRS Cohort
27
PolicyGeneralizations
RiskGender Race Age
Safe Harbor [90 - 120] 0.909
Alternative 1 [M or F] 0.476
Alternative 2 [Asian or Other] 0.857
…
Alternative 100 [52 - 53] 0.875
Who StateState
Population Size(2000 Census)
CohortSize
Patients >89years old
Vanderbilt TN 5,689,283 2,983 12
Broader Set of Cases
• EMERGE Cohorts
Who Clinical Finding of InterestCohort
Size
Patients >89
years oldGHC Dementia 3,616 1,483
Marshfield Cataracts 2,646 269Mayo Peripheral Arterial Disease 3,412 29
Northwestern (A) Type-II Diabetes 3,383 6Vanderbilt (B) QRS Duration 2,983 12
Northwestern (B) QRS Duration 149 0Vanderbilt (A) Type-II Diabetes 2,015 18
28
B. Malin, K. Benitez, and D. Masys. Never too old for anonymity: a statistical standard for demographic data sharing. Journal of the American Medical Informatics Association. To appear.
Alternatives for eMERGE
Risk Model: UniquesAre the number of uniques expected to be greater than Safe Harbor?
DisclosurePolicy
Acceptable?
GHCMARMAYN-A V-B N-A V-B
Generalized Ethnicity (AA, White, Other)Age at 5 Year BinsGeneralized Ethnicity AND Age at 5 year binsAge at 10 Year Bins
29
Red = more risk than Safe Harbor Green = risk no worse than Safe Harbor
We Can Be More Formal
Identifiability © Bradley Malin, 2010 30
Patient
We can guarantee no record maps to less than k people in the shared data.
Cohort
Beyond Today:We Can Model the Concerns
DNA
ClinicalFeatures
Familial(Pedigrees)
Demographics
CollectionSemantics
Challenges / Recommendations
• It takes time and energy to appropriately model risks … but we CAN and SHOULD do it– We need to train experts– May require national centers of excellence– We need to support longitudinal studies, possibly across
multiple datasets (a.k.a. anonymous linking)
• We need to determine what is an acceptable level of risk– Dependent on data type– Dependent on person– Dependent on organization
Acknowledgements
• Collaborators• Vanderbilt
– Kathleen Benitez– Ellen Clayton, M.D., J.D.– Grigorios Loukides, Ph.D.– Dan Masys, M.D.– Dan Roden, M.D.
• eMERGE Teams– Group Health Cooperative– Mayo Clinic– Marshfield Clinic– Northwestern University– Vanderbilt University
• Funding• NHGRI @ NIH
– U01 HG004603 (eMERGE network)
• NLM @ NIH– R01 LM009989