anonymity through data cubes
DESCRIPTION
Anonymity through Data cubes. Athos Antoniades. Introduction. Why Share Data? What are the current legal and ethical limitations? How have scientists shared medical data so far? Key Problems Perturbation Cell Suppression. The Problem. Why share data: Replication Testing - PowerPoint PPT PresentationTRANSCRIPT
Linked2Safety Project (FP7-ICT-2011-7 – 5.3)A NEXT-GENERATION, SECURE LINKED DATA MEDICAL INFORMATION SPACE FOR
SEMANTICALLY-INTERCONNECTING ELECTRONIC HEALTH RECORDSAND CLINICAL TRIALS SYSTEMS
ADVANCING PATIENTS SAFETY IN CLINICAL RESEARCH
12th International Conference on Bioinformatics and Bioengineering, Larnaka
Anonymity through Data cubes
Athos Antoniades
FP7, ICT-2011 – 5.3 Page 2
Introduction
Why Share Data? What are the current legal and ethical
limitations? How have scientists shared medical data so far? Key Problems Perturbation Cell Suppression
FP7, ICT-2011 – 5.3 Page 3
The Problem
Why share data:Replication TestingStatistical PowerMultiple Testing Problem
Legal and Ethical IssuesAnonymization vs PseudoanonimizationLimitations derived from consent form signed by subjectsOther, regional, study, or subject specific issues.
FP7, ICT-2011 – 5.3 Page 4
How have scientists shared medical data Contingency Table and Data Cube
example
aa aA AA
Case U00 U01 U02
Control U10 U11 U12
FP7, ICT-2011 – 5.3 Page 5
16 year old widow Problem
A paper that analyzes data from a specific study reports:
Marital Status
AgeAge Married Widowed Single0-16 0 1 50
18-24 10 5 5025-34 40 7 4035~ 60 15 20
FP7, ICT-2011 – 5.3 Page 6
16 year old widow Problem
A paper that analyzes data from a specific study reports:
Marital Status
AgeAge Married Widowed Single0-16 0 1 50
18-24 10 5 5025-34 40 7 4035~ 60 15 20
FP7, ICT-2011 – 5.3 Page 7
16 year old widow Problem
A paper that analyzes data from a specific study reports:
Marital Status
AgeAge Married Widowed Single0-16 0 1 50
18-24 10 5 5025-34 40 7 4035~ 60 15 20
FP7, ICT-2011 – 5.3 Page 8
Categorization Differences
Paper 1 that analyzes data from a specific
study reports:Marital Status
Age
Age MarriedWidowe
d Single0-16 NA NA 50
18-24 10 7 5025-34 40 7 4035~ 60 15 20
Marital Status
Age
Age MarriedWidowe
d Single0-16 NA NA 50
18-25 10 8 5026-35 45 7 4036~ 55 14 20
Paper 2 that analyzes data from the same
study reports:
FP7, ICT-2011 – 5.3 Page 9
Perturbation and Cell Suppression
Original Data
Marital Status
Age
Age MarriedWidowe
d Single0-16 0 1 50
18-24 10 7 5025-34 40 7 4035~ 60 15 20
Marital Status
Age
Age MarriedWidowe
d Single0-16 NA NA 51
18-24 9 8 4925-34 40 7 4135~ 61 14 21
Perturbation (+-1) andCell Suppression (<5)
FP7, ICT-2011 – 5.3 Page 10
Evaluation
• Most common parameters testedPerturbation:[0], [-1,1], [-3,3], [-5,5], [-10,10]Cell Supression: <0, <=1, <=3,<=5,<=10
• Standard main effect test using Chi Square
• Pearson’s Correlation Coefficient used to evaluate deviation of each parameter combination to original results.
• A-priory defined threshold for Pearson’s correlation coefficient <=0.95.
FP7, ICT-2011 – 5.3 Page 11
Evaluating Parameters with a matrix of graphs
FP7, ICT-2011 – 5.3 Page 12
Linked2Safety’s Data Analysis Space
Objectives: Design and develop the data mining techniques and the scalable
infrastructure for the identification of phenotypic and genetic associations related to adverse events.
Develop new and implement existing state of the art analytical approaches for genetic data.
Define and implement the knowledge extraction and filtering mechanisms and the knowledge base
Integrate the knowledge base into a lightweight decision support system (Adverse events early detection mechanism)
FP7, ICT-2011 – 5.3 Page 13
Data Analysis Steps
FP7, ICT-2011 – 5.3 Page 14
Quality Control Subspace
Provides the tools for identifying and removing erroneous data or data that do not conform to the quality standards that a user might define.
Tools: Hardy-Weinberg Equilibrium Test Allele Frequency Test Missing Data Test
FP7, ICT-2011 – 5.3 Page 15
Feature Selection Subspace
Provides the tools for removing redundant or irrelevant features from a dataset.
Tools: Rough Set Feature Selection Information Gain Feature Selection Chi Squared Feature Selection
FP7, ICT-2011 – 5.3 Page 16
Data Analysis Steps
FP7, ICT-2011 – 5.3 Page 17
Single Hypothesis Testing Subspace
Provides the tools for performing single hypothesis testing on a dataset and test for associations.
Tools: Pearson’s Chi Square Test Fisher’s Exact Test Odds Ratio Binomial Logistic Regression Linkage Disequilibrium Genetic Region Based Association Testing
FP7, ICT-2011 – 5.3 Page 18
Data Mining Subspace
Provides the tools for performing data mining analyses on a dataset and extract association rules.
Tools: Association Rules (apriori) Decision Trees with Percentage Split (C4.5) Decision Trees with Cross Validation (C4.5) Random Forest with Percentage Split Random Forest with Cross Validation
FP7, ICT-2011 – 5.3 Page 19
Data Analysis Space Interactions
FP7, ICT-2011 – 5.3 Page 20
Data Analysis Steps
FP7, ICT-2011 – 5.3 Page 21
Knowledge Extraction and Filtering Mechanism
Knowledge Extraction Mechanism This mechanism is responsible for storing
statistically significant associations and important association rules in the Linked2Safety knowledge database
Has two steps: Logging system Storing important knowledge
Filtering mechanism This mechanism allows users to insert or delete
associations and association rules
FP7, ICT-2011 – 5.3 Page 22
Adverse Event Early Detection Mechanism
Uses the knowledge in the L2S knowledge base Runs in the background to identify new
associations and association rules Reruns analyses when updated datasets are
available Creates alerts for patients profiles associated
with adverse events
FP7, ICT-2011 – 5.3 Page 23
Linked2Safety’s Data Analysis Platform
FP7, ICT-2011 – 5.3 Page 24
Linked2Safety’s Data Analysis Platform Workflow Screenshot
FP7, ICT-2011 – 5.3 Page 25
Patterns Discovery Common Variable Selection
Overlapping non genetic data of at least 2 data providers: Variables
Age Weight gainGender HeadachesBMI Gastrointestinal symptomsSmoking Ever Ophthalmological problemsDyslipidemia Type of ophthalmological condition Diabetes High blood pressureDiabetes type I Heart conditions existDiabetes type II Type of heart conditionAnemia HypertensionDepressive personality disorder Myocardial infarctionMajor depressive disorder StrokeSchizotypal personality disorder Coronary heart disease
FP7, ICT-2011 – 5.3 Page 26
Conclusion and future work on utilizing data cubes
We were able to identify for a given dataset the maximum noise that can be added to the data without significantly affecting the outcomes.
Results presented are only relevant to MASTOS, all other datasets need to repeat the analytical approach described to determine the maximum noise that can be added to the results.
Further investigation is necessary to identify the minimum parameter settings to satisfy legal and ethical requirements.
FP7, ICT-2011 – 5.3 Page 27
Who to Contact
Athos AntoniadesUniversity of Cyprus
email: [email protected]