anonymity through data cubes

27
Linked2Safety Project (FP7-ICT-2011-7 – 5.3) A NEXT-GENERATION, SECURE LINKED DATA MEDICAL INFORMATION SPACE FOR SEMANTICALLY-INTERCONNECTING ELECTRONIC HEALTH RECORDS AND CLINICAL TRIALS SYSTEMS ADVANCING PATIENTS SAFETY IN CLINICAL RESEARCH 12 th International Conference on Bioinformatics and Bioengineering, Larnaka Anonymity through Data cubes Athos Antoniades

Upload: deidra

Post on 23-Feb-2016

44 views

Category:

Documents


0 download

DESCRIPTION

Anonymity through Data cubes. Athos Antoniades. Introduction. Why Share Data? What are the current legal and ethical limitations? How have scientists shared medical data so far? Key Problems Perturbation Cell Suppression. The Problem. Why share data: Replication Testing - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Anonymity through Data cubes

Linked2Safety Project (FP7-ICT-2011-7 – 5.3)A NEXT-GENERATION, SECURE LINKED DATA MEDICAL INFORMATION SPACE FOR

SEMANTICALLY-INTERCONNECTING ELECTRONIC HEALTH RECORDSAND CLINICAL TRIALS SYSTEMS

ADVANCING PATIENTS SAFETY IN CLINICAL RESEARCH

12th International Conference on Bioinformatics and Bioengineering, Larnaka

Anonymity through Data cubes

Athos Antoniades

Page 2: Anonymity through Data cubes

FP7, ICT-2011 – 5.3 Page 2

Introduction

Why Share Data? What are the current legal and ethical

limitations? How have scientists shared medical data so far? Key Problems Perturbation Cell Suppression

Page 3: Anonymity through Data cubes

FP7, ICT-2011 – 5.3 Page 3

The Problem

Why share data:Replication TestingStatistical PowerMultiple Testing Problem

Legal and Ethical IssuesAnonymization vs PseudoanonimizationLimitations derived from consent form signed by subjectsOther, regional, study, or subject specific issues.

Page 4: Anonymity through Data cubes

FP7, ICT-2011 – 5.3 Page 4

How have scientists shared medical data Contingency Table and Data Cube

example

  aa aA AA

Case U00 U01 U02

Control U10 U11 U12

Page 5: Anonymity through Data cubes

FP7, ICT-2011 – 5.3 Page 5

16 year old widow Problem

A paper that analyzes data from a specific study reports:

Marital Status

AgeAge Married Widowed Single0-16 0 1 50

18-24 10 5 5025-34 40 7 4035~ 60 15 20

Page 6: Anonymity through Data cubes

FP7, ICT-2011 – 5.3 Page 6

16 year old widow Problem

A paper that analyzes data from a specific study reports:

Marital Status

AgeAge Married Widowed Single0-16 0 1 50

18-24 10 5 5025-34 40 7 4035~ 60 15 20

Page 7: Anonymity through Data cubes

FP7, ICT-2011 – 5.3 Page 7

16 year old widow Problem

A paper that analyzes data from a specific study reports:

Marital Status

AgeAge Married Widowed Single0-16 0 1 50

18-24 10 5 5025-34 40 7 4035~ 60 15 20

Page 8: Anonymity through Data cubes

FP7, ICT-2011 – 5.3 Page 8

Categorization Differences

Paper 1 that analyzes data from a specific

study reports:Marital Status

Age

Age MarriedWidowe

d Single0-16 NA NA 50

18-24 10 7 5025-34 40 7 4035~ 60 15 20

Marital Status

Age

Age MarriedWidowe

d Single0-16 NA NA 50

18-25 10 8 5026-35 45 7 4036~ 55 14 20

Paper 2 that analyzes data from the same

study reports:

Page 9: Anonymity through Data cubes

FP7, ICT-2011 – 5.3 Page 9

Perturbation and Cell Suppression

Original Data

Marital Status

Age

Age MarriedWidowe

d Single0-16 0 1 50

18-24 10 7 5025-34 40 7 4035~ 60 15 20

Marital Status

Age

Age MarriedWidowe

d Single0-16 NA NA 51

18-24 9 8 4925-34 40 7 4135~ 61 14 21

Perturbation (+-1) andCell Suppression (<5)

Page 10: Anonymity through Data cubes

FP7, ICT-2011 – 5.3 Page 10

Evaluation

• Most common parameters testedPerturbation:[0], [-1,1], [-3,3], [-5,5], [-10,10]Cell Supression: <0, <=1, <=3,<=5,<=10

• Standard main effect test using Chi Square

• Pearson’s Correlation Coefficient used to evaluate deviation of each parameter combination to original results.

• A-priory defined threshold for Pearson’s correlation coefficient <=0.95.

Page 11: Anonymity through Data cubes

FP7, ICT-2011 – 5.3 Page 11

Evaluating Parameters with a matrix of graphs

Page 12: Anonymity through Data cubes

FP7, ICT-2011 – 5.3 Page 12

Linked2Safety’s Data Analysis Space

Objectives: Design and develop the data mining techniques and the scalable

infrastructure for the identification of phenotypic and genetic associations related to adverse events.

Develop new and implement existing state of the art analytical approaches for genetic data.

Define and implement the knowledge extraction and filtering mechanisms and the knowledge base

Integrate the knowledge base into a lightweight decision support system (Adverse events early detection mechanism)

Page 13: Anonymity through Data cubes

FP7, ICT-2011 – 5.3 Page 13

Data Analysis Steps

Page 14: Anonymity through Data cubes

FP7, ICT-2011 – 5.3 Page 14

Quality Control Subspace

Provides the tools for identifying and removing erroneous data or data that do not conform to the quality standards that a user might define.

Tools: Hardy-Weinberg Equilibrium Test Allele Frequency Test Missing Data Test

Page 15: Anonymity through Data cubes

FP7, ICT-2011 – 5.3 Page 15

Feature Selection Subspace

Provides the tools for removing redundant or irrelevant features from a dataset.

Tools: Rough Set Feature Selection Information Gain Feature Selection Chi Squared Feature Selection

Page 16: Anonymity through Data cubes

FP7, ICT-2011 – 5.3 Page 16

Data Analysis Steps

Page 17: Anonymity through Data cubes

FP7, ICT-2011 – 5.3 Page 17

Single Hypothesis Testing Subspace

Provides the tools for performing single hypothesis testing on a dataset and test for associations.

Tools: Pearson’s Chi Square Test Fisher’s Exact Test Odds Ratio Binomial Logistic Regression Linkage Disequilibrium Genetic Region Based Association Testing

Page 18: Anonymity through Data cubes

FP7, ICT-2011 – 5.3 Page 18

Data Mining Subspace

Provides the tools for performing data mining analyses on a dataset and extract association rules.

Tools: Association Rules (apriori) Decision Trees with Percentage Split (C4.5) Decision Trees with Cross Validation (C4.5) Random Forest with Percentage Split Random Forest with Cross Validation

Page 19: Anonymity through Data cubes

FP7, ICT-2011 – 5.3 Page 19

Data Analysis Space Interactions

Page 20: Anonymity through Data cubes

FP7, ICT-2011 – 5.3 Page 20

Data Analysis Steps

Page 21: Anonymity through Data cubes

FP7, ICT-2011 – 5.3 Page 21

Knowledge Extraction and Filtering Mechanism

Knowledge Extraction Mechanism This mechanism is responsible for storing

statistically significant associations and important association rules in the Linked2Safety knowledge database

Has two steps: Logging system Storing important knowledge

Filtering mechanism This mechanism allows users to insert or delete

associations and association rules

Page 22: Anonymity through Data cubes

FP7, ICT-2011 – 5.3 Page 22

Adverse Event Early Detection Mechanism

Uses the knowledge in the L2S knowledge base Runs in the background to identify new

associations and association rules Reruns analyses when updated datasets are

available Creates alerts for patients profiles associated

with adverse events

Page 23: Anonymity through Data cubes

FP7, ICT-2011 – 5.3 Page 23

Linked2Safety’s Data Analysis Platform

Page 24: Anonymity through Data cubes

FP7, ICT-2011 – 5.3 Page 24

Linked2Safety’s Data Analysis Platform Workflow Screenshot

Page 25: Anonymity through Data cubes

FP7, ICT-2011 – 5.3 Page 25

Patterns Discovery Common Variable Selection

Overlapping non genetic data of at least 2 data providers: Variables

Age Weight gainGender HeadachesBMI Gastrointestinal symptomsSmoking Ever Ophthalmological problemsDyslipidemia Type of ophthalmological condition Diabetes High blood pressureDiabetes type I Heart conditions existDiabetes type II Type of heart conditionAnemia HypertensionDepressive personality disorder Myocardial infarctionMajor depressive disorder StrokeSchizotypal personality disorder Coronary heart disease

Page 26: Anonymity through Data cubes

FP7, ICT-2011 – 5.3 Page 26

Conclusion and future work on utilizing data cubes

We were able to identify for a given dataset the maximum noise that can be added to the data without significantly affecting the outcomes.

Results presented are only relevant to MASTOS, all other datasets need to repeat the analytical approach described to determine the maximum noise that can be added to the results.

Further investigation is necessary to identify the minimum parameter settings to satisfy legal and ethical requirements.

Page 27: Anonymity through Data cubes

FP7, ICT-2011 – 5.3 Page 27

Who to Contact

Athos AntoniadesUniversity of Cyprus

email: [email protected]