anonymity through data cubes

Linked2Safety Project (FP7-ICT-2011-7 – 5.3)A NEXT-GENERATION, SECURE LINKED DATA MEDICAL INFORMATION SPACE FOR

SEMANTICALLY-INTERCONNECTING ELECTRONIC HEALTH RECORDSAND CLINICAL TRIALS SYSTEMS

ADVANCING PATIENTS SAFETY IN CLINICAL RESEARCH

12th International Conference on Bioinformatics and Bioengineering, Larnaka

Anonymity through Data cubes

Athos Antoniades

FP7, ICT-2011 – 5.3

Introduction

Why Share Data? What are the current legal and ethical

limitations? How have scientists shared medical data so far? Key Problems Perturbation Cell Suppression

FP7, ICT-2011 – 5.3

The Problem

Why share data:Replication TestingStatistical PowerMultiple Testing Problem

Legal and Ethical IssuesAnonymization vs PseudoanonimizationLimitations derived from consent form signed by subjectsOther, regional, study, or subject specific issues.

FP7, ICT-2011 – 5.3

How have scientists shared medical data Contingency Table and Data Cube

example

aa aA AA

Case U00 U01 U02

Control U10 U11 U12

FP7, ICT-2011 – 5.3

16 year old widow Problem

A paper that analyzes data from a specific study reports:

Marital Status

AgeAge Married Widowed Single0-16 0 1 50

18-24 10 5 5025-34 40 7 4035~ 60 15 20

FP7, ICT-2011 – 5.3

Categorization Differences

Paper 1 that analyzes data from a specific

study reports:Marital Status

Age

Age MarriedWidowe

d Single0-16 NA NA 50

18-24 10 7 5025-34 40 7 4035~ 60 15 20

Marital Status

Age

Age MarriedWidowe


18-25 10 8 5026-35 45 7 4036~ 55 14 20

Paper 2 that analyzes data from the same

study reports:

FP7, ICT-2011 – 5.3

Perturbation and Cell Suppression

Original Data

Marital Status

Age

Age MarriedWidowe

d Single0-16 0 1 50

18-24 10 7 5025-34 40 7 4035~ 60 15 20

Marital Status

Age

Age MarriedWidowe


18-24 9 8 4925-34 40 7 4135~ 61 14 21

Perturbation (+-1) andCell Suppression (<5)

FP7, ICT-2011 – 5.3

Evaluation

• Most common parameters testedPerturbation:[0], [-1,1], [-3,3], [-5,5], [-10,10]Cell Supression: <0, <=1, <=3,<=5,<=10

• Standard main effect test using Chi Square

• Pearson’s Correlation Coefficient used to evaluate deviation of each parameter combination to original results.

• A-priory defined threshold for Pearson’s correlation coefficient <=0.95.

FP7, ICT-2011 – 5.3

Evaluating Parameters with a matrix of graphs

FP7, ICT-2011 – 5.3

Linked2Safety’s Data Analysis Space

Objectives: Design and develop the data mining techniques and the scalable

infrastructure for the identification of phenotypic and genetic associations related to adverse events.

Develop new and implement existing state of the art analytical approaches for genetic data.

Define and implement the knowledge extraction and filtering mechanisms and the knowledge base

Integrate the knowledge base into a lightweight decision support system (Adverse events early detection mechanism)

FP7, ICT-2011 – 5.3

Data Analysis Steps

FP7, ICT-2011 – 5.3

Quality Control Subspace

Provides the tools for identifying and removing erroneous data or data that do not conform to the quality standards that a user might define.

Tools: Hardy-Weinberg Equilibrium Test Allele Frequency Test Missing Data Test

FP7, ICT-2011 – 5.3

Feature Selection Subspace

Provides the tools for removing redundant or irrelevant features from a dataset.

Tools: Rough Set Feature Selection Information Gain Feature Selection Chi Squared Feature Selection

FP7, ICT-2011 – 5.3

Data Analysis Steps

FP7, ICT-2011 – 5.3

Single Hypothesis Testing Subspace

Provides the tools for performing single hypothesis testing on a dataset and test for associations.

Tools: Pearson’s Chi Square Test Fisher’s Exact Test Odds Ratio Binomial Logistic Regression Linkage Disequilibrium Genetic Region Based Association Testing

FP7, ICT-2011 – 5.3

Data Mining Subspace

Provides the tools for performing data mining analyses on a dataset and extract association rules.

Tools: Association Rules (apriori) Decision Trees with Percentage Split (C4.5) Decision Trees with Cross Validation (C4.5) Random Forest with Percentage Split Random Forest with Cross Validation

FP7, ICT-2011 – 5.3

Data Analysis Space Interactions

FP7, ICT-2011 – 5.3

Data Analysis Steps

FP7, ICT-2011 – 5.3

Knowledge Extraction and Filtering Mechanism

Knowledge Extraction Mechanism This mechanism is responsible for storing

statistically significant associations and important association rules in the Linked2Safety knowledge database

Has two steps: Logging system Storing important knowledge

Filtering mechanism This mechanism allows users to insert or delete

associations and association rules

FP7, ICT-2011 – 5.3

Adverse Event Early Detection Mechanism

Uses the knowledge in the L2S knowledge base Runs in the background to identify new

associations and association rules Reruns analyses when updated datasets are

available Creates alerts for patients profiles associated

with adverse events

FP7, ICT-2011 – 5.3

Linked2Safety’s Data Analysis Platform

FP7, ICT-2011 – 5.3

Linked2Safety’s Data Analysis Platform Workflow Screenshot

FP7, ICT-2011 – 5.3

Patterns Discovery Common Variable Selection

Overlapping non genetic data of at least 2 data providers: Variables

Age Weight gainGender HeadachesBMI Gastrointestinal symptomsSmoking Ever Ophthalmological problemsDyslipidemia Type of ophthalmological condition Diabetes High blood pressureDiabetes type I Heart conditions existDiabetes type II Type of heart conditionAnemia HypertensionDepressive personality disorder Myocardial infarctionMajor depressive disorder StrokeSchizotypal personality disorder Coronary heart disease

FP7, ICT-2011 – 5.3

Conclusion and future work on utilizing data cubes

We were able to identify for a given dataset the maximum noise that can be added to the data without significantly affecting the outcomes.

Results presented are only relevant to MASTOS, all other datasets need to repeat the analytical approach described to determine the maximum noise that can be added to the results.

Further investigation is necessary to identify the minimum parameter settings to satisfy legal and ethical requirements.

FP7, ICT-2011 – 5.3

Who to Contact

Athos AntoniadesUniversity of Cyprus

email: [email protected]

anonymity through data cubes

Documents