dept of biomedical engineering, medical informatics linköpings universitet, linköping, sweden a...

23
Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden http://www.imt.liu.se A Data Pre-processing Method to Increase Efficiency and Accuracy in Data Mining Amir R Razavi, Hans Gill, Hans Åhlfeldt, Nosrat Shahsavar Department of Biomedical Engineering, Division of Medical Informatics Linköpings universitet, Linköping, Sweden

Upload: anderson-janey

Post on 01-Apr-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden  A Data Pre-processing Method to Increase

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Swedenhttp://www.imt.liu.se

A Data Pre-processing Method to Increase Efficiency and Accuracy in Data Mining

Amir R Razavi, Hans Gill, Hans Åhlfeldt, Nosrat Shahsavar

Department of Biomedical Engineering, Division of Medical Informatics

Linköpings universitet, Linköping, Sweden

Page 2: Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden  A Data Pre-processing Method to Increase

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Swedenhttp://www.imt.liu.se

2

A Data Pre-processing Method in Data Mining

• Outline– Introduction– Dataset and variables– Data pre-processing– Data mining Algorithm (DTI)– Result– Discussion

Page 3: Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden  A Data Pre-processing Method to Increase

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Swedenhttp://www.imt.liu.se

3

Introduction

• Abundance of data in medicine and availability of comprehensive registers

• Difficulty in analysing huge amount of data with traditional methods

• Efficient data mining methods

Page 4: Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden  A Data Pre-processing Method to Increase

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Swedenhttp://www.imt.liu.se

4

Introduction

• Applying data mining methods to breast cancer register

• Pre-processing is an essential part of knowledge discovery in databases

• Finding an efficient pre-processing approach is essential for a successful data mining

Page 5: Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden  A Data Pre-processing Method to Increase

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Swedenhttp://www.imt.liu.se

5

Methods

• Dataset

• Data pre-processing– Data combination and selection– Cleaning data– Replacing missing values– Dimension reduction

• Decision Tree Induction (DTI)

• Performance comparison

Page 6: Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden  A Data Pre-processing Method to Increase

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Swedenhttp://www.imt.liu.se

6

Dataset

• 3949 female patients, 1986 to 1995, follow up to 2003

• Data from three registers: regional, tumour marker and death registers, overall more than 150 variables

Page 7: Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden  A Data Pre-processing Method to Increase

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Swedenhttp://www.imt.liu.se

7

Variables

Predictor Set Outcome Set ‡

Age Distant metastasis, first five years

Quadrant Distant metastasis, more than 5 years

Side Loco-regional recurrence, first five years

Tumor size * Loco-regional recurrence, more than 5 years

Lymph node involvement *

Lymph node involvement †

Periglandular growth *

Multiple tumors *

Estrogen receptor

Progesterone receptor

S-phase fraction

DNA index

DNA ploidy

* from pathology report, † N0: Not palpable LN metastasis, ‡ all periods are time after diagnosis

Page 8: Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden  A Data Pre-processing Method to Increase

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Swedenhttp://www.imt.liu.se

8

• After combining data from different registers, important variables (predictors/outcomes) were selected after consulting with domain experts:– Number of predictors were reduced from +150– Chosing four important outcomes for breast

cancer

Data Pre-processing – Data Selection

Page 9: Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden  A Data Pre-processing Method to Increase

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Swedenhttp://www.imt.liu.se

9

• Cleaning the data from outliers and errors, for example:– Duration between diagnosis of the disease and

the recurrence– Age

Data Pre-processing – Cleaning Data

Page 10: Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden  A Data Pre-processing Method to Increase

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Swedenhttp://www.imt.liu.se

10

Data Pre-processing - Replacing Missing Values

• EM (expectation maximization) algorithm– Dempster et al., 1977– A two step iterative approach that estimates the

parameters of a model starting from an initial guess. Each iteration consists of two steps:

• An expectation step that finds the distribution for the missing data based on the known values for the observed variables and the current estimate of the parameters.

• A maximization step that substitutes the missing data with the expected value.

Page 11: Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden  A Data Pre-processing Method to Increase

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Swedenhttp://www.imt.liu.se

11

Data Pre-processing - Dimension Reduction

• Canonical Correlation Analysis (CCA)– It investigates the relationship between two sets

of variables. – A canonical correlation is the correlation of two

canonical variates, one representing a set of independent variables, the other a set of dependent variables.

– A canonical variate, is a linear combination of a set of original variables.

Page 12: Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden  A Data Pre-processing Method to Increase

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Swedenhttp://www.imt.liu.se

12

Data Pre-processing - Dimension Reduction

– The aim is to create a number of canonical solutions each consisting of a linear combination of one set of variables:

Ui = a1 X1 + a2 X2 + … + am Xm

and a linear combination of the other set of variables: Vi = b1 Y1 + b2 Y2 + … + bn Yn

– The goal is to determine the coefficients (a’s and b’s) that maximize the correlation between canonical variates Ui and Vi.

Page 13: Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden  A Data Pre-processing Method to Increase

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Swedenhttp://www.imt.liu.se

13

Data Pre-processing - Dimension Reduction

– For finding important variables in each set (predictors and outcomes) magnitude of loadings were used.

– Variables with the absolute value of loadings more than or equal to 0.3 were assumed important and entered into the next step for data mining.

– Loading shows how each original variable contribute towards each canonical variate.

Page 14: Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden  A Data Pre-processing Method to Increase

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Swedenhttp://www.imt.liu.se

14

Data Pre-processing - Dimension Reduction

• Variables with their loadings

Page 15: Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden  A Data Pre-processing Method to Increase

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Swedenhttp://www.imt.liu.se

15

Data Mining Algorithm

• Decision Tree Induction (DTI)– A decision tree is a tree in which each branch node

represents a choice between a number of alternatives, and each leaf node represents a classification or decision.

– Each internal node denotes a test on variables, each branch stands for an outcome of the test, leaf nodes represent an outcome, and the uppermost node in a tree is the root node.

Page 16: Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden  A Data Pre-processing Method to Increase

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Swedenhttp://www.imt.liu.se

16

Resulted Decision Tree

Page 17: Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden  A Data Pre-processing Method to Increase

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Swedenhttp://www.imt.liu.se

17

Performance comparison

• Sensitivity =

• Specificity =

• Accuracy =

• Number of leaves and tree size

TP, TN, FP and FN denotes true positive, true negatives, false positives and false negatives, respectively

FNTP

TP

FPTN

TN

FNFPTNTP

TNTP

Page 18: Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden  A Data Pre-processing Method to Increase

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Swedenhttp://www.imt.liu.se

18

Performance Comparison

• Comparing different approaches

Without

pre-processing With replacing

missing values With

pre-processing

Accuracy 54% 57% 67% Sensitivity 83% 82% 80% Specificity 41% 46% 63% Number of Leaves 137 196 14 Tree Size 273 391 27

Page 19: Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden  A Data Pre-processing Method to Increase

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Swedenhttp://www.imt.liu.se

19

Discussion

• Effective data pre-processing is a very important step in knowledge discovery– Real word data are usually

• Incomplete

• Noisy

• Inconsistent

• Are not collected for data mining

Page 20: Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden  A Data Pre-processing Method to Increase

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Swedenhttp://www.imt.liu.se

20

Discussion

• Replacing missing values before dimension reduction – Providing more information to CCA for

dimension reduction

• Running CCA prior to DTI– Reducing the number of variables while

increasing accuracy of classification– Considerable increase in the interpretability of

DTI

Page 21: Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden  A Data Pre-processing Method to Increase

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Swedenhttp://www.imt.liu.se

21

Discussion

• In medical studies often no pre-processing is done before DTI

• Proper pre-processing including consulting with domain experts, replacing missing values and dimension reduction prepares the data for a better data mining by DTI

• Increasing the accuracy and interpretability of DTI are the result of our approach

Page 22: Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden  A Data Pre-processing Method to Increase

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Swedenhttp://www.imt.liu.se

22

Future Works

• Increase the efficiency of knowledge discovery of medical registers.

• Validate the result of our methodology (pre-processing prior to data mining ) with domain experts for the prediction of recurrence of cancer.

• How to use the discovered knowledge and integrate it with clinical workflow.

• Improve the quality of registers with adding and completing important predictors.

Page 23: Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden  A Data Pre-processing Method to Increase

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Swedenhttp://www.imt.liu.se

23

Thanks for your attention

[email protected]