master of science in engineering in computer science (mse ...nanni/didattica/matdid...weka: waikato...

29
Master of Science in Engineering in Computer Science (MSE-CS) DIPARTIMENTO DI INGEGNERIA INFORMATICA AUTOMATICA E GESTIONALE ANTONIO RUBERTI Seminars in Software and Services for the Information Society Umberto Nanni Lara Malfatti (MD-Thesis, March 2013) 1 Seminars of Software and Services for the Information Society Lara Malfatti - MD Thesis (Advisor: Umberto Nanni) Lara Malfatti (MD-Thesis, March 2013) Data Mining for evaluating the risk of chemotherapy-associated thrombosis

Upload: others

Post on 27-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Master of Science in Engineering in Computer Science (MSE ...nanni/Didattica/MatDid...WEKA: Waikato Environment for Knowledge Analysis • It is a free tool for data mining applications,

Master of Science in Engineering in Computer Science (MSE-CS)

DIPARTIMENTO DI INGEGNERIA INFORMATICA AUTOMATICA E GESTIONALE

ANTONIO RUBERTI

(MSE-CS)

Seminars in Software and Services for the Information Society

Umberto Nanni

Lara Malfatti (MD-Thesis, March 2013)

1Seminars of Software and Services for the Information SocietyLara Malfatti - MD Thesis (Advisor: Umberto Nanni)

Lara Malfatti (MD-Thesis, March 2013)

Data Mining for evaluating the risk of chemotherapy-associated thrombosis

Page 2: Master of Science in Engineering in Computer Science (MSE ...nanni/Didattica/MatDid...WEKA: Waikato Environment for Knowledge Analysis • It is a free tool for data mining applications,

Outline

• Problem and contextualization• Problem and contextualization

• Data Mining methodologies

• Dataset preprocessing

• Attributes’ selection

• Classification

2Seminars of Software and Services for the Information SocietyLara Malfatti - MD Thesis (Advisor: Umberto Nanni)

• Classification

• Costs’ evaluation

• Conclusion

Page 3: Master of Science in Engineering in Computer Science (MSE ...nanni/Didattica/MatDid...WEKA: Waikato Environment for Knowledge Analysis • It is a free tool for data mining applications,

Venous Thrombo-Embolism (VTE)

• It increases from 0,1% in

general population to 3% in general population to 3% in

cancer patients

• It is the second cause of

mortality in cancer patients

• Its treatment represents a big

cost for National Health

3Seminars of Software and Services for the Information SocietyLara Malfatti - MD Thesis (Advisor: Umberto Nanni)

cost for National Health

Service (about 8.000 € per

patient)

Page 4: Master of Science in Engineering in Computer Science (MSE ...nanni/Didattica/MatDid...WEKA: Waikato Environment for Knowledge Analysis • It is a free tool for data mining applications,

Data set description

Dataset contains 565 instances (526

negative + 39 positive).

Each entry contains 35 variables

which can be grouped in:

1. Patient risk factors: as age, sex,

laboratory analysis and comorbid

condition (i.e. obesity)

2. Cancer risk factors: as site and

4Seminars of Software and Services for the Information SocietyLara Malfatti - MD Thesis (Advisor: Umberto Nanni)

2. Cancer risk factors: as site and

stage of tumor

3. Treatment risk factors: as

assumption of chemotherapy or

targeted therapy agents

Page 5: Master of Science in Engineering in Computer Science (MSE ...nanni/Didattica/MatDid...WEKA: Waikato Environment for Knowledge Analysis • It is a free tool for data mining applications,

State of the art

5Seminars of Software and Services for the Information SocietyLara Malfatti - MD Thesis (Advisor: Umberto Nanni)

Page 6: Master of Science in Engineering in Computer Science (MSE ...nanni/Didattica/MatDid...WEKA: Waikato Environment for Knowledge Analysis • It is a free tool for data mining applications,

Terminology

• Classification process: takes in input an instance and tries to

forecast if it will be positive or negative

• Medical evaluation metrics are derived from the related

confusion matrix:

6Seminars of Software and Services for the Information SocietyLara Malfatti - MD Thesis (Advisor: Umberto Nanni)

Page 7: Master of Science in Engineering in Computer Science (MSE ...nanni/Didattica/MatDid...WEKA: Waikato Environment for Knowledge Analysis • It is a free tool for data mining applications,

Statistical approach: Khorana’s score

This model uses 5 biological variables as predictors and classifies

patients into three risk categories: low, intermediate and high risk

LOW INTERME

DIATE

HIGH

Num. of

patients

280 252 33

Metrics Values

Pros:

• Simple and clear model

• Low cost of predictive variables

Cons:

7Seminars of Software and Services for the Information SocietyLara Malfatti - MD Thesis (Advisor: Umberto Nanni)

Metrics Values

Accuracy 53%

PPV 10%

NPV 96%

Cons:

• Too many patients classified as

“intermediate risk”

• Poor performances

Page 8: Master of Science in Engineering in Computer Science (MSE ...nanni/Didattica/MatDid...WEKA: Waikato Environment for Knowledge Analysis • It is a free tool for data mining applications,

Challenge:

• Is it possible to find better variable

combinations able to predict thrombosis combinations able to predict thrombosis

through data mining?

• What is the the best predictive combination in

terms of cost/benefit among all the possible

ones?

8Seminars of Software and Services for the Information SocietyLara Malfatti - MD Thesis (Advisor: Umberto Nanni)

• Are the screening cost of these combinations

sustainable by the National Health Service?

Page 9: Master of Science in Engineering in Computer Science (MSE ...nanni/Didattica/MatDid...WEKA: Waikato Environment for Knowledge Analysis • It is a free tool for data mining applications,

Outline

• Problem and contextualization• Problem and contextualization

• Data Mining methodologies

• Dataset preprocessing

• Attributes’ selection

• Classification

9Seminars of Software and Services for the Information SocietyLara Malfatti - MD Thesis (Advisor: Umberto Nanni)

• Classification

• Costs’ evaluation

• Conclusion

Page 10: Master of Science in Engineering in Computer Science (MSE ...nanni/Didattica/MatDid...WEKA: Waikato Environment for Knowledge Analysis • It is a free tool for data mining applications,

Knowledge Discovery in Health Care

10Seminars of Software and Services for the Information SocietyLara Malfatti - MD Thesis (Advisor: Umberto Nanni)

Page 11: Master of Science in Engineering in Computer Science (MSE ...nanni/Didattica/MatDid...WEKA: Waikato Environment for Knowledge Analysis • It is a free tool for data mining applications,

WEKA

WEKA: Waikato Environment for Knowledge Analysis

• It is a free tool for data mining • It is a free tool for data mining

applications, written in JAVA

• It implements all the steps of

KDD workflow from data

preprocessing to the

visualization of discovered

patterns

11Seminars of Software and Services for the Information SocietyLara Malfatti - MD Thesis (Advisor: Umberto Nanni)

patterns

• Attention is focused on data

preprocessing, attribute

selection and learning phase

Page 12: Master of Science in Engineering in Computer Science (MSE ...nanni/Didattica/MatDid...WEKA: Waikato Environment for Knowledge Analysis • It is a free tool for data mining applications,

WEKA: learning phase

Unbalanced data set causes:

Learning phase: training and testing data sets must be disjoint

Unbalanced data set causes:

• Excessive influence of majority class

on classification model

• High global performance without

forecasting a single instance of the

minority class

12Seminars of Software and Services for the Information SocietyLara Malfatti - MD Thesis (Advisor: Umberto Nanni)

minority class

The creation of balanced training and

testing datasets is manually conducted

during the preprocessing phase

Page 13: Master of Science in Engineering in Computer Science (MSE ...nanni/Didattica/MatDid...WEKA: Waikato Environment for Knowledge Analysis • It is a free tool for data mining applications,

Outline

• Problem and contextualization• Problem and contextualization

• Data Mining methodologies

• Dataset preprocessing

• Attributes’ selection

• Classification

13Seminars of Software and Services for the Information SocietyLara Malfatti - MD Thesis (Advisor: Umberto Nanni)

• Classification

• Costs’ evaluation

• Conclusion

Page 14: Master of Science in Engineering in Computer Science (MSE ...nanni/Didattica/MatDid...WEKA: Waikato Environment for Knowledge Analysis • It is a free tool for data mining applications,

Data set pre-processing: cleaning (1/3)

Create three balanced folders and combine the partial results

All the instances are

partial results

• All the instances are classified exactly once

• All the training sets have the same number of positive and negative instances

Training and testing

14Seminars of Software and Services for the Information SocietyLara Malfatti - MD Thesis (Advisor: Umberto Nanni)

Training and testing datasets are disjoint

Extra cost: each experiment needs three run execution

Page 15: Master of Science in Engineering in Computer Science (MSE ...nanni/Didattica/MatDid...WEKA: Waikato Environment for Knowledge Analysis • It is a free tool for data mining applications,

Data set pre-processing: cleaning (2/3)

The objective is to remove noisy instances

• VTE normally falls

within 6 months from within 6 months from

the beginning of

chemotherapy

• Time interval is

enlarged to 12 months

to cover also

15Seminars of Software and Services for the Information SocietyLara Malfatti - MD Thesis (Advisor: Umberto Nanni)

Outliers are given by:

• Intrinsic probability of having a thrombotic event

• Changes in anticancer treatments

to cover also

asymptomatic events

Page 16: Master of Science in Engineering in Computer Science (MSE ...nanni/Didattica/MatDid...WEKA: Waikato Environment for Knowledge Analysis • It is a free tool for data mining applications,

Data set preprocessing: improvements (3/3)

Unstructured numerical data are

aggregated, to not badly influence

the classification model (see the classification model (see

figure)

Instances with missing values are

discarded because:

• Artificial values cannot

16Seminars of Software and Services for the Information SocietyLara Malfatti - MD Thesis (Advisor: Umberto Nanni)

• Artificial values cannot

correspond to real cases

• They can create problems both

in training and testing data set

Page 17: Master of Science in Engineering in Computer Science (MSE ...nanni/Didattica/MatDid...WEKA: Waikato Environment for Knowledge Analysis • It is a free tool for data mining applications,

Outline

• Problem and contextualization• Problem and contextualization

• Data Mining methodologies

• Dataset preprocessing

• Attributes’ selection

• Classification

17Seminars of Software and Services for the Information SocietyLara Malfatti - MD Thesis (Advisor: Umberto Nanni)

• Classification

• Costs’ evaluation

• Conclusion

Page 18: Master of Science in Engineering in Computer Science (MSE ...nanni/Didattica/MatDid...WEKA: Waikato Environment for Knowledge Analysis • It is a free tool for data mining applications,

Attribute selection (1/2)

Feature selection returns meaningful subsets of the original

attributes ignoring the ones which provide no information

Filter methods: Filter methods:

• they are independent from any

learning algorithms and rely only

on data properties

• they can be seen as the

combination of search techniques

18Seminars of Software and Services for the Information SocietyLara Malfatti - MD Thesis (Advisor: Umberto Nanni)

they can be seen as the

combination of search techniques

for proposing new subsets and

evaluation metrics to rank them

WEKA provides lots of possibilities

Page 19: Master of Science in Engineering in Computer Science (MSE ...nanni/Didattica/MatDid...WEKA: Waikato Environment for Knowledge Analysis • It is a free tool for data mining applications,

Attribute selection (2/2)

GreedyStepwise: performs a greedy search through the

space of attribute subsets in both directions (backward and

forward) starting from the empty setforward) starting from the empty set

CorrelationFeautureSubSetEval: prefers subsets with

attributes highly correlated with the class but having low

inter-correlation

19Seminars of Software and Services for the Information SocietyLara Malfatti - MD Thesis (Advisor: Umberto Nanni)

Page 20: Master of Science in Engineering in Computer Science (MSE ...nanni/Didattica/MatDid...WEKA: Waikato Environment for Knowledge Analysis • It is a free tool for data mining applications,

Outline

• Problem and contextualization• Problem and contextualization

• Data Mining methodologies

• Dataset preprocessing

• Attributes’ selection

• Classification

20Seminars of Software and Services for the Information SocietyLara Malfatti - MD Thesis (Advisor: Umberto Nanni)

• Classification

• Costs’ evaluation

• Conclusion

Page 21: Master of Science in Engineering in Computer Science (MSE ...nanni/Didattica/MatDid...WEKA: Waikato Environment for Knowledge Analysis • It is a free tool for data mining applications,

Classification

Guidelines:

• For each subset found in previous step some experiments

are conducted using different learning algorithmsare conducted using different learning algorithms

• PPV, NPV and Accuracy are compared, Khorana’s results

are used as benchmarks

• A constraint is fixed, no NPV values lower than 96% are

allowed

WEKA provides a variety of learning algorithms, the ones

21Seminars of Software and Services for the Information SocietyLara Malfatti - MD Thesis (Advisor: Umberto Nanni)

WEKA provides a variety of learning algorithms, the ones

used in experiments are:

• Bayes algorithms, Decision trees, Cover rules, Logistic

regression functions and Lazy algorithms

Page 22: Master of Science in Engineering in Computer Science (MSE ...nanni/Didattica/MatDid...WEKA: Waikato Environment for Knowledge Analysis • It is a free tool for data mining applications,

Classification: Accuracy

All the predictive

groups have

better accuracy

than Pure-KS

22Seminars of Software and Services for the Information SocietyLara Malfatti - MD Thesis (Advisor: Umberto Nanni)

Page 23: Master of Science in Engineering in Computer Science (MSE ...nanni/Didattica/MatDid...WEKA: Waikato Environment for Knowledge Analysis • It is a free tool for data mining applications,

Classification: NPV

Khorana group

violates the NPV

constraint which

is under 96%

23Seminars of Software and Services for the Information SocietyLara Malfatti - MD Thesis (Advisor: Umberto Nanni)

Page 24: Master of Science in Engineering in Computer Science (MSE ...nanni/Didattica/MatDid...WEKA: Waikato Environment for Knowledge Analysis • It is a free tool for data mining applications,

Classification: PPV

WEKA and ThP

groups doubles the

PPV obtained by

Pure-KS

24Seminars of Software and Services for the Information SocietyLara Malfatti - MD Thesis (Advisor: Umberto Nanni)

Page 25: Master of Science in Engineering in Computer Science (MSE ...nanni/Didattica/MatDid...WEKA: Waikato Environment for Knowledge Analysis • It is a free tool for data mining applications,

Outline

• Problem and contextualization• Problem and contextualization

• Data Mining methodologies

• Dataset preprocessing

• Attributes’ selection

• Classification

25Seminars of Software and Services for the Information SocietyLara Malfatti - MD Thesis (Advisor: Umberto Nanni)

• Classification

• Costs’ evaluation

• Conclusion

Page 26: Master of Science in Engineering in Computer Science (MSE ...nanni/Didattica/MatDid...WEKA: Waikato Environment for Knowledge Analysis • It is a free tool for data mining applications,

Cost Evaluation (1/2)

Evaluation of the screening cost and eventual NHS savings

26Seminars of Software and Services for the Information SocietyLara Malfatti - MD Thesis (Advisor: Umberto Nanni)

Page 27: Master of Science in Engineering in Computer Science (MSE ...nanni/Didattica/MatDid...WEKA: Waikato Environment for Knowledge Analysis • It is a free tool for data mining applications,

Cost Evaluation (2/2)

• In all the cases, National Health Service saves money from

correctly predicted thrombosis (no treatment needed) and

27Seminars of Software and Services for the Information SocietyLara Malfatti - MD Thesis (Advisor: Umberto Nanni)

correctly predicted thrombosis (no treatment needed) and

covers the screening costs at the same time

• Augmented-KS is the best predictive combination from an

economic point of view

Page 28: Master of Science in Engineering in Computer Science (MSE ...nanni/Didattica/MatDid...WEKA: Waikato Environment for Knowledge Analysis • It is a free tool for data mining applications,

Outline

• Problem and contextualization• Problem and contextualization

• Data Mining methodologies

• Dataset preprocessing

• Attributes’ selection

• Classification

28Seminars of Software and Services for the Information SocietyLara Malfatti - MD Thesis (Advisor: Umberto Nanni)

• Classification

• Costs’ evaluation

• Conclusion

Page 29: Master of Science in Engineering in Computer Science (MSE ...nanni/Didattica/MatDid...WEKA: Waikato Environment for Knowledge Analysis • It is a free tool for data mining applications,

Conclusion and future works

From the use of data mining for the study of chemotherapy-

associated thrombosis:

• PPV increases of 150% respect to the statistical approach

• NHS saves money from correctly predicted thrombosis and

covers the screening costs at the same time

Due to the limited size of dataset to be analyzed, better results can

be reached:

29Seminars of Software and Services for the Information SocietyLara Malfatti - MD Thesis (Advisor: Umberto Nanni)

be reached:

• repeating the experiments by integrating more biological

variables

• repeating the experiments by integrating more instances into

dataset