master of science in engineering in computer science (mse ...nanni/didattica/matdid...weka: waikato...
TRANSCRIPT
Master of Science in Engineering in Computer Science (MSE-CS)
DIPARTIMENTO DI INGEGNERIA INFORMATICA AUTOMATICA E GESTIONALE
ANTONIO RUBERTI
(MSE-CS)
Seminars in Software and Services for the Information Society
Umberto Nanni
Lara Malfatti (MD-Thesis, March 2013)
1Seminars of Software and Services for the Information SocietyLara Malfatti - MD Thesis (Advisor: Umberto Nanni)
Lara Malfatti (MD-Thesis, March 2013)
Data Mining for evaluating the risk of chemotherapy-associated thrombosis
Outline
• Problem and contextualization• Problem and contextualization
• Data Mining methodologies
• Dataset preprocessing
• Attributes’ selection
• Classification
2Seminars of Software and Services for the Information SocietyLara Malfatti - MD Thesis (Advisor: Umberto Nanni)
• Classification
• Costs’ evaluation
• Conclusion
Venous Thrombo-Embolism (VTE)
• It increases from 0,1% in
general population to 3% in general population to 3% in
cancer patients
• It is the second cause of
mortality in cancer patients
• Its treatment represents a big
cost for National Health
3Seminars of Software and Services for the Information SocietyLara Malfatti - MD Thesis (Advisor: Umberto Nanni)
cost for National Health
Service (about 8.000 € per
patient)
Data set description
Dataset contains 565 instances (526
negative + 39 positive).
Each entry contains 35 variables
which can be grouped in:
1. Patient risk factors: as age, sex,
laboratory analysis and comorbid
condition (i.e. obesity)
2. Cancer risk factors: as site and
4Seminars of Software and Services for the Information SocietyLara Malfatti - MD Thesis (Advisor: Umberto Nanni)
2. Cancer risk factors: as site and
stage of tumor
3. Treatment risk factors: as
assumption of chemotherapy or
targeted therapy agents
State of the art
5Seminars of Software and Services for the Information SocietyLara Malfatti - MD Thesis (Advisor: Umberto Nanni)
Terminology
• Classification process: takes in input an instance and tries to
forecast if it will be positive or negative
• Medical evaluation metrics are derived from the related
confusion matrix:
6Seminars of Software and Services for the Information SocietyLara Malfatti - MD Thesis (Advisor: Umberto Nanni)
Statistical approach: Khorana’s score
This model uses 5 biological variables as predictors and classifies
patients into three risk categories: low, intermediate and high risk
LOW INTERME
DIATE
HIGH
Num. of
patients
280 252 33
Metrics Values
Pros:
• Simple and clear model
• Low cost of predictive variables
Cons:
7Seminars of Software and Services for the Information SocietyLara Malfatti - MD Thesis (Advisor: Umberto Nanni)
Metrics Values
Accuracy 53%
PPV 10%
NPV 96%
Cons:
• Too many patients classified as
“intermediate risk”
• Poor performances
Challenge:
• Is it possible to find better variable
combinations able to predict thrombosis combinations able to predict thrombosis
through data mining?
• What is the the best predictive combination in
terms of cost/benefit among all the possible
ones?
8Seminars of Software and Services for the Information SocietyLara Malfatti - MD Thesis (Advisor: Umberto Nanni)
• Are the screening cost of these combinations
sustainable by the National Health Service?
Outline
• Problem and contextualization• Problem and contextualization
• Data Mining methodologies
• Dataset preprocessing
• Attributes’ selection
• Classification
9Seminars of Software and Services for the Information SocietyLara Malfatti - MD Thesis (Advisor: Umberto Nanni)
• Classification
• Costs’ evaluation
• Conclusion
Knowledge Discovery in Health Care
10Seminars of Software and Services for the Information SocietyLara Malfatti - MD Thesis (Advisor: Umberto Nanni)
WEKA
WEKA: Waikato Environment for Knowledge Analysis
• It is a free tool for data mining • It is a free tool for data mining
applications, written in JAVA
• It implements all the steps of
KDD workflow from data
preprocessing to the
visualization of discovered
patterns
11Seminars of Software and Services for the Information SocietyLara Malfatti - MD Thesis (Advisor: Umberto Nanni)
patterns
• Attention is focused on data
preprocessing, attribute
selection and learning phase
WEKA: learning phase
Unbalanced data set causes:
Learning phase: training and testing data sets must be disjoint
Unbalanced data set causes:
• Excessive influence of majority class
on classification model
• High global performance without
forecasting a single instance of the
minority class
12Seminars of Software and Services for the Information SocietyLara Malfatti - MD Thesis (Advisor: Umberto Nanni)
minority class
The creation of balanced training and
testing datasets is manually conducted
during the preprocessing phase
Outline
• Problem and contextualization• Problem and contextualization
• Data Mining methodologies
• Dataset preprocessing
• Attributes’ selection
• Classification
13Seminars of Software and Services for the Information SocietyLara Malfatti - MD Thesis (Advisor: Umberto Nanni)
• Classification
• Costs’ evaluation
• Conclusion
Data set pre-processing: cleaning (1/3)
Create three balanced folders and combine the partial results
All the instances are
partial results
• All the instances are classified exactly once
• All the training sets have the same number of positive and negative instances
Training and testing
14Seminars of Software and Services for the Information SocietyLara Malfatti - MD Thesis (Advisor: Umberto Nanni)
Training and testing datasets are disjoint
Extra cost: each experiment needs three run execution
Data set pre-processing: cleaning (2/3)
The objective is to remove noisy instances
• VTE normally falls
within 6 months from within 6 months from
the beginning of
chemotherapy
• Time interval is
enlarged to 12 months
to cover also
15Seminars of Software and Services for the Information SocietyLara Malfatti - MD Thesis (Advisor: Umberto Nanni)
Outliers are given by:
• Intrinsic probability of having a thrombotic event
• Changes in anticancer treatments
to cover also
asymptomatic events
Data set preprocessing: improvements (3/3)
Unstructured numerical data are
aggregated, to not badly influence
the classification model (see the classification model (see
figure)
Instances with missing values are
discarded because:
• Artificial values cannot
16Seminars of Software and Services for the Information SocietyLara Malfatti - MD Thesis (Advisor: Umberto Nanni)
• Artificial values cannot
correspond to real cases
• They can create problems both
in training and testing data set
Outline
• Problem and contextualization• Problem and contextualization
• Data Mining methodologies
• Dataset preprocessing
• Attributes’ selection
• Classification
17Seminars of Software and Services for the Information SocietyLara Malfatti - MD Thesis (Advisor: Umberto Nanni)
• Classification
• Costs’ evaluation
• Conclusion
Attribute selection (1/2)
Feature selection returns meaningful subsets of the original
attributes ignoring the ones which provide no information
Filter methods: Filter methods:
• they are independent from any
learning algorithms and rely only
on data properties
• they can be seen as the
combination of search techniques
18Seminars of Software and Services for the Information SocietyLara Malfatti - MD Thesis (Advisor: Umberto Nanni)
they can be seen as the
combination of search techniques
for proposing new subsets and
evaluation metrics to rank them
WEKA provides lots of possibilities
Attribute selection (2/2)
GreedyStepwise: performs a greedy search through the
space of attribute subsets in both directions (backward and
forward) starting from the empty setforward) starting from the empty set
CorrelationFeautureSubSetEval: prefers subsets with
attributes highly correlated with the class but having low
inter-correlation
19Seminars of Software and Services for the Information SocietyLara Malfatti - MD Thesis (Advisor: Umberto Nanni)
Outline
• Problem and contextualization• Problem and contextualization
• Data Mining methodologies
• Dataset preprocessing
• Attributes’ selection
• Classification
20Seminars of Software and Services for the Information SocietyLara Malfatti - MD Thesis (Advisor: Umberto Nanni)
• Classification
• Costs’ evaluation
• Conclusion
Classification
Guidelines:
• For each subset found in previous step some experiments
are conducted using different learning algorithmsare conducted using different learning algorithms
• PPV, NPV and Accuracy are compared, Khorana’s results
are used as benchmarks
• A constraint is fixed, no NPV values lower than 96% are
allowed
WEKA provides a variety of learning algorithms, the ones
21Seminars of Software and Services for the Information SocietyLara Malfatti - MD Thesis (Advisor: Umberto Nanni)
WEKA provides a variety of learning algorithms, the ones
used in experiments are:
• Bayes algorithms, Decision trees, Cover rules, Logistic
regression functions and Lazy algorithms
Classification: Accuracy
All the predictive
groups have
better accuracy
than Pure-KS
22Seminars of Software and Services for the Information SocietyLara Malfatti - MD Thesis (Advisor: Umberto Nanni)
Classification: NPV
Khorana group
violates the NPV
constraint which
is under 96%
23Seminars of Software and Services for the Information SocietyLara Malfatti - MD Thesis (Advisor: Umberto Nanni)
Classification: PPV
WEKA and ThP
groups doubles the
PPV obtained by
Pure-KS
24Seminars of Software and Services for the Information SocietyLara Malfatti - MD Thesis (Advisor: Umberto Nanni)
Outline
• Problem and contextualization• Problem and contextualization
• Data Mining methodologies
• Dataset preprocessing
• Attributes’ selection
• Classification
25Seminars of Software and Services for the Information SocietyLara Malfatti - MD Thesis (Advisor: Umberto Nanni)
• Classification
• Costs’ evaluation
• Conclusion
Cost Evaluation (1/2)
Evaluation of the screening cost and eventual NHS savings
26Seminars of Software and Services for the Information SocietyLara Malfatti - MD Thesis (Advisor: Umberto Nanni)
Cost Evaluation (2/2)
• In all the cases, National Health Service saves money from
correctly predicted thrombosis (no treatment needed) and
27Seminars of Software and Services for the Information SocietyLara Malfatti - MD Thesis (Advisor: Umberto Nanni)
correctly predicted thrombosis (no treatment needed) and
covers the screening costs at the same time
• Augmented-KS is the best predictive combination from an
economic point of view
Outline
• Problem and contextualization• Problem and contextualization
• Data Mining methodologies
• Dataset preprocessing
• Attributes’ selection
• Classification
28Seminars of Software and Services for the Information SocietyLara Malfatti - MD Thesis (Advisor: Umberto Nanni)
• Classification
• Costs’ evaluation
• Conclusion
Conclusion and future works
From the use of data mining for the study of chemotherapy-
associated thrombosis:
• PPV increases of 150% respect to the statistical approach
• NHS saves money from correctly predicted thrombosis and
covers the screening costs at the same time
Due to the limited size of dataset to be analyzed, better results can
be reached:
29Seminars of Software and Services for the Information SocietyLara Malfatti - MD Thesis (Advisor: Umberto Nanni)
be reached:
• repeating the experiments by integrating more biological
variables
• repeating the experiments by integrating more instances into
dataset