feature selection for adverse event prediction
TRANSCRIPT
Feature Selection for AdverseEvent Prediction
A dissertation submitted to The University of
Manchester for the degree of Master of Science by Research
in the Faculty of Engineering and Physical Sciences
2011
Elisabeta Marinoiu
School of computer Science
Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Declaration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Copyright Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Local Feature Selection-towards personalized medicine . . . . . . . . 2
1.3 Aims and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.1 Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Project Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Literature review 6
2.1 Feature Selection. Introduction . . . . . . . . . . . . . . . . . . . . . 6
2.2 Feature Selection using Information Theory . . . . . . . . . . . . . . 9
2.2.1 Overview of basic Information Theory Concepts . . . . . . . . 9
2.2.2 Relevancy. Redundancy. Relevancy in context . . . . . . . . . 11
2.2.3 Ranking criterion . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.4 Mutual Information Feature Selection Criterion . . . . . . . . 12
2.2.5 Double Input Symmetrical Relevance Criterion . . . . . . . . . 13
2.2.6 Joint Mutual Information Criterion . . . . . . . . . . . . . . . 14
2.3 Local Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 15
i
Feature Selection for Adverse Event Prediction2.3.1 Natural clustering . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.2 Measuring dissimilarity of subproblems . . . . . . . . . . . . . 17
2.3.3 Local feature selection using a clustering-like approach . . . . 18
2.3.4 Local feature selection and dynamic integration of classifiers . 20
2.4 Class-Specific Feature Selection . . . . . . . . . . . . . . . . . . . . . 21
3 Data Preprocessing and Initial Experiments 25
3.1 Data Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1.1 Adverse Events Data Set (not used in the experiments) . . . . 26
3.1.2 Subjects Data Set . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1.3 Concomitant Medication Data set . . . . . . . . . . . . . . . . 27
3.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.1 Converting string discrete variables into numbers . . . . . . . 27
3.2.2 Discretization of continuous variables . . . . . . . . . . . . . 27
3.2.3 Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.4 Sparse Features and Special Cases . . . . . . . . . . . . . . . . 29
3.3 Initial Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3.1 Experiment 1: Ranking features according to Mutual Infor-
mation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3.2 Permutation test . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.3 Experiment 2: Local analysis of individual feature importance 33
3.3.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4 Analysis of feature importance within subsets 39
4.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 Assumptions and Limitations . . . . . . . . . . . . . . . . . . . . . . 40
4.2.1 Feature Selection Criterion . . . . . . . . . . . . . . . . . . . . 41
4.3 Identifying the most discriminant features . . . . . . . . . . . . . . . 41
4.3.1 Consistency Index for feature selection . . . . . . . . . . . . . 42
4.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
ii
Feature Selection for Adverse Event Prediction
4.4 Local Analysis of biomarkers . . . . . . . . . . . . . . . . . . . . . . . 46
4.4.1 Description of the method . . . . . . . . . . . . . . . . . . . . 47
4.4.2 Computing the scores . . . . . . . . . . . . . . . . . . . . . . . 47
4.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.4.4 Summary and conclusions . . . . . . . . . . . . . . . . . . . . 52
5 Predictive Model Building 54
5.1 Measures of assessing performance . . . . . . . . . . . . . . . . . . . . 55
5.2 Local vs. Global Analysis . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2.1 Local predictive models . . . . . . . . . . . . . . . . . . . . . 57
5.3 Model building-Phase I . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.3.1 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . 59
5.4 Model Building -Phase II . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.4.1 Balancing class distribution . . . . . . . . . . . . . . . . . . . 64
5.4.2 Ensemble classifiers . . . . . . . . . . . . . . . . . . . . . . . . 65
5.4.3 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . 70
5.5 Chapter summary and Conclusions . . . . . . . . . . . . . . . . . . . 73
5.5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.5.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6 Conclusions 75
6.1 Summary of the research and conclusions . . . . . . . . . . . . . . . . 75
6.1.1 Data Preprocessing and Initial Experiments . . . . . . . . . . 75
6.1.2 Analysis of feature importance within subsets . . . . . . . . . 76
6.1.3 Predictive model building . . . . . . . . . . . . . . . . . . . . 78
6.1.4 How can the proposed techniques be transferred to new data
sets? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.1.5 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.1.6 Model Building . . . . . . . . . . . . . . . . . . . . . . . . . . 84
References 88
iii
List of Figures
2.1 A unified view of the feature selection process [11]. . . . . . . . . . . 7
2.2 Multivariate feature selection [18] . . . . . . . . . . . . . . . . . . . . 13
2.3 Natural clustering of data with regard to the pathogens [12]-an ex-
ample of problem decomposition for a particular microbiological data
using prior information. . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Schematic view of general wrapper approach to class-dependent fea-
ture selection [17]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1 Feature ranking according to normalized mutual information for Ap-
petite and Neutropenia . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2 Feature ranking according to normalized mutual information for Nail
disorder and Neuropathy . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 Feature ranking according to normalized mutual information for Ap-
petite in the subset of people who had Large Cell Carcinoma . . . . . 34
3.4 Feature ranking according to normalized mutual information for Ap-
petite in the subset of Females . . . . . . . . . . . . . . . . . . . . . . 35
3.5 Feature ranking according to normalized mutual information for Nail
disorder in the subset of Caucasian people . . . . . . . . . . . . . . . 35
3.6 Feature ranking according to normalized mutual information for Neu-
tropenia disorder in the subset of Males . . . . . . . . . . . . . . . . . 36
3.7 Feature ranking according to normalized mutual information for Ap-
petite in different clusters . . . . . . . . . . . . . . . . . . . . . . . . 37
iv
Feature Selection for Adverse Event Prediction
4.1 Schema for computing the feature scores within subsets . . . . . . . . 47
4.2 Locally important biomarkers for Appetite when splitting the data
on Body Mass Index (left) and on Number of Cycles (right). . . . . . 50
4.3 Locally important biomarkers for Neutropenia when splitting the data
on lmsite9 ( metastasis in Lymph Nodes) -left and on prt25 (prior
chemo therapy with Vinorelbine )-right . . . . . . . . . . . . . . . . 50
4.4 Locally important biomarkers for Neutropathy when splitting the
data on cm163 (H2-receptor antagonists)-left and on lmsite8 (metas-
tasis in Hepatic System including Gall Bladder)-right. . . . . . . . . . 51
4.5 Locally important biomarkers for Nail Disorder when splitting the
data on cm133 (combinations of penicillin)-left and on lmsite3( metas-
tasis in Bone or Locomotor System )-right. . . . . . . . . . . . . . . . 51
5.1 Schema for building a local predictive model . . . . . . . . . . . . . . 58
5.2 Appetite disorder prediction using Logistic regression. Left-Negative
Predictive value in Local vs. Global Models; Right-ROC points for
models built varying the number of features. . . . . . . . . . . . . . . 60
5.3 Neutropenia prediction using Logistic regression. Left-Negative Pre-
dictive value in Local vs. Global Models; Right-ROC points for mod-
els built varying the number of features. . . . . . . . . . . . . . . . . 60
5.4 Nail Disorder prediction using Logistic regression. Left-Negative Pre-
dictive value in Local vs. Global Models; Right-ROC points for mod-
els built varying the number of features. . . . . . . . . . . . . . . . . 61
5.5 Neuropathy prediction using Logistic regression. Left-Negative Pre-
dictive value in Local vs. Global Models; Right-ROC points for mod-
els built varying the number of features. . . . . . . . . . . . . . . . . 61
5.6 Variation of Sensitivity as the number of features increase. Left : Ad-
aboost (base classifier-Logistic Regression) for predicting Neutrope-
nia. Right: Random forest for predicting Neutropenia. . . . . . . . . 66
v
Feature Selection for Adverse Event Prediction
5.7 Neutropenia prediction using Adaboost. Left-Negative Predictive
value in Local vs. Global Models; Right-ROC points for models built
varying the number of features. . . . . . . . . . . . . . . . . . . . . . 67
5.8 Appetite prediction using Adaboost. Left-Negative Predictive value
in Local vs. Global Models; Right-ROC points for models built vary-
ing the number of features. . . . . . . . . . . . . . . . . . . . . . . . . 67
5.9 Nail Disorder prediction using Adaboost. Left-Negative Predictive
value in Local vs. Global Models; Right-ROC points for models built
varying the number of features. . . . . . . . . . . . . . . . . . . . . . 68
5.10 Neuropathy prediction using Adaboost. Left-Negative Predictive value
in Local vs. Global Models; Right-ROC points for models built vary-
ing the number of features. . . . . . . . . . . . . . . . . . . . . . . . . 69
5.11 Neuropathy (Left) and Neutropenia (Right) prediction using SVM . . 71
5.12 Nail Disorder (Left) and Appetite (Right) prediction using SVM . . 72
vi
List of Tables
4.1 Top 5 most discriminant features for Appetite . . . . . . . . . . . . . 44
4.2 Top 5 most discriminant features for Neutropenia . . . . . . . . . . . 45
4.3 Top 5 most discriminant features for Nail Disorder . . . . . . . . . . . 46
4.4 Top 5 most discriminant features for Neuropathy . . . . . . . . . . . 46
5.1 Global and Local performance obtained using Nave Bayes for Ap-
petite, Neutropenia, Neuropathy and Nail disorder . . . . . . . . . . . 62
5.2 Global and Local performance obtained using Decision Trees for Ap-
petite, Neutropenia, Neuropathy and Nail disorder . . . . . . . . . . . 63
5.3 Global and Local performance obtained using Random Forest for Ap-
petite, Neutropenia, Nail Disorder and Neuropathy . . . . . . . . . . 70
Word Count: 20 548
vii
Feature Selection for Adverse Event Prediction
Abstract
This document presents an investigation into applying machine learning techniques
to predict the occurrence of four adverse events (Appetite Disorder, Neutropenia,
Nail Disorder and Neuropathy) in lung cancer patients participating in a clinical
trial conducted by the pharmaceutical company AstraZeneca.
The focus of the project is to investigate the hypothesis that biomarkers show a dif-
ferent importance in different subareas of the input space and to develop techniques
that will identify what biomarkers are only locally predictive. This is a step towards
personalized medicine, which attempts to tailor the medical practices to the needs
of each patient. The first research area proposes a method for discovering the most
discriminant features based on Kuncheva Consistency Index for feature subsets. A
discriminant feature is considered one that splits the original data into two subsests
such that the features that are predictive for a specific adverse event in one sub-
sets are different than the features that are predictive for the same adverse event
in the other subset. The second investigation proposes a technique for highlighting
biomarkers that are only locally important in the subsets previously identified.
The last part of the thesis develops a method for building local predictive models
and comparing their performance with the global ones. The research showed that
the only adverse event that could be predicted from the measurements provided
was Neutropenia. For this, the local models always had a better negative predictive
value than the global ones, while maintaining a similar or better sensitivity and
specificity, depending on the particular learning algorithm used. The methodology
developed during this project should be immediately transferable to new data sets.
viii
Feature Selection for Adverse Event Prediction
Declaration
No portion of the work referred to in the dissertation has been submitted in support
of an application for another degree or qualification of this or any other university
or other institute of learning.
Copyright Statement
i The author of this dissertation (including any appendices and/or schedules to
this dissertation) owns certain copyright or related rights in it (the ”Copyright”)
and s/he has given The University of Manchester certain rights to use such
Copyright, including for administrative purposes.
ii Copies of this dissertation, either in full or in extracts and whether in hard or
electronic copy, may be made only in accordance with the Copyright, Designs
and Patents Act 1988 (as amended) and regulations issued under it or, where
appropriate, in accordance with licensing agreements which the University has
entered into. This page must form part of any such copies made.
iii The ownership of certain Copyright, patents, designs, trade marks and other in-
tellectual property (the ”Intellectual Property”) and any reproductions of copy-
right works in the dissertation, for example graphs and tables (”Reproductions”),
which may be described in this dissertation, may not be owned by the author
and may be owned by third parties. Such Intellectual Property and Reproduc-
tions cannot and must not be made available for use without the prior written
permission of the owner(s) of the relevant Intellectual Property and/or Repro-
ductions.
iv Further information on the conditions under which disclosure, publication and
commercialization of this dissertation, the Copyright and any Intellectual Prop-
erty andor Reproductions described in it may take place is available in the Uni-
versity IP Policy (see http://documents.manchester.ac.uk/display.aspx?
ix
Feature Selection for Adverse Event Prediction
DocID=487), in any relevant Dissertation restriction declarations deposited in
the University Library, The University Librarys regulations (see http://www.
manchester.ac.uk/library/aboutus/regulations) and in The University’s
Guidance for the Presentation of Dissertations.
x
Feature Selection for Adverse Event Prediction
Acknowledgements
First of all, I would like to thank Dr. Gavin Brown for giving me the opportunity
to work on this project and for his continuous guidance and constructive feedback
throughout the dissertation. I would also like to thank Dr. Diederik Pietersma, and
the entire staff at AstraZeneca for their helpful support and for making this project
possible. Secondly, I would like to thank my parents for their moral and financial
support. I am also grateful to Dinu Patriciu Foundation, for awarding me the ’Open
Horizons’ Scholarship which helped fund my postgraduate studies.
xi
Chapter 1
Introduction
1.1 Motivation
The development of a novel and useful drug extends over many years and involves
efforts of specialists in different domains, from medical to data engineering. One of
the most important steps is the clinical trial, when patients (volunteers) are given
different doses of the new drug, and at different time intervals while observing the
possible unexpected reactions. Clinical trials are used to determine the efficacy
and safety of a new product as well as provide valuable information in the early
development regarding cost effectiveness of the drug [19]. Moreover, clinical trials
can be a chance for patients to have access to the latest therapy available. On
the other hand, conducting a clinical trial can be a very costly procedure for a
company as it involves both financial resources for payment of the volunteers and
human resources for gathering information about the possible adverse reactions and
monitoring the participants health.
An adverse drug reaction is defined in [7] as an appreciably harmful or unpleasant
reaction, resulting from an intervention related to the use of medical product, which
predicts hazard from the future administration and warrants prevention or specific
1
Feature Selection for Adverse Event Prediction
treatment or alteration of the dosage regimen, or withdrawal of the product and
can range from minor alteration of a patient’s health to death. Thus, it would
be desirable to be able to identify in a first step which are the biomarkers that
influence the occurrence of an adverse event and then to predict in an automated
manner whether a new patient will experience a particular adverse drug reaction.
This involves gathering data by means of measuring different characteristics of a
patient and using specialized algorithms to extract the desired meaning from it,
usually by performing feature selection followed by a classification task.
Designing a classifier from a biomedical dataset is a Machine Learning task and
has been an active research field in the recent years. However, the quality of the
prediction depends heavily on the quality of the data used. Nowadays the datasets
produced (form DNA sequencing or clinical measures) can have hundreds or thou-
sands of attributes. This enormous quantity of information puts more pressure on
developing efficient algorithms to extract meaning from it. Of course that in general,
the actual number of characteristics that are important for making a classification
is much smaller and the others act as a noise, hindering the classifier and causing
misleading results. This is why an effective feature selection step is essential before
applying a classification procedure.
1.2 Local Feature Selection-towards personalized
medicine
Local Feature Selection attempts to formalize the intuition that each person is unique
and thus the key characteristics that should be looked at for one patient might be dif-
ferent than the characteristics meaningful for another patient in order to predict an
adverse effect. An intuitive yet simplified example is that what determines asthma
to appear in a child might be different than the causes of asthma of an old man.
Features like smoking and the pollution degree of the working environment might
2
Feature Selection for Adverse Event Prediction
be meaningful for the old man but irrelevant for the child. Here, what differentiates
the two categories is age and it is obvious that we should treat them as separate
problems and look for different medical measurements for a good prediction.
However, when the possible characteristics to choose form are thousands and a
certain subcategory of people is defined by a combination of those, efficient and
intelligent automated computation should be employed. Thus, the aim of a local
feature selection algorithm is to identify for each person or subgroup of persons which
is the most informative set of features that will lead to an accurate prediction. This
is in fact a step towards developing personalized medicine which intends to tailor
each medical action to the specific needs of individual patients. In our framework,
(predicting adverse effects for lung cancer patients) this can be seen as determining
what are the features to be used for each person (subgroup of persons) in order to
assure better classification results.
1.3 Aims and Objectives
1.3.1 Aims
• To investigate the hypothesis that biomarkers can hold a different predictive
importance when considered within different groups of people ;
• To investigate how local feature selection can be integrated in building predic-
tive models and provide a comparative analysis of the results obtained.
1.3.2 Objectives
• Identifying the degree of statistical dependence between individual features
and each of the target variables in order to gain initial insights into the rela-
3
Feature Selection for Adverse Event Prediction
tionship between the measurements and the adverse event to be predicted;
• Modeling a local feature selection procedure that will be able to identify for
each adverse event what are the biomarkers that are only locally important
along with the subspaces (groups of people) where they are meaningful;
• Developing a procedure for building local classification models and comparing
their performance in predicting the occurrence of an adverse event with the
global ones integrating the information obtained in the previous step.
1.4 Project Outline
The structure of the document is as follows:
Literature review. The Literature review chapter starts by providing an intro-
duction in feature selection techniques. Then it makes a description of information
theoretic measures and how these are integrated in feature selection methods. The
last part of the chapter focuses on reviewing current literature on local feature se-
lection algorithms (instance based and class specific).
Data preprocessing and initial experiments. This chapter provides a detailed
description of the data sets supplied for carrying this research project along with
the motivation behind the choices made in preprocessing them. In addition to this,
it analyzes the results of 3 initial experiments which aimed at gaining early insights
into the data sets. The experiments were designed to understand the degree of
information that is contained in individual features relative to each of the adverse
event to be predicted. They also attempted to carry a preliminary local analysis on
subsets obtained by performing different splits of the data and applying clustering
algorithms.
Analysis of feature importance within subsets. This chapter is structured in
4
Feature Selection for Adverse Event Prediction
two sections. The first one proposes a method for identifying which are the splits
that generate the most distinctive subsets of data (in the sense that the features
that are predictive of an adverse event in one subset are different than those that
are predictive in another subset). Based on the results obtained at this stage, the
second part of the chapter proposes a method for highlighting what biomarkers are
only locally important and in what group of people.
Predictive model building. This chapter proposes a method for building local
predictive models and explores the performance of different classification algorithms
both locally and globally in two stages. In the first one, in order to keep the model
simple, possibly interpretable less complex classifiers are used while the second one
employs more advanced classification algorithms together with a method for balanc-
ing the class distribution.
Conclusions. This chapter provides a summary of the research conducted, high-
lighting the conclusions drawn together with possible directions for further investi-
gation.
5
Chapter 2
Literature review
2.1 Feature Selection. Introduction
Feature selection is part of the field of Machine Learning that aims to select from the
input variables those that are the most relevant and have the best predictive power in
a classification task. With the advancement of the techniques that generate datasets
with thousands of features (web stream, gene expression, etc) performing feature
selection before applying a classification algorithm has become an indispensable
preprocessing step [8]. The aims of feature selection are:
• To reduce the computational costs associated with the prediction process and
lower the storage requirements;
• To reduce the prediction time;
• To help improve the model comprehensibility;
• To provided a higher accuracy of the prediction by removing irrelevant, redun-
dant or noisy features [8] [11] [3].
6
Feature Selection for Adverse Event Prediction
A general classification process that uses feature selection is shown in the image
bellow and was introduced in [11]. Phase I represents the actual feature selection
process, while phase II summarizes the model fitting and performance evaluation.
Figure 2.1: A unified view of the feature selection process [11].
Performing feature selection involves iterating through 3 steps:
1. Generating a feature subset candidate using a search strategy;
2. Evaluating and adjusting (adding/removing features) the set by means of a
selection criterion;
3. Deciding when the current set is good enough to further be used in the Model
fitting phase [11].
Model fitting consists mainly of training a chosen learning algorithm with the pre-
viously selected features and testing its performance using a testing data set that
has not been used in the training step.
7
Feature Selection for Adverse Event Prediction
Depending on the search strategy used, the techniques developed for feature selection
fall into three categories:
1. Filter approaches;
2. Wrapper approaches;
3. Embedded methods.
Feature selection algorithms of filter models are based on analyzing the intrinsic
relationship between features and target. In evaluation of a candidate set there is
no learning algorithm involved [8]. The evaluation is carried employing information
theory measures or probabilistic approaches. Among the most important advan-
tages of filter methods is the fact that they can be used regardless of the learning
algorithm and that the actual algorithm used has a very simple structure (usually
forward selection or backward elimination) and therefore it is easy to understand
[11]. Moreover, filters are generally faster than the other types of feature selection
methods.
On the other hand, wrapper approaches use a learning algorithm to judge the per-
formance of a candidate set of features. The most popular method for searching the
space in a wrapper approach is Greedy (Forward Selection or Backward Elimina-
tion) [9]. In the Forward Selection method we start with the empty set of features
and repeatedly add each available feature, evaluate the performance of the learning
algorithm and retain the feature that yield the highest gain in performance. In the
Backward approach, we start with all the feature set and progressively discard those
that allow the smallest drop in performance [9].
Integrating the learning algorithm in the feature selection process has both positive
and negative implications. The major drawback is that now the feature selection
model is no longer independent of the learning process and as a consequence a set
of features obtained once cannot be reused with another algorithm. Moreover, as
8
Feature Selection for Adverse Event Prediction
the learning task is carried for every new modified set, the whole feature selection
process is slow.
However, the important advantage of wrapper methods is that generally they lead
to a higher performance in prediction accuracy. Embedded methods use the same
idea of assessing the features usefulness by a learning algorithm, but the difference
to filter and wrapper methods is in the way feature selection and learning interact,
as there is no separation between those two steps [9]. Embedded methods are more
prone to overfitting than filters and thus if only small amounts of data are available
it is expected that filters will perform better than embedded methods, whereas
embedded methods will outperform filters as the training data increase[9].
2.2 Feature Selection using Information Theory
This section presents an overview of different attempts to design a filter feature se-
lection algorithm based on the mutual information shared between variables. These
can be then used in the process of performing local feature selection, which is in-
troduced in the next section. In all approaches presented a multi-class classifi-
cation problem is considered. Given a set of m examples{xk, yk} (k = 1 . . .m)
where xk = (xk1 . . . xki . . . xkn) is the kth instance consisting of n input features and
Y = {y1 . . . yi . . . yc} are the possible classes that each of the input instances can
belong to, the problem is to select from the n features those that are the best in
predicting the true class of an unseen(testing) set of examples.
2.2.1 Overview of basic Information Theory Concepts
1. Entropy. The entropy of a random variable X measures the degree of uncer-
tainty (randomness) in the distribution of X [4]. It is defined in the following
9
Feature Selection for Adverse Event Prediction
way:
H(X) = −∑x∈X
p(x) log p(x) (1)
The entropy has a maximum value when all the events have the same probabil-
ity of occurrence (for example rolling a dice: each face has the same probability:
1/6) as the uncertainty is maximum.
2. Conditional Entropy of X given Y measures the uncertainty that still re-
mains in X when we know the outcome of Y [4].
H(X|Y ) = −∑y∈Y
p(y)∑x∈X
p(x|y) log p(x|y) (2)
3. Mutual Informationdenotes the information shared between two random
variables [5] and it is defined as follows:
I(X, Y ) = H(X)−H(X|Y ) =∑x∈X
∑y∈Y
p(xy) logp(xy)
p(x)p(y)(3)
4. Conditional Mutual Information measures the information still shared
between the variable X and Y when the value of Z is known [4].
I(X;Y |Z) = H(X|Z)−H(X|Y Z) =∑z∈Z
p(z)∑x∈X
∑y∈Y
p(xy|z) logp(xy|z)
p(x|z)p(y|z)(4)
A possible way of assessing the usefulness of a feature set in a classification problem
is to rank the features according to a defined criterion that measures the intrinsic
relation between each feature and the target [4]. In the past 20 years many different
approaches based on information theory measures have been proposed. The fol-
lowing section introduces the basic notions taken into consideration when building
a criterion and presents some of the most important filter criteria based on them,
highlighting their strengths and weaknesses and explaining why they might be of
interest in the context of the project.
10
Feature Selection for Adverse Event Prediction
2.2.2 Relevancy. Redundancy. Relevancy in context
The measures introduced above can be used to quantify the usefulness of a featureXk
in relation to the target Y. The features selected to further be used in classification
should be relevant to the target class and not redundant. Having redundant features
means adding computational burden without adding relevant information.
Relevancy. The relevancy of a single featureXk with respect to the output class Y
is the mutual information shared between Xk and Y. (Equation 3)
Redundancy. The redundancy of a feature Xk is computed with respect to the
already selected features. If we denote by S the set of the features already selected
then Xk is redundant if it has high mutual information with the elements in S.
Relevancy in context. This measures takes into account the already selected
features, thus the relevancy of a feature Xk to the output, when a subset S of
features has already been selected is denoted by the conditional mutual information
between Xk, Y given each of the features in S [3]. (Equation 4)
Using these notions, in [4] it has been shown that most heuristic criteria that attempt
to increase relevancy and at the same time lower redundancy follow a general form:
J = I(Xk, Y )− β∑j∈S
I(Xj, Xk) + γ∑j∈S
I(Xj;Xk|Y ) (5)
where the first term accounts for relevancy, the second for redundancy and the third
for relevancy in the context of other features. The parameters and stress the
importance put on each of the terms they determine.
11
Feature Selection for Adverse Event Prediction
2.2.3 Ranking criterion
A simple way of selecting relevant features is to rank them according to the mutual
information between each feature and the target variable in descending order and
then keep selecting features until either a certain threshold has been reached or the
performance of a classifier starts degrading. This is in fact considering and in
equation (5) zero which means it measures only the individual relevancy of a feature
to the output class. Although the criterion is simple to implement and understand
as well as very fast, its major drawback is the fact that it assumes that all variables
are independent of each other and doesn’t take into account that features may be
redundant or may be relevant only in the context of others [4].
2.2.4 Mutual Information Feature Selection Criterion
In order to overcome some of the problems of the ranking criterion, in [2] Battiti
proposed a criterion that attempts to avoid selection of redundant features. This is
achieved by keeping the idea of maximizing mutual information between each feature
and the class variable, but adding a penalty if the current investigated features have
a high mutual information with the already selected ones. If we denote by Xk the
feature we want to asses a score and by S the set of already selected features, then
the Mutual Information Feature Selection Criterion can be expressed as:
J = I(Xk, Y )− β∑Xj∈S
I(Xk;Xj)
The first term accounts for the information shared between feature Xk and the
target Y, while the second one is a summation over the information shared between
Xk and the already selected features in S. The parameter has to be chosen by the
user. Even if this criterion penalizes redundancy, it fails to consider the fact that
that there are cases when feature useless by itself can be useful in the context of
12
Feature Selection for Adverse Event Prediction
other features [8].
2.2.5 Double Input Symmetrical Relevance Criterion
In [3] is proposed a criterion that attempts to take into account the fact that some-
times a set of variable can have higher mutual information with the output class
than the sum of the variables taken individually. Thus, the authors introduce a new
idea of variable complementarity. The rationale behind trying to formalize this is
also very clearly expressed in [8]. The example presented below was given in [18]
and expresses in a simple, yet intuitive way the importance of taking into account
variable interaction.
Figure 2.2: Multivariate feature selection [18]
The first graph from the left shows a two-class classification problem with two
features,x1 and x2. If we consider only x1, then we can see that there is much
overlap between the classes. Considering only x2 will result in even worst classifica-
tion accuracy, as the two classes overlap perfectly. However, if we look at the graph
considering both features together, then we can notice that the two classes can be
separated with high accuracy. In this way, we can say that x1 is more relevant once
x2 has been considered.
In the same way, the right graph shows that two features that are useless considered
individually can be very relevant when considered together. This is what the Dou-
13
Feature Selection for Adverse Event Prediction
ble Input Symmetrical Relevance criterion attempts to take into account, when the
variables complement each other and are more useful considered jointly than indi-
vidually. The complementarity of two random variables with respect to the output
class Y is defined as:
CY (Xi;Xj) = I(Xi,j;Y )− I(Xi;Y )− I(Xj, Y )
Another idea that led to the final formulation of the criterion was the authors intu-
ition that when we have no knowledge about how to combine subsets of d variables,
the best subset can be obtained by combining subsets of d-1 variables[3]. This
heuristic was theoretically proved in the article and the Double Input Symmetrical
Relevance criterion was defined as:
XDISR = arg maxXi∈X−S
{∑
Xj∈XS
SR(Xi,j;Y )}
where SR(X;Y ) = I(X,Y )H(X,Y )
is the symmetrical relevance between X and Y. The
normalization term does not follow a theoretical background but is motivated by
the fact that mutual information is biased towards higher arity features. The most
important advantage of Double Input Symmetrical Relevance criterion is that it
favors selecting a variable that is complementary with an already selected one[3].
However, it does not take explicitly into account the problem of selecting redundant
features.
2.2.6 Joint Mutual Information Criterion
Another criterion that takes into account the complementarity of features is Joint
Mutual information proposed in [21]. For a feature Xk the criterion associates the
following score:
Jjmi =n−1∑k=1
I(XnXk;Y )
14
Feature Selection for Adverse Event Prediction
The criterion is expressed as the sum of the mutual information between the target
variable Y and a joint random variable XnXk obtained by paring the current feature
under investigation with each of the already selected ones. The idea is to select a
feature that carries complementary information to the ones that have already been
selected.
2.3 Local Feature Selection
Instance-Based Feature Selection
The previous presented techniques can be very efficient for some problems and when
applied globally as a preprocessing step in a classification problem can substantially
improve the accuracy. However, there are cases when a global attempt to select
relevant features is not suitable, but we should take into account the fact that
sometimes features are more important in specific regions of the whole space and
less important in others [6]. Ignoring this aspect can lead to discarding features
that though are irrelevant in most of the feature space, are very important in a
small regions. Alternatively, we can select those features that are relevant in most
of the space, but still hinder the classifier in certain regions [6]. In order to deal
with this problem, different solutions for identifying a heterogeneous problem and
then applying local feature selection have been proposed.
2.3.1 Natural clustering
In [10] the authors explore the effects of local feature selection compared global fea-
ture selection using natural clusters. Here, the problem of decomposing the space
has been solved using experts knowledge about the dataset (microbiological data)
and thus the promising results obtained after applying local feature selection can
serve as a motivation for finding methods to cluster the data in an automated way
15
Feature Selection for Adverse Event Prediction
to form homogeneous feature subspaces. Their study was meant to investigate the
impact of incorporating knowledge of domain experts in the preprocessing step on
classifying antibiotic resistance (sensitive, resistant, intermediate) and in particular
how the classification accuracy differs when applying local versus global dimension-
ality reduction techniques.
Though in the above mentioned article, both feature extraction and feature selec-
tion techniques were applied, only the results obtained using feature selection are
presented below as this is the focus of this research project.
The distribution of data after clustering is shown in the figure below:
Figure 2.3: Natural clustering of data with regard to the pathogens [12]-an exampleof problem decomposition for a particular microbiological data using prior informa-tion.
The techniques used for local feature selection are of a wrapper type: Forward
Feature Selection, Backward Feature Elimination and Bidirectional Search. The
evaluation of selected features was done using knn classifier (k=7). The results
revealed that feature selection applied locally at the second level of splitting (gram+
and gram-) improved the classification accuracy. Moreover, the total number of
features selected locally is always smaller than the number of features selected when
applying global feature selection.
16
Feature Selection for Adverse Event Prediction
Though the results are encouraging, practically the problem is more complex as
in general we do not have such knowledge about how to cluster the data, but we
have to use traditional clustering techniques or other methods for decomposition
of heterogeneous problems. Moreover, the evaluation was done only with wrapper
methods which are computationally expensive and using knn classifier which requires
large memory resources as the model is the training data itself. The fact that good
results were obtained only when splitting the data in two clusters and applying
feature selection locally suggest that the decomposition in subproblems should be
carefully analyzed as there is a risk of overfitting. Moreover, ways of analyzing how
different are two subsets in terms of feature relevance would be useful.
2.3.2 Measuring dissimilarity of subproblems
An attempt to measure the degree of dissimilarity between two given subproblems
is given in [1]. The author’s idea was to define for each subproblem a vector of
dimension f (the number of features) where the ith element is a measure of the
importance (merit) of the ith feature. The angle between the two vectors (called
the Importance Profile Angle) will denote how different the two regions are as far
as feature importance is considered. Formally, the IPA is defined in the following
manner:
IPA =2
πarccos
∑fi=1MaiMbi
(∑f
i=1M2ai)
12 (∑f
i=1M2bi)
12
WhereMa1,Ma2, ,Maf is the merit vector for the first subproblem andMb1,Mb2, ,Mbf
the corresponding vector for the second subproblem. The IPA defined above is the
angle between these vectors, normalized. A threshold above which to consider that
the two problems are different should be set experimentally. In order to measure the
feature importance, the authors proposed three methods based on Gini index and
entropy. Though these measures are faster and easier to compute, they also share
the great disadvantage that they only measures the correlation between a single
feature and the class target, without taking into consideration feature interaction
17
Feature Selection for Adverse Event Prediction
(as discussed in the first part of this chapter)[1].
The authors define IPA firstly for categorical features with binary values. It involves
generating a split for each feature, computing the IPA and then choosing the split
with the largest angle between merit vectors. The process is similar with building a
tree, but the difference is that the aim of splitting is to obtain homogeneous problems
and from this point any other classifier can be used together with a feature selection
method. Though a method for dealing with multiple valued features is proposed, this
involves computing more splitting points which can be computationally expensive
and infeasible for large datasets with thousands of features. For numerical features,
the authors proposed using firstly a discretization method.
However, this method is based on a global analysis of the data and may not be
suitable for heterogeneous problems [1]. Another issue that has to be taking into
account when applying IPA is the stopping criterion. The authors have not proposed
a specific method, but they mention as guidelines that the criterion should probably
include a threshold for IPA and one for the number of instances in the subproblems
[1]. Moreover, as it was also expressed in [12] the fragmentation of the initial dataset
should not go too far in order to avoid overfitting [1]. Some possible extensions of
this method would be using IPA to assess the splitting obtained by other clustering
techniques and adapt it to compute the degree of dissimilarity between multiple
clusters.
2.3.3 Local feature selection using a clustering-like approach
An algorithm that attempts to perform local feature selection using a clustering-like
approach is proposed in [6]. The method is of a wrapper type, using knn classification
algorithm with k equals 1 and aims to select different relevant features for each new
instance to be classified. This is done by constructing an instance space from the
initial training data in the following manner. At the beginning the instance space
18
Feature Selection for Adverse Event Prediction
is exactly the training and an initial accuracy is computed. Each instance finds the
nearest example that has the same class and assumes that the features that differ
by more than one standard deviation are not relevant. These features are dropped
from the instance and the classifier is run again, comparing the current accuracy
with the accuracy obtained by keeping the features. If there is an improvement then
the features will be deleted, else they are kept and the algorithm will not attempt
to do any feature selection for this instance [6]. The algorithm is stopped when no
improvement can be achieved by performing feature selection on any instance.
The distance to the closest example has to be adapted to the fact that by deleting
some features from instances, we obtain vectors with different size. Thus, the author
employs normalized Euclidian distance for numeric features and a version of the
distance proposed by Stanfill and Waltz [6] to deal with symbolic ones (considering
that when a feature is discarded from an instance, this is marked by a special value
*) [6].
A great disadvantage of this method, as with any other wrapper feature selection
algorithm that also uses knn as a classifier is the computational cost of running the
algorithm each time feature selection is attempt. Moreover there is also a mem-
ory cost associated with retaining all instances which may result in the algorithm
not being suitable for large datasets and large feature sets. The average number
of features kept by this algorithm is generally higher than the number of features
retained using Backward Sequential Selection or Forward Feature Selection, because
the algorithm will drop a feature from all instances only if it is irrelevant in all sub-
spaces (globally irrelevant). The accuracy reported using this type of local feature
selected was higher than the accuracy using global feature selection algorithms and
in order to verify that this difference comes from the fact that the algorithm takes
into account the idea that some features are relevant only in parts of the instance
space, artificial data sets were created and used in testing.
19
Feature Selection for Adverse Event Prediction
2.3.4 Local feature selection and dynamic integration of clas-
sifiers
A further step in the field of local feature selection was done by Tsymbal et al.
in [15]. Besides proposing a method of selecting features relevant for each new
instance, the authors also investigate how the most suitable classifier can be used in
each case. The main idea behind their approach is to use classifiers built on different
feature subsets, store information about the predicted errors of them and for a new
instance to be classified a meta level classifier (weighted nearest neighbor) will be
used to determine which classifier is best. In order to restrict the possible classifiers,
a decision tree is constructed on the entire dataset and for a new instance x, the
classifiers built on features that are not in the path followed by x in the tree will be
discarded.
The algorithm consists of a learning phase and an application phase. In the learning
part possible features subsets are generated and cross-validation is used to estimate
the errors of a classifier on a specific feature set. For the application phase two
versions are proposed, depending on how each classifier contributes to the final
assignment of a class. In the static version the classifier with the smallest predicted
error is chosen to make the final classification, while in the dynamic version, each
classifier has a weight and the final classification is obtained using weighted voting
[15].
The experimental results showed that with this technique a 10% accuracy improve-
ment can be reached using less than a half of the initial features. However, the
datasets used for testing the proposed method are relatively small, with at most
432 instances and no more than 57 features. It would be interesting to see how the
algorithm handles much larger datasets. In particular, in the generation of features
subsets step a method for avoiding exhaustive enumeration should be used (the
authors mention the possibility of employing heuristic methods).
20
Feature Selection for Adverse Event Prediction
2.4 Class-Specific Feature Selection
A different type of local feature selection is class-dependent feature selection. Here,
the focus changes from selecting different features for each instance to be classified,
to selecting a possibly different feature subset for each class, depending on their
discriminating properties [17][14].
An intuitive and motivating example why such an approach would be useful is given
in [17] : supposing that we are dealing with medical data and for each patient we
have measurements like blood pressure, weight, cholesterol, age, height etc and that
we have to diagnose whether the patient suffers from disease A, B or C. Assume that
if someone has the blood pressure above a certain limit it means it has disease A,
looking at the weight we can tell if he has disease B and a certain level of cholesterol
means he suffers from disease C [17]. A general feature selection algorithm will select
blood pressure, weight and cholesterol as being important. If the patient suffers from
disease A, then the weight and the blood pressure will act as noise (analogous in the
other cases) and we can end up with misleading results [17]. On the other hand, if
we know what features are relevant for each class then, we can build three classifiers,
each of them distinguishing a possible disease from all the rest [17].
Considering the project framework, that is performing feature selection for predict-
ing adverse events, class specific feature selection may be important when dealing
with different degrees of severity of an adverse event. For example, if a drug may
produce nausea, headache and heart attack, then we might be interested firstly in
what are the important features in predicting heart attack and we would like a high
accuracy in prediction. For this scenario, class specific feature selection might be
more useful. Furthermore, we may have classes that have very few examples and so,
performing global feature selection would advantage the richer class.
A general wrapper approach for selection of class dependent features is proposed in
[17]. The algorithm uses the idea of one against all, meaning that it will construct
21
Feature Selection for Adverse Event Prediction
C classifiers (where C is the number of classes), with classifier i distinguishing be-
tween class i and all the others. The process is illustrated in Figure 2.4. For each
of those classifiers a hybrid feature selection method is used, that combines wrap-
per approach with a filter one. Firstly all features will be ranked according to an
importance measure such as RELIEF [10] weight measure and Class Separability
Measure (CSM)[17]. Any other ranking measure can be used, including those based
on information theory. After that, following a forward selection search, different
subsets of features are added for each class in the order of ranking [17] using SVM
as a classifier. As a stopping condition can be used either checking when the val-
idation accuracy starts to decrease or when all the features have been added [17].
For classifying an instance a heuristic method is proposed that asserts weights to
the output of each model before carrying out the comparison between them and
selecting the final output [17].
22
Feature Selection for Adverse Event Prediction
Figure 2.4: Schematic view of general wrapper approach to class-dependent featureselection [17].
The method proposed in [17] is very flexible as it allows customizing the class-
dependent feature selection algorithm choosing what classifier to use as well as what
filter method for ranking of the features. The results reported by the authors using
RELIEF [10], CSM[17] and mRMR (Minimal-Redundancy-Maximal-Relevancy)[13]
as ranking measures and SVM as classifier are promising, as the accuracy using
this method is always higher than performing class-independent feature selection.
Moreover, the average number of features selected is smaller in the case of the
proposed method. The drawback is the extra computational cost added by using a
wrapper approach for each of the binary classifier.
23
Feature Selection for Adverse Event Prediction
A similar approach is proposed in [14] where again the idea of transforming a C-
class classification problem into C binary problems is used. Unlike the previous
method, here the problem of obtaining imbalanced binary problems (one class having
considerably more examples than the other) is addressed. Their idea was to use an
oversampling method before applying a conventional feature selection method on
a binary problem [14]. Moreover, the authors experimented with both filters and
wrappers as feature selection methods on the subproblems and using Nave Bayes,
C4.5, MLP and knn (k=1 and k=3) as classifiers.
For classification of a new instance, the authors have also used a heuristic method to
select between the outputs of the C classifiers. They compared the results obtained
with no feature selection, traditional feature selection and class specific feature selec-
tion on 15 datasets and concluded that usually class specific feature selection yields
better results than traditional feature selection which in turn helps obtaining better
results than applying a classifier without any feature selection as a preprocessing
step. Even if some of the datasets used in the experiments reached a large number
of instances (12960-the largest), the number of features is relatively small (at most
64) and there is no record on how the proposed framework deals with large feature
sets.
24
Chapter 3
Data Preprocessing and Initial
Experiments
This chapter aims at giving a detailed description of the supplied data sets and
the choices made in preprocessing them in order to be able to conduct the desired
experiments in Matlab. Moreover, as the choices made in early steps will influence
the final results, the motivation behind each step is given. The initial experiments
were designed to gain first insights into the data which will guide the choices of
further possibilities of experimentation.
3.1 Data Overview
The data on which the experiments were carried out come from a clinical study
conducted by the pharmaceutical company AstraZeneca and consisted of 3 distinct
datasets along with the explanations of the measurements recorded. A general
description of each of them is given below.
25
Feature Selection for Adverse Event Prediction
3.1.1 Adverse Events Data Set (not used in the experi-
ments)
This dataset contains information about the occurrences of adverse events within
the patients that were included in the clinical trial along with an internationally
accepted classification of adverse events obtained from MedDRA ontology. The set
contained 6868 incidents of adverse events and 593 different types of adverse events.
Though the actual data set was not used in the experiments, the 4 adverse events
were grouped according to this ontology in order to increase the occurrences of
positive cases. The grouped adverse events were also supplied.
3.1.2 Subjects Data Set
The data set is made up of 129 measurements recorded for 613 patients, including the
variables associated with the occurrences of the 4 adverse events to be investigated.
Their distribution among the patients is listed below:
• Anorexia- occurs 186 times, with a severity over 3 in 7 cases.
• Neutropenia- occurs 219 times, with a severity over 3 in 197 cases.
• Nail disorder -occurs 67 times, with a severity over 3 in 0 cases.
• Neuropathy- occurs 71 times, with a severity over 3 in 3 cases.
One of the particularities of this data set was that it contained mixed types of
features:
• Continuous (Age, Baseline Weight, Baseline Height, Baseline Body Mass In-
dex, Baseline Body Surface, Baseline BFGF, etc);
• Binary discreet (Sex, Smoking status, tumor stage, location of the metastasis
26
Feature Selection for Adverse Event Prediction
(lmsite1-lmsite15), prior chemo therapy treatment, reduction of doses(aered1-
aered8),etc);
• Categorical data(Race, Country, Histology Type, etc).
3.1.3 Concomitant Medication Data set
Concomitant Medication data set contained 390 features recorded for the 613 pa-
tients where each feature was a binary variable representing whether a patient took
or not a particular medicine that could possibly be linked with the occurrence of
one of the adverse events. The number of Cycles of doses a person received was also
recorded along with another binary variable accounting for whether a patient had a
dose reduction or not. This data set was used along with the Subjects data set.
3.2 Data Preprocessing
3.2.1 Converting string discrete variables into numbers
The binary variables had as possible values the strings Y or N. Variables Race,
Histology type and Country had multiple categories. All string values were attached
a numeric label starting from 0, in alphabetical order according to the string values
denoting the categories. (e.g. for Race, there were 4 categories : Black, Caucasian,
Oriental, Other which were converted into 0, 1, 2,3, respectively).
3.2.2 Discretization of continuous variables
The project focuses on a mutual information framework and thus, the whole data
has to be discrete. Moreover, the number of continuous variables was much smaller
27
Feature Selection for Adverse Event Prediction
(only 9) than the number of already discrete variables. The following techniques
were taken into consideration for this preprocessing step:
• Discretization using the mean value into two binary classes. One
of the advantages of using this technique is that the categories resulted are
easy to interpret and that since most of the variables are binary, the mutual
information computed using that variable will not be biased. However, the
major drawback and the main reason why this technique was not employed is
that by simply binarization we may lose important information.
• Discretization using prior knowledge (manual discretization). This
implies choosing both the number of categories and the minimum and maxi-
mum values allowed for each of them according to intuition and general infor-
mation (for example, considering 3 categories for Age: [20-40] years, [41-60]
years, over 60 years). Although this technique may seem reasonable it is highly
dependent on personal experience and restricted by the available prior knowl-
edge. Therefore, it cannot be employed for all continuous features (such as
BBFGF).
• Discretization by minimization of the information loss. This technique
uses the target variable in the process of choosing the optimal thresholds. In
order to explain it, the variable Age will be used as an example. For all other
continuous variables, the process was analogous.
Firstly, the range of the variable was computed (20-82 years). Then, 20 initial
possible thresholds were placed at equal distance. The number of categories was
chosen to be 5 (4 thresholds) as this number turned out to be at the best trade-
off between minimizing information loss on the one hand (fewer categories: more
information loss) and dealing with the problem of the mutual information’s bias
towards high arity features as well as a significant increase in computational time,
on the other hand. All the possible ways of placing the 4 thresholds were generated
28
Feature Selection for Adverse Event Prediction
(4845) and for each of them the mutual information between one candidate and the
target variable was computed. The final discretization was chosen to be the one
that maximized the dependency between the variable and the adverse event.
3.2.3 Missing Values
In order to be able to use in experiments the data points that had missing values, two
methods of approximating the lack of information were analyzed. The first method
was to compute the mean of that particular value for all the other patients and use
it to fill in the missing one. This approximation is poor as it does not consider
the particular characteristics of the patient and it will fill use the same value (the
mean) for all patients that had a missing record on a particular characteristic. The
second one and the one that was employed in obtaining the preprocessed data was
to compute for each data point that had a missing value on ith feature, the closest
10 data points in the sense of Hamming distance. Then, among these 10 closest
neighbors, the value that occurred most often for the ith feature was voted to fill in
the missing one.
3.2.4 Sparse Features and Special Cases
Sparse features were considered those that had more that 50% missing values and
were not used in the experiments as they do not carried sufficient information for a
valid analysis.
Some of the features needed to be treated separately as they exhibited special prop-
erties:
• The variable Country initially had 25 values. Since the mutual information
is biased towards high arity features and the majority of features are binary,
we decided to make this feature binary as well, by grouping the countries into
29
Feature Selection for Adverse Event Prediction
European and Non-European.
• For mixed continuous and categorical data (such as variable GENERLT (EGFR
gene amplification)), a value of NO RESULT was treated as a missing value.
• For each adverse event, the features that represented whether it occurred with
a severity over 3 ( e.g for anorexia: aeg3p1, aeg3p5) were not included in the
experiments where the target variable was that particular event. The reason
behind this choice is that the variable representing the degree of severity of
an adverse event is conditioned on knowing that that particular adverse event
happened. Therefore, its value can be found only after we know if the adverse
event occurred or not.
3.3 Initial Experiments
The initial experiments were designed to help gain first insights into the data set
and in the way the features are correlated with the target variables (the 4 adverse
events). The results obtained at this step influenced the choices considered for next
steps.
3.3.1 Experiment 1: Ranking features according to Mutual
Information
This experiment aims at understanding what amount of information is shared be-
tween an individual biomarker and the target variable which was considered in turn:
Appetite, Neutropenia, Nail Disorder and Neuropathy. After computing the mutual
information between each feature and the target variable, the features were sorted
in descending order of the mutual information and the top most important were
displayed.
30
Feature Selection for Adverse Event Prediction
3.3.2 Permutation test
Computing the mutual information between a feature and the target variable in-
volves making an approximation of the probability distribution in two random vari-
ables. The accuracy of the approximation is influenced by the number of available
data points as well as by the noise present in the data [20]. Therefore, in order to be
able to say to what extent a feature is useful, a threshold has to be set [20]. One way
of choosing this threshold is performing a permutation test that will also involve a
formal hypothesis test [20].
A permutation test aims to investigate the question of how likely is the value ob-
tained for a statistic θ̂i computed over the vectors of length n, xi and y if we suppose
that they are independent and as a consequence, the value for the statistic should
be zero? [20]. This is done by estimating the distribution of the random variable θ̂
by the values θ̂i obtained for all possible permutations of the vectors x (or y) and
then computing the proportion of the values obtained for θ̂ that are larger than θ̂i
[20]. In our context, the permutation test can be employed in order to assess the
significance of the mutual information between each feature and the variable to be
predicted and to automatically discard those that do not pass the permutation test.
The level of significance was chosen to be 1 and since computing the total number
of permutations (n!) would add a significant computational cost, only 500 permuta-
tions were considered. In addition to this, the mutual information was normalized
to be between 0 and 1 in the following manner:
NormalizedMI(X, Y ) =I(X, Y )
min(H(X), H(Y )),
where H(X) is the entropy of X.
31
Feature Selection for Adverse Event Prediction
Figure 3.1: Feature ranking according to normalized mutual information for Ap-petite and Neutropenia
Figure 3.2: Feature ranking according to normalized mutual information for Naildisorder and Neuropathy
In Figure 3.1 it can be noticed that the maximum value reached for Normalized
Mutual Information computed for Appetite and Neutropenia is very low (0.23 for
Appetite, 0.19 for Neutropenia, when maximum possible is 1) which means that
there is little predictive power in individual features as far as Appetite disorders or
Neutropenia are concerned. For Nail and Neuropathy (Figure 3.2), the maximum
value of Normalized Mutual Information is higher (>0.4). However, for all adverse
events, it should be noticed that the first two features are those referring to the
reduction of doses in the case of occurrence of that adverse event (the grouped
or the specific one): aered7 and aered3 for Nail Disorder, aered8 and aered4 for
32
Feature Selection for Adverse Event Prediction
Neuropathy, aered2 and aered6 for Neutropenia and aered1 and aered5 for Appetite.
In the case of Nail Disorder and Neuropathy, all other features have a considerably
lower value. Moreover for these two adverse events only 5 respectively 4 features
have passed the permutation test, which indicates again a low predictive power in
individual features.
3.3.3 Experiment 2: Local analysis of individual feature im-
portance
The previous experiment revealed that the mutual information between individual
features and the adverse event is small when considering the whole data set. In
the second one, the hypothesis that the importance of features may be different in
different subsets is taken into consideration.
The experiment framework is the same as in previous one: computing mutual in-
formation between all the features and the target variables and then displaying the
ones that passed the permutation test in descending order. The difference is that the
mutual information is computed firstly by considering subsets of the data defined
by splitting on particular features and then by applying clustering algorithms.
This experiment does not attempt to analyze all the possible subsets that can be
obtained by splitting the data on the categories defined by a feature, or to propose
an optimal split. The purpose is to gain more information about the structure of
the data and to offer some possible pathways to be investigated rigorously in the
next chapters. This is why the subsets analyzed are only a small selection and
were obtained by splitting the data on the following features: Sex, Race, hstltyp
(Histology type), Smoking habits and Tstage (Cancer Stage).
Some of the results are displayed below. Though the increase in the mutual infor-
mation is small, the tendency to rise can still be noticed. For example, in Figure 3.1,
33
Feature Selection for Adverse Event Prediction
the maximum normalized mutual information between the features and Neutrope-
nia was in the previous experiment less than 0.19, while considering only the subset
of people who had Large Cell Carcinoma (Figure 3.3), the most predictive feature
has normalized mutual information higher than 0.35. Moreover, another thing that
should be noticed is that the ranking of the features differs from one set to another
when considering a particular adverse event.
For subsets smaller than 20 data points a stability check was done along with a
permutation test in order to ensure the validity of the experiments. The stability
check consisted in removing each data point in turn from the subset and running
the permutation test over the new subset, recording at each iteration the features
that passed it. Only those that passed the test for every iteration are displayed as
being statistically valid.
Figure 3.3: Feature ranking according to normalized mutual information for Ap-petite in the subset of people who had Large Cell Carcinoma
34
Feature Selection for Adverse Event Prediction
Figure 3.4: Feature ranking according to normalized mutual information for Ap-petite in the subset of Females
Figure 3.5: Feature ranking according to normalized mutual information for Naildisorder in the subset of Caucasian people
35
Feature Selection for Adverse Event Prediction
Figure 3.6: Feature ranking according to normalized mutual information for Neu-tropenia disorder in the subset of Males
The next step was to use automated techniques to cluster the data and then to
compute the mutual information and rank the features accordingly within the sub-
sets. For this purpose K-means clustering was employed with k equals 4. Since the
data is categorical, the metric employed was Hamming distance which computes for
two data points the number of mismatches across all features. The results are sim-
ilar with those considering the subsets in that the normalized mutual information
slightly increases as compared to that computed on the entire data set and that
the same features seem to change their predictive power for an adverse event when
considering different clusters.
36
Feature Selection for Adverse Event Prediction
Figure 3.7: Feature ranking according to normalized mutual information for Ap-petite in different clusters
One of the greatest disadvantages when applying clustering techniques is that the
interpretability behind the clusters is not very transparent. Since the problem to
be investigated is in the context of medical field, the project aims in a first stage at
developing local techniques that will preserve the meaning of the subsets analyzed.
For this reason, in the following chapter, the focus will be more on analyzing subsets
of patients resulted from splitting the data on a particular variable that can be
associated a clear meaning (such as Males and Females) rather than using automated
clustering methods.
37
Feature Selection for Adverse Event Prediction
3.3.4 Conclusions
These initial experiments revealed that individual features have very small mutual
information with the occurrence of any of the four adverse events. Analyzing mu-
tual information in subsets defined by splitting the data on different features or in
clusters resulted in a small increase in the features predictivity. Moreover, the rela-
tive ranking of the features in the case of a particular adverse event changes within
subsets. These results leads to the next steps of the project which are considering
features jointly rather than individually and proposing methods to analyze the dif-
ference in features importance in one subset as opposed to another as well as to
automatically detect which splits generates subsets that differ the most in terms of
what features are important in predicting one of the four adverse events.
38
Chapter 4
Analysis of feature importance
within subsets
This chapter aims at understanding how the importance of features varies within
different subsets of data. Firstly, the choice for a feature selection criterion is mo-
tivated. Then, a method is proposed to choose the features for splitting the data
in such a way that the resulted subsets are the most dissimilar as far as the top
10 predictive features are concerned. A further step is done in order to identify
the local important biomarkers by assigning to each feature two scores for the two
subsets it appears in. A heuristic is then applied to identify the features that are
only locally important and change their predictive power within the subsets.
4.1 Definitions
A discriminant feature1 is one that splits the original data into 2 subsests with
the following characteristic: the features that are predictive for a specific adverse
event in one subset are different than the features that are predictive for the same
1In some disciplines the term discriminant feature can be used generically for an importantfeature. In this context it will be employed with a different meaning, as mentioned in the definition.
39
Feature Selection for Adverse Event Prediction
adverse event in the other subset.
A locally important feature is a feature that is predictive of an adverse event
only in a sub area of the input space.
Two classification subproblems are considered different/ dissimilar if:
• the data sets on which the classification tasks are defined come from the same
initial set and were obtained by splitting it on a particular feature ;
• the variable to be predicted is the same for both subsets (one of the adverse
events);
• the set of features that are the most relevant in predicting the tar-
get computed on the first subset is different than the set of most
important features computed on the second subset.
4.2 Assumptions and Limitations
In this study the local analysis is performed by dividing the initial data set each time
only in two subsets. The rationale behind this choice was that making a further split
would result in producing small-size subsets that could not be used for a reliable
analysis.
Moreover, for the features that have multiple categories, a grouping of those was
used in order to create only two subsets. For example, the feature Race had 4
categories (Caucasian, Black, Oriental and Others) and since the Caucasian subset
had a size of 360, the grouping was done in the following manner: Caucasians −
first subset, Other Races (Black, Oriental, Others) − second subset.
40
Feature Selection for Adverse Event Prediction
4.2.1 Feature Selection Criterion
The previous chapter revealed a very small amount of information in individual
features for all 4 adverse events. As a consequence, the following analysis will be
done by adopting a feature selection criterion that takes into account the possibility
that features may carry more information when considered jointly (that is they
are complementary to one another). The criterion adopted in this study was Joint
Mutual Information [21] which is defined as:
Jjmi =n−1∑k=1
I(XnXk;Y )
In order to select the next feature the criterion computes the information between
the targets and a joint random variable, defined by pairing the candidate Xn with
each current feature.
4.3 Identifying the most discriminant features
The first step in performing a local analysis and building local predictive models
is to identify the sub-demographics present in the data that define two different
problems. One way to measure the degree of heterogeneity between two subsets
and the one adopted in this research is to analyze how the most predictive features
relative to each of the 4 adverse events differ in one subset compared to another.
Steps in this direction have been done in [1] where for each possible split the angle
between the 2 vectors containing the mutual information for all features within the
2 subsets was computed. A small angle would indicate that the features have similar
importance in the 2 subsets. The major drawback of this method in the context
of the supplied data set is that it requires using a merit measure that would assess
features individually. As the mutual information of individual features is very small
41
Feature Selection for Adverse Event Prediction
and knowing that only a small number of them passed the permutation test, the
significance of the results returned can be affected.
4.3.1 Consistency Index for feature selection
The method proposed for computing the dissimilarity between 2 subsets of data
is based on the Consistency Index for feature selection introduced in [22]. The
index asses how similar are two sequences of features of the same length obtained
at different runs of a features selection algorithm. The formal definition of the
Consistency index for two subsets A ⊂ X and B ⊂ X such that |A| = |B| = k,
where 0 < k < |X| = n and r = |A ∩B| is:
IC =r − k2
n
k − k2
n
=rn− k2
k(n− k)[22]
The Consistency Index satisfies the following 3 properties:
• Monotonicity. The larger the intersection of the two feature subsets, the
highest the value of the index;
• Limits.The index is bound by constants (−1 ≤ IC ≤ 1) that do not depend
on k or n. The maximum value of the consistency index is 1 and is reached
when r = k, that means when S1 is the same as S2. The minimum value (-1)
is obtained when the intersection of S1 and S2 is the empty set (r = 0) and
their size is half of the total number of available features (k = n2);
• Correction for chance. IC(A,B) will have a value around 0 when A and B
are independently drawn as r is expected to be around k2
n[22].
A major advantage of employing this index in the analysis of feature importance in
subsets for the current problem is that since it compares sets of features and not
the individual mutual information carried by each attribute, it is compatible with
42
Feature Selection for Adverse Event Prediction
a feature selection criterion that considers features jointly, such as JMI. Having
these properties, the Consistency Index has been chosen as a measure of similarity
between the sets of features obtained by running a feature selection algorithm on
each of the two subpopulation defined by splitting the original data on a particular
feature.
Splitting the data in subpopulations implies a considerable reduction in the size of
the available data, which affects the reliability of the distribution approximation of
each feature. In an attempt to overcome this problem, in the analysis the feature
selection algorithm is run 20 times on 20 bootstrap samples of the original data
set, and in order to reduce the variance the mean of the Consistency Index will be
considered as a final value. The procedure is done in turn for every possible split and
the output will be features in ascending order of the Consistency Index computed on
the two subproblems that each feature defines. The method is summarized below:
Compute 20 bootstrap samples of the initial data set
For each feature f that defines a valid split
For each bootstrap sample
Split the data into subpopulation A and subpopulation B
Compute the set S1 of the 10 most predictive features using JMI
for the subset A
Compute the set S2 of the 10 most predictive features using JMI
for the subset B
Compute the consistency index (IC) for sets S1 and S2
Average the consistency index on feature f on the 20 bootstrap samples
Sort all the features in ascending order of IC and display the first 5.
43
Feature Selection for Adverse Event Prediction
4.3.2 Results
This section shows the top most discriminant features, that is the features that
produced the most dissimilar subsets for a particular adverse event. In this context,
two subsets are considered different (i.e. they define different problems that may be
tackled separately) if the top 10 features important in predicting a certain adverse
event are different. The results are shown for each of the 4 adverse events in turn
together with a brief explanation of what the name of the feature means.
Appetite Disorder
Feature BLBSAM BLBMI Ncycles Lmsite8 AgeExplanation Baseline
BodySurfaceArea
BaselineBody MassIndex
Number ofCycles Re-ceived dur-ing Study
LocAdv/MetaSite:Hepatic
Age
Averaged Con-sistency Index
0.4236 0.4290 0.4500 0.4710 0.4762
Table 4.1: Top 5 most discriminant features for Appetite
It can be noticed that the features which are the most significant to split the data
on in order to perform a local analysis are intuitively related to Appetite Disorder
/Anorexia (body mass index, body surface area, metastasis in Hepatic System).
For example, the second smallest consistency index was achieved when the data
was split on body mass index, thus resulting two categories of people: light-weight
(BMI ≤ 25) and heavy-weight (BMI > 25). This indicates that the features which
are predictive for Appetite disorder in the subset of light-weight people are different
that those that predicts Appetite disorder in the subset of heavy-weight people.
44
Feature Selection for Adverse Event Prediction
Neutropenia
Feature Cm71 Prt25 Cm223 Lmsite8 Lmsite9Explanation Benzo
diazepinederivatives
PriorCancerTherapy:Vinorel-bine
Opium al-kaloids andderivatives
LocAdv/MetaSite:Hepatic
LocAdv/MetaSite:LymphNodes
Averaged Con-sistency Index
0.4133 0.4238 0.4290 0.4448 0.4605
Table 4.2: Top 5 most discriminant features for Neutropenia
The extent to which the results are given a medical interpretation is limited by
experience and prior knowledge in understanding the medical terms involved. Table
4.2 shows that among the 5 most useful features to split the data on in order to obtain
more homogeneous problems for predicting Neutropenia is the one that indicates
whether the patient have metastasis in Lymph nodes. (’Lymph nodes are found all
through the body, and act as filters or traps for foreign particles. They are important
in the proper functioning of the immune system. They are packed tightly with white
blood cells .’ [27]).
On the other hand, Neutropenia is the type of disease which affects the number
of white blood cells in the blood.(’Neutropenia, [. . . ] is a hematological disorder
characterized by an abnormally low number of neutrophils, the most important type
of white blood cell. Neutrophils usually make up 50-70% of circulating white blood
cells and serve as the primary defense against infections by destroying bacteria in the
blood. ’ [28]). The small value of the averaged consistency index indicates that the
features important in predicting Neutropenia for the patients who had metastasis
in the lymph nodes are different than those considered for patients who have not
experienced this condition.
The same type of results (i.e. the top 5 features that defines the most different
subsets) is shown in Table 4.3 and Table 4. 4 for Nail Disorder and Neuropathy.
45
Feature Selection for Adverse Event Prediction
Nail Disorder
Feature Ncycles Cm113 Lmsite3 Cm3 Prt19Explanation Number of
Cycles Re-ceived dur-ing Study
Any CM:Combs ofpenicillinsincl beta
LocAdv /MetaSite:Bone andLocomotor
LocAdv/Any CM:Acetic acidderivatives
PriorCancerTherapy:Paclitaxel
Averaged Con-sistency Index
0.3190 0.3924 0.4029 0.4133 0.4290
Table 4.3: Top 5 most discriminant features for Nail Disorder
Neuropathy
Feature Cm163 Lmsite8 Ncycles NOORGAN Lmsite9Explanation Any
CM: H2-receptorantago-nists
LocAdv/MetaSite:Hepatic
Number ofCycles Re-ceived dur-ing Studyr
Number ofOrgans
LocAdv/MetaSite:LymphNodes
Averaged Con-sistency Index
0.3767 0.3976 0.4081 0.4343 0.4395
Table 4.4: Top 5 most discriminant features for Neuropathy
4.4 Local Analysis of biomarkers
The previous steps indicated which are the splits that define subsets of data where
features importance changes the most. This section attempts to take the analysis
further and inspect which are the features that are important in one subset and less
important in the other. The purpose of this analysis is gaining a deeper understand-
ing of what biomarkers influence the occurrence of an adverse event conditioned on
the fact that a patient belongs to a specific group or has taken a certain treatment.
46
Feature Selection for Adverse Event Prediction
4.4.1 Description of the method
As in the previous step, in order to increase the level of validity of the results, the
analysis was taken considering 20 bootstrap samples of the initial data set. The
main idea was to split the data on a particular feature, run a feature selection
algorithm (JMI) on each of the two resulted subsets and record how many times a
particular feature appears in top k most important features over the 20 bootstrap
samples. The method is summarized in Figure 4.1 where the split considered was
Males/Females.
Figure 4.1: Schema for computing the feature scores within subsets
4.4.2 Computing the scores
Each attribute can appear at most 20 times in the top k most important features for
a subset. As a consequence, the maximum score is 20 and the minimum is 0. This
was normalized and the final measure was limited by 0 and 1. The value of k was
iterated only between 2 and 5, as the analysis is primarily focused on identifying
47
Feature Selection for Adverse Event Prediction
difference in the behavior of the biomarkers that are the most informative. For a
fixed k , the features that appeared either in the first subset or in the second were
recorder along with their scores. If a feature appeared only in one of the subsets,
then the score for the other subset was set to 0.
In order to consider whether a feature changes its behavior from one subset to
another a threshold had to be set between the two scores assigned for each of the
two subsets. If we denote by s1 the score a feature have in the first subset and by
s2 the score in the second subset, then a feature is considered to have a different
importance in the two subsets in the following 2 cases:
1. min (s1, s2) = 0 and max (s1, s2) ≥ 0.5
2. (s1 > 0.7 or s2 > 0.7) and max(s1,s2)min(s1,s2)
> 1.5
The first case means that if a feature appears 50% of the times in top k important
features for one subset, but never in top k important features for the other subset
then it is considered as a biomarker that is relevant only locally. The second situation
accounts for the case when none of the scores is 0. In this case, in order to consider
a biomarker only locally important, the ratio between the score associated in one
subset has to be at least 50% higher than the score associated with the biomarker
in the other subset. Moreover, in order to avoid situations where both of the scores
are very small, but the ration between them is higher than 1.5 (e.g. s1 = 0.1 and
s2 = 0.2) another condition was set such that at least one of the scores is higher
than 0.7 which means that feature has a high importance in that particular subset.
4.4.3 Results
This section presents which are the locally important biomarkers identified by the
method described above. For each of the 4 adverse events, the analysis was carried
by splitting the data on each of the 5 features described in Tables 4.1-4.4. However,
48
Feature Selection for Adverse Event Prediction
the results will be displayed only for the splits obtained using 2 of the 5 features,
that were considered representative for pointing out how the biomarkers change their
predictive power in subsets.
Appetite Disorder
Figure 4.2 shows which are the biomarkers that are important only locally, in one
of the subsets returned by splitting the data on Baseline Body Mass index and on
Number of Cycles. It can be noticed that Age occurs in top 5 predictive features
for Appetite in the subset of people whose BMI is greater than 25 more than 70%
of the time, whereas in the subgroup of people whose BMI is less or equal with 25,
Age occurs only 30% of the time. A different behavior within the 2 subsets can
be observed for the feature denoting whether a patient had a treatment involving
electrolyte solutions(cm 147) which is present more than 50% of the time in top 5
predictive features in the group of more heavy weight persons (BLBMI>25), but
never in for the group of more light weight people. The same type of behavior
is exhibited by the biomarkers BBFGF and cm229 (Osmotically acting laxatives)
which are present more than 50% of the time in the top 5 predictive features for
Appetite considering the group of people who received more than 3 doses, but never
occur as being important for those who received less than 3 doses. On the other
hand, Cancer histology is more predictive of Appetite disorder in the group of people
receiving less than 3 doses compared to those who received more.
49
Feature Selection for Adverse Event Prediction
Figure 4.2: Locally important biomarkers for Appetite when splitting the data onBody Mass Index (left) and on Number of Cycles (right).
Neutropenia
The results obtained for Neutropenia (Figure 4.3) revealed that a particular biomarker
(cm108-colony stimulating factors) can be very important for two groups of people
(those who had metastasis located in Lymph nodes and those who have not had
prior chemo therapy with Vinorelbine) while carrying a significantly smaller impor-
tance for the other two, complementary groups. Moreover, for the group of people
who had Vinorelbine as a prior chemo therapy medicine, whether they also had
H2-receptor antagonist (cm163) appear in top 4 predictive features for Neutropenia
aprox. 50% of the time, while for those who did not have Vinorelbine, it never
occurs as being important.
Figure 4.3: Locally important biomarkers for Neutropenia when splitting the dataon lmsite9 ( metastasis in Lymph Nodes) -left and on prt25 (prior chemo therapywith Vinorelbine )-right
50
Feature Selection for Adverse Event Prediction
Neuropathy and Nail Disorder
Analogous with the results shown for Appetite and Neutropenia are those for Neu-
ropathy (Figure 4.4) and Nail Disorder (Figure 4.5). It can be noticed that there
are biomarkers which tend to change their predictive power more (e.g Figure 4.4
-left aered8-reduction of doses appears 80% of the times in top 5 predictive features
for Neuropathy in the subset of people who had H2-receptor antagonists and never
for the subset of people who did not have) and other who shows a smaller difference
in the percent of times appeared as being important in the two subsets (e.g.Figure
4.5-right, the case of BLBSAM)
Figure 4.4: Locally important biomarkers for Neutropathy when splitting the dataon cm163 (H2-receptor antagonists)-left and on lmsite8 (metastasis in Hepatic Sys-tem including Gall Bladder)-right.
Figure 4.5: Locally important biomarkers for Nail Disorder when splitting the dataon cm133 (combinations of penicillin)-left and on lmsite3( metastasis in Bone orLocomotor System )-right.
51
Feature Selection for Adverse Event Prediction
4.4.4 Summary and conclusions
This chapter proposed a method of identifying which are the splits that create the
most different 2 subsets (in the sense of having different predictive features for
a particular target). Then a further analysis was done to identify what are the
biomarkers that change their importance the most when the data is restricted to
one of the subsets or the others. The main findings can be summarized as follows:
Cm229: Osmotically acting laxatives:
• For No. of Cycles> 3 cm 229 is in the top 4 predictive features for Appetite
60% of the time
• For No. of Cycles ≤ 3 cm 229 is never in the top 2 predictive features
Cm147: Electrolyte Solutions
• For BLBMI>25 cm147 is in the top 4 predictive features for Appetite 60% of
the time
• For BLBMI<=25 cm147 is never in the top 2 predictive features
Cm108: Colony Stimulating factors
• For metastasis in lymph nodes (lmsite9): cm108 is in the top 2 predictive
features for Neutropenia 100% of the time
• For no metastasis in lymph nodes (lmsite9): cm108 is in the top 2 predictive
features for Neutropenia 5% of the time
• For no prior chemo 25 (Vinorelbine): cm108 is in the top 4 predictive features
for Neutropenia 100% of the time
52
Feature Selection for Adverse Event Prediction
• For prior chemo 25 (Vinorelbine): cm108 is never in the top 4 predictive
features for Neutropenia
Cm163: H2-receptor antagonists
• For no prior chemo 25 (Vinorelbine): cm163 is never in the top 4 predictive
features for Neutropenia
• For prior chemo 25 (Vinorelbine): cm163 is in the top 4 predictive features for
Neutropenia 50% of the times
BLBMI: Baseline Body Mass Index
• For metastasis in Hepatic System (Gall Bladder included): BLBMI is in the
top 5 predictive features for Neuropathy 15% of the times
• For no metastasis in Hepatic System : BLBMI is in the top 5 predictive features
for Neuropathy 80% of the times
• For people who had cm163 (H2-receptor antagonists): BLBMI is in the top 5
predictive features for Neuropathy 75% of the times
• For people who have not had cm163 (H2-receptor antagonists): BLBMI is in
the top 5 predictive features for Neuropathy 20% of the times
The results confirmed the fact that sometimes features carry different importance
in different parts of the input space and that adopting a local analysis can reveal
relationships between the features and the target variable that would have been
hidden by the averaging effect in a global analysis.
53
Chapter 5
Predictive Model Building
This chapter aims at building and assessing the performance of different predictive
models for the four adverse events. The analysis was carried both globally and
locally on different subsets of the initial data, as computed in the previous chapter.
The outline of the chapter is as follows: firstly, an overview of different measures for
assessing performance along with the choices made in this study is presented. Then,
the method proposed for building local classifiers is explained. The third part of
the chapter will focus on the actual classifiers employed and the results obtained.
This part is build in two phases: while the first one is concerned with keeping a
relatively simple, construction of the classifier in order to maintain interpretability,
the second part will attempt to improve the performance by using more advanced
classifiers and techniques to deal with the particularities of the data set.
Since the problem under investigation involves predicting four different adverse
events, this chapter has a more exploratory character, by experimenting with dif-
ferent classifiers, different numbers of features or changing parameters, rather than
focusing only on a specific setting.
54
Feature Selection for Adverse Event Prediction
5.1 Measures of assessing performance
In order to have a meaningful assessment of a classifier’s performance, the way the
outcome of a model is assessed has to consider the context of the problem as well
as the particularities of the data set on which the classification problem is defined.
In the case considered in this study, the choice of measures have to reflect the fact
that the classification task is set in a medical framework and is concerned with
predicting whether a person will develop an adverse event or not. The reason why
this is important is that the real cost of misclassifying a data point greatly depends
on what the true class of that data point was, that is the cost of predicting that
someone will not develop an adverse event when in fact he will is greater than the
cost of predicting that a person will develop an adverse event when he will not.
Moreover, for all four events the class distribution is imbalanced with a greater
number of examples for the negative class than for the positive one.
As a consequence, the measures used in this study to assess the performance of
the different predictive models are sensitivity, specificity and negative predictive
value. A short overview of each of them along with a motivation for why they are
appropriate is given below. The following notations are used:
• TP=True Positives (the number of positive examples correctly classified);
• TN=True Negatives (the number of negative examples correctly classified);
• FP=False Positives (the number of negative examples incorrectly classified as
positive);
• FN=False Negatives (the number of positive examples incorrectly classified as
negative).
Sensitivity measures how good the classifier was at predicting the positive class,
that is it reports how many of the positive cases it has actually predicted as being
55
Feature Selection for Adverse Event Prediction
positive. In our context, the sensitivity will inform about the rate at which the
classifier identifies the occurrence of an adverse event.
Sensitivity =TP
TP + FN
Specificity measures the ability of the classifier to identify the negative class. In the
context of predicting adverse events, the specificity will inform us about how many
of the negative cases, the classifier correctly identified as being negative.
Specificity =TN
TN + FP
In the case of imbalanced class distribution, the rare class has a smaller impact
on accuracy than the prevalent class [23] and for this reason recording separately
the performance on each of the two classes (sensitivity and specificity) is a more
significant way of assessing the real performance of the classifier. For example, in
the case of Neuropathy and Nail Disorder where the positive class has only 10%
of the examples, a trivial classifier that will predict all the time the negative class
(0-no adverse event) will have an accuracy of 90%. However the sensitivity in this
case would be 0.
The negative predictive value indicates the proportion of the persons classified as
negative (will not experience an adverse event) who are correctly classified. In a
medical framework a high negative predictive value is desirable as it means that
when the model indicates that a person will not experience an adverse event, it is
highly probably that this is a correct result.
NegativePredictiveV alue =TN
TN + FN
For the visualization of the results, ROC curves will be mainly used, as they allow
a good representation of the classifier’s performance in terms of sensitivity (on x
56
Feature Selection for Adverse Event Prediction
axis) and false positive rate (1-specificity) on the y axis. Thus, the ideal point on
an ROC curve would be (0, 1), while a point situated on the line x=y represent the
case of randomly guessing the class.
5.2 Local vs. Global Analysis
The results in the previous chapter indicated that features have different importance
in different subsets of the data. As a consequence, this chapter will not only asses
the performance of classifier in predicting one of the four adverse events globally,
(on the whole data set) but will also propose a method of building local models and
compare their performance with the global one.
5.2.1 Local predictive models
In order to allow for a fair comparison between local and global models all the
choices made for the global one (number of features to use, classifiers, methods for
splitting the data into training and testing, etc) were also implemented in the local
models in the following manner:
• The initial data set was split in two subsets P1 and P2 on a particular feature
(as indicated by the consistency index in the previous chapter);
• Each of the subsets and the corresponding targets were treated as a different
problem by separating locally the training and the testing sets and building
two local models, one for each subset;
• The testing set on which all the measures (sensitivity, specificity, etc) were
computed was obtained by concatenating the two testing sets resulted from
splitting both P1 and P2 into training and testing data. For each of the testing
points, the predicted value was obtained using the corresponding classifier.
57
Feature Selection for Adverse Event Prediction
A schematic view on building the local models is shown in Figure 5.1, considering
that the split has been done on Males/Females. The analysis is made analogous for
any other chosen split of the initial data.
Figure 5.1: Schema for building a local predictive model
Though the local models will benefit from a more homogeneous data set, their
major drawback that can potentially influence the accuracy of the results is that the
available training data for training each of the classifier is only half of the available
data for the global model
5.3 Model building-Phase I
In the first phase, the choice of classifiers reflected the intention to keep the resulting
models simple, easy to interpret. Moreover, the number of features selected using
JMI was varied only between 2 and 11 for the same reason. In the second phase,
those restrictions will be removed and more powerful classifiers will be assessed.
58
Feature Selection for Adverse Event Prediction
5.3.1 Logistic regression
The logistic regression is a model used to predict the probability of a class, given the
current configuration of the input variables, by focusing on the relative probability
(odds) of obtaining one of the two categories. The general form for computing the
posterior probability of a class is:
p(C1|φ) = y(φ) = σ(wTφ) [26]
where φ is the feature vector and σ() is the logistic sigmoid function defined as:
σ(a) =1
1 + exp(−a)[26]
The inverse of the logistic function is given by a = ln ( σ1−σ ). As a consequence, the
natural logarithm of the odds is expressed as linear function of the features used
and the resulting model is linear and may be attached a possible interpretation.
ln(p(C1|φ)
1− p(C1|φ)) = wTφ [26]
The results obtained are shown in Figures 5.2-5.5 for each of the 4 adverse events.
Each point in the ROC curve corresponds to one of the models obtained by varying
the number of features between 2 and 11. The splitting in training/testing data
was done by allowing 2/3 of the available set for training and 1/3 for testing. The
procedure was repeated 20 times, each time shuffling the data in order to allow
for the classifier to learn and test on different data points and the resulted plotted
(for negative predictive value) are the 95% confidence intervals computed on the
20 outcomes obtained. This method was preferred to measuring the results using
cross-validation because of the imbalanced distribution in the classes (For example
in the case of Nail disorder, the positive class is 10% of the total examples and
employing a 10-fold cross validation technique may cause misleading results as it is
59
Feature Selection for Adverse Event Prediction
highly probable that in a testing fold none or very few positive cases will be present
which will affect the computation of sensitivity and specificity). This methodology
will be maintained throughout the experiments.
Figure 5.2: Appetite disorder prediction using Logistic regression. Left-NegativePredictive value in Local vs. Global Models; Right-ROC points for models builtvarying the number of features.
Figure 5.3: Neutropenia prediction using Logistic regression. Left-Negative Predic-tive value in Local vs. Global Models; Right-ROC points for models built varyingthe number of features.
60
Feature Selection for Adverse Event Prediction
Figure 5.4: Nail Disorder prediction using Logistic regression. Left-Negative Predic-tive value in Local vs. Global Models; Right-ROC points for models built varyingthe number of features.
Figure 5.5: Neuropathy prediction using Logistic regression. Left-Negative Predic-tive value in Local vs. Global Models; Right-ROC points for models built varyingthe number of features.
The results indicate in all 4 cases a poor ability of the classifier to correctly iden-
tify the positive class (low sensitivity) both for the global and for the local model.
However, it can still be noticed that in the case of Neutropenia the sensitivity is
higher (it reaches 60% for a specificity of 70%) as compared to the other adverse
events where it rarely raises above 20%. Moreover, for all adverse events the local
model had a higher negative predictive value than the global one, while maintaining
61
Feature Selection for Adverse Event Prediction
comparable results for sensitivity and specificity which indicates a higher degree
of certainty when predicting that a person will not experience a particular adverse
event. The results shown are only for splitting the data on the first feature indicated
by the consistency index in the previous chapter as the others were similar.
Under the same restrictions of attempting to keep the model interpretable, with
decision rules that are easy to explain, Naive Bayes and Decision Trees were also
applied as classifiers. However, the results obtained did not differ significantly from
those obtained using logistic regression. A summary is shown in Table 5.1 and 5.2,
where the results are averaged over the 20 shuffling rounds for 10 features.
Adverse Event Type of Analysis Sensitivity Specificity NegativePredictiveValue
AppetiteGlobal 0.2145 0.8493 0.6578
Local(BLBSAM) 0.2031 0.8527 0.7091
NeutropeniaGlobal 0.5793 0.8364 0.7424
Local(cm71) 0.5829 0.8130 0.7743
Nail disorderGlobal 0.1615 0.9672 0.8909
Local(ncycles) 0.1369 0.9675 0.9002
NeuropathyGlobal 0.1529 0.9887 0.8978
Local(cm163) 0.1652 0.9601 0.9013
Table 5.1: Global and Local performance obtained using Nave Bayes for Appetite,Neutropenia, Neuropathy and Nail disorder
62
Feature Selection for Adverse Event Prediction
Adverse Event Type of Analysis Sensitivity Specificity NegativePredictiveValue
AppetiteGlobal 0.2674 0.7870 0.6255
Local(BLBSAM) 0.2679 0.7658 0.7017
NeutropeniaGlobal 0.5554 0.7954 0.7098
Local(cm71) 0.5644 0.7954 0.7640
Nail disorderGlobal 0.1241 0.9593 0.8686
Local(ncycles) 0.1355 0.9403 0.8966
NeuropathyGlobal 0.1898 0.9505 0.8600
Local(cm163) 0.1706 0.9497 0.8922
Table 5.2: Global and Local performance obtained using Decision Trees for Appetite,Neutropenia, Neuropathy and Nail disorder
The results obtained for Decision Trees and Naive Bayes indicate the same poor
performance of classifiers in correctly identifying the positive class. As far as local
vs. global analysis is concerned, for sensitivity and specificity the values obtained are
comparable. However, it can be noticed that the negative predicted value is higher
for the local model in all cases. Among the possible explanations for the small
sensitivity obtained are the inappropriateness of the classifiers chosen, the number
of features considered and also the highly imbalanced class distribution especially
for Nail Disorder and Neuropathy.
It can be noticed that as the number of positive examples increase (Appetite (30%),
Neutropenia(35%)), so does the sensitivity. On the other hand for these adverse
events the results also indicate a smaller value for specificity and negative predicted
value. However, as problem is concerned with predicting adverse events, the focus
of a predictive model is identifying as many people as possible from those who will
develop an adverse event while (high sensitivity) while maintain a reasonable number
of false positives (which implies a high specificity).
63
Feature Selection for Adverse Event Prediction
5.4 Model Building -Phase II
In the second part of model building, in the attempt to improve performance and
address the possible problems identified in the first part, the constraints imposed to
maintain interpretability are removed. Therefore, the classifiers applied will make
use of more complex rules for creating the nonlinear boundaries that separate the
data. Moreover, since the data is highly imbalanced towards the positive class,
methods for balancing the training data will be considered.
5.4.1 Balancing class distribution
This section briefly overviews some of the most popular existing techniques for deal-
ing with the imbalanced class problem and explains the chosen method implemented
in the following experiments. The techniques for balancing class distribution can
be broadly classified into under-sampling, over-sampling and techniques that create
new data points.
• In random undersampling technique, the number of the majority class data is
decreased by removing randomly chosen points belonging to the majority class
from the data set [24]. Though this technique has empirically good results, its
major disadvantage is that it discards possible important information.
• The complementary random oversample technique works by randomly increas-
ing the minority class data points simply by replicating the existing ones.
Though this technique shares one of the advantages of random undersam-
pling, which is simplicity, it has been argued that in fact it does not add any
new information to the data set as it only copies existing information [24].
• A technique that attempts to oversample the minority class by adding new
information is SMOTE (Synthetic Minority Oversampling Technique)[27].The
main idea of SMOTE is to choose an existing data point form the minority
64
Feature Selection for Adverse Event Prediction
class, and find its closest n neighbors. From those, one neighbor is chosen
randomly and a new data point is created as a random point on the line
segment which joins the initial data point and its neighbor [27]. However, as
we deal with categorical data this technique is not suitable in this framework.
The oversampling method employed in the experiments also adds new information
by approximating the distribution of each feature from the available examples in the
positive class and generating a new data point of the same class which will follow
that distribution. The technique is based on the Inverse Transformation Method
for discrete random variables [25]. Its main drawback, that will be considered for
future work is that each of the features is considered independently of the others, an
assumption which may not hold in the real case. In order to assure the validity of the
results, the oversampling was computed only for the training data. The performance
of the model was assessed on the training data that came from the original data set.
5.4.2 Ensemble classifiers
AdaBoost-Overview
Adaboost (adaptive boosting) is the most widely used form of boosting algorithm
that combine multiple base classifiers to produce the final outcome of a classification
task [8]. This type of algorithm was chosen in this task as it is known to give good
results even when the performance of the base classifiers is only slightly better than
random [8] (which is the case of the classifiers applied in the first phase).
In Adaboost a series of weak classifiers are trained in sequence, each of them on a
weighted training set where the weights are updated considering the performance of
the previous classifier in the following manner: the points that were misclassified by
the previous classifier will be assigned higher weights when it will be used to train
the next classifier in the sequence [8]. In order to output the final prediction, the
65
Feature Selection for Adverse Event Prediction
outcomes of all the classifiers will be combined through a weighted majority voting
scheme [8].
Experimental design and results
In experiments, logistic regression was used as a base classifier. The number of fea-
tures was varied initially between 2 and 50 adding at each step 5 features. However,
as shown in Figure 5.6, the sensitivity slowly decreases as the number of features in-
crease above 10. Consequently, the number of features used was as in the first phase,
between 2 and 11. The same observation can be made for the other classifiers em-
ployed in this set of experiments: increasing the number of features, confuses more
the classifier in the ability to correctly identify the positive class. The separation
between training and testing was done as in the previous section, allowing 2/3 of the
data for training and 1/3 for testing, but the class distribution in the training data
was balanced using the oversample technique described in section 5.4.1, so that the
number of the positive and negative examples would be equal. The results for each
of the four events are shown in Figures 5.7, 5.8, 5.9 and 5.10.
Figure 5.6: Variation of Sensitivity as the number of features increase. Left : Ad-aboost (base classifier-Logistic Regression) for predicting Neutropenia. Right: Ran-dom forest for predicting Neutropenia.
66
Feature Selection for Adverse Event Prediction
Figure 5.7: Neutropenia prediction using Adaboost. Left-Negative Predictive valuein Local vs. Global Models; Right-ROC points for models built varying the numberof features.
Compared to the initial results in phase I, it can be noticed an increase in sensitivity,
associated with only a small drop in specificity. The negative predicted value has
also shown a small increase from aprox. 70% (for the local model) to around 80%.
In both phases the local model had a higher negative predicted value than the global
one. If in the first phase there is not a significant difference between local and global
models as far as sensitivity and specificity is concerned, here it can be noticed the
tendency that the local model has towards a higher sensitivity, while in the global
one a higher specificity can be noticed.
Figure 5.8: Appetite prediction using Adaboost. Left-Negative Predictive value inLocal vs. Global Models; Right-ROC points for models built varying the number offeatures.
67
Feature Selection for Adverse Event Prediction
For Appetite no significant improvement in the prediction performance can be no-
ticed. The small increase in sensitivity, associated with a decrease in specificity,
shows that the results almost follow the x=y line which signifies random guess. The
relatively high negative predicted value can be explained by the fact that as the
distribution in the testing data has not been changed (30% positive class), a basic
classifier saying all the time no, will have an aprox. 70% negative predictive value.
Figure 5.9: Nail Disorder prediction using Adaboost. Left-Negative Predictive valuein Local vs. Global Models; Right-ROC points for models built varying the numberof features.
In the case of Nail Disorder, a higher sensitivity (40-60%) came at the cost of a larger
drop in sensitivity (or conversely an increase in false positive rate). The difference
between global and local model can be observed only for sensitivity over 40% when
the global one performed better. However, for the negative predicted value, the local
one had better results.
68
Feature Selection for Adverse Event Prediction
Figure 5.10: Neuropathy prediction using Adaboost. Left-Negative Predictive valuein Local vs. Global Models; Right-ROC points for models built varying the numberof features.
The performance in predicting Neuropathy is similar to that obtained for Nail Dis-
order, that is an increase in sensitivity is only obtained for a lower specificity. The
results for all four adverse events indicate a small increase in sensitivity (as com-
pared to the performance obtained in phase I). However, these values are associated
with a smaller specificity. The best results were obtained as in the previous phase,
for Neutropenia. In terms of comparison between local and global analysis, as far
as sensitivity and specificity are concerned the performance is similar. On the other
hand, the negative predicted value was again better for the local analysis for all
adverse events.
Random forest
Random forest is an ensemble classifier that uses a number of decision trees built
on bootstrap samples of the data and whose final prediction is the majority vote of
the individual classifiers. The randomness comes from the fact that for each node
of a decision tree a number of splits are chosen randomly and among those the best
one is used. As the results are similar with those obtained using AdaBoost, only
a summary of them is shown in Table 5.3, computed using 10 decision trees and
69
Feature Selection for Adverse Event Prediction
averaging over 20 shuffles of the data.
Adverse Event Type of Analysis Sensitivity Specificity NegativePredictiveValue
AppetiteGlobal 0.3757 0.7007 0.6078
Local(BLBSAM) 0.4278 0.6073 0.7131
NeutropeniaGlobal 0.6802 0.7335 0.7147
Local(cm71) 0.6896 0.6971 0.7943
Nail disorderGlobal 0.2208 0.8128 0.8157
Local(ncycles) 0.3336 0.9510 0.9038
NeuropathyGlobal 0.3198 0.8881 0.8176
Local(cm163) 0.2135 0.8283 0.8890
Table 5.3: Global and Local performance obtained using Random Forest for Ap-petite, Neutropenia, Nail Disorder and Neuropathy
The most accurate performance was for Neutropenia using the local model (sensitiv-
ity 68%, specificity 73% and negative predicted value 71%). For the others adverse
events, the results follow the same pattern: a low sensitivity and a higher specificity,
which denotes again that even using an oversample method the classifiers identify
only a small percentage of the positive class.
5.4.3 Support Vector Machines
Support Vector Machines are a set of classifiers that construct a decision boundary
(a hyperplane) or a set of decision boundaries separating the data in different classes.
A SVM finds the best hyperplane that is the one with the maximum margin between
the points in the classes. It can also be employed for non-linear data sets by using
the so-called Kernel trick that maps the features to a higher dimensional space using
different kernel functions. Thus, a data set that was not linearly separable in the
original space may become linearly separable in a higher space.
70
Feature Selection for Adverse Event Prediction
Experimental design and results
In the experiments the Radial Basis Function (RBF) was used as a Kernel Function.
It is defined as:
φ(x, y) = exp(γ|x− y|2)
The gamma parameter (γ) which is the kernel width was varied between [10−5, 10−7 . . . 109]
in order to understand how it affects the classifiers performance. The splitting into
training and testing data was done by considering 2/3 for training and 1/3 for test-
ing. In order to decrease the variance and allow the classifier to learn and to be
tested on different parts of the input space, the data was shuffled 20 times and the
results averaged. Each time, the number of the positive cases was increased in the
training data using the technique described in Balancing the class distribution sec-
tion so that there is an equal number of positive and negative examples. The results
are shown in Figures 5.11 and 5.12. For this experiment, the number of features was
fixed at 10 and each point in the ROC space represent the classifier obtained for a
different gamma parameter.
Figure 5.11: Neuropathy (Left) and Neutropenia (Right) prediction using SVM
71
Feature Selection for Adverse Event Prediction
Figure 5.12: Nail Disorder (Left) and Appetite (Right) prediction using SVM
For different values of the gamma parameter, different points in the ROC space
were obtained. It can be noticed that generally, as we obtain a higher sensitivity,
the specificity of the model drops. Better results compared to previous ones can be
noticed for Neuropathy, while for the others there is not a significant difference.
For Neuropathy, the local model was better than the global one in the majority of
cases. Compared with the initial results from the first phase, the models obtained
using SVM and oversampling show a significant increase in sensitivity. However,
this increase came at the cost of a decrease in specificity.
For Nail Disorder, the local model has a higher sensitivity (it reaches 50%), but
this increase came at the cost of a considerable reduction in specificity (60%). On
the other hand, the results obtained with the global model were less sensitive to
varying gamma. The best results were obtained as with all the other classifiers
used in experiments for the adverse event that was naturally more balanced which
is Neutropenia.
72
Feature Selection for Adverse Event Prediction
5.5 Chapter summary and Conclusions
5.5.1 Summary
This chapter investigated the performance of different classifiers in predicting the 4
adverse events. The experiments were conducted in two phases: in the first one, in
order to keep the resulting model easy to interpret, less complex classifiers were used:
Logistic Regression, Decision Trees and Nave Bayes, while in the second one the
experiments were conducted using AdaBoost, RandomForests and Support Vector
Machines along with an oversampling technique. Based on the results obtained in
the previous chapter where the most different splits were identified, a method for
building local models and comparing the performance with global ones was proposed.
5.5.2 Conclusions
The results indicate that the only predictable adverse event is Neutropenia. For
all the other events (Neuropathy, Nail Disorder, Appetite), the classification results
can be summarized as having low sensitivity and a high specificity in the first phase
of model building (see Figures 5.2, 5.4, 5.5, Table 5.1 and Table 5.2). Using more
powerful classifiers along with oversampling in the second phase resulted in an in-
crease in sensitivity, but followed by a decrease in specificity (see Figures 5.8, 5.9,
5.10 and Table 5.3).
Neutropenia
In the first phase of model building the results show that depending on the number
of features used, the sensitivity ranges between 40% and 65%, while the specificity
varies between 60% and 80% (the higher the sensitivity, the lower the specificity -see
Figure 5.3).
73
Feature Selection for Adverse Event Prediction
In the second phase of model building, the predictability of Neutropenia improved
in the following manner: the sensitivity increased- varies between 60% and 80%
while the specificity was maintained at the same level (60%-80%)- see Figure 5.7.
The negative predictive value also improved from aprox. 70% (for local model-see
Figure 5.3) to 80% (Figure 5.7).
The local analysis for Neutropenia was carried on the two subsets resulted by split-
ting the initial dataset on feature cm71 (Benzodiazepine derivatives).This feature
was indicated as being the first choice for splitting (see chapter Analysis of feature
importance within subsets, Table 4.2) . While in the first phase of model building,
the performance of local models was similar to that of the global ones as far as sensi-
tivity and specificity are concerned (Figure 5.3), in the second phase using Adaboost
and Random Forest, the local model had a higher sensitivity, but a lower specificity
than the global one (Figure 5.7). In both phases, the negative predictive value was
higher for the local analysis. Using SVM in the second phase revealed that the local
model had a better performance than the global one in the majority of situations
considered by changing the gamma parameter (Figure 5.11).
The performance of a classifier in the present setting is mainly concerned with cor-
rectly identifying the people who will experience an adverse event while maintaining
a reasonable rate of false positives. This is because the real cost of allowing a person
who will experience an adverse event to take part in a clinical trial (that is mak-
ing a false negative error) is higher than the cost of preventing someone who will
not experience an adverse event form taking part (a false positive error). For this
reason, the performance of predicting Neutropenia can be considered better in the
local analysis than in the global one.
74
Chapter 6
Conclusions
This chapter presents an overview of the research carried in this project along with
the main conclusions drawn. In addition to this, it proposes possible further inves-
tigation steps that have not been covered. The main objective of the project was to
investigate the hypothesis that feature importance differs in different parts of the
input space and based on the results to propose a method for building local clas-
sification models and compare the performance with the global ones. The research
was carried on the data provided by the pharmaceutical company AstraZeneca and
focused on the investigation of predicting four adverse events: Appetite Disorder,
Neutropenia, Nail Disorder and Neuropathy. The project was carried out in three
main phases as described below.
6.1 Summary of the research and conclusions
6.1.1 Data Preprocessing and Initial Experiments
This part gives a thorough description and motivation of the choices made in pre-
processing the data as those influences the final results obtained. Moreover, initial
75
Feature Selection for Adverse Event Prediction
experiments are carried out in three phases in order to gain insights in the relation-
ship between the data and the adverse events to be analyzed.
1. The first experiment was to analyze the normalized mutual information be-
tween each feature and the four adverse events.
2. The second one consisted in choosing five features (Sex, Race, hstltyp (His-
tology type)), Smoking habits and Tstage (Cancer Stage) to split the data on
and analyze the mutual information locally, in each of the resulted subsets.
3. The third experiment also focused on local analysis of individual feature im-
portance, but the subsets of the data were obtained by running a k means
clustering algorithm.
These initial experiments revealed a small statistical dependence between the mea-
surements and each of the four adverse events. However, a small increase could be
noticed when the mutual information was computed locally, in different subsets of
the initial data.
6.1.2 Analysis of feature importance within subsets
The second part of the project attempted to understand how the importance of each
feature in predicting a specific adverse event changes within different subsets and
identify the features that are only locally important. This part of the research is
structured into two subsections that aim at answering two questions:
1. Which are the most discriminant features? A discriminant feature is consid-
ered one that splits the original data into 2 subsests that have the following
characteristic: the features that are predictive for a specific adverse event in
one subset are different than the features that are predictive for the same
adverse event in the other subset.
76
Feature Selection for Adverse Event Prediction
2. Which are the biomarkers that are only locally important? A locally important
biomarker is considered one that is predictive of an adverse event only in a
sub area of the input space.
Firstly, a method for discovering which are the features that produce the most
dissimilar subsets in terms of top 10 predictive features is proposed. The subsets
are obtained by splitting the initial data on the categories each feature defines.
The method is based on Consistency Index introduced in [4] which measures the
similarity between two feature sets. Then, a score is associated with each feature
based on the frequency of appearance in the top most predictive features within
different subsets. A threshold on this score is proposed for considering a feature
only locally important.
The results indicate that the most discriminant feature is:
• For Appetite Disorder: BLBSAM (Baseline Body surface Area)
• For Neutropenia: cm71 (Benzodiazepine derivatives)
• For Nail Disorder: ncycles (Number of Cycles during Study)
• For Neuropathy: cm163 (H2-receptor antagonists)
The second investigation revealed that there are features that change their impor-
tance within different subsets and show only a local predictive power. A summary of
the most significant results is given below, mentioning the biomarker, its associate
score of occurrence in top predictive features, the adverse event and the group of
people for which it is important. In the complementary group, the same biomarker
was either never present among top predictive features or showed a considerably
lower score of occurrence (50% smaller).
77
Feature Selection for Adverse Event Prediction
Cm229: Osmotically acting laxatives
• 60% of the time present in top 4 predictive features for Appetite in the group
of people who had more than 3 cycles of doses
Cm147: Electrolyte Solutions
• 60% of the time present in top 5 predictive features for Appetite for people
with BLBMI > 25
Cm108: Colony Stimulating factors
• 100% of the time present in top 2 predictive features for Neutropenia for people
who have metastasis in lymph nodes (lmsite9)
• 100% of the time present in top 4 predictive features for Neutropenia in the
group of people who did not have prior chemo therapy with Vinorelbine (prt25)
BLBMI: Baseline Body Mass Index
• 80% of the times present in top 5 predictive features for Neuropathy for people
with no metastasis in Hepatic System (lmsite8)
• 80% of the times present in top 5 predictive features for Neuropathy for people
who had cm163 (H2-receptor antagonists)
6.1.3 Predictive model building
The last part of the project analyzes the performance of different classifiers in pre-
dicting the four adverse events. The analysis was carried in two phases:
78
Feature Selection for Adverse Event Prediction
1. Firstly, in order to maintain a simple, possibly interpretable model, less com-
plex classifiers were used: Logistic Regression, Naive Bayes and Decision Trees.
2. In the second phase, more advanced classification methods are employed (Ran-
dom Forest, Adaboost, Support Vector Machines) together with an oversam-
pling technique to balance the distribution of the positive and negative exam-
ples in the training data.
In addition to this, based on the results obtained in the second part of the project
(Analysis of feature importance within subsets), a method for building local predic-
tive models and comparing their performance with the global ones was proposed.
The local analysis was carried for both phases considered in this chapter.
The results obtained in this part showed that the only adverse event that can be
predicted from the measurements provided is Neutropenia. The data set on which
the prediction task was carried is made up of measurements from patients that
received the placebo in the clinical trial. The poor prediction performance obtained
for Appetite Disorder, Nail Disorder and Neuropathy indicates that there is not a
strong relationship between the occurrence of these adverse events and the clinical
measures taken. Repeating the investigation on the people that actually received
the new drug and comparing the prediction performance with the current results
can reveal whether the drug has a greater influence on the occurrence of any of the
adverse events.
For Appetite Disorder, Neuropathy, Nail Disorder, though in the analysis were used
both simple and complex classifiers with a different number of features and differ-
ent settings of the parameters along with an oversampling technique to artificially
increase the number of positive cases, the results obtained indicate that there is not
a significant dependence between the biomarkers measured and the occurrence of
these events. The performance of the classifiers in predicting Appetite Disorder,
Neuropathy and Nail Disorder can be summarized as showing a low sensitivity and
79
Feature Selection for Adverse Event Prediction
a high specificity in the first phase of the analysis (see Figures 5.2, 5.4 and 5.5),
while in the second phase an increase in the sensitivity can be obtained, but only
associated with a significant decrease in specificity (see Figures 5.8-5.10).
In predicting the occurrence of Neutropenia, the performance obtained was a sen-
sitivity between 40%-65% associated with a specificity between 60%-80% (a higher
sensitivity corresponds to a lower specificity) for the simple classifiers such as Lo-
gistic Regression, Naive Bayes and Decision Trees. The variation of the results is
attributed to the different numbers of features used: between 2 and 11 ( Figure 5.3
and Table 5.1). Using more complex classifiers (Adaboost , Random Forest) and
oversampling the positive class such that there is an equal number of positive and
negative examples in the training data, the results can be improved: the sensitivity
increased , while the specificity was maintained at the same level (see Figure 5.7).
The negative predictive value was also higher in the second phase of model building
that is when the complex classifiers were used.
The local models for predicting Neutropenia were built on the subsets obtained by
splitting the initial data on cm71 (Benzodiazepine derivatives) as computed in the
chapter Analysis of Feature importance within subsets. As for comparing the perfor-
mance of local vs. global models, the local model had a higher negative predictive
value in all the experiments. In terms of specificity and sensitivity, the results for
the first phase of model building were similar for both approaches (the differences
are mainly associated with the different numbers of features used). However, in the
second phase it can be noticed a tendency for the local model towards a higher sen-
sitivity, and for the global model toward a higher specificity. Moreover, using SVM
with different parameter settings the performance of the local model was better than
the performance of the global one in the majority of cases (Figure 5.11-Right).
Since the predictive task is set in a medical framework, there is a greater importance
of obtaining a higher sensitivity (without a major loss of specificity) as this means a
larger number of people correctly identified as suffering an adverse event. All things
80
Feature Selection for Adverse Event Prediction
considered, since the local model was either similar or better than the global one
in terms of specificity and sensitivity and always better for the negative predictive
value, the local analysis has an advantage over the global one.
6.1.4 How can the proposed techniques be transferred to
new data sets?
The thesis presented a research into locally important biomarkers on a specific data
set. However, the methods proposed can be easily transferred to new data. This
section shows in a concise manner the main steps of the employed procedures.
Step1: Identifying the most discriminant feature
This procedures identified the most discriminant feature in a data set, which in this
context is the one that splits the original data into 2 subsests such that the features
that are predictive for a target variable in one subset are different than the features
that are predictive for the same variable in the other subset.
Begin: Define a valid split of the data in the context of the given problem
(minimum number of datapoints in a group, merging subcategories, etc)
For each feature that creates a valid split
Split the data into 2 subsets : A and B
FS1 = top k predictive features after running a feature selection algorithm on A
FS2 = top k predictive features after running a feature selection algorithm on B
Compute Kuncheva Consistency Index1for FS1 and FS2
End For
Display the feature with the lowest Kuncheva Consistency Index (most discriminant).
End Procedure
Depending on the number of instance available and on the constraints imposed for
1For data sets where individual features have a significant mutual information with the targetvariable Importance Profile Angle proposed in [1] can also be used.
81
Feature Selection for Adverse Event Prediction
a valid split, the procedure can be repeated in a recursive manner, such that S1
and S2 can be treated again like individual problems and perform a further split.
However, it should be noted that the smaller the problem, the less accurate the
results obtained and the higher the risk of overfitting.
Step2: Identifying locally important biomarkers
The second step is to identify what are the biomarkers that have only local impor-
tance. The discriminant features obtained in Step 1 are employed for decomposing
the initial problem in subspaces for performing a local analysis of the biomarker’s
importance.
Begin: Compute n bootstrap samples of the initial data
For each bootstrap sample
Split the data on a discriminant feature (see Step1) into subgroups A and B
Select top k important features for A using a feature selection algorithm
and update their frequency of occurrence, frq1
Select top k important features for B using a feature selection algorithm
and update their frequency of occurrence, frq2
End For
F= The union of features that appeared as being important at least once in
one of the two subsets: (frq1 > 0 or frq2 > 0)
For each feature f in F
Compute the scores of occurrence by normalizing the frequencies:
s1 = frq1/n
s2 = frq2/n
If (min (s1, s2) = 0 and max (s1, s2) ≥ 0.5 )or
((s1 > 0.7 or s2 > 0.7) and max(s1,s2)min(s1,s2)
> 1.5))2
then f is a locally important feature.
End If
End For
End Procedure
2The threshold may be adjusted to meet the particularities of a specific data set.
82
Feature Selection for Adverse Event Prediction
Step3: Building local predictive models
The last step is the actual building of local predictive models. General guidances
on how this can be done in order to allow comparison with the global ones is shown
below:
• Split the data into two subsets3 A and B on a particular discriminant feature;
• Consider each of the subsets and the corresponding target as a different prob-
lem by separating locally the training and the testing sets and building two
local models, one for each subset;
• Concatenate the two testing sets resulted from splitting both A and B into
training and testing data and predict the class for each point using the model
built on the subset it belongs.
6.1.5 Future work
There are several possible paths that can be considered for further investigation of
the data set and of the assumption that local feature selection can provide better
results than a global one.
Analysis of feature importance within subsets
• Integrating prior knowledge in identifying the degree of heterogeneity of the
dataset and choosing a more meaningful decomposition into subsets. Having
a medical opinion regarding how to group the patients that are more likely to
have similar causes for an adverse event could potentially improve the classi-
fication results, as it decreases the chances that a classifier would be confused
by features that are only meaningful for a certain category of people.
3An extension to more than 2 subsets can also be employed in an analogous manner.
83
Feature Selection for Adverse Event Prediction
• Repeating the analysis of the local biomarkers using different feature selection
criteria than JMI and investigate the stability of the obtained results.
• A more fine grained investigation of the most suitable threshold on feature
scores in order to consider a biomarker only locally important.
6.1.6 Model Building
• Since the current oversampling technique assumes that the features are inde-
pendent, steps could be considered for loosening this assumption and taking
into account the possible dependencies between the features.
• A further analysis of different other kernels, as in applying SVM only RBF
kernel was investigated.
• Experimenting with different classifiers for the local and global models. In the
present study, the same classifier and the same feature selection method was
used both for both global and local models.
84
References
[ 1 ] Apte Chidanand, Hong J, Hosking J, Lepre J, Pednault E, Rosen B.1997.
Decomposition of heterogeneous classification problems. Proceedings of the
Second International Symposium on Advances in Intelligent Data Analysis,
Springer-Verlang 17-28
[ 2 ] Battiti R.1994, Using mutual information for selecting features in supervised
neural net learning. IEEE Transactions on Neural Networks. 5(4):537-550
[ 3 ] Bontempi G, Meyer P. 2006. On the Use of Variable Complementarity for
Feature Selection in Cancer Classification. Applications of Evolutionary Com-
puting 91-102.
[ 4 ] Brown G, Pocock A, Zhao M, Lujan M 2010. Feature Selection via Conditional
Likelihood. Journal of Machine Learning Research 1-48.
[ 5 ] Brown G, 2010. Lecture Pack1, Machine Learning and Data Mining. The
University of Manchester.
[ 6 ] Domingos P. Context-Sensitive Feature Selection for Lazy Learners. 1997.
Journal of Artificial Intelligence Review, 11(15), 227253.
[ 7 ] Edwards R, Aronson JK 2000. Adverse drug reactions: definitions, diagnosis
and management. The Lancet, vol 356.
[ 8 ] Guyon I, Elisseeff A 2003. An Introduction to Variable and Feature Selection.
Journal of Machine Learning Research 3(2003) 1157-1182.
Feature Selection for Adverse Event Prediction
[ 9 ] Guyon I, Gunn S, Nikravesh M, Zadeh L.A (Eds.)2006. Feature Extraction.
Foundations and Applications. Springer-Verlag Berlin Heidelberg.
[ 10 ] Kira K, Rendell L 1992. A practical Approach to Feature Selection In Pro-
ceedings of the ninth international workshop on Machine learning (ML92).
Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 249-256.
[ 11 ] Liu H, Motoda H, Setiono R, Zhao Z 2010. Feature Selection: An Ever
Evolving Frontier in Data Mining. Journal of Machine Learning Research:
Workshop and Conference Proceedings 10: 4-13
[ 12 ] Pechenizkiy M, Tsymbal A, Puuronen S. 2006. Local Dimensionality Reduc-
tion and Supervised Learning Within Natural Clusters for Biomedical Data
Analysis. IEEE Transactions on Information Technology in Biomedicine.
10(3):533-9.
[ 13 ] Peng H, Long F, Ding C.(2005) Feature Selection based on MI: Criteria of
max-dependency, max-relevance, and min-redundancy. IEEE Transactions on
Pattern Analysis and Machine Intelligence vol 27, no 8 1226-1238.
[ 14 ] Pineda-Bautista B, Carrasco-Ochoa J. A, Martinez-Trinidad J. 2010.General
framework for class-specific feature selection. Expert Systems with Applica-
tions. Volume 38, Issue 8.
[ 15 ] Puuronen S, Tsymbal A.2001 Local Feature Selection with Dynamic Integra-
tion of Classifiers. Journal Fundamenta Informaticae Intelligent Systems, vol
47, issue 1-2, 91-117.
[ 16 ] Puuronen S, Tsymbal A, Skrypnyk I. 2000. Advanced Local Feature Selec-
tion in Medical Diagnostics In Proceedings of the 13th IEEE Symposium on
Computer-Based Medical Systems (CBMS’00) (CBMS ’00). IEEE Computer
Society, Washington, DC, USA, 25-.
86
Feature Selection for Adverse Event Prediction
[ 17 ] Wang L, Zhou N, Chu F. 2008, A general wrapper Approach to Selection of
Class-Dependent Features. IEEE Transactions on Neural Networks vol 19, no
7.
[ 18 ] VideoLectures.Net (2010) Guyon I. Presentation on: Introduction to feature
selection. [Online] Available at :http://videolectures.net/bootcamp07_
guyon_ifs/ [Accessed : May 2011]
[ 19 ] AstraZeneca (2011) Clinical Trials [Online] Available at:http://www.astrazeneca.
co.uk/rnd/clinical-trials/ [Accessed : May 2011]
[ 20 ] Francois D., Wertz V., Verleysen M., 2006. The permutation test for feature
selection by mutual information. ESANN’2006 proceedings - European Sym-
posium on Artificial Neural Networks. Bruges (Belgium), 26-28 April 2006,
d-side publi., ISBN 2-930307-06-4.
[ 21 ] Hua H., Moody J., 1999. Feature Selection Based on Joint Mutual Informa-
tion. Advances in Intelligent Data Analysis. Rochester New York
[ 22 ] Kuncheva, L.I.: A stability index for feature selection. In: Proceedings of
the 25th International Multi-Conference on Artificial Intelligence and Appli-
cations. February 2007) 390395
[ 23 ] M. V. Joshi, V. Kumar, and R. C. Agarwal. Evaluating boosting algorithms
to classify rare classes: Comparison and improvements. In Proceeding of the
First IEEE International Conference on Data Mining( ICDM01 ), 2001
[ 24 ] Liu A, Ghosh J, Martin C (2007), Generative Oversampling for Mining Im-
balanced Datasets, In Proceedings of the 2007 International Conference on
Data Mining, Las Vegas, Nevada, USA 2007, CSREA Press.
[ 25 ] Luc Devroye (1986). Non-Uniform Random Variate Generation. New York:
Springer-Verlag.
87
Feature Selection for Adverse Event Prediction
[ 26 ] Christopher M. Bishop. 2006. Pattern Recognition and Machine Learning
(Information Science and Statistics). Springer-Verlag New York, Inc., Secau-
cus, NJ, USA.
[ 27 ] Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip
Kegelmeyer. 2002. SMOTE: synthetic minority over-sampling technique. J.
Artif. Int. Res. 16, 1 (June 2002), 321-357.
[ 28 ] Wikipedia-Lymph node [Online] Available at http://en.wikipedia.org/
wiki/Lymph_node [Accessed : 15 August 2011]
[ 29 ] Wikipedia- Neutropenia [Online] Available at http://en.wikipedia.org/
wiki/Neutropenia [Accessed : 15 August 2011]
88