a study on classification algorithms for predicting colon ... · classification of the gene...
TRANSCRIPT
A Study on Classification Algorithms for
Predicting Colon Cancer using
Gene Tissue Parameters
Aditya Tekur1, Prerna Jain2
Department of Information Technology,
SRM Institute of Science and Technology, Chennai, India. 1email: [email protected]
2email: [email protected]
ABSTRACT
Cancer is a class of illnesses characterized by out-of-control cell increase.
Computer Aided diagnosis is now helping the medical field in finding out the
onset of cancer at an earlier stage. This paper presents a comparative study of
numerous classification prediction models for Colon Cancer. This would help in
identifying whether the person with the parameters provided, can be classified
for the chance of colon cancer or not. Interpreting the current research outcomes,
classification of the gene expression data set for colon cancer has been realized as
an arduous task. This study makes use of the gene expression data of 62 samples
of colon epithelial tissues, out of which, 40 are tumorous and 22 normal. The
Waikato Environment for Knowledge Analysis (WEKA 3.8) tool has been used
for the classification of the dataset, using the n (10) fold cross validation
technique. Pre- dominated genes, which are highly correlated with colon cancer,
are obtained using the feature selection methods with filter and wrapper
approach, in order to obtain better classification accuracy. The results indicate
that Naive Bayes is the best predictor reaching the highest accuracy rate of
93.6%, followed by Logistic, Decision Table and Hoeffding tree with 90.3% and
Bagging classifier came with the lowest accuracy of 67.7%, among the algorithms
used in this paper.
Keywords: cancer prediction; machine learning; classification; computer aided
diagnosis.
I. INTRODUCTION
Colon cancer is a form of cancer that affects the large intestine. In many
previous cases of colon cancer, it starts with the development of small, non-
cancerous cells called adenomatous polyps. Many of these polyps go on to become
International Journal of Pure and Applied MathematicsVolume 119 No. 18 2018, 2147-2166ISSN: 1314-3395 (on-line version)url: http://www.acadpubl.eu/hub/Special Issue http://www.acadpubl.eu/hub/
2147
cancerous. These cells can vary in size and may produce symptoms, through
which the detection can take place. Inherited gene mutations become a reason
for the increased risk of this cancer as it can be passed down the family
hierarchical order, but these inherited genes do not always result in the colon
cancer.
Histopathological examination of a tissue specimen is a common method
that is used to locate as well as classify the colon cancer. In an alternate method
for colon cancer detection, pathologists examine the various parameters that
may cause changes in the cell structure. The tissue distribution and changes in
the cell structure help in determining the erratic region in the specimen, if any
[1].
This method of examination of the specimens is very tedious for the
histopathologist nonetheless being very expansive and subjective. Most of the
times it leads to variability [3].
Researches are happening in the field of Computer Aided Diagnosis. The
rise in the use of Machine Learning algorithms has helped a lot in medical
diagnosis. Computers can assist doctors in diagnosing a patient for certain type
of diseases that can help in saving a patient's life. Early prediction of diseases
like cancer will help the medical field in providing better diagnosis and take
preventive measures.
We through our research plan to implement various classification
algorithms on the data set containing gene concentration in a tissue to predict
the occurrence of colon cancer in a human being. The training and the testing
data have been formulated using the 10-fold cross validation method.
The remaining paper is organized as follows. Section II describes the
related work that has already been done in this field. Section III deals with the
methodology. The results achieved are analyzed in section IV. Section V
concludes the paper.
II. RELATED WORK
Cancer has been one of the deadliest diseases to have affected humanity;
more than 1/6th of the deaths worldwide are due to cancer. Generally, mutations
of gene structure, lead to the changes in the composition of the gene that
eventually causes the cancerous growth of cells. If we could possibly identify the
gene which changed, that eventually lead to the cell turning cancerous, we can
International Journal of Pure and Applied Mathematics Special Issue
2148
supply better treatment to cancer patients. Hence, utilizing gene expression
profile is a vital step towards integration of the complex genomic information
that is unique and, in many ways, customized for an individual patient.
To deliver a reliable forecast result, a suitable approach is required that
can give high precision in classification, which is subject to the efficient approval
strategy. The main procedure of gene expression data classification task
includes: feature selection and pattern classification stage [22]. The feature
selection selects a list of genes which may be informative for the prediction of
tumour suppressor. The pattern classifier makes a call to the class, to which the
gene pattern belongs to, at the prediction stage.
The Oligonucleotide arrays give brief information on the condition of the
cell. It checks the expression level of various genes at the same time. It is
important to create strategies for separating helpful data from the subsequent
informational collections. A proficient 2-way clustering method was applied by
Alon et al. [16], to a set of gene expressions in 22 normal and 40 tumour colon
tissues. This led to uncovering of wide coherent designs, those which recommend
a high level of organization underlying gene expression in the tissues.
On the basis of the above selected genes, gene sets were summarized by
Zhang [23] using the recursive partitioning tree. Floating search algorithm was
used by Liu Jin Quan [24] to deal with colon cancer gene expression data. A fast
correlation based filter algorithm was used by Yu and Liu [17] that utilized
relationship degree to eliminate repetition, and gain significant genes.
SVM-RBF-RFE algorithm that figured out the weight of each feature was
a wrapper selection technique proposed by Yang Jhang [18]. This method was
able to identify most of the significant genes related with the colon cancer. A
hybrid approach of the filter and wrapper methods was put forward by Xing et
al. [19].
A gene selection method for cancer classification, consisting of the genetic
algorithm and the SVM was proposed by Shutao et al. [20]. The Wilcoxon rank
sum test was used to filter out the repetitive genes. A definitive subset, including
exceeding isolating genes was achieved by analysing the repetition of the
presence of every gene in the distinctive subsets. Shen et al. [21] put forward the
combination of particle swarm optimisation (PSO) and SVM. Informative genes
were extracted by applying PSO and the classifier used was SVM. In this
process, t test was applied to filter the data.
In order to improve the method stated by Shutao et al. implementation of
GA/SVM was modified by combining it with the cross- validation method [22]. K
means classification technique was utilized by Zhang Ya [25] to extricate 22
International Journal of Pure and Applied Mathematics Special Issue
2149
informative genes. To select the master gene, SVM was used for classification,
which reached the maximum accuracy rate of 86.4%.
III. METHODOLOGY
A. DATA SOURCE:
Colon Cancer dataset by Alon [16], which is frequently used as a benchmark, has
been chosen to perform the comparative study of the various algorithms. The
dataset consists of 62 samples of colon epithelial cells, out of which 40 are
tumorous and 22 normal. These tissue samples were collected from the patients
affected by the colon cancer. The “tumour” biopsies were extracted from the
tumorous part, whereas the “normal” biopsies were extracted from the healthy
part of the colon.
High density oligonucleotide arrays were used to measure the gene
expression levels in the 62 samples. 2000 genes out of the 6000 were selected
based on the confidence in the measured expression levels. The raw data consists
of two more files, one with the tissue data and the other with the gene names.
The dataset is available at:-
http://genomicspubs.princeton.edu/oncology/affydata/index.html.
B. PREDICTION MODELS USED:
PREDICTION MODELS Explanation
Bayes Net Probabilistic graphical model that
represents a set of variables and their
conditional dependencies via a directed acyclic graph.
Naïve Bayes It’s a classifier which uses the Bayes
Theorem. It predicts membership
probabilities for each class, such as
the probability that given record or
data point belongs to a particular class.
Logistic Algorithm An equation as the representation,
very much like linear regression.
SGD Stochastic gradient descent for learning various linear models.
Simple logistic This is a classifier for building linear,
logistic regression models.
SMO The SVM training algorithm builds a
model that assigns new examples to
one category or the other, making it a
International Journal of Pure and Applied Mathematics Special Issue
2150
non-probabilistic binary linear
classifier.
Voted Perceptron An algorithm for linear classification,
which combines the Rosenblatt's
perceptron algorithm with leave-one-
out method.
IBk The K - Nearest neighbour is also an
algorithm for analysis, used for
regression.
K Star An instance-based classifier that is,
the class of a test instance is based
upon the class of those training
instances similar to it, as determined by some similar function.
LWL Non-parametric and the current
prediction is done by local functions
which are using only a subset of the
data.
Adaboost M1 A general ensemble method that
creates a strong classifier from a
number of weak classifiers.
Attribute Selected Classifier Dimensionality of training and test
data is reduced by attribute selection
before being passed on to a classifier
using this algorithm.
Bagging Bootstrapping is a process of selecting
samples from the original sample and
using these samples for estimating
various statistics or model accuracy.
Classification via regression For every single value of the classes, a
single regression model is constructed.
Random Committee A class is used for building an
ensemble of classifiers with a
randomizable base .
Randomizable filtered Classifier It is a simple variant of the filtered
classifier, that instantiates the model with the classifier.
Decision Table The class is used for building and it
uses a simple decision table as its
classifier. The minimum number of instances is 1.
JRip Implements a propositional rule
learner, Repeated Incremental
Pruning to Produce Error Reduction.
Decision Stump A machine learning model consisting
of a one-level decision tree.
Hoeffding Tree Is an incremental, anytime decision
International Journal of Pure and Applied Mathematics Special Issue
2151
tree induction algorithm that is
capable of learning from massive data
streams, assuming that the
distribution generating examples does not change over time.
J48 Generating pruned and unpruned C4
[11]. A depth first approach is used for
decision growth.
LMT For building 'logistic model trees',
which are classification trees with
logistic regression functions at the leaves.
Random Forest Easy to use machine learning
algorithm which is very flexible and
produces great results most of the
time, even without proper hyper-
parameter tuning.
Random Tree Supervised Classifier; it is an
ensemble learning algorithm that
generates many individual learners.
Rep Tree Uses the regression tree logic and
creates multiple trees in different
iterations.
C. IMPLEMENTATION:
Fig 1a: Flow Chart Diagram
International Journal of Pure and Applied Mathematics Special Issue
2152
This section will describe the prediction process, consisting of five main phases:
pre-processing (data cleansing), pre- selection, feature selection, classification
and validation phase.
a. Pre- processing phase
This phase is also referred to as the data cleansing step. In any data mining
application, this phase is amongst the most important steps. To understand the
dataset and train it for mining, exploratory data analysis was performed. The
original data, I2000 matrix (MXN) consists of gene expression data of 62 samples
(N) over 2000 genes (M). This data can be obtained in the arff format from the
link http://csse.szu.edu.cn/staff/zhuzx/Datasets.html. Out of the 2000 genes, 92
were found to be redundant, which were filtered out and 1908 genes were
obtained. Since the goal of this project is to develop efficient models for the
prediction of colon cancer, a binary dependent variable representing the class of
the tissue, namely “tumour” and “normal” is created. For each sample, it is
indicated if it has come from a tumour or a normal biopsy.
b. Pre- selection phase
This phase aims to use the Info gain attribute evaluator which evaluates the
dataset, ranks the features of the evaluated data set and finally sorts them
according to the top rank information gain, statistically [22]. The worth of an
attribute is evaluated by measuring the information gain. The info gain
evaluator calculates the worth of an attribute. The search method ranks
attributes by their individual evaluations. During this selection process, those
gene features considered as the most discriminatory features are extracted. For
this paper, the top 130 genes are selected for the classification process.
Generally, this phase aims at reducing the dimensionality of the dataset.
c. Feature selection phase
This phase allows learning algorithms to operate faster and more effectively. In
order to achieve high accuracy, we will be searching for uniform patterns to
select the predominant genes out of a large number of initial gene features. This
is performed with an objective of finding an optimal relevant subset of attributes
(genes). In addition to improving accuracy, a representation of the target class
can be obtained easily.
The classifier subset evaluator (feature selection with wrapper) along with the
best first search (BFS) is used to achieve this. We choose the classifier pertaining
to which we require the informative genes and perform the BFS.
If there are n number of attributes initially, the possible number of subsets that
can be formed are n2. The best way to choose the perfect one would be by trying
out all.
International Journal of Pure and Applied Mathematics Special Issue
2153
The best first search starts with an empty subset, and starts generating all
single attribute expansions. The highest evaluated subset is chosen and is
expanded in the above similar manner. If the resulting subset leads to no
advancement, the search backtracks to the next best unexpanded result and the
execution continues. This way, the entire search space is covered and the best
subset found is returned after the termination of the search [26].
The subset obtained as a result of the above procedure results in the selection of
5-8 genes, varying from classifier to classifier. The reduced data derived above,
can be saved in the arff format, which is used for the further classification
process.
d. Classification phase
Classification, which is considered as an instance of supervised learning,
involves identifying to which a set of observation belongs. Our objective is to
classify the colon cancer dataset into benign and malignant using various
classifiers such as Support vector machine (SVM), Random Forest, BayesNet etc.
The reduced data obtained in the feature selection phase, containing the
informative genes corresponding to a particular algorithm, is loaded into the
WEKA tool. This data is passed through the classifier. In the training phase, the
learning algorithm finds patterns in the input, that map the data attributes to
the target class, and outputs an ML model, which captures these patterns. This
model is tested using the cross validation technique, explained in the following
section.
e. Validation Phase
Researchers tend to use the k- fold cross- validation to reduce the bias that is
usually associated, in terms of the random sampling of the training and the hold
out data samples, in comparing the predictive accuracy of two or more methods.
The predictive models are evaluated by splitting the original samples into
training and testing data sets. The total number of samples in the set is divided
into k subsets, where each subset approximately contains equal number of
samples. The classification model is trained and tested k times. Every time the
classification model is trained on k-1 subsamples, and it is tested on the
remaining single fold. This way, each subsample gets a chance to act as the
validation set. Since empirical studies have proven 10 to be an optimal value for
k, we use the 10 fold cross validation.
International Journal of Pure and Applied Mathematics Special Issue
2154
IV. RESULT ANALYSIS
Upon performing the implementation procedure for the selected classification
algorithms, the results of the algorithms are compared with the fellow
algorithms under the same class of classifier, the algorithm having the highest
accuracy of them all is selected from each of the classifier techniques. The
comparison between the selected best algorithms is done to identify the best
algorithm for the detection of the colon cancer.
Before the results, a brief explanation about the confusion matrix, accuracy and
sensitivity is given.
Confusion Matrix: It is a binary classifier. A confusion matrix can be of any size
depending upon the different number of parameters inputted (labels in our case).
The confusion matrix in our case is a 2x2 matrix.
TP FN
FP TN
TP-True Positive, FN-False Negative, FP-False Positive, TN-True
Negative
TP and TN denote the number of instances which have been correctly classified
as tumorous and normal respectively. FP and FN signify the number of instances
which have been wrongly classified as tumorous and normal respectively.
Accuracy: The accuracy can be calculated with the help of the formula given
below.
Accuracy= 𝑇𝑃+𝑇𝑁
𝑇𝑃+𝐹𝑁+𝐹𝑃+𝑇𝑁
Sensitivity: The sensitivity can be calculated as follows.
Sensitivity= 𝑇𝑃 𝑇𝑃+𝐹𝑁
The category of classifiers that we have selected from the WEKA tool is:
T- Denotes Tumour
N- Denotes Non-Tumour
International Journal of Pure and Applied Mathematics Special Issue
2155
Bayes
0.94
0.92
0.9
0.88
Accuracy Sensitivity
BayesNet NaiveBayes
BAYES
Table 1a: Table of Results of Bayes Classifier
Name of
Algorithm
Confusion
Matrix
Correctly
Classified Instances
Accuracy Sensitivity
BayesNet T N 90.3226 % 0.903 0.903
38 2
4 18
NaiveBayes T N 93.5484 % 0.936 0.935
39 1
3 19
From the results above, the Naive Bayes Algorithm has a higher value of
Accuracy and Sensitivity as compared to BayesNet. It is also understood that
the NaïveBayes algorithm can classify 93% of the instances correctly as
compared to the BayesNet algorithm. Hence it can be concluded that
NaïveBayes is the better algorithm, in terms of accuracy.
Fig 1b: Bar Graph for Bayes Classifier
FUNCTIONS
Table 2a: Results for Function Classifier
Name of
Algorithm
Confusion
Matrix
Correctly
Classified Instances
Accuracy Sensitivity
Logistic T N 90.3226 % 0.903 0.903
38 2
4 18
International Journal of Pure and Applied Mathematics Special Issue
2156
Logistic SGD Simple Logistic SMO Voted Perceptron
Sensitivity Accuracy
0.92
0.9
0.88
0.86
0.84
0.82
0.8
0.78
0.76
0.74
Functions
SGD T N 90.3226 % 0.903 0.903
38 2
4 18
Simple
Logistic
T N 85.4839 % 0.854 0.855
36 4
5 17
SMO T N 87.0968 % 0.870 0.871
37 3
5 17
Voted
Perception
T N 80.6452 % 0.812 0.806
33 7
5 17
In this class of classifiers, Logistic algorithm has the highest accuracy
with 0.903 along with the SGD which has the same value in this category. Voted
Perceptron has the least value in terms of both Accuracy and Sensitivity
amongst the 5 algorithms. Logistic and SGD algorithms also classify
approximately 91% of the instances correctly, which prove out to be the highest
in terms of all the other algorithms. As the accuracy of both Logistic and the
SGD are same, we can select any algorithm out of these. Logistic is selected.
Fig 2b: Bar Graph for Functions Classifier
International Journal of Pure and Applied Mathematics Special Issue
2157
LWL K Star Ibk
Sensitivity Accuracy
0.825
0.82
0.815
0.81
0.805
0.8
0.795
0.79
Lazy
LAZY
Table 3a: Table of Results for Lazy Classifier
Name of
Algorithm
Confusion
Matrix
Correctly
Classified
Instances
Accuracy Sensitivity
lBk T N 80.6452 % 0.804 0.806
35 5
7 15
K Star T N 82.2581 % 0.820 0.823
36 4
7 15
LWL T N 82.2581 % 0.823 0.823
37 3
8 14
The K star and the LWL algorithms have the same value for the correctly
classified instances that stands at approximately 82%. Both of these
algorithms have the same value for Sensitivity. LWL algorithm is chosen over
K Star because it has a greater value of Accuracy.
Fig 3b: Bar Graph for Lazy Classifiers
International Journal of Pure and Applied Mathematics Special Issue
2158
META
Table 4a: Table of Result for Meta Classifiers
Name of the
Algorithm
Confusion
Matrix
Correctly
Classified
Instances
Accuracy Sensitivity
Adaboost M1 T N 88.7097 % 0.893 0.887
39 1
6 16
Attribute
Selected
Classifier
T N 88.7097 % 0.893 0.887
39 1
6 16
Bagging T N 67.7419 % 0.677 0.677
30 10
10 12
Classification
Via
Regression
T N 83.871% 0.837 0.839
36 4
6 16
Random
Committee
T N 69.3548% 0.704 0.694
29 11
8 14
Randomizable
Filtered
Classifier
T N 69.3548% 0.704 0.694
29 11
8 14
Amongst the above algorithms, AdaBoostM1 and Attribute Selected Classifier
have the highest percentage of instances that have been classified correctly.
These two algorithms also have the highest values of Accuracy and
Sensitivity amongst all the algorithms. Bagging has the least value out of all
the algorithms. AdaBoostM1 is selected.
International Journal of Pure and Applied Mathematics Special Issue
2159
Fig 4b: Bar Graph for Meta Classifiers
RULES
Table 5a: Table of Result for Rules Classifier
Name of the
Algorithm
Confusion
Matrix
Correctly
Classified
Instances
Accuracy Sensitivity
Decision
Table
T N 90.3226% 0.903 0.903
38 2
4 18
JRip T N 83.871% 0.837 0.839
36 4
6 16
Through the Result table, Decision Table Algorithm has the highest
accuracy and sensitivity value as compared to the JRip algorithm. Decision
Table is selected.
Attribute Selected Classifier
Classification via Regression
Randomizable Filtered Classifier
AdaBoostM1
Bagging
Random Committee
Sensitivity Accuracy
1
0.8
0.6
0.4
0.2
0
META
International Journal of Pure and Applied Mathematics Special Issue
2160
RULES
0.92
0.9
0.88
0.86
0.84
0.82
0.8
Accuracy Sensitivity
Decision Table Jrip
Fig 5b: Bar Graph for Rules Classifier
TREES
Table 6a: Table of Results for Trees Classifier
Name of the
Algorithm
Confusion
Matrix
Correctly
Classified
Instances
Accuracy Sensitivity
Decision
Stump
T N 85.4839% 0.867 0.855
39 1
8 14
Hoeffding
tree
T N 90.3226% 0.903 0.903
38 2
4 18
J48 T N 88.7097% 0.893 0.887
39 1
6 16
LMT T N 85.4839% 0.854 0.855
36 4
5 17
Random
Forest
T N 69.3548% 0.704 0.694
29 11
8 14
Random Tree T N 69.3548% 0.704 0.694
29 11
8 14
International Journal of Pure and Applied Mathematics Special Issue
2161
REP Tree T N 77.4194% 0.786 0.774
38 2
12 10
From the above table of result, the Hoeffding Tree algorithm shows that it
is highly accurate as well as highly sensitive when compared to all the other
algorithms. The J48 algorithm comes second best with an accuracy of 0.893.
The RandomForest and the RandomTree have the least accuracy and
sensitivity values. The Hoeffding Tree algorithm is selected.
Fig 6b: Bar Chart for Trees
BEST OF ALL CLASSIFIERS:
Table 7: Table Result of the best of classifiers
Name of the
Algorithm
Confusion
Matrix
Correctly
Classified
Instances
Accuracy Sensitivity
NaiveBayes T N 93.5484 % 0.936 0.935
39 1
3 19
Logistic T N 90.3226 % 0.903 0.903
38 2
4 18
Random Forest
LMT J48
REP Tree
Decision Stump Hoeffding Tree
Random Tree
Sensitivity Accuracy
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
TREES
International Journal of Pure and Applied Mathematics Special Issue
2162
LWL T N 82.2581 % 0.823 0.823
37 3
8 14
Adaboost M1 T N 88.7097 % 0.893 0.887
39 1
6 16
Decision
Table
T N 90.3226% 0.903 0.903
38 2
4 18
Hoeffding tree T N 90.3226% 0.903 0.903
38 2
4 18
From the selected best, it can be concluded that NaiveBayes algorithm is
the best algorithm in Classification as it shows the highest correctly classified
instances and the highest values for accuracy and sensitivity.
Fig 7: Bar chart for Best Classifiers
Best Algorithm 0.96
0.94
0.92
0.9
0.88
0.86
0.84
0.82
0.8
0.78
0.76
NaiveBayes
Accuracy
Logistic
Sensitivity
LWL AdaBoostM1 Decision Table Hoeffding Tree
International Journal of Pure and Applied Mathematics Special Issue
2163
V. CONCLUSION AND FUTURE WORK:
This paper studied the different classification algorithms existing today
for predicting the chance for colon cancer. The prediction is based on using the
different gene parameters and training them into different classifiers for
classifying as tumorous or non-tumorous. The work carried out here clearly
shows how classification algorithms like Naïve Bayes, Logistic regression and
decision trees provide better accuracy. In future this work can be extended in
using Neural Networks and Deep Neural networks for aiming better accuracy.
REFERENCES
[1]. Madeeha Naiyar, Yousra Asim, Aqsa Shahid “Automated colon cancer
detection using structural and morphological features” 2015.
[2]. Francesco Archetti, Mauro Castelli, Ilaria Giordani, Leonardo Vanneschi
“Classification of colon tumor tissues using genetic programming” 2010.
[3]. G.D Thomas, M.F. Dixon, N.C Smeeton “Observer Variation in the
Histological Grading of Rectal Carcinoma”, Journal of Clinical Pathology,
Vol 36, no 4, pp.385-391, 1983
[4]. C. Demir and B. Yener “Automated cancer Diagnosis based on
Histopathological Images: A systematic survey” 2009.
[5]. Eibe Frank, Yong Wang, Stuart Inglis, Geoffrey Holmes, Ian H. Witten
“Using Model Trees for Classification” 1998.
[6]. http://weka.sourceforge.net/doc.dev/weka/classifiers/rules/DecisionTable.ht
ml
[7]. http://weka.sourceforge.net/doc.dev/weka/classifiers/meta/RandomCommit
tee.html
[8]. https://www.eecs.yorku.ca/tdb/_doc.php/userg/sw/weka/doc/weka/classifiers
/rules/JRip.html
[9]. Iba, Wayne; and Langley, Pat (1992); “Induction of One-Level Decision
Trees”, in ML92: Proceedings of the Ninth International Conference on
Machine Learning, Aberdeen, Scotland, 1–3 July 1992, San Francisco, CA:
Morgan Kaufmann, pp. 233–240
[10]. http://weka.sourceforge.net/doc.dev/weka/classifiers/trees/HoeffdingTree.h
tml
[11]. Rausheen Bal, Sangeeta Sharma “Review on Meta Classification
Algorithms using WEKA” 2016
[12]. http://weka.sourceforge.net/doc.dev/weka/classifiers/trees/LMT.html
[13]. Sumner, Marc, Eibe Frank, and Mark Hall (2005) “Speeding up logistic
model tree induction” PKDD. Springer. pp. 675–683.
[14]. Sushil Kumar Rameshpant Kalmegh “Comparative Analysis of WEKA
Data Mining Algorithm Random Forest,Random Tree and LAD Tree for
Classification of Indigenous News Data” 2015.
International Journal of Pure and Applied Mathematics Special Issue
2164
[15]. Sushil Kumar Kalmegh “Analysis of WEKA data mining Algorithm REP
Tree, Simple Cart and Random Tree for Classification of Indian News”
2015
[16]. Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D. and
Levine, A. J. (1999). “Broad Patterns of Gene Expression Revealed by
Clustering Analysis of Tumor and Normal Colon Tissues Probed by
Oligonucleotide Arrays”.
[17]. Yu, L. and Liu, H. (2004). “Efficient Feature Selection via Analysis of
Relevance and Redundancy. Journal of Machine Learning Research”.
[18]. Quanzhong Liu, Chihau Chen, Yang Zhang, Zhengguo Hu. “Feature
selection for support vector machines with RBF kernel” 2011.
[19]. Xing, E. P., Jordan, M. I. and Karp, R. M. (2001). “Feature Selection for
High-dimensional Genomic Microarray Data”.
[20]. Shutao Li, Xixian Wu, Xiaoyan Hu. ”Gene selection using genetic
algorithm and support vectors machines” 2008.
[21]. Shen, Q., Min, W., Kong, S. W. and Xian, B. Y. (2007). “A Combination of
Modified Particle Swarm Optimization Algorithm and Support Vector
Machine for Gene Selection and Tumor Classification”.
[22]. Zuraini Ali Shah, Puteh Saad, Razib M. Othman. “Feature Selection for
Classification of Gene Expression Data”.
[23]. Zhang H, Yu C Y, et al. “Recursive Partioning for Tumor Classification
with Gene Expression Microarray Data”.
[24]. Liu Jinjin, Lin Yinxin et al. “Informative Genes Selection for Colon
Tumor Based on Gene Expression Profiles”. Journal of
KunmingUniversity of Science and Technology(Science and Technology),
2006.
[25]. Zhang Ya, Rao Nini et al. “A Feature Selection Method for Colon Tumor
Based on Gene Expression Profiles”. Space Medicine and Medical
Engineering, 2008.
[26]. Mark A. Hall, Lloyd A. Smith. “Practical Feature Subset Selection for
Machine Learning”.
[27]. https://www.cs.waikato.ac.nz/~ml/weka/
International Journal of Pure and Applied Mathematics Special Issue
2165
2166