orange tool approach for comparative analysis of ... · 2/7/2019 · rapid miner tool - rapid...
TRANSCRIPT
Journal of Analysis and Computation (JAC) (An International Peer Reviewed Journal), www.ijaconline.com, ISSN 0973-2861
Volume XIII, Issue I, January 2019
MRS. G. AMALA 1
ORANGE TOOL APPROACH FOR COMPARATIVE ANALYSIS OF
SUPERVISED LEARNING ALGORITHM IN CLASSIFICATION MINING
MRS. G. AMALA
Assistant Professor of Computer Science,
Nadar Saraswathi College of Arts and Science, Theni.
ABSTRACT
Data Mining is the technic to extract the hidden predictive data from large database. Data mining is a
powerful new technology with great potential to help for all the fields focus on the most vital information
in their data warehouse. Data mining is the automated prediction of trends and behaviors and it is of high
speed which makes it easy for the users to analyze huge amount of data in less time. Data mining
techniques classification is the most commonly used data mining technique which contains a set of pre
classified samples to create a model which can classify the large set of data. This technique helps in
deriving important information about data and metadata (data about data). The classification technics are
applied by the learning algorithms such as Decision tree (DT), Support Vector Machines (SVM), Naive
Bayes (NB) and Neural Network (NN) and these methods can handle both numerical and categorical
attributes. This study will be implementing in Orange Tool and it will be applied in Iris Dataset. This study
described the performance analysis of classification algorithm based on the correct and incorrect instances
of data classification. The comparison will be taking the following parameters such as Precision, Recall, F-
Measure, Accuracy and Root mean squared error.
Keywords: Random Forest Algorithem, Data mining, Classification, Decision tree, Orange Tool,
Precision, Recall.
1. INTRODUCTION
Data Mining is defined as mining data from huge sets of data. In other words, we can say
that data mining is the procedure of mining knowledge from data. The information or knowledge
extracted so can be used for any of the following applications −
Market Analysis
Fraud Detection
Customer Retention
Production Control
ORANGE TOOL APPROACH FOR COMPARATIVE ANALYSIS OF SUPERVISED LEARNING ALGORITHM IN
CLASSIFICATION MINING
MRS. G. AMALA 2
Science Exploration
2. DATA MINING TECHNICS
Data mining have many technics based on the task. The very important concept in data
mining is to select the correct data mining technique has to be chosen based on the users need
and the type of the problems. A wide ranging of approach has to be used to the accuracy and
cost effectiveness of using data mining techniques. Data mining have basically seven main
data mining techniques. Such as
Statistics
Clustering
Visualization
Decision Tree
Association Rules
Neural Networks
Classification
2.1 STATISTICS
Statistics is one of the main technic for data mining it is based on mathematics, which is
related to the collection and description of data. In some case Statistical technique is not
considered as a data mining technique by many analysts.
2.2 CLUSTERING
Clustering is one among the data mining Technique. Clustering analysis is the process of
identifying data that are similar to each other. Clustering method will help to understanding the
difference and similarities between the data. This is sometimes called as Segmentations.
2.3 VISUALIZATION
Visualization is the one of the most important technique which is used to discover the
data patterns. This technique is used at the beginning of the data mining process.
2.4 DECISION TREE
Decision tree is a predictive model and the decision tree looks like a tree. Decision tree
technique, each branch of the tree is viewed as a classification question and the leaves of the
trees are considered as partitions of the dataset related to that particular classification. The main
Journal of Analysis and Computation (JAC) (An International Peer Reviewed Journal), www.ijaconline.com, ISSN 0973-2861
Volume XIII, Issue I, January 2019
MRS. G. AMALA 3
reason behind this technique it is used to exploration analysis, data pre-processing and prediction
work. Decision tree technique is mostly used by statisticians to find out which database is more
related to the problem of the business. Decision tree technique can be used for Prediction and
Data pre-processing.
2.5 NEURAL NETWORK
Neural Network is another important technique used by people these days. This technique
is most often used in the starting stages of the data mining technology. Artificial neural network
was formed out of the community of Artificial intelligence. Neural networks are very easy to use
as they are automated to a particular extent and because of this the user is not expected to have
much knowledge about the work or database.
2.6 CLASSIFICATION
Data mining techniques classification is the most commonly used data mining technique
which contains a set of pre classified samples to create a model which can classify the large set
of data. This technique helps in deriving important information about data and metadata (data
about data). This technique is closely related to cluster analysis technique and it uses decision
tree or neural network system. There are two main processes involved in this technique
Learning – In this process the data are analyzed by classification algorithm
Classification – In this process the data is used to measure the precision of the
classification rules
2.7 SUPERVISED LEARNING ALGORITHMS
Classification techniques can be compared on the basis of predictive accuracy, speed,
robustness, scalability and interpretability criteria. In this study, four supervised learning
algorithms were compared.
Support Vector Machine
Naïve Bayes
Neural Network
K – Nearest Neighbor
ORANGE TOOL APPROACH FOR COMPARATIVE ANALYSIS OF SUPERVISED LEARNING ALGORITHM IN
CLASSIFICATION MINING
MRS. G. AMALA 4
Random Forest
3. DATA MINING TOOLS:
Data Analysis is a Method of performing three major tasks such as cleansing,
transforming and modeling data. There are various tools are available in data mining to perform
data visualization, analysis and extraction. Comparison of some of the tools along with
parameters and features and decided to use for analysis.
Orange - One of the most useful tools in data mining, which is useful for visual
programming and explorative data analysis. It can be written in Python. Orange has multiple
Weka - Weka is another tool for data mining, which is written by Java
Programming, it contains visualization and analysis.
R Tool - R is also a data mining tool and it is open source. R tool mainly used for
statistical computing.
Rapid Miner tool - Rapid miner tool manly used for client/ server Model.
Rapid miner has been performed extraction and transformation operations.
Knime Tool - K nime is a open source data mining tool.
Data Melt - Data Melt is a framework for scientific computation and multiplatform
and written in Java. It is open source data mining.
FEATURES/
PARAMETERS FOR
DATA MINING TOOL
DATA MINING TOOL
ORANGE WEKA R RAPID
MINER KNIME
DATA
MELT
Open Source 1 1 1 1 1 1
Data Visualization and
Analysis 1 1 0 1 0 1
Interaction and Data
Analysis 1 1 0 1 1 1
Large Toolbox 1 0 0 1 1 0
Scripting Interface 1 1 1 0 1 1
Platform Independence 1 1 1 1 0 0
Covering Methods 0 1 0 1 0 1
Para meters optimized
Method Learning/ Statistical 0 0 0 1 0 1
Journal of Analysis and Computation (JAC) (An International Peer Reviewed Journal), www.ijaconline.com, ISSN 0973-2861
Volume XIII, Issue I, January 2019
MRS. G. AMALA 5
methods
Total 06 05 03 05 04 04
TABLE 1
3.1 TECHNOLOGY USED - ORANGE TOOL:
The Orange tool in data mining is an open source Data Visualization and analysis tool.
The orange tool based on Visual Programming and Python language. The orange tool has
components for machine learning and adds- ons for bioinformatics and text mining. Orange tool
consists of canvas interface onto which the user places widgets and creates data analysis
workflow. Data mining Widgets Offer basic functionalities such as reading the data, showing a
table, selecting features, training predictors, comparing learning algorithm, visualization
elements. Orange tool provides 99% of an advanced analytical solution through template based
framework that speed delivery and reduce errors by nearly eliminating the need to write code.
3.2 DATA COLLECTION
The dataset for this study “ Breast Canser” has been collected from in built dataset of the
Orange tool. The dataset has been already preprocessed and found less number of missing value
and noisy data. Therefore the result obtained will be more accurate and performance of the
classifier also will be more efficient. Totally 683instance and 9 features were used for this study.
The dataset has been analyzed and statistical calculation has been done on it to classify the
survived and non- survived category.
3.3 PERFORMANCE ANALYSIS FOR SUPERVISED LEARNING ALGORITHMS
The following measurements will be used to analyze the best algorithm for given dataset.
The goal of classification technique based algorithm, to find the optimum solution algorithm for
“Breast Cancer” dataset. The dataset details shown below:
Breast Cancer Dataset
Attributes 9
Instances 683
Total Value of Dataset 683
Table 2
3.4 CONFUSION MATRIX:
ORANGE TOOL APPROACH FOR COMPARATIVE ANALYSIS OF SUPERVISED LEARNING ALGORITHM IN
CLASSIFICATION MINING
MRS. G. AMALA 6
A confusion matrix is a table that can be used to measure the performance of a machine
learning algorithm for supervised learning. Confusion matrix is divided by rows and columns
each row of the matrix represents the instance of an actual class and column represents the
instance of a predicted class. Confusion matrix rows for predicted classed and columns for actual
classes.
The following example confusion matrix is done by using 2 class case, Negative and
Positive.
PREDICTED
AC
TU
AL
Negative Positive
Negative TN True positive FP False Positive
Positive FN False negative TP True positive
Table 3
Accuracy:
AC= TN+TP/ TN+FP+FN+TP
The accuracy is not always an adequate performance measure. Let us assume we have
1000 samples. 995 of these are negative and 5 are positive cases. Let us further assume we have
a classifier, which classifies whatever it will be presented as negative. The accuracy will be a
surprising 99.5%, even though the classifier could not recognize any positive samples.
Recall
True positive Rate: recall=TP / FN+TP
True Negative Rate : TNR=FP / TN+FP
Precision:
Precision: TP / FP+TP
4. EXPERIMENTAL WORK AND ANALYSIS
The repository data contains 9 attributes and 683 instances respectively. Orange tool have
been applied on “Breast Cancer” data set taking cross validation for performance evaluation of
the different supervised learning algorithms.
Journal of Analysis and Computation (JAC) (An International Peer Reviewed Journal), www.ijaconline.com, ISSN 0973-2861
Volume XIII, Issue I, January 2019
MRS. G. AMALA 7
Test Learners
Validation method
Method: Cross-validation
Folds: 8
Target class: 2
Data
Examples: 683
Attributes: 9 (Clump, Unif_Cell_Size, Unif_Cell_Shape, Marginal_Adh, Single_Cell_Size,
Bare_Nuclei, Bland_Chromatine, Normal_Nucleoi, Mitoses)
Class: y
Results
Classification
Accuracy Area Under ROC Curve
F- Measure Precision Recall
Random Forest
0.8859 0.9860 0.9179 0.8617 0.9820
Naive Bayes 0.9663 0.9741 0.9737 0.9907 0.9572
kNN 0.9590 0.9843 0.9687 0.9622 0.9752
Neural Network
0.9458 0.9816 0.9579 0.9678 0.9482
SVM 0.9693 0.9918 0.9762 0.9818 0.9707
Table 4
CONFUSION MATRIX: DONE BY NUMBER OF EXAMPLES
Actual
Predicted
Random Forest Naive Bayes KNN Neural Network SVM
Survived Not
Survived
Survive
d
Not
Survived
Survived Not
Survived
Survived Not
Survived
Survived Not
Surviv
ed
Survived 435 9 425 19 434 10 422 22 431 13 Not
Survived 61 178 1 238 17 222 13 226 7 232
Table 5 Classifiers for Confusion Matrix : Done by Number of Examples
ORANGE TOOL APPROACH FOR COMPARATIVE ANALYSIS OF SUPERVISED LEARNING ALGORITHM IN
CLASSIFICATION MINING
MRS. G. AMALA 8
Chart 1
Figure 1: ROC Analysis Figure 2 : Calibration Plot
Table 5 reveals confusion matrix for mentioned four algorithms, which maps the actual and
predicted values for the respective algorithms.
The chart 1 has been shown like pictorial representation of comparison for accuracy to the
algorithms. SVM was the highest accuracy compare with others.
0.84
0.86
0.88
0.9
0.92
0.94
0.96
0.98
RandomForest
NaiveBayes
kNN NeuralNetwork
SVM
Classification Accuracy
Classification Accuracy
Journal of Analysis and Computation (JAC) (An International Peer Reviewed Journal), www.ijaconline.com, ISSN 0973-2861
Volume XIII, Issue I, January 2019
MRS. G. AMALA 9
CLASSIFICATION ACCURACY
ERROR RATE %
RANDOM FOREST 0.8859 11.41 NAIVE BAYES 0.9663 3.37 KNN 0.959 4.1 NEURAL NETWORK 0.9458 5.42 SVM 0.9693 3.07
Table 6: Error Rate Calculation- (Misclassification Error Rate =1 – Accuracy)
In table 6 calculated the Error rate by error rate formula, the result shown Support Vector
Machine is the minimum error rate and Random forest is the highest error rate.
5. CONCLUSION
This study has been analyzed and examine with the Random Forest, Naïve Bayes, KNN,
Neural Network and Support Vector Machine methods using 683 samples from the Breast
Cancer data set. The observations were noticed and discussed. In the discussion it was found that
the supervised method Support Vector machine had maximum accuracy and minimum error rate.
And also it was noticed that the running time of random forest is high. When compare with other
algorithms, precision, recall and F-measure values has been elevated. On the basis of accuracy
measures of the classifiers and performance measures of classification used to easily understand
the guidelines of result process. This result has classified in a category such as survived and non-
survived.
6. REFERENCES:
1. Han, J. and Kamber, M. Data Mining: “Concepts and Techniques”, 2001 (Academic
Press, San Diego, California, USA).
2. Tomoki Watanuma, Tomonobu Ozaki, and Takenao Ohkawa. ―”Decision Tree
Construction from Multidimensional Structured Data”‖. Sixth IEEE International
Conference on Data Mining – Workshops, 2006.
3. Micheline Kamber, Lara Winstone, Wan Gong, Shang Cheng, Jiawei Han,
―”Generalization and Decision Tree Induction: Efficient Classification in Data
Mining”‖, Canada V5A IS6, 1996.
4. Caruana, R. and Niculescu-Mizil, A.: "An empirical comparison of supervised learning
algorithms". Proceedings of the 23rd international conference on Machine learning, 2006.
ORANGE TOOL APPROACH FOR COMPARATIVE ANALYSIS OF SUPERVISED LEARNING ALGORITHM IN
CLASSIFICATION MINING
MRS. G. AMALA 10
5. X. Yang, Y. Guo, and Y. Liu, “Bayesian-inference-based recommendation in online
social networks,” Parallel and Distributed Systems, IEEE Transactions- April 2013.
6. G.Kesavaraj, Dr. S.Sukumaran, “A Study on Classification Techniques in Data Mining”,
IEEE-31661, July 4-6, 2013.
7. N. Cristianini and J. Shawe-Taylor. An Introduction to support vector machines and other
kernel based learning methods. Cambridge University Press, 2000.
8. J. Platt. Fast training of support vector machines using sequential minimal optimization.
In C. B. B. Sch•olkopf and A. Smola, editors, Advances in Kernel Methods | Support
Vector Learning, MIT Press, 1999.