orange tool approach for comparative analysis of ... · 2/7/2019 · rapid miner tool - rapid...

Journal of Analysis and Computation (JAC) (An International Peer Reviewed Journal), www.ijaconline.com, ISSN 0973-2861

Volume XIII, Issue I, January 2019

MRS. G. AMALA 1

ORANGE TOOL APPROACH FOR COMPARATIVE ANALYSIS OF

SUPERVISED LEARNING ALGORITHM IN CLASSIFICATION MINING

MRS. G. AMALA

Assistant Professor of Computer Science,

Nadar Saraswathi College of Arts and Science, Theni.

ABSTRACT

Data Mining is the technic to extract the hidden predictive data from large database. Data mining is a

powerful new technology with great potential to help for all the fields focus on the most vital information

in their data warehouse. Data mining is the automated prediction of trends and behaviors and it is of high

speed which makes it easy for the users to analyze huge amount of data in less time. Data mining

techniques classification is the most commonly used data mining technique which contains a set of pre

classified samples to create a model which can classify the large set of data. This technique helps in

deriving important information about data and metadata (data about data). The classification technics are

applied by the learning algorithms such as Decision tree (DT), Support Vector Machines (SVM), Naive

Bayes (NB) and Neural Network (NN) and these methods can handle both numerical and categorical

attributes. This study will be implementing in Orange Tool and it will be applied in Iris Dataset. This study

described the performance analysis of classification algorithm based on the correct and incorrect instances

of data classification. The comparison will be taking the following parameters such as Precision, Recall, F-

Measure, Accuracy and Root mean squared error.

Keywords: Random Forest Algorithem, Data mining, Classification, Decision tree, Orange Tool,

Precision, Recall.

1. INTRODUCTION

Data Mining is defined as mining data from huge sets of data. In other words, we can say

that data mining is the procedure of mining knowledge from data. The information or knowledge

extracted so can be used for any of the following applications −

Market Analysis

Fraud Detection

Customer Retention

Production Control

http://www.ijaconline.com/

ORANGE TOOL APPROACH FOR COMPARATIVE ANALYSIS OF SUPERVISED LEARNING ALGORITHM IN

CLASSIFICATION MINING

MRS. G. AMALA 2

Science Exploration

2. DATA MINING TECHNICS

Data mining have many technics based on the task. The very important concept in data

mining is to select the correct data mining technique has to be chosen based on the users need

and the type of the problems. A wide ranging of approach has to be used to the accuracy and

cost effectiveness of using data mining techniques. Data mining have basically seven main

data mining techniques. Such as

Statistics

Clustering

Visualization

Decision Tree

Association Rules

Neural Networks

Classification

2.1 STATISTICS

Statistics is one of the main technic for data mining it is based on mathematics, which is

related to the collection and description of data. In some case Statistical technique is not

considered as a data mining technique by many analysts.

2.2 CLUSTERING

Clustering is one among the data mining Technique. Clustering analysis is the process of

identifying data that are similar to each other. Clustering method will help to understanding the

difference and similarities between the data. This is sometimes called as Segmentations.

2.3 VISUALIZATION

Visualization is the one of the most important technique which is used to discover the

data patterns. This technique is used at the beginning of the data mining process.

2.4 DECISION TREE

Decision tree is a predictive model and the decision tree looks like a tree. Decision tree

technique, each branch of the tree is viewed as a classification question and the leaves of the

trees are considered as partitions of the dataset related to that particular classification. The main



MRS. G. AMALA 3

reason behind this technique it is used to exploration analysis, data pre-processing and prediction

work. Decision tree technique is mostly used by statisticians to find out which database is more

related to the problem of the business. Decision tree technique can be used for Prediction and

Data pre-processing.

2.5 NEURAL NETWORK

Neural Network is another important technique used by people these days. This technique

is most often used in the starting stages of the data mining technology. Artificial neural network

was formed out of the community of Artificial intelligence. Neural networks are very easy to use

as they are automated to a particular extent and because of this the user is not expected to have

much knowledge about the work or database.

2.6 CLASSIFICATION

Data mining techniques classification is the most commonly used data mining technique

which contains a set of pre classified samples to create a model which can classify the large set

of data. This technique helps in deriving important information about data and metadata (data

about data). This technique is closely related to cluster analysis technique and it uses decision

tree or neural network system. There are two main processes involved in this technique

Learning – In this process the data are analyzed by classification algorithm

Classification – In this process the data is used to measure the precision of the

classification rules

2.7 SUPERVISED LEARNING ALGORITHMS

Classification techniques can be compared on the basis of predictive accuracy, speed,

robustness, scalability and interpretability criteria. In this study, four supervised learning

algorithms were compared.

Support Vector Machine

Naïve Bayes

Neural Network

K – Nearest Neighbor




MRS. G. AMALA 4

Random Forest

3. DATA MINING TOOLS:

Data Analysis is a Method of performing three major tasks such as cleansing,

transforming and modeling data. There are various tools are available in data mining to perform

data visualization, analysis and extraction. Comparison of some of the tools along with

parameters and features and decided to use for analysis.

Orange - One of the most useful tools in data mining, which is useful for visual

programming and explorative data analysis. It can be written in Python. Orange has multiple

Weka - Weka is another tool for data mining, which is written by Java

Programming, it contains visualization and analysis.

R Tool - R is also a data mining tool and it is open source. R tool mainly used for

statistical computing.

Rapid Miner tool - Rapid miner tool manly used for client/ server Model.

Rapid miner has been performed extraction and transformation operations.

Knime Tool - K nime is a open source data mining tool.

Data Melt - Data Melt is a framework for scientific computation and multiplatform

and written in Java. It is open source data mining.

FEATURES/

PARAMETERS FOR

DATA MINING TOOL

DATA MINING TOOL

ORANGE WEKA R RAPID

MINER KNIME

DATA

MELT

Open Source 1 1 1 1 1 1

Data Visualization and

Analysis 1 1 0 1 0 1

Interaction and Data

Analysis 1 1 0 1 1 1

Large Toolbox 1 0 0 1 1 0

Scripting Interface 1 1 1 0 1 1

Platform Independence 1 1 1 1 0 0

Covering Methods 0 1 0 1 0 1

Para meters optimized

Method Learning/ Statistical 0 0 0 1 0 1



MRS. G. AMALA 5

methods

Total 06 05 03 05 04 04

TABLE 1

3.1 TECHNOLOGY USED - ORANGE TOOL:

The Orange tool in data mining is an open source Data Visualization and analysis tool.

The orange tool based on Visual Programming and Python language. The orange tool has

components for machine learning and adds- ons for bioinformatics and text mining. Orange tool

consists of canvas interface onto which the user places widgets and creates data analysis

workflow. Data mining Widgets Offer basic functionalities such as reading the data, showing a

table, selecting features, training predictors, comparing learning algorithm, visualization

elements. Orange tool provides 99% of an advanced analytical solution through template based

framework that speed delivery and reduce errors by nearly eliminating the need to write code.

3.2 DATA COLLECTION

The dataset for this study “ Breast Canser” has been collected from in built dataset of the

Orange tool. The dataset has been already preprocessed and found less number of missing value

and noisy data. Therefore the result obtained will be more accurate and performance of the

classifier also will be more efficient. Totally 683instance and 9 features were used for this study.

The dataset has been analyzed and statistical calculation has been done on it to classify the

survived and non- survived category.

3.3 PERFORMANCE ANALYSIS FOR SUPERVISED LEARNING ALGORITHMS

The following measurements will be used to analyze the best algorithm for given dataset.

The goal of classification technique based algorithm, to find the optimum solution algorithm for

“Breast Cancer” dataset. The dataset details shown below:

Breast Cancer Dataset

Attributes 9

Instances 683

Total Value of Dataset 683

Table 2

3.4 CONFUSION MATRIX:




MRS. G. AMALA 6

A confusion matrix is a table that can be used to measure the performance of a machine

learning algorithm for supervised learning. Confusion matrix is divided by rows and columns

each row of the matrix represents the instance of an actual class and column represents the

instance of a predicted class. Confusion matrix rows for predicted classed and columns for actual

classes.

The following example confusion matrix is done by using 2 class case, Negative and

Positive.

PREDICTED

AC

TU

AL

Negative Positive

Negative TN True positive FP False Positive

Positive FN False negative TP True positive

Table 3

Accuracy:

AC= TN+TP/ TN+FP+FN+TP

The accuracy is not always an adequate performance measure. Let us assume we have

1000 samples. 995 of these are negative and 5 are positive cases. Let us further assume we have

a classifier, which classifies whatever it will be presented as negative. The accuracy will be a

surprising 99.5%, even though the classifier could not recognize any positive samples.

Recall

True positive Rate: recall=TP / FN+TP

True Negative Rate : TNR=FP / TN+FP

Precision:

Precision: TP / FP+TP

4. EXPERIMENTAL WORK AND ANALYSIS

The repository data contains 9 attributes and 683 instances respectively. Orange tool have

been applied on “Breast Cancer” data set taking cross validation for performance evaluation of

the different supervised learning algorithms.



MRS. G. AMALA 7

Test Learners

Validation method

Method: Cross-validation

Folds: 8

Target class: 2

Data

Examples: 683

Attributes: 9 (Clump, Unif_Cell_Size, Unif_Cell_Shape, Marginal_Adh, Single_Cell_Size,

Bare_Nuclei, Bland_Chromatine, Normal_Nucleoi, Mitoses)

Class: y

Results

Classification

Accuracy Area Under ROC Curve

F- Measure Precision Recall

Random Forest

0.8859 0.9860 0.9179 0.8617 0.9820

Naive Bayes 0.9663 0.9741 0.9737 0.9907 0.9572

kNN 0.9590 0.9843 0.9687 0.9622 0.9752

Neural Network

0.9458 0.9816 0.9579 0.9678 0.9482

SVM 0.9693 0.9918 0.9762 0.9818 0.9707

Table 4

CONFUSION MATRIX: DONE BY NUMBER OF EXAMPLES

Actual

Predicted

Random Forest Naive Bayes KNN Neural Network SVM

Survived Not

Survived

Survive

d

Not

Survived

Survived Not

Survived

Survived Not

Survived

Survived Not

Surviv

ed

Survived 435 9 425 19 434 10 422 22 431 13 Not

Survived 61 178 1 238 17 222 13 226 7 232

Table 5 Classifiers for Confusion Matrix : Done by Number of Examples




MRS. G. AMALA 8

Chart 1

Figure 1: ROC Analysis Figure 2 : Calibration Plot

Table 5 reveals confusion matrix for mentioned four algorithms, which maps the actual and

predicted values for the respective algorithms.

The chart 1 has been shown like pictorial representation of comparison for accuracy to the

algorithms. SVM was the highest accuracy compare with others.

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

RandomForest

NaiveBayes

kNN NeuralNetwork

SVM

Classification Accuracy

Classification Accuracy



MRS. G. AMALA 9

CLASSIFICATION ACCURACY

ERROR RATE %

RANDOM FOREST 0.8859 11.41 NAIVE BAYES 0.9663 3.37 KNN 0.959 4.1 NEURAL NETWORK 0.9458 5.42 SVM 0.9693 3.07

Table 6: Error Rate Calculation- (Misclassification Error Rate =1 – Accuracy)

In table 6 calculated the Error rate by error rate formula, the result shown Support Vector

Machine is the minimum error rate and Random forest is the highest error rate.

5. CONCLUSION

This study has been analyzed and examine with the Random Forest, Naïve Bayes, KNN,

Neural Network and Support Vector Machine methods using 683 samples from the Breast

Cancer data set. The observations were noticed and discussed. In the discussion it was found that

the supervised method Support Vector machine had maximum accuracy and minimum error rate.

And also it was noticed that the running time of random forest is high. When compare with other

algorithms, precision, recall and F-measure values has been elevated. On the basis of accuracy

measures of the classifiers and performance measures of classification used to easily understand

the guidelines of result process. This result has classified in a category such as survived and non-

survived.

6. REFERENCES:

1. Han, J. and Kamber, M. Data Mining: “Concepts and Techniques”, 2001 (Academic

Press, San Diego, California, USA).

2. Tomoki Watanuma, Tomonobu Ozaki, and Takenao Ohkawa. ―”Decision Tree

Construction from Multidimensional Structured Data”‖. Sixth IEEE International

Conference on Data Mining – Workshops, 2006.

3. Micheline Kamber, Lara Winstone, Wan Gong, Shang Cheng, Jiawei Han,

―”Generalization and Decision Tree Induction: Efficient Classification in Data

Mining”‖, Canada V5A IS6, 1996.

4. Caruana, R. and Niculescu-Mizil, A.: "An empirical comparison of supervised learning

algorithms". Proceedings of the 23rd international conference on Machine learning, 2006.




MRS. G. AMALA 10

5. X. Yang, Y. Guo, and Y. Liu, “Bayesian-inference-based recommendation in online

social networks,” Parallel and Distributed Systems, IEEE Transactions- April 2013.

6. G.Kesavaraj, Dr. S.Sukumaran, “A Study on Classification Techniques in Data Mining”,

IEEE-31661, July 4-6, 2013.

7. N. Cristianini and J. Shawe-Taylor. An Introduction to support vector machines and other

kernel based learning methods. Cambridge University Press, 2000.

8. J. Platt. Fast training of support vector machines using sequential minimal optimization.

In C. B. B. Sch•olkopf and A. Smola, editors, Advances in Kernel Methods | Support

Vector Learning, MIT Press, 1999.

orange tool approach for comparative analysis of ... · 2/7/2019 · rapid miner tool - rapid...

Documents