a comparative study of single-step and multi-step data mining tools

7/30/2019 A Comparative Study of Single-step and Multi-step Data Mining Tools

1/16


2/16

difficult to extract knowledge from the given dataset. On the other hand, in s multi-step tool the data mining tasks

clustering, classification and visualization are unified and the tool looks like a single-step, provides the knowledge

as output.

The rest of the paper is organized as follows; section 2 deals with the Data Mining Tools, section 3 is about the

comparison of tools and results are discussed in section 4 and finally the conclusion is drawn in section 5.

2. Data Mining Tools

In this section we discuss the single-step data mining tools namely ODM and MS SQL Server and a multi-step data

mining tool called UDMTool.

2.1 Oracle Data Mining (ODM)

The architecture of ODM is based on the Cross Industry Standard Process for Data Mining (CRISP-DM) model

which was founded in 1997 and funded by the European Commission. The main idea was to define an industry

standard for data mining [9]. The CRISP-DM process is shown below:

Business Understanding Data Understanding Data PreparationModeling Evaluation Deployment

There are six steps in CRISP-DM process model. The ODM implements and supports the last three steps of CRISP-

DM model. The main components of the MS SQL Server are shown below:

Data SourceModeling Evaluation and Deployment

The data mining is an iterative process, the process continues after a solution is deployed. The lessons learned

during the process can trigger new business questions. Any change in the data can require new models. The

subsequent data mining processes benefit from the experiences of previous ones. The remaining steps are supported

by a combination of the ODM and the Oracle database, especially in the context of an Oracle data warehouse. The

facilities of the Oracle database can be very useful during data understanding and data preparation. The ODM

integrates data mining with the Oracle database and exposes data mining through the interfaces namely, Java

interface, PL/SQL interface, an Automated data mining, the Data mining SQL functions and the Graphical

interfaces. The ODM supports data mining model export and import in native format between Oracle databases or

schemas to provide a way to move models [9][10][13]. The workflow of ODM is illustrated in figure 1.

Figure 1. The Workflow of the ODM

The figure 1 depicts the workflow of the ODM. The data source is the dataset, explore data is the viewing the dataset

and selection of model is the data mining models such as clustering, classification, association and feature

extraction. These are the required components to do mining in the ODM. The next phase is to apply the model on

the dataset and finally store the results in a separate table for further processing. The user can apply only two

components data source and model and build the model. The rest of the components are just to facilitate the user.

2.2 MS SQL Server

The MS SQL Server also uses the Cross Industry Standard Process for Data Mining (CRISP-DM) model.

Business Understanding Data Understanding Data PreparationModeling Evaluation Deployment

The data mining is a process that involves the interaction of multiple components. In MS SQL Server one can access

the sources of data in a SQL Server database or any other data source to use for training, testing, or prediction,

define the data mining structures and models by using Business Intelligence Development Studio or Visual Studio

2008 and the data mining objects are managed, create the predictions and the queries by using SQL Server

Management Studio. After the completion of the solution, deploy it to an instance of Analysis Services. The main

components of the MS SQL Server are shown below:

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 10, OCTOBER 2012, ISSN (Online) 2151-9617

https://sites.google.com/site/journalofcomputing

WWW.JOURNALOFCOMPUTING.ORG 27

2012 Journal of Computing Press, NY, USA, ISSN 2151-9617


3/16

Data Source Data Mining Structure Data Mining Models Deployment

In MS SQL Server the data mining can be done quickly and easily on relational data tables, or any other data source

that has been defined as an Analysis Services data source view. The MS SQL Server 2008 Analysis Services also

provides the ability to separate the data into training and testing datasets. A data mining structure is a logical data

structure that defines the data domain from which mining models are built. A single mining structure can support

multiple mining models that share the same domain. The data mining structure can also be partitioned into a training

and test dataset. This partitioning can be done automatically when the data mining structure is defined. A datamining model represents a combination of data, a data mining algorithm, and a collection of parameter and filter

settings that affect the data used and how the data is processed. The ultimate goal of data mining development is to

create a model that can be used by end users [12][14].

2.3 The Unified Data Mining Tool (UDMTool)

The Unified Data Mining Tool (UDMTool) is a new and better next generation solution based on the UDMT which

is a unified way of architecting and building software solutions by integrating different data mining tasks. The

foundation of the UDMTool is that the Knowledge can only be obtained if the data mining processes such as

clustering, classification and visualization are unified which is also called the Unified Data mining Theory (UDMT)

i.e. the Knowledge can be extracted from a given dataset after passing through all the data mining processes. This

is illustrated in equation (1).

ionVisulaizattionClassificaClusteringKnowledge (1)

It can be written as in equation (2).

CBAK (2)

WhereA is the clustering,B is the classification, Cis the visualization and Kis the knowledge.

The architecture of the UDMTool is based on the unified data mining process (UDMP) as illustrated in figure 2.

Figure 2. The Unified Data Mining Process

The first three processes of the figure 2 are data gathering, data cleansing and then preparing a dataset. The next

process unifies the clustering, classification and visualization processes of data mining, called unified data mining

processes (UDMP) followed by the output which is the knowledge. The user evaluates and interprets the

knowledge according to his/her business rules. The dataset is the only required input; the knowledge is produced

as final output from the UDMP. As compared to the ad-hoc data mining models, the appropriate data mining

algorithms are selected automatically depending on the nature and the value of the given dataset in the UDMP.

The figure 3 depicts the architecture of the UDMTool.






4/16

Figure 3. The Architecture of the UDMTool

The UDMTool is a multiagent system (MAS). The dataset is the required input; there are many types of datasets

like, numeric, categorical, multimedia, text and many more. First agent takes the dataset and computes the value of

Akaike Information Center (AIC), a model selection criterion, second agent creates the appropriate vertical

partitions of the dataset and the third agent computes the logarithm value of the complexities O of data mining

algorithms deployed in the UDMTool. The fourth agent is applied to input the vertically partitions of the dataset to

UDMP, which itself is a MAS, where one agent is for clustering, second agent is for classification and the third

agent is for visualization, these agents are cascaded i.e. the output of one agent is an input of second agent and theoutput of second agent is input of the third agent. The appropriate data mining algorithms for clustering,

classification and visualization are selected through the value of AIC of the given dataset, the process is completed

by an agent which maps the value of AIC with the logarithmic value of the complexities O of data mining

algorithms. The function of the UDMTool is demonstrated in figure 4.

Figure 4. The Function of the UDMTool

A well-prepared dataset is an input of this framework. First, intelligent agent compute the value model of selection

AIC, which is used to select appropriate data mining algorithm. A MAS called the UDMP is based on the UDMT.

Finally, the knowledge is extracted, which is either accepted or rejected. The relationship between dataset and

selection criterion is one-to-one i.e. one dataset and one value for model selection and between dataset and vertical

partitions is one-to-many i.e. more then one partitions are created for one dataset. The relationship between selection

criterion and the UDMP is one-to-one i.e. one value of selection model will give one data mining algorithm and

finally the relationship between vertical partitions and the UDMP is many-to-many i.e. many partitioned datasets are

inputs for the UDMP and only one result is produced as knowledge.

3. A Comparison of ODM, MS SQL Server and UDMTool

A comparison is drawn between ODM, MS SQL Server and UDMTool in table 1.

Table 1. A Comparison of ODM, MS SQL Server and UDMTool

ODM MS SQL SERVER UDMTool

It is not a magic wind. The user has to

select manually an appropriate data mining

algorithm from the available data mining

pool and if the required results are not

It is not a magic wind. The user

has to combine the different data

mining algorithms provided by

MS on Ad-hoc bases in order to

It is a magic wind. The tool is

based on Unified Data Mining

Theory (UDMT). There is no

need to select any data mining






5/16

produced or obtained from the selected

algorithm, one has to choose another one.

In this suite one algorithm is for one data

mining task, e.g. for clustering k-means,

but the produced clusters presents only the

groups of the data, it is not a knowledge or

serve any purpose to the user. In order to

extract the feature or pattern from the

given dataset, one has to combine or unify

different algorithms manually or one by

one and then at the end the desired results

are obtained.

find the solutions of the problem.

MS SQL Server does not provide

any facility which shows that this

combination of algorithms will

produce better results for the

problem. It provides a facility to

view the cluster profiles, which

helps the user to select the cluster

for further processing.

algorithm, the tool

automatically selects suitable and

appropriate algorithms according

to the nature of the data and

produces the knowledge in the

form of 2D graphs. The processes

for the extraction of knowledge

from the given datasets are

unified, which eases the user to

produce required results.

There is no need to prepare a dataset for

mining. It supports the already created

databases. It also provides the training

facility of a dataset.

There is no need to prepare a

dataset for mining. It supports

the already created databases. It

also provides the training facility

of a dataset.

The user has to prepare the

dataset in the form of a text or

data file. The tool does not

support any databases.

Java Implementation Interface only

supports numeric datasets and

DBMS_DATA_Mining Interface supports

categorical and numeric data.

The suite of MS algorithms

supports numeric and categorical

datasets.

The tool supports only numeric

datasets because all the programs

are implemented in Java.

The user has to set parameters for each of

algorithm in order to produce useful

pattern from the dataset. If no parameter is

set then the default values are

automatically taken by the algorithm, i.e.

the algorithms are not optimized according

to the requirement of the given dataset.

The user has to set parameters for

each of algorithm in order to

produce useful pattern from the

dataset. If no parameter is set

then the default values are

automatically taken by the

algorithm, i.e. the algorithms are

not optimized according to the

requirement of the given dataset.

The number of parameters of

algorithms in MS SQL Server is

more than ODM.

The algorithms are optimized in

this tool. Therefore, there is no

need to set default parameters.

Supports only limited number ofalgorithms for each of the data mining

tasks like clustering and classification.

ODM does not provide visualization of the

data, for this purpose the user has to

import/export the results to the other

visualization tools like MS Excel etc.

Supports only limited number ofalgorithms for each of the data

mining tasks like clustering and

classification. The results of MS

SQL Server can be opened in MS

Excel using Add-ins, which we

say a separate facility of data

visualization.

There is no such limit in thetool; the user can further add

the required algorithms. The

tool directly provides the

visualization of the dataset,

which helps the user to draw

conclusion and extract

knowledge.

It provides the support for Model

evaluation using BIC, export and import,

comparison and cross validation only in

Java Implementation Interface. Some of

the mention facilities are not supported by

the other implementation of ODM.

In MS SQL Server, testing the

accuracy of mining models is

performed through Mining

Accuracy Chart, which plots a

Lift Chart, shows the

performance of different modelsunder different algorithms.

It provides the only support for

Model evaluation and selection

using AIC. If the user wants to

import/export any result,

copy/paste can be used.

ODM implements data mining through

Java objects in function setting and

algorithm setting.

MS SQL Server uses Data

Mining Extensions (DMX) which

extends SQL commands.

UDMTool implements data

mining algorithms through

Intelligent Agents, developed in

Java.

Graphical User Interface is provided by

ODM.

IDE is provided by MS SQL

Server. Mining Model Wizards

ease the user to choose the

Graphical User Interface is

provided by UDMTool.






6/16

different data source provided

e.g. different MS Algorithms and

in this way the system becomes

user friendly.

There is no such limit in ODM but if the

user is applying the Java language then

there may be some constraints.

There is no such limit in MS SQL

Server.

The UDMTool supports:

Number of parameters = 23

Number of Attributes = 211The Sample Size = 12000

It is obvious from the table 1 that in ODM and MS SQL Server, the selection of algorithms is on ad-hoc bases,

although both data mining suites provide the statistical information about the dataset, but these information are not

sufficient to extract the knowledge from the given dataset. The data mining processes clustering, classification and

visualization are individually carried out in ODM and MS SQL Server and there is no relation between these data

mining processes, therefore, it is difficult to extract the knowledge. On the other hand, the proposed UDMTool

unifies all the required data mining processes to extract the knowledge and the selection of the data mining

algorithm(s) in each data mining process is made through the value of model selection criterion AIC and the

complexities O of data mining algorithm(s).

4. Results and Discussion

The MS SQL Server, ODM and the UDMTool are tested on the variety of datasets, Diabetes, a medical dataset,

Breast Cancer, a medical dataset, Iris, an agriculture dataset, Sales, an account dataset and Cars, a vehicledataset. We present the results of Breastcancer, a medical dataset. The attributes of dataset Breast Cancer are:

Clump Thickness (CT), Uniformity of Cell Size (UCS), Uniformity of Cell Shape (UCSh), Marginal Adhesion

(Mad), Single Epithelial Cell Size (SECS), Bare Nuclei (BNu), Bland Chromatin (BCh), Normal Nucleoli (NNu),

Mitoses , Class (benign, malignant) [19].

Case 1: The Results of MS SQL Server

1. The Result of MS Clustering Algorithm

Figure 5. The Diagram of the Clusters of the Breastcancer dataset

We apply the MS clustering data mining algorithm which is similar to k-means clustering algorithm. Figure 5

shows the 10 clusters of the given dataset without the predictable variable. The solid lines show the strong

relation between the clusters and the thin lines show the weak relation. As it is obvious from the above figure 1,

there is a strong relation among cluster 1 and cluster 7 and 3 and the other clusters. On the other hand there is a weak

relation between cluster 1 and 10, cluster 2 and 9, cluster 2 and 6 and cluster 5 and 6. From the figure 1 one can only

visualize the structure of the clusters and their relation but it is still difficult to produce useful information. The

population means number of records per cluster of each cluster is visible by putting the curser on the cluster.






7/16

The MS clustering algorithm produces the 10 clusters by default if the user wants to make his own choice it can only

be done through the programming of MS clustering algorithm, by using the wizards there is no option of selection of

number of clusters. Why the algorithm produces 10 clusters for each dataset it is an issue in MS clustering

algorithm? The algorithm either uses the horizontal partition or vertical partition. All the clustering data mining

algorithms are unsupervised machine learning algorithms, therefore, there is no need to specify the predicted or

target variable in the dataset. The next tables are the extra features available in MS SQL Server 2005.

Table 2. Clusters Profile

Population

(All) Size:

233

Cluster 1

Size: 84

Cluster 2

Size: 36

Cluster 3

Size: 27

Cluster 4

Size: 24

Cluster 7

Size: 18

Cluster 5

Size: 14

Cluster 8

Size: 12

Cluster 6

Size: 11

Cluster 9

Size: 5

Cluster

10 Size: 2

B Ch3.27+/-

2.37

3.27+/-

2.37

1.90+/-

0.79

5.76+/-

2.15

1.78+/-

0.80

6.55+/-

2.31

2.19+/-

1.01

5.03+/-

2.01

1.78+/-

0.84

3.31+/-

2.34

2.41+/-

0.80

4.00+/-

1.41

B Nu3.22+/-

3.40

3.22+/-

3.40

1.04+/-

0.19

6.53+/-

3.161.00

7.52+/-

3.33

1.15+/-

0.37

8.62+/-

2.51

2.62+/-

1.35

3.23+/-

2.17

1.16+/-

0.392.00

Class

benign

malignantmissing

benign:

164

malignant:69 missing:

0

benign:

1.000malignant:

0.000missing:

0.000

benign:

0.124malignant:

0.876

missing:

0.000

benign:

1.000malignant:

0.000

missing:

0.000

benign:

0.000malignant:

1.000

missing:

0.000

benign:

1.000malignant:

0.000

missing:

0.000

benign:

0.069malignant:

0.931

missing:0.000

benign:1.000

malignant:

0.000

missing:0.000

benign:0.990

malignant:

0.010

missing:0.000

benign:1.000

malignant:

0.000

missing:0.000

benign:1.000

malignant:

0.000

missing:0.000

CT 4.15+/-2.75

4.15+/-2.75

2.39+/-1.40

6.09+/-2.33

3.37+/-1.66

7.41+/-2.32

2.88+/-1.70

8.85+/-1.26

2.65+/-1.65

3.49+/-1.78

3.90+/-1.13

3.00+/-2.83

M Adh2.63+/-

2.652.63+/-

2.651.00

4.88+/-2.49

1.67+/-0.83

6.47+/-3.02

1.00+/-0.02

4.76+/-3.21

2.98+/-2.90

1.26+/-0.63

2.07+/-1.22

2.00

Mitoses1.52+/-

1.611.52+/-

1.611.00 1.00 1.00

4.72+/-3.02

1.001.93+/-

0.611.60+/-

1.891.00 1.00 2.00

N Nuc2.65+/-

2.83

2.65+/-

2.831.00

6.04+/-

3.24

1.13+/-

0.34

6.46+/-

3.11

1.74+/-

0.77

4.69+/-

2.241.00 1.00

1.87+/-

0.36

2.50+/-

0.71

SECS3.03+/-

2.08

3.03+/-

2.08

1.93+/-

0.37

4.89+/-

2.052.00

6.70+/-

2.602.00

3.22+/-

1.05

2.00+/-

1.03

2.29+/-

0.82

2.64+/-

0.942.00

UC Sh2.91+/-

2.81

2.91+/-

2.811.00

6.14+/-

2.27

1.93+/-

0.92

7.66+/-

2.54

1.40+/-

0.74

4.19+/-

1.75

1.10+/-

0.32

2.47+/-

1.63

1.19+/-

0.43

1.50+/-

0.71

UCS2.81+/-

2.862.81+/-

2.861.00

5.89+/-2.52

1.11+/-0.32

8.02+/-2.31

2.01+/-0.86

3.94+/-1.64

1.001.88+/-

1.081.31+/-

0.502.50+/-

0.71

Table 2 is about the profile of each cluster with all the attributes of the given dataset. Table also shows the size ofeach cluster i.e. the number of record per cluster. There are only two parameters of the attribute class benign and

malignant and all the other attributes have the integer values in the given dataset but the MS clustering algorithm

shows the two possible values of each attribute which may confuse the user. The value of each attribute varies from

cluster to cluster. The interpretation of table 2 is a little bit difficult.

Table 3. Clusters Characterizing

Variables Values Probability

Class benign Probability = 70.386%

Class malignant Probability = 29.614%

B Nu 3.2 - 5.5 Probability = 24.980%

B Ch 3.3 - 4.9 Probability = 24.980%

UC Sh 2.9 - 4.8 Probability = 24.980%

SECS 1.6 - 3.0 Probability = 24.980%

CT 4.2 - 6.0 Probability = 24.980%

CT 2.3 - 4.2 Probability = 24.980%

N Nuc 2.7 - 4.6 Probability = 24.980%

B Ch 1.7 - 3.3 Probability = 24.980%

UCS 2.8 - 4.7 Probability = 24.980%






8/16

Table 3 is about the clusters characterizing, the attribute/ variable, its value in different clusters and the probability

of the variable. The value and the probability of variables/attributes SECS, MAdh, UCSh, UCS, CT and BNu is high

in some clusters as compare to the rest of variables/attributes.

Table 4. Cluster Discrimination

Variables Values Favors Cluster 1 Favors Complement of Cluster 1

UCS 1.0 Score = 0.000UC Sh 1.0 Score = 0.069

N Nuc 1.0 Score = 0.288

M Adh 1.0 Score = 0.324

Mitoses 1.0 Score = 3.879

B Nu 1.0 1.5 Score = 28.032

UC Sh 1.0 10.0 Score = 51.050

UCS 1.0 10.0 Score = 52.941

M Adh 1.0 10.0 Score = 53.583

N Nuc 1.0 10.0 Score = 54.846

B Nu 1.5 10.0 Score = 59.424

SECS 1.3 2.5 Score = 61.108

Mitoses 1.0 10.0 Score = 64.856

SECS 2.5 10.0 Score = 76.031

Class benign Score = 79.337

Class malignant Score = 79.337

B Ch 1.0 2.8 Score = 80.075

B Ch 2.8 10.0 Score = 85.118

CT 1.0 3.3 Score = 90.353

CT 3.3 10.0 Score = 91.269

SECS 1.0 1.3 Score = 96.887

Table 4 is about the cluster discrimination and the results of only cluster 1 are shown in this table. The favor and the

complement of the favor of cluster 1 are shown. Similarly, the results of the remaining clusters can be displayed.

These are three available options after applying the MS clustering algorithm.

2. The Results of MS Decision Tree Algorithm

Figure 6. The Decision Tree of the Breastcancer dataset






9/16

We apply the MS Decision Tree Algorithm which is ID3 data mining algorithm, on the Breastcancer dataset. The

figure 3 depicts the structure of the decision tree. In our proposed UDMTool we are producing the rules instead of

the tree. In MS SQL Server, in order to get the decision rules, one has to apply the MS Association Rules.

3. The Results of MS Association Rules

Table 5. The Association Rules

Support Size Itemset196 1 Mitoses < 1.1818008626

164 1 Class = benign

160 2 Class = benign, Mitoses < 1.1818008626

154 1 SECS < 2.352245034

148 2 SECS < 2.352245034, Mitoses < 1.1818008626

147 1 N Nuc < 1.4350025798

145 2 SECS < 2.352245034, Class = benign

144 2 N Nuc < 1.4350025798, Mitoses < 1.1818008626

142 3 SECS < 2.352245034, Class = benign, Mitoses < 1.1818008626

141 2 N Nuc < 1.4350025798, Class = benign

140 3 N Nuc < 1.4350025798, Class = benign, Mitoses < 1.1818008626

140 1 UCS < 1.6782988738

The table 5 shows the association rules of the dataset Breastcancer. We are showing only the top support values of

the variables, otherwise the MS Association Rules Algorithms produces a long list, which also confuse the user how

to select the specific value and get the required results. It is important point to note here is that in order to get the

rules MS Association algorithm is applied, the decision tree in MS SQL Server does not produce the decision rules.

The proposed UDMTool uses C4.5 data mining algorithm for classification and produces only few rules in the form

of if-then-else which are easy to take the decision for the user.

Case 2: The Results of ODM 11g2

In ODM, there is no option to save the results of each data mining process like MS SQL Server, therefore, the

results are saved using the print screen. Figure 7 depicts the workflow of clustering model; similarly, the other data

mining models such as classification, association and feature selection are applied.

Figure 7. The Workflow of the Clustering Model

The ODM provides a visual facility of workflow of each model to the user. Figure 7 shows the workflow of the

clustering model. The data source which is a table of the oracle or a dataset is the required component, the other

component is explore data which is basically a view of the dataset, we think it is an optional component and the last






10/16

component is a model which is one of the data mining processes like clustering, classification, association and

feature selection as the list provided by ODM. The user can apply only one model at a time, so this is why we are

referring ODM is a single-step tool. A link is created between the data source and data explore and data source and a

model. Finally, build the model and the ODM applies all the available data mining algorithms in a model and the

user can compare the results of all algorithms and also view the results of a particular required data mining

algorithm. The user can also store the results in a separate table.

1. The Enhanced k-means Clustering Algorithm

Figure 8. The Results of the K-means Clustering Model

We apply the enhanced k-means clustering algorithm of ODM. The algorithm uses the top-down or divisive

technique of hierarchical clustering. There is an option available in ODM to set the required parameters of the

algorithm if the parameters are not set then ODM uses the default. We test the dataset by setting the default

parameters. The ODM creates the clusters in a tree structure the clusters are shown in figure 8. The characterization

of each clusters is also performed in ODM, giving the centroids and clusters rule separately, which facilitates the

user the better understanding about the cluster. In this way we assume that the ODM is unifying the clustering and

classification processes.






11/16

Figure 9. The Results of the K-means Clustering Model with Centroid

Figure 9 shows the value of the centroids of a cluster. There is no role of the value of the centroid in the knowledge

extraction from a dataset.

Figure 10. The Results of the K-means Clustering Model with Cluster Rules

Figure 10 shows the rules of a cluster, which is a task of the classification data mining process. The rules of a cluster

are also known as decision rules play an important and vital role in the knowledge extraction from a dataset. On the

other hand our proposed UDMTool is providing the decision rules of each cluster in the next step by using the C5.4

a classification data mining algorithm. The user can apply these decision rules in simple queries for further

validation of the results.

2. The Results Classification using Decision Tree Algorithm

Figure 11. The Decision Tree Algorithm with Decision Rules

We apply the decision tree algorithm from the classification model of ODM and the results are shown in figure 11.

The algorithm creates a tree structure of clusters and provides the characterization of each cluster is given in the

form of rules, surrogates and target values. Furthermore, the number of clusters produced through the decision tree

algorithm varies from the enhanced k-means clustering algorithms. The decision rules facilitate the user the better






12/16

understanding about the cluster. In this way we assume that the ODM is unifying the clustering and classification

processes.

Figure 12. The Decision Tree Algorithm with Surrogates

Figure 12 shows the value of the surrogates of a cluster. There is no role of the value of the surrogates in the

knowledge extraction from a dataset.

Figure 13. The Decision Tree Algorithm with Target Values

Figure 13 shows the value of the target values of in a cluster. The percentage of the target values varies from clusterto cluster. We can say that there is no role of the value of the target values in the knowledge extraction from a

dataset.

Remark: After applying the clustering and classification models of ODM, it is difficult for the user to select the

right model because in both models first the clusters are created and then the rules of each cluster are produced. The

output of both cases is not the same. In UDMTool the first process is clustering followed by the classification and

visualization, therefore, there is no such problem in multi-step tool. We can say the results of clustering model are

accurate because in the data mining process model first the clusters are created and then the rest of the processes are

applied to extract the useful information and knowledge.






13/16

Case 3: The Results of UDMTool

The UDMTool produces the 2D scatter graphs as the final output(s) of the Breastcancer dataset which can be

interpreted as knowledge.

Figure 14 The Graph between UCSh and MAdh attributes of Breastcancer datasetThe graph in figure 14 can be divided into two regions; in the first region, the value of the attributes Uniformity of

Cell Shape and Marginal Adhesion varies and it is constant in the subsequent second region. The outcome of this

graph is that if the value of the attributes is variable then the patient has malignant class of breast cancer and

benign class of breast cancer for the constant values of the attributes.

Figure 15 The Graph between BCh and Mitoses attributes of Breastcancer dataset

The value of the attributes Mitoses and Bland Chromatin is almost constant throughout in this graph of figure 15.

The graph can be divided into two main regions; the value of the attributes Bland Chromatin and Mitoses varies

in the first region and remains constant in the subsequent next region. The outcome of this graph is that if the value

of the attributes is variable then the patient has malignant class of breast cancer otherwise benign class of breast

cancer for the constant value of the attributes.Table 6 below summaries the results of data mining processes clustering, classification and visualization using MS

SQL Server, ODM and UDMTool.

Table 6. Summary of the output

Data Mining

Process

MS SQL Server ODM UDMTool

Clustering 1. Uses MS Clustering algorithm

and creates 10 clusters by default

1. Uses K-means Clustering

algorithm and creates 10

1. Uses K-means

Clustering algorithm and






14/16

(the number of clusters are not

optimized).

2. Provides the further

characterization of each cluster

such as population per cluster,

probability of each input variable.

3. Provides the bindings (weak or

strong) among clusters.

Remark: Only the clusters

population and probability is not

sufficient to extract knowledge.

clusters by default (the number

of clusters are not optimized)

in a hierarchical structure.


characterization of each cluster

such as centroids and clusters

rule.

Remark: Clusters rules are

basically output of the

classification data mining

process. In this way the ODM

unifies clustering and

classification data mining

processes, which is a step

forward towards the

knowledge extraction.

creates 2 clusters according

to the target values of the

input variable of the given

dataset because the number

of clusters are optimized.


characterization of eachcluster such as population

per cluster.

Classification 1. Uses MS Decision Tree

algorithm and creates a horizontal

tree of the whole dataset. There are

total 8 nodes of the tree.

2. The rules of the dataset can becreated by another algorithm MS

Association. The list of the rules is

very long, some time misleading

and confuse the user in the

selection of important and the best

rules. In this way the user has to

apply two data mining algorithm to

obtain the decision rules.

Remark: The nodes of the tree do

not reflect the knowledge.

1. Uses the Decision Tree

algorithm and creates a

hierarchical tree of the whole

dataset. There are total 7 nodes

of the tree.2. Provide the further

characterization of each node

by Surrogates, Decision rules

and percentage of target value

in each node.

3. The decision rules are in the

form of (if-then-else) which

can be deployed in the simple

query.

Some times it looks like that

there is no such difference in

clustering and classification

models in ODM, the onlydifference is of the

characterization. The decision

rules vary from cluster to

cluster.

Remark: There is still

confusion in the selection of

the results of these two data

mining processes in ODM.

1. Uses the output(s) of the

clustering process as input

and applies the C4.5

(Decision Tree) algorithm

and classify each cluster byproviding the decision

rules as output.

2. The number of decision

rules varies from cluster to

cluster. The list of decision

rules is not long as in MS

SQL Server. This is also

referred as the

characterization of

classified clusters.

3. The output of this

process is in the form of

(if-then-else) like in ODM,which can be deployed in

the simple query.

Visualization There is no such model/algorithm

is provided although MS SQL

Server provides GUI in each

process of data mining. The user

can save the results and use MSExcel as visualization tool.

Remark: The data mining

processes are not unified rather

than each process is individually

carried out therefore it is difficult

to extract the knowledge.

There is no such

model/algorithm is provided

although ODM provides GUI

in each process of data mining.

The user can save the resultsand use MS Excel as

visualization tool.


processes clustering and

classification are unified

which is a step forward in

knowledge extraction.

Provides the 2D graphs of

each classified cluster

which helps the user to

visualize then interpret the

results and finally extractthe knowledge.


processes are unified which

eases the user to extract the

knowledge.

Conclusion A single-step data mining tool

where the selection of algorithms is

A single-step (up to some

extent a multi-step) data

A multi-step data mining

tool where the selection of






15/16

ad-hoc and difficult to extract

knowledge.

mining tool where the

selection of algorithms is ad-

hoc and the knowledge

extraction is ease as compared

to MS SQL Server.

algorithms is automatic

(based on the value of the

dataset) and the knowledge

extraction is very simple.

We test the Breastcancer, a medical dataset on MS SQL Server, ODM and UDMTool, three different data mining

tools. The obtained results are different although we used the same data mining algorithm in each process of datamining. Firstly, in MS SQL Server and ODM some of the inputs of the data mining algorithms are not optimized on

the other hand UDMTool uses the optimized algorithms. Secondly, the data mining processes clustering,

classification and visualization are individually carried out in MS SQL Server and there is no relation between the

data mining processes, therefore, it is difficult to extract the knowledge. In ODM clustering and classification are

unified which helps the user to extract the knowledge. In UDMTool data mining processes are unified and the output

of clustering is the input of classification and the output of classification is the input of visualization which provides

the user knowledge.

5. Conclusion

The conclusion is that in MS SQL server the selection of the data mining algorithms which are also called the MS

data algorithms is easy but the choice of the algorithm depends on the user not on the data. The user has to select

different algorithms on each step of data mining processes to obtain the knowledge which is the primary goal of the

Data Mining. In a single-step data mining tool like MS SQL Server, if one algorithm is not providing the requiredresults; the user has to choose another one to get the required results. In ODM the process of clustering and

classification is unified i.e. if the user applies the clustering algorithm it automatically produces the rules of each

cluster. Similarly, if the user chooses the classification algorithm it first produces the clusters and then the rules of

each cluster. This is somehow a step towards a multi-step knowledge extraction process. But again the choice of the

algorithm depends on the user not on the data. ODM provides facility of the workflow which is helpful for the user.

It is obvious from the above results it is difficult for the user to extract knowledge from ODM, although the tool

provides a lot of statistical information of the given dataset. We conclude that no single algorithm can produce the

knowledge, which is not possible in a single-step based data mining tools like MS SQL Server and ODM because

the knowledge is a multi-step process and our proposed UDMTool is ultimate choice.

Another issue in the single-step tools is that the selected data mining for the particular task takes the whole dataset

and produces the results. The produced results are not the inputs of other data mining tasks; therefore, it is difficult

to extract knowledge from the given dataset. It is due to the fact that in single-step tools, each data mining task is

carried out individually, instead of unifying the data mining tasks. It is only possible if the output of one data miningtask must be the input of next task i.e. the output of clustering data mining task must be the input of classification

process which is not possible in the single-step tools. One possible solution of this issue is that the user save the

results of first step in a separate dataset and then apply newly created dataset as input to the next step, which we

believe is a lengthy process because saving the results and preparing a dataset in not very simple. It is obvious from

the results of both single-step based data mining tools, MS SQL Server and ODM that no single algorithm can

produce the knowledge, because the knowledge is a multi-step process and our proposed UDMTool is ultimate

choice.

Acknowledgement

The authors are thankful to The Islamia University of Bahawalpur, Pakistan for providing financial assistance to

carry out this research activity under HEC project 6467/F II.

References

[1] Berry, M.J., Data Mining Techniques: For Marketing, Sales and Customer Relationship Management,

Hoboken, NJ, USA: John Wiley & Sons Incorporated, pp. 35, 2004.

[2]Skrypnik, Irina., Terziyan, Vagan., Puuronen, Seppo., and Tsymbal, Alexey, Learning Feature Selection for

Medical Databases, CBMS 1999.

[3]Peng, Y., Kou, G., Shi, Y., Chen, Z., A Descriptive Framework for the Field of Data Mining and Knowledge

Discovery, International Journal of Information Technology and Decision Making, Vol. 7, Issue: 4, Page 639-682,

2008






16/16

[4] Grossman. Robert, Kasif. Simon, Moore. Reagan, Rocke. David and Ullman. Jeff, Data Mining Research:

Opportunities and Challenges, A Report of three NSF Workshops on Mining Large, Massive, and Distributed

Data, (Draft 8.4.5) January 21, 1998

[5] Yang. Qlang, Wu. Xindong, 10 CHALLENGING PROBLEMS IN DATA MINING RESEARCH,

International Journal of Information Technology & Decision Making, Vol. 5, No. 4 (2006) 597604, 2006

[6] Wu. Xindong, Kumar. Vipin, Quinlan, J. Ross, et al, Top 10 algorithms in data mining, SURVEY PAPER,

Knowl Inf Syst (2008) 14:137, 2008.

[7] Das, Somenath, "Unified data mining engine as a system of patterns, Master's Theses. Paper

3440.http://scholarworks.sjsu.edu/etd_theses/3440, 2007.

[8] Singh. Shivanshu K., Eranti. Vijay Kumer., Fayad. M.E., Focus Group on Unified Data Mining Engine (UDME

2010): Addressing Challenges, Focus Group Proposal, 2010.

[9] CRISP-DM 1.0-Step-by-step data mining guide at URL:http://www.crisp-dm.org/CRISPWP-0800.pdf

[10] Oracle Data Mining Concepts 10g Release 2 (10.2) at URL:

http://docs.oracle.com/html/B14339_01/5dmtasks.htm

[11] US Census Bureau. Iris, Diabetes, Vote and Breast datasets at URL: www.sgi.com/tech/mlc/db visited 2009.

[12] Web site of Micro soft http://msdn.microsoft.com/en-us/library/bb510508(v=sql.105).aspx

[13] Oracle Data Mining (ODM) Concepts, 10g Release 1 (10.1), Part Number B10698-01, at URL:

http://docs.oracle.com/cd/B12037_01/datamine.101/b10698/ 2003.

[14] Utley, Craig, Introduction to SQL Server 2005 Data Mining, at URL: http://msdn.microsoft.com/en-

us/library/ms345131(v=sql.90).aspx, 2005.




a comparative study of single-step and multi-step data mining tools

Documents