a comparative study of single-step and multi-step data mining tools

Upload: journal-of-computing

Post on 04-Apr-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/30/2019 A Comparative Study of Single-step and Multi-step Data Mining Tools

    1/16

  • 7/30/2019 A Comparative Study of Single-step and Multi-step Data Mining Tools

    2/16

    difficult to extract knowledge from the given dataset. On the other hand, in s multi-step tool the data mining tasks

    clustering, classification and visualization are unified and the tool looks like a single-step, provides the knowledge

    as output.

    The rest of the paper is organized as follows; section 2 deals with the Data Mining Tools, section 3 is about the

    comparison of tools and results are discussed in section 4 and finally the conclusion is drawn in section 5.

    2. Data Mining Tools

    In this section we discuss the single-step data mining tools namely ODM and MS SQL Server and a multi-step data

    mining tool called UDMTool.

    2.1 Oracle Data Mining (ODM)

    The architecture of ODM is based on the Cross Industry Standard Process for Data Mining (CRISP-DM) model

    which was founded in 1997 and funded by the European Commission. The main idea was to define an industry

    standard for data mining [9]. The CRISP-DM process is shown below:

    Business Understanding Data Understanding Data PreparationModeling Evaluation Deployment

    There are six steps in CRISP-DM process model. The ODM implements and supports the last three steps of CRISP-

    DM model. The main components of the MS SQL Server are shown below:

    Data SourceModeling Evaluation and Deployment

    The data mining is an iterative process, the process continues after a solution is deployed. The lessons learned

    during the process can trigger new business questions. Any change in the data can require new models. The

    subsequent data mining processes benefit from the experiences of previous ones. The remaining steps are supported

    by a combination of the ODM and the Oracle database, especially in the context of an Oracle data warehouse. The

    facilities of the Oracle database can be very useful during data understanding and data preparation. The ODM

    integrates data mining with the Oracle database and exposes data mining through the interfaces namely, Java

    interface, PL/SQL interface, an Automated data mining, the Data mining SQL functions and the Graphical

    interfaces. The ODM supports data mining model export and import in native format between Oracle databases or

    schemas to provide a way to move models [9][10][13]. The workflow of ODM is illustrated in figure 1.

    Figure 1. The Workflow of the ODM

    The figure 1 depicts the workflow of the ODM. The data source is the dataset, explore data is the viewing the dataset

    and selection of model is the data mining models such as clustering, classification, association and feature

    extraction. These are the required components to do mining in the ODM. The next phase is to apply the model on

    the dataset and finally store the results in a separate table for further processing. The user can apply only two

    components data source and model and build the model. The rest of the components are just to facilitate the user.

    2.2 MS SQL Server

    The MS SQL Server also uses the Cross Industry Standard Process for Data Mining (CRISP-DM) model.

    Business Understanding Data Understanding Data PreparationModeling Evaluation Deployment

    The data mining is a process that involves the interaction of multiple components. In MS SQL Server one can access

    the sources of data in a SQL Server database or any other data source to use for training, testing, or prediction,

    define the data mining structures and models by using Business Intelligence Development Studio or Visual Studio

    2008 and the data mining objects are managed, create the predictions and the queries by using SQL Server

    Management Studio. After the completion of the solution, deploy it to an instance of Analysis Services. The main

    components of the MS SQL Server are shown below:

    JOURNAL OF COMPUTING, VOLUME 4, ISSUE 10, OCTOBER 2012, ISSN (Online) 2151-9617

    https://sites.google.com/site/journalofcomputing

    WWW.JOURNALOFCOMPUTING.ORG 27

    2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

  • 7/30/2019 A Comparative Study of Single-step and Multi-step Data Mining Tools

    3/16

    Data Source Data Mining Structure Data Mining Models Deployment

    In MS SQL Server the data mining can be done quickly and easily on relational data tables, or any other data source

    that has been defined as an Analysis Services data source view. The MS SQL Server 2008 Analysis Services also

    provides the ability to separate the data into training and testing datasets. A data mining structure is a logical data

    structure that defines the data domain from which mining models are built. A single mining structure can support

    multiple mining models that share the same domain. The data mining structure can also be partitioned into a training

    and test dataset. This partitioning can be done automatically when the data mining structure is defined. A datamining model represents a combination of data, a data mining algorithm, and a collection of parameter and filter

    settings that affect the data used and how the data is processed. The ultimate goal of data mining development is to

    create a model that can be used by end users [12][14].

    2.3 The Unified Data Mining Tool (UDMTool)

    The Unified Data Mining Tool (UDMTool) is a new and better next generation solution based on the UDMT which

    is a unified way of architecting and building software solutions by integrating different data mining tasks. The

    foundation of the UDMTool is that the Knowledge can only be obtained if the data mining processes such as

    clustering, classification and visualization are unified which is also called the Unified Data mining Theory (UDMT)

    i.e. the Knowledge can be extracted from a given dataset after passing through all the data mining processes. This

    is illustrated in equation (1).

    ionVisulaizattionClassificaClusteringKnowledge (1)

    It can be written as in equation (2).

    CBAK (2)

    WhereA is the clustering,B is the classification, Cis the visualization and Kis the knowledge.

    The architecture of the UDMTool is based on the unified data mining process (UDMP) as illustrated in figure 2.

    Figure 2. The Unified Data Mining Process

    The first three processes of the figure 2 are data gathering, data cleansing and then preparing a dataset. The next

    process unifies the clustering, classification and visualization processes of data mining, called unified data mining

    processes (UDMP) followed by the output which is the knowledge. The user evaluates and interprets the

    knowledge according to his/her business rules. The dataset is the only required input; the knowledge is produced

    as final output from the UDMP. As compared to the ad-hoc data mining models, the appropriate data mining

    algorithms are selected automatically depending on the nature and the value of the given dataset in the UDMP.

    The figure 3 depicts the architecture of the UDMTool.

    JOURNAL OF COMPUTING, VOLUME 4, ISSUE 10, OCTOBER 2012, ISSN (Online) 2151-9617

    https://sites.google.com/site/journalofcomputing

    WWW.JOURNALOFCOMPUTING.ORG 28

    2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

  • 7/30/2019 A Comparative Study of Single-step and Multi-step Data Mining Tools

    4/16

    Figure 3. The Architecture of the UDMTool

    The UDMTool is a multiagent system (MAS). The dataset is the required input; there are many types of datasets

    like, numeric, categorical, multimedia, text and many more. First agent takes the dataset and computes the value of

    Akaike Information Center (AIC), a model selection criterion, second agent creates the appropriate vertical

    partitions of the dataset and the third agent computes the logarithm value of the complexities O of data mining

    algorithms deployed in the UDMTool. The fourth agent is applied to input the vertically partitions of the dataset to

    UDMP, which itself is a MAS, where one agent is for clustering, second agent is for classification and the third

    agent is for visualization, these agents are cascaded i.e. the output of one agent is an input of second agent and theoutput of second agent is input of the third agent. The appropriate data mining algorithms for clustering,

    classification and visualization are selected through the value of AIC of the given dataset, the process is completed

    by an agent which maps the value of AIC with the logarithmic value of the complexities O of data mining

    algorithms. The function of the UDMTool is demonstrated in figure 4.

    Figure 4. The Function of the UDMTool

    A well-prepared dataset is an input of this framework. First, intelligent agent compute the value model of selection

    AIC, which is used to select appropriate data mining algorithm. A MAS called the UDMP is based on the UDMT.

    Finally, the knowledge is extracted, which is either accepted or rejected. The relationship between dataset and

    selection criterion is one-to-one i.e. one dataset and one value for model selection and between dataset and vertical

    partitions is one-to-many i.e. more then one partitions are created for one dataset. The relationship between selection

    criterion and the UDMP is one-to-one i.e. one value of selection model will give one data mining algorithm and

    finally the relationship between vertical partitions and the UDMP is many-to-many i.e. many partitioned datasets are

    inputs for the UDMP and only one result is produced as knowledge.

    3. A Comparison of ODM, MS SQL Server and UDMTool

    A comparison is drawn between ODM, MS SQL Server and UDMTool in table 1.

    Table 1. A Comparison of ODM, MS SQL Server and UDMTool

    ODM MS SQL SERVER UDMTool

    It is not a magic wind. The user has to

    select manually an appropriate data mining

    algorithm from the available data mining

    pool and if the required results are not

    It is not a magic wind. The user

    has to combine the different data

    mining algorithms provided by

    MS on Ad-hoc bases in order to

    It is a magic wind. The tool is

    based on Unified Data Mining

    Theory (UDMT). There is no

    need to select any data mining

    JOURNAL OF COMPUTING, VOLUME 4, ISSUE 10, OCTOBER 2012, ISSN (Online) 2151-9617

    https://sites.google.com/site/journalofcomputing

    WWW.JOURNALOFCOMPUTING.ORG 29

    2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

  • 7/30/2019 A Comparative Study of Single-step and Multi-step Data Mining Tools

    5/16

    produced or obtained from the selected

    algorithm, one has to choose another one.

    In this suite one algorithm is for one data

    mining task, e.g. for clustering k-means,

    but the produced clusters presents only the

    groups of the data, it is not a knowledge or

    serve any purpose to the user. In order to

    extract the feature or pattern from the

    given dataset, one has to combine or unify

    different algorithms manually or one by

    one and then at the end the desired results

    are obtained.

    find the solutions of the problem.

    MS SQL Server does not provide

    any facility which shows that this

    combination of algorithms will

    produce better results for the

    problem. It provides a facility to

    view the cluster profiles, which

    helps the user to select the cluster

    for further processing.

    algorithm, the tool

    automatically selects suitable and

    appropriate algorithms according

    to the nature of the data and

    produces the knowledge in the

    form of 2D graphs. The processes

    for the extraction of knowledge

    from the given datasets are

    unified, which eases the user to

    produce required results.

    There is no need to prepare a dataset for

    mining. It supports the already created

    databases. It also provides the training

    facility of a dataset.

    There is no need to prepare a

    dataset for mining. It supports

    the already created databases. It

    also provides the training facility

    of a dataset.

    The user has to prepare the

    dataset in the form of a text or

    data file. The tool does not

    support any databases.

    Java Implementation Interface only

    supports numeric datasets and

    DBMS_DATA_Mining Interface supports

    categorical and numeric data.

    The suite of MS algorithms

    supports numeric and categorical

    datasets.

    The tool supports only numeric

    datasets because all the programs

    are implemented in Java.

    The user has to set parameters for each of

    algorithm in order to produce useful

    pattern from the dataset. If no parameter is

    set then the default values are

    automatically taken by the algorithm, i.e.

    the algorithms are not optimized according

    to the requirement of the given dataset.

    The user has to set parameters for

    each of algorithm in order to

    produce useful pattern from the

    dataset. If no parameter is set

    then the default values are

    automatically taken by the

    algorithm, i.e. the algorithms are

    not optimized according to the

    requirement of the given dataset.

    The number of parameters of

    algorithms in MS SQL Server is

    more than ODM.

    The algorithms are optimized in

    this tool. Therefore, there is no

    need to set default parameters.

    Supports only limited number ofalgorithms for each of the data mining

    tasks like clustering and classification.

    ODM does not provide visualization of the

    data, for this purpose the user has to

    import/export the results to the other

    visualization tools like MS Excel etc.

    Supports only limited number ofalgorithms for each of the data

    mining tasks like clustering and

    classification. The results of MS

    SQL Server can be opened in MS

    Excel using Add-ins, which we

    say a separate facility of data

    visualization.

    There is no such limit in thetool; the user can further add

    the required algorithms. The

    tool directly provides the

    visualization of the dataset,

    which helps the user to draw

    conclusion and extract

    knowledge.

    It provides the support for Model

    evaluation using BIC, export and import,

    comparison and cross validation only in

    Java Implementation Interface. Some of

    the mention facilities are not supported by

    the other implementation of ODM.

    In MS SQL Server, testing the

    accuracy of mining models is

    performed through Mining

    Accuracy Chart, which plots a

    Lift Chart, shows the

    performance of different modelsunder different algorithms.

    It provides the only support for

    Model evaluation and selection

    using AIC. If the user wants to

    import/export any result,

    copy/paste can be used.

    ODM implements data mining through

    Java objects in function setting and

    algorithm setting.

    MS SQL Server uses Data

    Mining Extensions (DMX) which

    extends SQL commands.

    UDMTool implements data

    mining algorithms through

    Intelligent Agents, developed in

    Java.

    Graphical User Interface is provided by

    ODM.

    IDE is provided by MS SQL

    Server. Mining Model Wizards

    ease the user to choose the

    Graphical User Interface is

    provided by UDMTool.

    JOURNAL OF COMPUTING, VOLUME 4, ISSUE 10, OCTOBER 2012, ISSN (Online) 2151-9617

    https://sites.google.com/site/journalofcomputing

    WWW.JOURNALOFCOMPUTING.ORG 30

    2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

  • 7/30/2019 A Comparative Study of Single-step and Multi-step Data Mining Tools

    6/16

    different data source provided

    e.g. different MS Algorithms and

    in this way the system becomes

    user friendly.

    There is no such limit in ODM but if the

    user is applying the Java language then

    there may be some constraints.

    There is no such limit in MS SQL

    Server.

    The UDMTool supports:

    Number of parameters = 23

    Number of Attributes = 211The Sample Size = 12000

    It is obvious from the table 1 that in ODM and MS SQL Server, the selection of algorithms is on ad-hoc bases,

    although both data mining suites provide the statistical information about the dataset, but these information are not

    sufficient to extract the knowledge from the given dataset. The data mining processes clustering, classification and

    visualization are individually carried out in ODM and MS SQL Server and there is no relation between these data

    mining processes, therefore, it is difficult to extract the knowledge. On the other hand, the proposed UDMTool

    unifies all the required data mining processes to extract the knowledge and the selection of the data mining

    algorithm(s) in each data mining process is made through the value of model selection criterion AIC and the

    complexities O of data mining algorithm(s).

    4. Results and Discussion

    The MS SQL Server, ODM and the UDMTool are tested on the variety of datasets, Diabetes, a medical dataset,

    Breast Cancer, a medical dataset, Iris, an agriculture dataset, Sales, an account dataset and Cars, a vehicledataset. We present the results of Breastcancer, a medical dataset. The attributes of dataset Breast Cancer are:

    Clump Thickness (CT), Uniformity of Cell Size (UCS), Uniformity of Cell Shape (UCSh), Marginal Adhesion

    (Mad), Single Epithelial Cell Size (SECS), Bare Nuclei (BNu), Bland Chromatin (BCh), Normal Nucleoli (NNu),

    Mitoses , Class (benign, malignant) [19].

    Case 1: The Results of MS SQL Server

    1. The Result of MS Clustering Algorithm

    Figure 5. The Diagram of the Clusters of the Breastcancer dataset

    We apply the MS clustering data mining algorithm which is similar to k-means clustering algorithm. Figure 5

    shows the 10 clusters of the given dataset without the predictable variable. The solid lines show the strong

    relation between the clusters and the thin lines show the weak relation. As it is obvious from the above figure 1,

    there is a strong relation among cluster 1 and cluster 7 and 3 and the other clusters. On the other hand there is a weak

    relation between cluster 1 and 10, cluster 2 and 9, cluster 2 and 6 and cluster 5 and 6. From the figure 1 one can only

    visualize the structure of the clusters and their relation but it is still difficult to produce useful information. The

    population means number of records per cluster of each cluster is visible by putting the curser on the cluster.

    JOURNAL OF COMPUTING, VOLUME 4, ISSUE 10, OCTOBER 2012, ISSN (Online) 2151-9617

    https://sites.google.com/site/journalofcomputing

    WWW.JOURNALOFCOMPUTING.ORG 31

    2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

  • 7/30/2019 A Comparative Study of Single-step and Multi-step Data Mining Tools

    7/16

    The MS clustering algorithm produces the 10 clusters by default if the user wants to make his own choice it can only

    be done through the programming of MS clustering algorithm, by using the wizards there is no option of selection of

    number of clusters. Why the algorithm produces 10 clusters for each dataset it is an issue in MS clustering

    algorithm? The algorithm either uses the horizontal partition or vertical partition. All the clustering data mining

    algorithms are unsupervised machine learning algorithms, therefore, there is no need to specify the predicted or

    target variable in the dataset. The next tables are the extra features available in MS SQL Server 2005.

    Table 2. Clusters Profile

    Population

    (All) Size:

    233

    Cluster 1

    Size: 84

    Cluster 2

    Size: 36

    Cluster 3

    Size: 27

    Cluster 4

    Size: 24

    Cluster 7

    Size: 18

    Cluster 5

    Size: 14

    Cluster 8

    Size: 12

    Cluster 6

    Size: 11

    Cluster 9

    Size: 5

    Cluster

    10 Size: 2

    B Ch3.27+/-

    2.37

    3.27+/-

    2.37

    1.90+/-

    0.79

    5.76+/-

    2.15

    1.78+/-

    0.80

    6.55+/-

    2.31

    2.19+/-

    1.01

    5.03+/-

    2.01

    1.78+/-

    0.84

    3.31+/-

    2.34

    2.41+/-

    0.80

    4.00+/-

    1.41

    B Nu3.22+/-

    3.40

    3.22+/-

    3.40

    1.04+/-

    0.19

    6.53+/-

    3.161.00

    7.52+/-

    3.33

    1.15+/-

    0.37

    8.62+/-

    2.51

    2.62+/-

    1.35

    3.23+/-

    2.17

    1.16+/-

    0.392.00

    Class

    benign

    malignantmissing

    benign:

    164

    malignant:69 missing:

    0

    benign:

    1.000malignant:

    0.000missing:

    0.000

    benign:

    0.124malignant:

    0.876

    missing:

    0.000

    benign:

    1.000malignant:

    0.000

    missing:

    0.000

    benign:

    0.000malignant:

    1.000

    missing:

    0.000

    benign:

    1.000malignant:

    0.000

    missing:

    0.000

    benign:

    0.069malignant:

    0.931

    missing:0.000

    benign:1.000

    malignant:

    0.000

    missing:0.000

    benign:0.990

    malignant:

    0.010

    missing:0.000

    benign:1.000

    malignant:

    0.000

    missing:0.000

    benign:1.000

    malignant:

    0.000

    missing:0.000

    CT 4.15+/-2.75

    4.15+/-2.75

    2.39+/-1.40

    6.09+/-2.33

    3.37+/-1.66

    7.41+/-2.32

    2.88+/-1.70

    8.85+/-1.26

    2.65+/-1.65

    3.49+/-1.78

    3.90+/-1.13

    3.00+/-2.83

    M Adh2.63+/-

    2.652.63+/-

    2.651.00

    4.88+/-2.49

    1.67+/-0.83

    6.47+/-3.02

    1.00+/-0.02

    4.76+/-3.21

    2.98+/-2.90

    1.26+/-0.63

    2.07+/-1.22

    2.00

    Mitoses1.52+/-

    1.611.52+/-

    1.611.00 1.00 1.00

    4.72+/-3.02

    1.001.93+/-

    0.611.60+/-

    1.891.00 1.00 2.00

    N Nuc2.65+/-

    2.83

    2.65+/-

    2.831.00

    6.04+/-

    3.24

    1.13+/-

    0.34

    6.46+/-

    3.11

    1.74+/-

    0.77

    4.69+/-

    2.241.00 1.00

    1.87+/-

    0.36

    2.50+/-

    0.71

    SECS3.03+/-

    2.08

    3.03+/-

    2.08

    1.93+/-

    0.37

    4.89+/-

    2.052.00

    6.70+/-

    2.602.00

    3.22+/-

    1.05

    2.00+/-

    1.03

    2.29+/-

    0.82

    2.64+/-

    0.942.00

    UC Sh2.91+/-

    2.81

    2.91+/-

    2.811.00

    6.14+/-

    2.27

    1.93+/-

    0.92

    7.66+/-

    2.54

    1.40+/-

    0.74

    4.19+/-

    1.75

    1.10+/-

    0.32

    2.47+/-

    1.63

    1.19+/-

    0.43

    1.50+/-

    0.71

    UCS2.81+/-

    2.862.81+/-

    2.861.00

    5.89+/-2.52

    1.11+/-0.32

    8.02+/-2.31

    2.01+/-0.86

    3.94+/-1.64

    1.001.88+/-

    1.081.31+/-

    0.502.50+/-

    0.71

    Table 2 is about the profile of each cluster with all the attributes of the given dataset. Table also shows the size ofeach cluster i.e. the number of record per cluster. There are only two parameters of the attribute class benign and

    malignant and all the other attributes have the integer values in the given dataset but the MS clustering algorithm

    shows the two possible values of each attribute which may confuse the user. The value of each attribute varies from

    cluster to cluster. The interpretation of table 2 is a little bit difficult.

    Table 3. Clusters Characterizing

    Variables Values Probability

    Class benign Probability = 70.386%

    Class malignant Probability = 29.614%

    B Nu 3.2 - 5.5 Probability = 24.980%

    B Ch 3.3 - 4.9 Probability = 24.980%

    UC Sh 2.9 - 4.8 Probability = 24.980%

    SECS 1.6 - 3.0 Probability = 24.980%

    CT 4.2 - 6.0 Probability = 24.980%

    CT 2.3 - 4.2 Probability = 24.980%

    N Nuc 2.7 - 4.6 Probability = 24.980%

    B Ch 1.7 - 3.3 Probability = 24.980%

    UCS 2.8 - 4.7 Probability = 24.980%

    JOURNAL OF COMPUTING, VOLUME 4, ISSUE 10, OCTOBER 2012, ISSN (Online) 2151-9617

    https://sites.google.com/site/journalofcomputing

    WWW.JOURNALOFCOMPUTING.ORG 32

    2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

  • 7/30/2019 A Comparative Study of Single-step and Multi-step Data Mining Tools

    8/16

    Table 3 is about the clusters characterizing, the attribute/ variable, its value in different clusters and the probability

    of the variable. The value and the probability of variables/attributes SECS, MAdh, UCSh, UCS, CT and BNu is high

    in some clusters as compare to the rest of variables/attributes.

    Table 4. Cluster Discrimination

    Variables Values Favors Cluster 1 Favors Complement of Cluster 1

    UCS 1.0 Score = 0.000UC Sh 1.0 Score = 0.069

    N Nuc 1.0 Score = 0.288

    M Adh 1.0 Score = 0.324

    Mitoses 1.0 Score = 3.879

    B Nu 1.0 1.5 Score = 28.032

    UC Sh 1.0 10.0 Score = 51.050

    UCS 1.0 10.0 Score = 52.941

    M Adh 1.0 10.0 Score = 53.583

    N Nuc 1.0 10.0 Score = 54.846

    B Nu 1.5 10.0 Score = 59.424

    SECS 1.3 2.5 Score = 61.108

    Mitoses 1.0 10.0 Score = 64.856

    SECS 2.5 10.0 Score = 76.031

    Class benign Score = 79.337

    Class malignant Score = 79.337

    B Ch 1.0 2.8 Score = 80.075

    B Ch 2.8 10.0 Score = 85.118

    CT 1.0 3.3 Score = 90.353

    CT 3.3 10.0 Score = 91.269

    SECS 1.0 1.3 Score = 96.887

    Table 4 is about the cluster discrimination and the results of only cluster 1 are shown in this table. The favor and the

    complement of the favor of cluster 1 are shown. Similarly, the results of the remaining clusters can be displayed.

    These are three available options after applying the MS clustering algorithm.

    2. The Results of MS Decision Tree Algorithm

    Figure 6. The Decision Tree of the Breastcancer dataset

    JOURNAL OF COMPUTING, VOLUME 4, ISSUE 10, OCTOBER 2012, ISSN (Online) 2151-9617

    https://sites.google.com/site/journalofcomputing

    WWW.JOURNALOFCOMPUTING.ORG 33

    2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

  • 7/30/2019 A Comparative Study of Single-step and Multi-step Data Mining Tools

    9/16

    We apply the MS Decision Tree Algorithm which is ID3 data mining algorithm, on the Breastcancer dataset. The

    figure 3 depicts the structure of the decision tree. In our proposed UDMTool we are producing the rules instead of

    the tree. In MS SQL Server, in order to get the decision rules, one has to apply the MS Association Rules.

    3. The Results of MS Association Rules

    Table 5. The Association Rules

    Support Size Itemset196 1 Mitoses < 1.1818008626

    164 1 Class = benign

    160 2 Class = benign, Mitoses < 1.1818008626

    154 1 SECS < 2.352245034

    148 2 SECS < 2.352245034, Mitoses < 1.1818008626

    147 1 N Nuc < 1.4350025798

    145 2 SECS < 2.352245034, Class = benign

    144 2 N Nuc < 1.4350025798, Mitoses < 1.1818008626

    142 3 SECS < 2.352245034, Class = benign, Mitoses < 1.1818008626

    141 2 N Nuc < 1.4350025798, Class = benign

    140 3 N Nuc < 1.4350025798, Class = benign, Mitoses < 1.1818008626

    140 1 UCS < 1.6782988738

    The table 5 shows the association rules of the dataset Breastcancer. We are showing only the top support values of

    the variables, otherwise the MS Association Rules Algorithms produces a long list, which also confuse the user how

    to select the specific value and get the required results. It is important point to note here is that in order to get the

    rules MS Association algorithm is applied, the decision tree in MS SQL Server does not produce the decision rules.

    The proposed UDMTool uses C4.5 data mining algorithm for classification and produces only few rules in the form

    of if-then-else which are easy to take the decision for the user.

    Case 2: The Results of ODM 11g2

    In ODM, there is no option to save the results of each data mining process like MS SQL Server, therefore, the

    results are saved using the print screen. Figure 7 depicts the workflow of clustering model; similarly, the other data

    mining models such as classification, association and feature selection are applied.

    Figure 7. The Workflow of the Clustering Model

    The ODM provides a visual facility of workflow of each model to the user. Figure 7 shows the workflow of the

    clustering model. The data source which is a table of the oracle or a dataset is the required component, the other

    component is explore data which is basically a view of the dataset, we think it is an optional component and the last

    JOURNAL OF COMPUTING, VOLUME 4, ISSUE 10, OCTOBER 2012, ISSN (Online) 2151-9617

    https://sites.google.com/site/journalofcomputing

    WWW.JOURNALOFCOMPUTING.ORG 34

    2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

  • 7/30/2019 A Comparative Study of Single-step and Multi-step Data Mining Tools

    10/16

    component is a model which is one of the data mining processes like clustering, classification, association and

    feature selection as the list provided by ODM. The user can apply only one model at a time, so this is why we are

    referring ODM is a single-step tool. A link is created between the data source and data explore and data source and a

    model. Finally, build the model and the ODM applies all the available data mining algorithms in a model and the

    user can compare the results of all algorithms and also view the results of a particular required data mining

    algorithm. The user can also store the results in a separate table.

    1. The Enhanced k-means Clustering Algorithm

    Figure 8. The Results of the K-means Clustering Model

    We apply the enhanced k-means clustering algorithm of ODM. The algorithm uses the top-down or divisive

    technique of hierarchical clustering. There is an option available in ODM to set the required parameters of the

    algorithm if the parameters are not set then ODM uses the default. We test the dataset by setting the default

    parameters. The ODM creates the clusters in a tree structure the clusters are shown in figure 8. The characterization

    of each clusters is also performed in ODM, giving the centroids and clusters rule separately, which facilitates the

    user the better understanding about the cluster. In this way we assume that the ODM is unifying the clustering and

    classification processes.

    JOURNAL OF COMPUTING, VOLUME 4, ISSUE 10, OCTOBER 2012, ISSN (Online) 2151-9617

    https://sites.google.com/site/journalofcomputing

    WWW.JOURNALOFCOMPUTING.ORG 35

    2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

  • 7/30/2019 A Comparative Study of Single-step and Multi-step Data Mining Tools

    11/16

    Figure 9. The Results of the K-means Clustering Model with Centroid

    Figure 9 shows the value of the centroids of a cluster. There is no role of the value of the centroid in the knowledge

    extraction from a dataset.

    Figure 10. The Results of the K-means Clustering Model with Cluster Rules

    Figure 10 shows the rules of a cluster, which is a task of the classification data mining process. The rules of a cluster

    are also known as decision rules play an important and vital role in the knowledge extraction from a dataset. On the

    other hand our proposed UDMTool is providing the decision rules of each cluster in the next step by using the C5.4

    a classification data mining algorithm. The user can apply these decision rules in simple queries for further

    validation of the results.

    2. The Results Classification using Decision Tree Algorithm

    Figure 11. The Decision Tree Algorithm with Decision Rules

    We apply the decision tree algorithm from the classification model of ODM and the results are shown in figure 11.

    The algorithm creates a tree structure of clusters and provides the characterization of each cluster is given in the

    form of rules, surrogates and target values. Furthermore, the number of clusters produced through the decision tree

    algorithm varies from the enhanced k-means clustering algorithms. The decision rules facilitate the user the better

    JOURNAL OF COMPUTING, VOLUME 4, ISSUE 10, OCTOBER 2012, ISSN (Online) 2151-9617

    https://sites.google.com/site/journalofcomputing

    WWW.JOURNALOFCOMPUTING.ORG 36

    2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

  • 7/30/2019 A Comparative Study of Single-step and Multi-step Data Mining Tools

    12/16

    understanding about the cluster. In this way we assume that the ODM is unifying the clustering and classification

    processes.

    Figure 12. The Decision Tree Algorithm with Surrogates

    Figure 12 shows the value of the surrogates of a cluster. There is no role of the value of the surrogates in the

    knowledge extraction from a dataset.

    Figure 13. The Decision Tree Algorithm with Target Values

    Figure 13 shows the value of the target values of in a cluster. The percentage of the target values varies from clusterto cluster. We can say that there is no role of the value of the target values in the knowledge extraction from a

    dataset.

    Remark: After applying the clustering and classification models of ODM, it is difficult for the user to select the

    right model because in both models first the clusters are created and then the rules of each cluster are produced. The

    output of both cases is not the same. In UDMTool the first process is clustering followed by the classification and

    visualization, therefore, there is no such problem in multi-step tool. We can say the results of clustering model are

    accurate because in the data mining process model first the clusters are created and then the rest of the processes are

    applied to extract the useful information and knowledge.

    JOURNAL OF COMPUTING, VOLUME 4, ISSUE 10, OCTOBER 2012, ISSN (Online) 2151-9617

    https://sites.google.com/site/journalofcomputing

    WWW.JOURNALOFCOMPUTING.ORG 37

    2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

  • 7/30/2019 A Comparative Study of Single-step and Multi-step Data Mining Tools

    13/16

    Case 3: The Results of UDMTool

    The UDMTool produces the 2D scatter graphs as the final output(s) of the Breastcancer dataset which can be

    interpreted as knowledge.

    Figure 14 The Graph between UCSh and MAdh attributes of Breastcancer datasetThe graph in figure 14 can be divided into two regions; in the first region, the value of the attributes Uniformity of

    Cell Shape and Marginal Adhesion varies and it is constant in the subsequent second region. The outcome of this

    graph is that if the value of the attributes is variable then the patient has malignant class of breast cancer and

    benign class of breast cancer for the constant values of the attributes.

    Figure 15 The Graph between BCh and Mitoses attributes of Breastcancer dataset

    The value of the attributes Mitoses and Bland Chromatin is almost constant throughout in this graph of figure 15.

    The graph can be divided into two main regions; the value of the attributes Bland Chromatin and Mitoses varies

    in the first region and remains constant in the subsequent next region. The outcome of this graph is that if the value

    of the attributes is variable then the patient has malignant class of breast cancer otherwise benign class of breast

    cancer for the constant value of the attributes.Table 6 below summaries the results of data mining processes clustering, classification and visualization using MS

    SQL Server, ODM and UDMTool.

    Table 6. Summary of the output

    Data Mining

    Process

    MS SQL Server ODM UDMTool

    Clustering 1. Uses MS Clustering algorithm

    and creates 10 clusters by default

    1. Uses K-means Clustering

    algorithm and creates 10

    1. Uses K-means

    Clustering algorithm and

    JOURNAL OF COMPUTING, VOLUME 4, ISSUE 10, OCTOBER 2012, ISSN (Online) 2151-9617

    https://sites.google.com/site/journalofcomputing

    WWW.JOURNALOFCOMPUTING.ORG 38

    2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

  • 7/30/2019 A Comparative Study of Single-step and Multi-step Data Mining Tools

    14/16

    (the number of clusters are not

    optimized).

    2. Provides the further

    characterization of each cluster

    such as population per cluster,

    probability of each input variable.

    3. Provides the bindings (weak or

    strong) among clusters.

    Remark: Only the clusters

    population and probability is not

    sufficient to extract knowledge.

    clusters by default (the number

    of clusters are not optimized)

    in a hierarchical structure.

    2. Provides the further

    characterization of each cluster

    such as centroids and clusters

    rule.

    Remark: Clusters rules are

    basically output of the

    classification data mining

    process. In this way the ODM

    unifies clustering and

    classification data mining

    processes, which is a step

    forward towards the

    knowledge extraction.

    creates 2 clusters according

    to the target values of the

    input variable of the given

    dataset because the number

    of clusters are optimized.

    2. Provides the further

    characterization of eachcluster such as population

    per cluster.

    Classification 1. Uses MS Decision Tree

    algorithm and creates a horizontal

    tree of the whole dataset. There are

    total 8 nodes of the tree.

    2. The rules of the dataset can becreated by another algorithm MS

    Association. The list of the rules is

    very long, some time misleading

    and confuse the user in the

    selection of important and the best

    rules. In this way the user has to

    apply two data mining algorithm to

    obtain the decision rules.

    Remark: The nodes of the tree do

    not reflect the knowledge.

    1. Uses the Decision Tree

    algorithm and creates a

    hierarchical tree of the whole

    dataset. There are total 7 nodes

    of the tree.2. Provide the further

    characterization of each node

    by Surrogates, Decision rules

    and percentage of target value

    in each node.

    3. The decision rules are in the

    form of (if-then-else) which

    can be deployed in the simple

    query.

    Some times it looks like that

    there is no such difference in

    clustering and classification

    models in ODM, the onlydifference is of the

    characterization. The decision

    rules vary from cluster to

    cluster.

    Remark: There is still

    confusion in the selection of

    the results of these two data

    mining processes in ODM.

    1. Uses the output(s) of the

    clustering process as input

    and applies the C4.5

    (Decision Tree) algorithm

    and classify each cluster byproviding the decision

    rules as output.

    2. The number of decision

    rules varies from cluster to

    cluster. The list of decision

    rules is not long as in MS

    SQL Server. This is also

    referred as the

    characterization of

    classified clusters.

    3. The output of this

    process is in the form of

    (if-then-else) like in ODM,which can be deployed in

    the simple query.

    Visualization There is no such model/algorithm

    is provided although MS SQL

    Server provides GUI in each

    process of data mining. The user

    can save the results and use MSExcel as visualization tool.

    Remark: The data mining

    processes are not unified rather

    than each process is individually

    carried out therefore it is difficult

    to extract the knowledge.

    There is no such

    model/algorithm is provided

    although ODM provides GUI

    in each process of data mining.

    The user can save the resultsand use MS Excel as

    visualization tool.

    Remark: The data mining

    processes clustering and

    classification are unified

    which is a step forward in

    knowledge extraction.

    Provides the 2D graphs of

    each classified cluster

    which helps the user to

    visualize then interpret the

    results and finally extractthe knowledge.

    Remark: The data mining

    processes are unified which

    eases the user to extract the

    knowledge.

    Conclusion A single-step data mining tool

    where the selection of algorithms is

    A single-step (up to some

    extent a multi-step) data

    A multi-step data mining

    tool where the selection of

    JOURNAL OF COMPUTING, VOLUME 4, ISSUE 10, OCTOBER 2012, ISSN (Online) 2151-9617

    https://sites.google.com/site/journalofcomputing

    WWW.JOURNALOFCOMPUTING.ORG 39

    2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

  • 7/30/2019 A Comparative Study of Single-step and Multi-step Data Mining Tools

    15/16

    ad-hoc and difficult to extract

    knowledge.

    mining tool where the

    selection of algorithms is ad-

    hoc and the knowledge

    extraction is ease as compared

    to MS SQL Server.

    algorithms is automatic

    (based on the value of the

    dataset) and the knowledge

    extraction is very simple.

    We test the Breastcancer, a medical dataset on MS SQL Server, ODM and UDMTool, three different data mining

    tools. The obtained results are different although we used the same data mining algorithm in each process of datamining. Firstly, in MS SQL Server and ODM some of the inputs of the data mining algorithms are not optimized on

    the other hand UDMTool uses the optimized algorithms. Secondly, the data mining processes clustering,

    classification and visualization are individually carried out in MS SQL Server and there is no relation between the

    data mining processes, therefore, it is difficult to extract the knowledge. In ODM clustering and classification are

    unified which helps the user to extract the knowledge. In UDMTool data mining processes are unified and the output

    of clustering is the input of classification and the output of classification is the input of visualization which provides

    the user knowledge.

    5. Conclusion

    The conclusion is that in MS SQL server the selection of the data mining algorithms which are also called the MS

    data algorithms is easy but the choice of the algorithm depends on the user not on the data. The user has to select

    different algorithms on each step of data mining processes to obtain the knowledge which is the primary goal of the

    Data Mining. In a single-step data mining tool like MS SQL Server, if one algorithm is not providing the requiredresults; the user has to choose another one to get the required results. In ODM the process of clustering and

    classification is unified i.e. if the user applies the clustering algorithm it automatically produces the rules of each

    cluster. Similarly, if the user chooses the classification algorithm it first produces the clusters and then the rules of

    each cluster. This is somehow a step towards a multi-step knowledge extraction process. But again the choice of the

    algorithm depends on the user not on the data. ODM provides facility of the workflow which is helpful for the user.

    It is obvious from the above results it is difficult for the user to extract knowledge from ODM, although the tool

    provides a lot of statistical information of the given dataset. We conclude that no single algorithm can produce the

    knowledge, which is not possible in a single-step based data mining tools like MS SQL Server and ODM because

    the knowledge is a multi-step process and our proposed UDMTool is ultimate choice.

    Another issue in the single-step tools is that the selected data mining for the particular task takes the whole dataset

    and produces the results. The produced results are not the inputs of other data mining tasks; therefore, it is difficult

    to extract knowledge from the given dataset. It is due to the fact that in single-step tools, each data mining task is

    carried out individually, instead of unifying the data mining tasks. It is only possible if the output of one data miningtask must be the input of next task i.e. the output of clustering data mining task must be the input of classification

    process which is not possible in the single-step tools. One possible solution of this issue is that the user save the

    results of first step in a separate dataset and then apply newly created dataset as input to the next step, which we

    believe is a lengthy process because saving the results and preparing a dataset in not very simple. It is obvious from

    the results of both single-step based data mining tools, MS SQL Server and ODM that no single algorithm can

    produce the knowledge, because the knowledge is a multi-step process and our proposed UDMTool is ultimate

    choice.

    Acknowledgement

    The authors are thankful to The Islamia University of Bahawalpur, Pakistan for providing financial assistance to

    carry out this research activity under HEC project 6467/F II.

    References

    [1] Berry, M.J., Data Mining Techniques: For Marketing, Sales and Customer Relationship Management,

    Hoboken, NJ, USA: John Wiley & Sons Incorporated, pp. 35, 2004.

    [2]Skrypnik, Irina., Terziyan, Vagan., Puuronen, Seppo., and Tsymbal, Alexey, Learning Feature Selection for

    Medical Databases, CBMS 1999.

    [3]Peng, Y., Kou, G., Shi, Y., Chen, Z., A Descriptive Framework for the Field of Data Mining and Knowledge

    Discovery, International Journal of Information Technology and Decision Making, Vol. 7, Issue: 4, Page 639-682,

    2008

    JOURNAL OF COMPUTING, VOLUME 4, ISSUE 10, OCTOBER 2012, ISSN (Online) 2151-9617

    https://sites.google.com/site/journalofcomputing

    WWW.JOURNALOFCOMPUTING.ORG 40

    2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

  • 7/30/2019 A Comparative Study of Single-step and Multi-step Data Mining Tools

    16/16

    [4] Grossman. Robert, Kasif. Simon, Moore. Reagan, Rocke. David and Ullman. Jeff, Data Mining Research:

    Opportunities and Challenges, A Report of three NSF Workshops on Mining Large, Massive, and Distributed

    Data, (Draft 8.4.5) January 21, 1998

    [5] Yang. Qlang, Wu. Xindong, 10 CHALLENGING PROBLEMS IN DATA MINING RESEARCH,

    International Journal of Information Technology & Decision Making, Vol. 5, No. 4 (2006) 597604, 2006

    [6] Wu. Xindong, Kumar. Vipin, Quinlan, J. Ross, et al, Top 10 algorithms in data mining, SURVEY PAPER,

    Knowl Inf Syst (2008) 14:137, 2008.

    [7] Das, Somenath, "Unified data mining engine as a system of patterns, Master's Theses. Paper

    3440.http://scholarworks.sjsu.edu/etd_theses/3440, 2007.

    [8] Singh. Shivanshu K., Eranti. Vijay Kumer., Fayad. M.E., Focus Group on Unified Data Mining Engine (UDME

    2010): Addressing Challenges, Focus Group Proposal, 2010.

    [9] CRISP-DM 1.0-Step-by-step data mining guide at URL:http://www.crisp-dm.org/CRISPWP-0800.pdf

    [10] Oracle Data Mining Concepts 10g Release 2 (10.2) at URL:

    http://docs.oracle.com/html/B14339_01/5dmtasks.htm

    [11] US Census Bureau. Iris, Diabetes, Vote and Breast datasets at URL: www.sgi.com/tech/mlc/db visited 2009.

    [12] Web site of Micro soft http://msdn.microsoft.com/en-us/library/bb510508(v=sql.105).aspx

    [13] Oracle Data Mining (ODM) Concepts, 10g Release 1 (10.1), Part Number B10698-01, at URL:

    http://docs.oracle.com/cd/B12037_01/datamine.101/b10698/ 2003.

    [14] Utley, Craig, Introduction to SQL Server 2005 Data Mining, at URL: http://msdn.microsoft.com/en-

    us/library/ms345131(v=sql.90).aspx, 2005.

    JOURNAL OF COMPUTING, VOLUME 4, ISSUE 10, OCTOBER 2012, ISSN (Online) 2151-9617

    https://sites.google.com/site/journalofcomputing

    WWW.JOURNALOFCOMPUTING.ORG 41