evaluation of an integrated knowledge discovery and data mining process model

14
Evaluation of an integrated Knowledge Discovery and Data Mining process model Sumana Sharma , Kweku-Muata Osei-Bryson, George M. Kasper Virginia Commonwealth University, School of Business, 301 W Main St., Richmond, VA 23220, United States article info Keywords: Evaluation Knowledge Discovery and Data Mining (KDDM) process models CRISP-DM IKDDM Analytical testing abstract Data Mining projects are implemented by following the knowledge discovery process. This process is highly complex and iterative in nature and comprises of several phases, starting off with business under- standing, and followed by data understanding, data preparation, modeling, evaluation and deployment or implementation. Each phase comprises of several tasks. Knowledge Discovery and Data Mining (KDDM) process models are meant to provide prescriptive guidance towards the execution of the end-to-end knowledge discovery process, i.e. such models prescribe how exactly each one of the tasks in a Data Min- ing project can be implemented. Given this role, the quality of the process model used, affects the effec- tiveness and efficiency with which the knowledge discovery process can be implemented and therefore the outcome of the overall Data Mining project. This paper presents the results of the rigorous evaluation of the Integrated Knowledge Discovery and Data Mining (IKDDM) process model and compares it to the CRISP-DM process model. Results of statistical tests confirm that the IKDDM leads to more effective and efficient implementation of the knowledge discovery process. Ó 2012 Elsevier Ltd. All rights reserved. 1. Introduction Today, data driven decision making is considered as the corner- stone of modern organizational strategy. It involves the mining of large volumes of data, in the quest for discovering nuggets of knowledge. In recent years Data Mining practitioners and research- ers (e.g. CRISP-DM; Cios, Teresinska, Konieczna, Potocka, & Sharma, 2000; Kurgan & Musilek, 2006; Shearer, 2000) have recognized the need for formal Data Mining process models that prescribes the journey from converting data into knowledge. Kurgan and Musilek (2006), noted with regards to Data Mining ‘‘Before any attempt can be made to perform the extraction of this useful knowledge, an overall approach that describes how to extract knowledge needs to be estab- lished’’. The Knowledge Discovery and Data Mining (KDDM) process is a multiphase process that includes: business under- standing (also sometimes referred to as domain understanding), data preparation, modeling, evaluation and deployment or imple- mentation phases (see Fig. 1). The KDDM process is highly iterative and complex, as each phase involves multiple tasks, and there are numerous intra-phase and inter-phase dependencies that exist be- tween the various tasks of the process. Several KDDM process models have been proposed by research- ers and practitioners. Examples include, Fayyad, Piatetsky-Shapiro, Smyth, and Uthurusamy (1996), Cabena, Hadjinian, Stadler, and Verhees (1998), Cios et al. (2000), CRISP-DM (2003), and Berry and Linoff (1997). In a poll conducted by KDNuggets, 42% of the respondents chose CRISP-DM is the main methodology used by them for Data Mining (KDNuggets, 2007). Sharma and Osei-Bryson (2010) identified some significant limitations in existing KDDM process models and presented an integrated KDDM (IKDDM) process model to address these limita- tions. Since a KDDM process model is a design artifact, it should be subjected to formal evaluation as such an evaluation provides essential feedback which can then be used to refine the given arti- fact. It should be noted that to-date there has be no published re- search studies on formal evaluation of any of the KDDM process models. In this paper, we follow the methodology of Hevner, March, Park, and Ram (2004) to present the results of the formal evaluation of the static qualities of the IKDDM process model. We also compare the performance of the IKDDM process model with that of the CRISP-DM process model. The rest of the paper is organized as follows: Section 2 provides an overview of the KDDM process and includes a discussion on sev- eral serious limitations with previously proposed KDDM process models; Section 3 describes the measurement instrument used for comparing the quality of the IKDDM process model versus the CRISP-DM process model. Section 4 presents our evaluation methodology and the statistical results of the analytical testing and Section 5 presents a discussion of significant findings. 2. Overview of the KDDM process Knowledge Discovery and Data Mining or KDDM process models serve the purpose of a roadmap or guide, that provide 0957-4174/$ - see front matter Ó 2012 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2012.02.044 Corresponding author. Tel.: +1 804 519 8085. E-mail addresses: [email protected] (S. Sharma), [email protected]. edu (K.-M. Osei-Bryson), [email protected] (G.M. Kasper). Expert Systems with Applications 39 (2012) 11335–11348 Contents lists available at SciVerse ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

Upload: sumana-sharma

Post on 05-Sep-2016

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Evaluation of an integrated Knowledge Discovery and Data Mining process model

Expert Systems with Applications 39 (2012) 11335–11348

Contents lists available at SciVerse ScienceDirect

Expert Systems with Applications

journal homepage: www.elsevier .com/locate /eswa

Evaluation of an integrated Knowledge Discovery and Data Mining process model

Sumana Sharma ⇑, Kweku-Muata Osei-Bryson, George M. KasperVirginia Commonwealth University, School of Business, 301 W Main St., Richmond, VA 23220, United States

a r t i c l e i n f o a b s t r a c t

Keywords:EvaluationKnowledge Discovery and Data Mining(KDDM) process modelsCRISP-DMIKDDMAnalytical testing

0957-4174/$ - see front matter � 2012 Elsevier Ltd. Adoi:10.1016/j.eswa.2012.02.044

⇑ Corresponding author. Tel.: +1 804 519 8085.E-mail addresses: [email protected] (

edu (K.-M. Osei-Bryson), [email protected] (G.M. Ka

Data Mining projects are implemented by following the knowledge discovery process. This process ishighly complex and iterative in nature and comprises of several phases, starting off with business under-standing, and followed by data understanding, data preparation, modeling, evaluation and deployment orimplementation. Each phase comprises of several tasks. Knowledge Discovery and Data Mining (KDDM)process models are meant to provide prescriptive guidance towards the execution of the end-to-endknowledge discovery process, i.e. such models prescribe how exactly each one of the tasks in a Data Min-ing project can be implemented. Given this role, the quality of the process model used, affects the effec-tiveness and efficiency with which the knowledge discovery process can be implemented and thereforethe outcome of the overall Data Mining project. This paper presents the results of the rigorous evaluationof the Integrated Knowledge Discovery and Data Mining (IKDDM) process model and compares it to theCRISP-DM process model. Results of statistical tests confirm that the IKDDM leads to more effective andefficient implementation of the knowledge discovery process.

� 2012 Elsevier Ltd. All rights reserved.

1. Introduction

Today, data driven decision making is considered as the corner-stone of modern organizational strategy. It involves the mining oflarge volumes of data, in the quest for discovering nuggets ofknowledge. In recent years Data Mining practitioners and research-ers (e.g. CRISP-DM; Cios, Teresinska, Konieczna, Potocka, & Sharma,2000; Kurgan & Musilek, 2006; Shearer, 2000) have recognized theneed for formal Data Mining process models that prescribes thejourney from converting data into knowledge. Kurgan and Musilek(2006), noted with regards to Data Mining ‘‘Before any attempt canbe made to perform the extraction of this useful knowledge, an overallapproach that describes how to extract knowledge needs to be estab-lished’’. The Knowledge Discovery and Data Mining (KDDM)process is a multiphase process that includes: business under-standing (also sometimes referred to as domain understanding),data preparation, modeling, evaluation and deployment or imple-mentation phases (see Fig. 1). The KDDM process is highly iterativeand complex, as each phase involves multiple tasks, and there arenumerous intra-phase and inter-phase dependencies that exist be-tween the various tasks of the process.

Several KDDM process models have been proposed by research-ers and practitioners. Examples include, Fayyad, Piatetsky-Shapiro,Smyth, and Uthurusamy (1996), Cabena, Hadjinian, Stadler, andVerhees (1998), Cios et al. (2000), CRISP-DM (2003), and Berry

ll rights reserved.

S. Sharma), [email protected]).

and Linoff (1997). In a poll conducted by KDNuggets, 42% of therespondents chose CRISP-DM is the main methodology used bythem for Data Mining (KDNuggets, 2007).

Sharma and Osei-Bryson (2010) identified some significantlimitations in existing KDDM process models and presented anintegrated KDDM (IKDDM) process model to address these limita-tions. Since a KDDM process model is a design artifact, it should besubjected to formal evaluation as such an evaluation providesessential feedback which can then be used to refine the given arti-fact. It should be noted that to-date there has be no published re-search studies on formal evaluation of any of the KDDM processmodels. In this paper, we follow the methodology of Hevner,March, Park, and Ram (2004) to present the results of the formalevaluation of the static qualities of the IKDDM process model.We also compare the performance of the IKDDM process modelwith that of the CRISP-DM process model.

The rest of the paper is organized as follows: Section 2 providesan overview of the KDDM process and includes a discussion on sev-eral serious limitations with previously proposed KDDM processmodels; Section 3 describes the measurement instrument usedfor comparing the quality of the IKDDM process model versusthe CRISP-DM process model. Section 4 presents our evaluationmethodology and the statistical results of the analytical testingand Section 5 presents a discussion of significant findings.

2. Overview of the KDDM process

Knowledge Discovery and Data Mining or KDDM processmodels serve the purpose of a roadmap or guide, that provide

Page 2: Evaluation of an integrated Knowledge Discovery and Data Mining process model

Business Understanding

Deployment

Evaluation Modeling

Data Preparation

Data Understanding

Data

Fig. 1. Typical view of KDDM process models.

11336 S. Sharma et al. / Expert Systems with Applications 39 (2012) 11335–11348

prescriptive guidance towards how each task in the end-to-endprocess can be implemented. They can be regarded as a referenceguide or manual that describes ‘what’ tasks should be executedin the context of a Data Mining project and ‘how’ they should beexecuted.

2.1. Overview on phases of the KDDM process

Table 1 provides a brief description of the different phases ofthe KDDM process, as presented by various process models.

2.2. Limitations of previous KDDM models

Given their prescriptive role and the fact that one essentially re-lies on a chosen KDDM process model to execute a Data Miningproject, it is apparent that the quality of model used to implementthe knowledge discovery process has a strong effect on the effec-tiveness and efficiency with which the relevant Data Mining pro-ject can be executed, as well as on the outcome of the project.Our previous detailed review (Sharma & Osei-Bryson, 2010) ofthe previously proposed KDDM process models revealed that theysuffer from several significant limitations that are discussed below.

2.2.1. Checklist oriented description and lack of tool supportWhile all KDDM process models acknowledge the complexity of

the KDDM process, they still describe the complicated KDDM pro-cess in terms of a list of steps or tasks. While the steps outlinedmay be valid (such as formulate a Data Mining objective, chosean appropriate modeling algorithm, evaluate the modeling results,etc.), they are at best, a broad guideline, and do not provide assis-tance towards ‘‘how to’’ execute the tasks. The issue of lack of toolsupport by KDDM models has been previously identified in the

literature (Charest, Delisle, Cervantes, & Shen, 2006). It is impor-tant to note that this problem is especially compounded in the caseof a process model such as CRISP-DM which outlines a total of 288activities to be executed in the context of a Data Mining project.Without support in form of ‘how’ to execute this long list ofactivities, it is likely that many of these tasks will be completelyoverlooked or not adequately implemented when the Data Miningproject is implemented.

Given that a KDDM process requires a user to make numerousdecisions (Fayyad et al. 1996), it is only necessary that the processmodels be complemented by support in form of appropriate toolsand techniques for carrying out the various tasks. Charest et al.(2006) note that existing process models ‘only provide generaldirectives, however what a non-specialist really needs are explana-tions, heuristics and recommendations on how to effectively carryout the particular steps of the methodology’. Lack of decision sup-port towards tasks can result in certain tasks not being executedduring the knowledge discovery process. Given the numeroustask–task dependencies, each task helps drive other tasks (its out-put may be used as input by one or more tasks). Therefore not exe-cuting a task can quickly translate into either not implementing ornot effectively implementing the succeeding tasks in the model.

2.2.2. Fragmented designThe limitation outlined above leads to another issue in form of

the fragmented design of the existing KDDM process models. Inother words, the process models do not capture or highlight theimportant dependencies that exist in a typical KDDM process. Bydependencies we mean the interrelationships between the varioussteps, or between the various phases and tasks (of the same anddifferent phases) of a KDDM project. The dependency which ismost obvious from Fig. 1 is the phase-phase dependency resulting

Page 3: Evaluation of an integrated Knowledge Discovery and Data Mining process model

Table 1Phases of the Knowledge Discovery and Data Mining (KDDM) process.

Phase Description

Businessunderstanding

This initial phase focuses on understanding the project objectives and requirements from a business perspective, then converting this knowledgeinto a Data Mining problem definition and a preliminary plan designed to achieve the objectives

Data understanding This phase starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data qualityproblems, to discover first insights into the data or to detect interesting subsets to form hypotheses for hidden information

Data preparation This phase covers all activities to construct the final dataset (data that will be fed into the modeling tool(s)) from the initial raw data. Datapreparation tasks are likely to be performed multiple times and not in any prescribed order. Tasks include table, record and attribute selection aswell as transformation and cleaning of data for modeling tools

Modeling(Data Mining) In this phase, various modeling techniques (e.g. decision tree, regression. Clustering) are selected and applied and their parameters are calibrated.

The CRISP DM documentation points out that typically, there are several techniques for the same Data Mining problem type. Some techniqueshave specific requirements on the form of data and therefore, stepping back to the data preparation phase is often necessary

Evaluation This phase of the project consists of thoroughly evaluating the model and reviewing the steps executed to construct the model to be certain thatit properly achieves the business objectives. A key objective is to determine if there is some important business issue that has not perhaps beensufficiently considered. At the end of this phase, a decision on the use of the Data Mining results should be reached.

Deployment Creation of the model is generally not the end of the project. Even if the purpose of the model is to increase knowledge of the data, the knowledgegained will need to be organized and presented in a way that the customer can use it. According to the CRISP DM process model, depending onthe requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable Data Miningprocess across the enterprise

S. Sharma et al. / Expert Systems with Applications 39 (2012) 11335–11348 11337

from the ordering of phases proposed by the KDDM process model.That the KDDM models recommend executing the business under-standing phase ahead of the data understanding phase suggeststhat data understanding phase must be utilizing the output ofthe business understanding phase. These dependencies are criticalas they cannot be reversed without leading to detrimental effectsor even inability to executing a particular phase. Further, it isimportant to consider that a phase really comprises of varioustasks. Therefore, the output of a phase is really comprised of theoutput of the diverse array of tasks that lie within it. Clearly, atask-level view of a process model should explicate and highlightthese dependencies. These dependencies are not obvious fromthe phase-level view of the knowledge discovery process whichis presented by existing process models.

2.2.3. Absence of an integrated viewIdentification of task–task dependencies (between tasks of the

same phase and different phases) is the first step towards buildingan integrated process model. The importance of such a model hasbeen acknowledged in the literature (Brachman & Anand, 1996;Kurgan & Musilek, 2006). The integrated process model can alsosubsequently be used for enabling the semi-automation (Kurgan& Musilek, 2006) or automation of some of the well understoodtasks of the process. There is a general understanding that it is onlythe task of implementation of Data Mining methods (modelingphase) which can be automated (Berry & Linoff, 2000). Recentlyhowever, researchers have also attempted to automate certainother tasks such as selection of appropriate modeling techniquesor algorithms (Bernstein, Provost, & Hill, 2005), which were onceperformed manually by the human user. Clearly, the same oppor-tunity lies in the other phases of the knowledge discovery processwhere certain tasks could be semi-automated if not completelyautomated to increase the overall efficiency and effectiveness ofthe knowledge discovery process.

2.2.4. Conspicuous lack of support for tasks of the businessunderstanding phase

Review of published Data Mining case studies (refer to Berry &Linoff, 2000) reveals that that the business understanding phase ofKDDM projects is often implemented in an ad hoc manner. Hardlyany published Data Mining case studies actually provide a detaileddescription of how this phase was formally implemented. We be-lieve that the reason for such an unstructured approach is due to

the general lack of support towards ‘how’ the tasks of this phasecan be implemented.

This issue has been highlighted and somewhat addressed byPyle (2003) who describes how real world business problems (tobe addressed through Data Mining) can be modeled. While theauthor has not based his approach on any particular DM method-ology, he discusses various tools to carry out many (but not all) ofthe activities prescribed under the BU phase of the CRISP-DMmethodology. However, these are only presented in a linear fash-ion, with the description of each activity followed by a briefdescription of a proposed tool. The overall framework which con-sists of nested sequences of action boxes, discovery boxes, tech-nique boxes and example boxes is complicated to navigate, andmay appear to be cumbersome or even cost prohibitive to actorsinvolved in carrying out the critical business understanding phase.

The description of the user guide portion of CRISP DM method-ology (CRISP-DM, 2003) also purports to provide detailed adviceabout ‘‘how’’ to execute KDDM activities outlined in the model.The only applicable tool mentioned in this phase is the use of anorganizational chart, to ‘‘identify divisions, manager’s names andresponsibilities, etc.’’ Clearly, organizations also need support forthe diverse array of other activities associated with this importantphase. Besides, the usefulness of organizational charts, a primarilystatic entity, to identify organizational actors and their interrela-tionships can be also be debated. Formally implementing the Busi-ness Understanding phase is just as important as implementing theModeling phase or any other phase of the Data Mining project(Sharma & Osei-Bryson, 2008). Perhaps, the Business Understand-ing Phase is even somewhat more important than other phases gi-ven that a number of decisions about other phases, such as theModeling as well as other phases (such as data preparation, dataunderstanding, and evaluation) are made, or ideally should bemade, during the BU phase (see Fig. 2).

2.3. IKDDM: overcoming the limitations of existing KDDM models

The various limitations plaguing existing KDDM models moti-vated the design of a new KDDM process model in form of the Inte-grated Knowledge Discovery and Data Mining process model orIKDDM (Sharma, 2008). A preliminary version of the IKDDM modelwas also presented in Sharma and Osei-Bryson (2010). All the iden-tified limitations in previously proposed KDDM process modelswere used as design requirements in creating this new KDDM pro-cess model. The design requirements are summarized in Table 2.

Page 4: Evaluation of an integrated Knowledge Discovery and Data Mining process model

Business Understanding Phase

Determine business objectivesAssess situationDetermine data mining goalsProduce a project plan

Data Understanding Phase

Collect initial dataDescribe dataExplore dataVerify data quality

Modeling Phase

Select modeling techniqueGenerate test designBuild ModelAssess model

Data Preparation Phase

Select dataClean dataConstruct dataIntegrate dataFormat data

Deployment Phase

Plan deploymentPlan monitoring and assistanceProduce final reportReview report

Evaluation Phase

Evaluate resultsReview ProcessDetermine next steps

Output of Business Understanding Phase determines if target variable should be

discretized

Output of Business Understanding Phase

includes determination of evaluation criteria

Output of Business Understanding Phase

produces a project plan for remainder of the

phases, including how the project is to be

deployed

Output of Business Understanding Phase

results in enumeration of applicable modeling

techniques

Output of Business Understanding Phase

determines relevant data

Fig. 2. Relationship between business understanding phase and other phases of the KDDM process (Sharma & Osei-Bryson, 2008).

Table 2Design Requirements for the IKDDM model.

Issues identified with existing KDDMprocess models (as-is situation)

Design requirements for the IKDDMmodel (to-be situation)

Description of the KDDM Process in achecklist manner and lack of toolsupport

Present a user-oriented coherentdescription of the KDDM process andPrescribe approaches for offeringdecision support towards all tasks inall phases, described in theintegrated KDDM model

Fragmented view of the KDDMprocess

Develop an integrated view of theKDDM process by explicating thevarious phase-phase and task–taskdependencies

Fragmented view acts as a hindranceto building an integrated processmodel and ‘‘semi-automating’’tasks

Leverage the dependenciesexplicated in the integrated processmodel to drive semi-automation oftasks, wherever possible

Visible lack of support towardsexecution of tasks of the businessunderstanding phase – thefoundational phase of a KDDMprocess

Provide support for tasks of thisfoundational phase and use them as abasis for developing the integratedmodel

11338 S. Sharma et al. / Expert Systems with Applications 39 (2012) 11335–11348

2.3.1. Development of an integrated viewThe IKDDM model presents an integrated view of the KDDM

process which is of much higher granularity than the phase level

view of previously proposed process models. The IKDDM modelwas designed by explicating the numerous dependencies that existbetween the various tasks of the KDDM process. Some of thedependencies can be regarded as intra-phase dependencies, asthey exist between the tasks of the same phase. For example, thereis a dependency between the Data Mining objective and businessobjective of the business understanding phase as the former uti-lizes the latter as its input. Other dependencies can be classifiedas inter-phase dependencies as they exist between tasks of differ-ent phases. For example, there is a dependency between Data Min-ing success criteria (a task within the business understandingphase) and evaluation of modeling results (a task within the eval-uation phase) as the latter utilizes the former as its input. TheIKDDM model presents a detailed task level overview of each ofthe phases in addition to an integrated view comprising of all thephases. See Figs. 3 and 4.

3. Measurement instrument used for assessing the quality ofIKDDM and CRISP-DM process models

A design artifact can be evaluated through the following designevaluation methods (Hevner et al. 2004):

� Observational (through case studies and field studies).� Analytical (through static analysis, architecture analysis, opti-

mization and dynamic analysis).

Page 5: Evaluation of an integrated Knowledge Discovery and Data Mining process model

+

Determine Business Objective

+

Determine Data Mining

Objective +Data Mining

Success Criteria

Target Variable

+

Determine Business

Requirements

+Array of

modeling techniques

Pass to Modeling

Identify Data Sources

Estimate data collection, operating and

implementation costs

Approximate Data Related Costs.doc

Pass to Evaluation

Personnel Related Costs

Tool Support available?

Notify relevant business/technical personnel

Yes

No

Approximate Tool Related Costs

Total Costs

+

Total Benefits

Perform Cost Benefit

Analysis

Sponsor’s decision Go

Document Reasons

Project Terminated by SponsorNo/Go

Develop Project Plan

Determine preference functions

Determine value

functions

Repeat for all techniques and tools

Tool Supports DMSC?

Yes No

Data Mining Objective

Dependency with BU

Dependency with BU

Fig. 3. Business understanding phase of the IKDDM Model.

S. Sharma et al. / Expert Systems with Applications 39 (2012) 11335–11348 11339

� Experimental (through controlled experiments and simulation).� Testing (through functional or black box and structural or white

box testing).� Descriptive (through informed arguments and scenario

construction).

In this paper we present the results of the analytical evaluationof the IKDDM versus the CRISP DM process models. Specifically, weconducted a static analysis of the quality of the two KDDM processmodels as perceived by expert and naïve users of Data Mining. Astatic analysis helps in evaluation of a design artifact on the basisof static or desired qualities. We used the evaluation criteria forassessing quality of conceptual models as proposed by Maes andPoels (2006) to judge the quality of the IKDDM and CRISP-DM

process models. A conceptual model defines user requirementsand is used for designing information systems. The artifact in formof the integrated KDDM process model can also be regarded as aconceptual model which could ultimately be used to design aninformation system to implement the KDDM process. Hence it isreasonable to evaluate it according to guidelines for assessingquality of conceptual models.

The instrument proposed by Maes and Poels (2006) describesstatic qualities of a conceptual model along four different dimen-sions. They regard conceptual model quality as the totality of thefeatures and characteristics of a conceptual model that bear onits ability to satisfy stated or implied needs (Sheer and Hars,1992). Maes and Poels’s (2006) model is based on Seddon’s(1997) variant of DeLone and McLean’s (1992) Information Systems

Page 6: Evaluation of an integrated Knowledge Discovery and Data Mining process model

Assess Models

Meets Thresholds for

DMSC?

List of Models rejected due to technical reasons

List of Approved Models

No

Yes List = NULL?

Yes

No

Meets Thresholds for

BSC?

UPDATE Threshold Values for DMSC

Yes

List of Models rejected due to

business reasons

List of Approved Models

Yes NoList = NULL?

Yes

UPDATE Threshold Values for BSC

No

Apply Value Functions-Create Composite Scores

Compare Scores.

Tie?

Yes

Meets Business Requirements?

No

YesCreate 2 Stage Model

Recommend as Final Model

No

Rank Order Models based on performance

on each criteria

Rank Order Criteria by

weight

Submit to domain

expert for consideration

Publish next steps

Final Model Selected?

Yes

NoReview

Reasons/Study Steps

Feedback to Modeling

Feedback to Business

Understanding

Repeat for all Models

Repeat for all Models

Send Any Feedback to

Relevant Phase

Study Reasons and direct

feedback to relevant preceding

stage

Thresholds for DMSC

Dependency with BU

Thresholds for BSC

Dependency with BU

Value Functions-formulae for composite score

Dependency with BU

Weights for DMSC

Dependency with BU

Business Requirements

Dependency with BU

Results of Modeling

Dependency with Modeling

Document Model and Update Modeling Results

Forward Feedback to Evaluation

Update

Update

Revise thresholds

?

No Continue with Challenger

Model

Yes

Not Available/Not Applicable

No

Submit to domain

expert for consideration

Project Terminated

Revise thresholds

?

YesFeedback to Business

Understanding

No

Submit to domain expert for

consideration

Fig. 4. Evaluation phase of the IKDDM Model.

11340 S. Sharma et al. / Expert Systems with Applications 39 (2012) 11335–11348

Success Model. The model incorporates the same dimensions asSeddon’s model (perceived ease of use, perceived usefulness, anduser satisfaction) but replaces the Information Quality dimensionof the original model with a Perceived semantic quality construct.Maes and Poels (2006) contend that the Information Quality of aconceptual model users is the perceived semantic quality of themodel such as how valid and complete it is with respect to (theirperception of) the problem domain. Validity means that all

information conveyed by the model is correct and relevant to theproblem whereas completeness entails that the model containsall information about the domain that is considered correct andrelevant. In Table 3 we present the measurement instrument forassessing conceptual model quality as proposed by Maes and Poels(2006). The language has been modified to include ‘KDDM processmodel’ instead of the terms ‘conceptual model’ that was part of theoriginal instrument.

Page 7: Evaluation of an integrated Knowledge Discovery and Data Mining process model

Table 3Measurement instrument for conceptual model (Maes and Poels, 2006).

Perceived ease of use (PEOU) Perceived usefulness (PU)PEOU1 It was easy for me to understand what the KDDM model was

trying to modelPU1 Overall, I think the KDDM model would be an improvement to a textual

description of the KDDM processPEOU2 Using the KDDM model was often frustrating PU2 Overall, I found the KDDM model useful for understanding the process modeledPEOU3 Overall, the KDDM model was easy to use PU3 Overall, I think the KDDM model improves my performance when understanding

the process modeledPEOU4 Learning how to read the KDDM model was easy.User satisfaction (US) Perceived semantic quality (PSQ)US1 The KDDM model adequately met the information needs that I

was asked to support.PSQ1 The KDDM model represents the KDDM process correctly.

US2 The KDDM model was not efficient in providing the information Ineeded.

PSQ2 The KDDM model is a realistic representation of the KDDM process.

US3 The KDDM model was effective in providing the information Ineeded.

PSQ3 The KDDM model contains contradicting elements.

US4 Overall, I am satisfied with the KDDM model for providing theinformation I needed.

PSQ4 All the elements in the KDDM model are relevant for the representation of theKDDM process

PSQ5 The KDDM model gives a complete representation of the KDDM process

S. Sharma et al. / Expert Systems with Applications 39 (2012) 11335–11348 11341

4. Evaluation methodology: results of analytical testing ofIKDDM and CRISP-DM

Analytical testing is comprised of the examination of the struc-ture of artifact with respect to static qualities such as ease of use,complexity, and usability (Hevner et al. 2004). Prior to solicitinguser input for analytical testing, the artifact, i.e. the IKDDM processmodel must be made available to users for experimentation anduse in executing Data Mining tasks. Since, our simultaneous goalwas to also compare the performance and static qualities of theIKDDM model versus the performance and static qualities of aleading competing artifact, the CRISP-DM process model, weadopted the following methodology for performing the analyticaltesting:

1. Identified and recruited 42 study participants and randomlydivided them in two groups.

2. Presented one group of users with a test questionnaire, whichincludes Data Mining tasks posed as multiple choice questions.Provide them with the documentation of the CRISP-DM processmodel to assist in answering the questions (i.e. in executingtasks of a Data Mining project). Presented the second group ofusers with the same test questionnaire but with the documen-tation of the IKDDM process model to assist in answering thequestions.

3. After the completion of the test questionnaire, recorded the per-ception of the static qualities of the artifact (i.e. the CRISP-DMor the IKDDM process model) used by each participant througha set of survey questions (Refer Table 3).

4. Recorded each participant’s gender, role/designation, number ofyears of experience in Data Mining, and time taken to completethe test. A numeric id was used to link the responder’s test tothe survey. No identifying detail, such as name of the partici-pant, or name of the organization that the individual is affiliatedwere recorded.

5. Tested for statistical differences in the quality of the two mod-els, as perceived by the users. The independent means t-test aswell as the Mann Whitney procedure was used to test the dif-ferences between the two groups (IKDDM versus CRISP-DM).

4.1. Statistical tests for evaluating the results of analytical testing

4.1.1. Independent means t-test for comparing performance of IKDDMmodel versus CRISP-DM model

One of the goals of the evaluation was to compare the perfor-mance of the group that used the CRISP-DM model to answer thetest questionnaire to that of the group that used the IKDDM model

to answer the same test questionnaire. The performance of the twogroups was used as a proxy for the effectiveness of the model usedby them for answering the test. The results for each group werecomputed by assigning a score of 2 points for every correct answerand 0 points for every incorrect answer.

The performance of the two groups (each with N = 21) was com-pared using an independent means t-test to determine if there wasany statistical difference between the two groups. The SPSS soft-ware v. 15 was used for conducting the test. When conductingan independent means t-test, the null hypothesis states that the‘‘experimental manipulation has no effect on the subjects andtherefore we expect the sample means to be identical or very sim-ilar’’ (Field, 2000). If the null hypothesis is incorrect then we canconclude that the two sample means differ because of the experi-mental manipulation imposed on each sample.

4.1.2. Mann–Whitney test for comparing difference in groups’perception about static qualities of KDDM process models

As stated earlier, the static qualities of the KDDM process modelemployed by the users to execute the Data Mining tests (in the testquestionnaire) was assessed through a set of survey questions with7 point Likert-scale options, ranging from Strongly Agree toStrongly Disagree. The goal of the evaluation was to determineany difference in user’s perception of the static qualities (such asperceived usefulness, ease of use etc.) of CRISP process model ver-sus the IKDDM process model. Our rationale for choosing a non-parametric test such as Mann–Whitney stems from the fact thatthe Likert Scale data violates the assumptions of parametric teststhat assume that the underlying data is interval or ratio in nature.A non-parametric test (sometimes referred to as an assumption-free test) makes no assumptions about the data on which theycan be used. It can be used for testing differences between meanswhen there are two conditions and different subjects have beenused in each condition.

4.1.3. Pilot test of test questionnaire and surveyPrior to conducting the actual evaluation, a pilot test of the test

questionnaire and survey was conducted. Four users with expertisein Data Mining participated in the pilot test. The average number ofyears of Data Mining experience of these users was 4 years. On thebasis of feedback received from the users the test questionnairewas slightly revised, and a final version was created for use inthe actual evaluation.

4.1.4. Assessment of artifact by users with experience in Data MiningFollowing the approach described earlier, the artifact, i.e. a

KDDM process model and its extract documentation was made

Page 8: Evaluation of an integrated Knowledge Discovery and Data Mining process model

11342 S. Sharma et al. / Expert Systems with Applications 39 (2012) 11335–11348

available to individuals the study participants (N = 42). They wereasked to use the artifact by applying it to execute the various tasksrelated to typical Data Mining projects. IRB (Institutional ReviewBoard) approval was sought prior to conducting this study (RefNumber HM 11636). The IRB (also called the Ethics Committee insome countries) is charged with reviewing all research protocolsinvolving humans to ensure compliance with federal, state and lo-cal regulations. Based on the IRB guidelines, each participant waspresented with a consent form, prior to soliciting their inputthrough the test and the survey. The 42 participants were ran-domly assigned into two groups, CRISP-DM (N = 21) or IKDDM(N = 21) and were asked to use the documentation of KDDM pro-cess model to answer the Data Mining tasks. Hereafter the twogroups are referred to as CRISP-DMeval and IKDDMeval respectively.

The following information was recorded for each participant:

� Date on which data was collected from the individual.� Participant’s Gender.� Participant’s Role/Title.� Participant’s number of years of Data Mining experience.� Start Time for the test.� End Time for the test.

The start and end times for the test were used to estimate thetotal time taken by the participants to answer the test. The sum-mary of participant’s profile based on gender, years of Data Mining

Table 4Summary of participant’s profile.

CRISPeval

(N = 21)IKDDMeval

(N = 21)

Gender distribution 28.5% females 23.8% females71.4% males 76.1% males

Average years of Data Miningexperience

2.5 years 2.6 years

Average time taken to answer the test 36.52 min 31.38 min

Fig. 5. Path model showing loadings for reflective constructs (

experience and the time taken by participants is summarized inTable 4.

4.2. Assessment of validity of the measurement instrument

Unlike Maes and Poels (2006) our goal was not to test any struc-tural model or hypotheses after validating the instrument. Never-theless, it is important to assess the validity of the measurementinstrument (refer Table 3) used to assess user’s perception of thequality of the artifact and if the results appear to be in line with rec-ommendations. We conducted the validity assessments in Smart-PLS (Ringle, Wende, & Will, 2005). Following Maes and Poels weconducted separate validity assessments for the reflectively (PEOU,PU, US) and formatively modeled (PSQ) construct (see Fig. 5). InSmart-PLS software, results of path analysis include factor loadingsfor reflective constructs and weights for formative constructs.

4.2.1. Validity assessments of reflective constructs: perceived ease ofuse, perceived usefulness, user satisfaction

The results obtained from testing the measurement model (seeTable 5) provide evidence of the robustness of the measures as indi-cated by their internal consistency reliabilities (indexed by thecomposite reliabilities). The composite reliabilities of the measuresrange from 0.917 to 0.972. All of these reliabilities exceed the rec-ommended threshold of 0.70 suggested by Nunnally (1978). Thereliability can also be confirmed through the values for Cronbach’salpha, ranging from 0.878 to 0.968, which exceed the minimumthreshold of 0.7 (Pedhazur & Schmelkin, 1991). These are shownin table above. Also, the average variances extracted (AVE) for themeasurement constructs range from 0.737 to 0.914 Consistent withthe recommendation of Fornell and Larcker (1981), the AVE for eachmeasure well exceeds the lower bound threshold value of 0.50.

4.2.2. Factor loadingsFinally, to complete the psychometric assessment of our mea-

surement model discriminant validity was examined. Discriminant

PEOU, US, PU) and weights for formative construct (PSQ).

Page 9: Evaluation of an integrated Knowledge Discovery and Data Mining process model

Table 5Results of validity assessments.

AVE Composite reliability R square Cronbach’s alpha

PEOU 0.73 0.91 0.00 0.87PU 0.84 0.94 0.85 0.90US 0.91 0.97 0.86 0.96

Table 7Factor correlations matrix.

PEOU PU US

PEOU 1PU 0.917 1US 0.8987 0.8969 1

Table 8Assessment of discriminant validity.

PEOU PU US

PEOU 0.858 0 0PU 0.917 0.916 0US 0.898 0.896 0.956

S. Sharma et al. / Expert Systems with Applications 39 (2012) 11335–11348 11343

validity refers to the extent to which the items proposed to mea-sure a given construct, differ from the items intended to measureother constructs in the same model. A cross-loading check indi-cated that all items loaded higher on the construct they were sup-posed to measure than on any other construct (refer Table 6). Acommon rule of thumb to indicate convergent validity is that allitems should load greater than 0.7 on their own construct, andshould load more highly on their respective construct than onthe other constructs (Yoo & Alavi, 2001). Furthermore, each item’sfactor loading on its respective construct was highly significant(p < 0.01). This was true for items for all reflective constructs. An-other means of assessing the discriminant validity is using the fac-tor correlations and AVE. Evidence of discriminant validity is foundif the square root of AVE is greater than the factor correlations. Thefactor correlations matrix (Table 7) is a symmetric matrix with 1along the diagonals (correlation of a factor with itself is 1).

The method for conducting analysis of discriminant validityconsists of replacing the diagonal elements by the square root ofthe variance, and assessing if this value is greater than the factor’scorrelation with other factors (see Table 7). In this case, it can beseen that discriminant validity holds true for all factors, exceptfor PU or perceived usefulness because the square root of PU isthe same as the correlation between PU and PEOU (see Table 8).From this analysis it appears that these two factors are not distinct,however the cross loadings confirm discriminant validity.

4.3. Validity assessments of formative construct: perceived semanticquality

Because of the formative structure of the PSQ construct, tradi-tional validity assessments cannot be used (Diamantopoulos &Winklhofer, 2001). Observed correlations among the items maynot be meaningful (Diamantopoulos & Winklhofer, 2001) and asa consequence, assessment of internal consistency and convergentvalidity become irrelevant (Chin, 1998; Hulland, 1999). The PSQmeasure can be considered as valid if the PSQ indicator coefficientsare significantly different from zero (Diamantopoulos & Winklho-fer, 2001). This can be determined by running a bootstrapping pro-cedure in Smart-PLS. The output of the path model shows thevalues for t-statistic for all paths and coefficients (see Fig. 6).

The result of the PLS analysis indicates that not all PSQ indicatorshave a coefficient significantly different from zero, i.e. t > 2.086 (seeTable 9). Such indicators should be deleted from the model if a

Table 6Factor Cross Loadings.

PEOU PU US

PEOU1 0.8975 0.8514 0.7683PEOU2 0.7201 0.6137 0.5426PEOU3 0.9351 0.8813 0.9152PEOU4 0.8658 0.7719 0.8087PU1 0.889 0.9363 0.8816PU2 0.7761 0.8738 0.729PU3 0.8504 0.938 0.8456US1 0.8606 0.8822 0.9676US2 0.8153 0.8054 0.9522US3 0.8481 0.8644 0.9511US4 0.9101 0.8763 0.9545

structural model is to be tested. In this sample, PSQ1, PSQ3, andPSQ4 turned out to be significantly different from zero, but PSQ2and PSQ5 are not significantly different from zero. On the basis ofthese results it appears that only PSQ1, PSQ3, and PSQ4, are relevantformative indicators of perceived semantic quality.

5. Evaluation results

5.1. Independent means t-test to assess differences based on genderdistribution, years of Data Mining experience, and time taken

We also ran independent means t-test to assess if there wereany differences between the two groups based on the gender dis-tribution, years of Data Mining experience or time taken to answerthe test. No statistical differences were found between the twogroups, on these measures and the corresponding p values werenon-significant (p values for years of experience, gender, and timetaken were 0.804, 0.733 and 0.266 respectively).

5.2. Results of independent means t-test – analysis of performance ofCRISP-DMeval versus IKDDMeval on test questionnaire: usingindependent mean t-test

The performance of the participants in the two groups (CRISP-DM versus IKDDM) was measured by calculating the accuracy oftheir response. An independent means t-test was used for deter-mining the statistical difference in performance between the twogroups. SPSS v. 15 was used for running the t-test.

Parametric tests assume that variances in experimental groupsare roughly equal. Levene’s test tests the hypothesis that thevariances in the two groups are roughly equal, i.e. the differencebetween variances is zero (Field, 2000). If Levene’s test is signifi-cant, then the null hypothesis is incorrect and we have to concludethat the variances are significantly different. If, however, Levene’stest is non-significant, then it can be concluded that the differencesin variances is zero and the assumption of equal variances is tena-ble. For our data, the Levene’s test is not significant (p = 0.107)which is greater than 0.05 and so we should read the test statisticsin the row labeled equal variances assumed (Table 10).

Having established the assumption of homogeneity of varianceswe can look at the t-test itself. SPSS produces exact significance va-lue for t and we are interested in whether or not this value is lessthan or greater than 0.05. In this case the two tailed value of p is.000, which is much smaller than 0.05, and therefore we can con-clude that there was a highly significant difference (p = 0.000) be-tween the performance of the group that used the IKDDM model toexecute Data Mining tasks versus the group that used the CRISPmodel to execute the same set of tasks.

Page 10: Evaluation of an integrated Knowledge Discovery and Data Mining process model

Fig. 6. Output of bootstrapping t-statistics for indicator coefficients and paths.

Table 9Weights and t-values for formative indicators.

Weight t-Statistic Significant?

PSQ1 ? PSQ 0.2451 2.3228 SignificantPSQ2 ? PSQ 0.0123 0.1205 Non-significantPSQ3 ? PSQ 0.5723 5.373 SignificantPSQ4 ? PSQ 0.1935 2.1812 SignificantPSQ5 ? PSQ 0.1304 1.2334 Non-significant

11344 S. Sharma et al. / Expert Systems with Applications 39 (2012) 11335–11348

The sample for both IKDDM and CRISP group included a fewnaïve users; specifically the CRISP group had 5 naïve users whereasIKDDM group had 6 naïve users. Given the small number of naïveusers, their performance cannot be separately assessed through aprocedure like the independent means t-test, and so we insteadcompare their mean accuracy rate to gain insights into their rele-vant performance. The mean accuracy rate of naïve users in theCRISP group turned out to be 9.6/30 whereas the mean accuracyrate of naïve users in the IKDDM group turned out to be 28/30.

5.2.1. Discussion of results of independent means t-testThe results of the Independent Means t-test confirms that the

IKDDM group outperformed the CRISP group in terms of its perfor-mance on the test which asked users to utilize the process model

Table 10Independent Samples Test.

Levene’s test forEquality ofVariances

t-test for Equality of

F Sig. T df

Lower Upper Lower Upper

Test score Equal variances assumed 2.726 .107 �12.955 40Equal variances not assumed �12.955 36.681

assigned to them to execute Data Mining tasks. Since the taskswere formulated as multiple choice questions with only one cor-rect answer, the performance of users in both the groups couldbe estimated using the accuracy of their responses. The perfor-mance provides insights into the effectiveness and efficiency of-fered by the IKDDM model over the CRISP model.

Given the significant differences in the mean accuracy rate ofnaïve users in the CRISP group (9.6/30) versus the IKDDM group,(28/30), we can conclude that the IKDDM model was equally effec-tive in supporting the information needs of both naïve users aswell as experienced users and allowed for more effective and effi-cient implementation of tasks by both types of users.

5.3. Results of Mann–Whitney test

5.3.1. Analysis of perception about static qualities of process model ofCRISP-DMeval versus IKDDMeval: using Mann–Whitney test

The Mann–Whitney test works by looking at differences in theranked positions of scores in different groups. The first part of theoutput, shown in the Ranks table (Table 11) shows the average andtotal ranks for each condition. The group with the lowest meansrank is also the group with the greatest number of lower scoresin it. In the context of this study, the group with the lowest meansrank is the group that was assigned to use the CRISP process model.

Means

Sig. (2-tailed) Mean difference Std. Error difference 95% confidenceinterval of thedifference

Lower Upper Lower Upper Lower

.000 �13.905 1.073 �16.074 �11.736

.000 �13.905 1.073 �16.080 �11.729

Page 11: Evaluation of an integrated Knowledge Discovery and Data Mining process model

Table 13Ranks table for Mann Whitney test: (comparing groups on perceived ease of use, usersatisfaction, perceived usefulness and perceived semantic quality).

Group N Mean rank Sum of ranks

PEOU CRISP 21 12.02 252.50IKDDM 21 30.98 650.50Total 42

US CRISP 21 12.33 259.00IKDDM 21 30.67 644.00Total 42

PU CRISP 21 11.52 242.00IKDDM 21 31.48 661.00Total 42

PSQ CRISP 21 13.40 281.50IKDDM 21 29.60 621.50Total 42

Table 14Test statistics for Mann Whitney test: (comparing groups on perceived ease of use,user satisfaction, perceived usefulness and perceived semantic quality).

Mann–Whitney U 21.500PEOU Wilcoxon W 252.500

Z �5.015Asymp. sig. (2-tailed) .000

US Mann–Whitney U 28.000Wilcoxon W 259.000Z �4.860Asymp. sig. (2-tailed) .000

PU Mann–Whitney U 11.000Wilcoxon W 242.000Z �5.294Asymp. sig. (2-tailed) .000

PSQ Mann–Whitney U 50.500Wilcoxon W 281.500Z �4.319Asymp. sig. (2-tailed) .000

Table 15Objects and their defining characteristics.

Objects Defining characteristics

Customers Wireless internet CustomersCustomers with tenure >1Customers acquired though marketing channelMost loyal customers

Suppliers Suppliers for eastern regionSuppliers of small moving partsSuppliers of parts X

Products Co-selling productsProducts from a particular line (baby care or feminineproducts)

Employees Internal hiresPart time employeesFull time employeesContract employeesEmployees with tenure >5

Transactions Transactions that occurred in last week/month/yearTransactions valued at >$250

Table 11Ranks table for Mann Whitney test (N = 42).

Group N Mean rank Sum of ranks

Survey score CRISP 21 11.76 247.00IKDDM 21 31.24 656.00Total 42

Table 12Test statistics for Mann–Whitney (N = 42).

Survey score

Mann–Whitney U 16.000Wilcoxon W 247.000Z �5.146Asymp. sig. (2-tailed) .000

S. Sharma et al. / Expert Systems with Applications 39 (2012) 11335–11348 11345

It can be seen that IKDDM (group 2) fared significantly better thanthe CRISP model in terms of user’s perception of the quality of pro-cess model.

Table 12 shows the actual test statistics for the Mann–Whitneytest. The SPSS output has a column for the dependent variable(here, the survey score), and rows showing the value of MannWhitney’s U statistic, Wilcoxon’s W statistic, and the associated zapproximation. The table also contains the significance value ofthe test which gives the two-tailed probability that the magnitudeof the test statistic is a chance result. For this test, the Mann–Whit-ney test is highly significant (p < 0.0001) for the survey scores ofthe two groups. The value of the means rankings indicates thatthe quality of the IKDDM process model was rated as significantlyhigher than the quality of the CRISP process model. This conclusionis reached by noting that for the survey scores representing modelquality, the average rank is higher in the IKDDM group (31.24) thanin the CRISP group (11.76).

5.4. Results of Mann Whitney test to assess difference between groupson individual constructs

The Mann Whitney test was also used to assess if there weredifferences between the two groups (CRISP versus IKDDM) whenthe four constructs (perceived ease of use, user satisfaction, per-ceived usefulness, and perceived semantic quality) were analyzedseparately. The earlier test, established that a significant differenceexisted between the groups on the combined score on the surveybut did not convey if this was true for each construct as well.The test was set up the same way, except the scores on items forthe four different constructs were summed up for each of thetwo groups and differences examined. These have been interpretedin the same manner as the results in the previous section.

5.4.1. Results for perceived ease of useThe Mann–Whitney test is highly significant (p < 0.0001) for the

perceived ease of use scores of the two groups (refer Table 13). Thevalue of the means rankings (refer Table 14) indicates that the per-ceived ease of use of the IKDDM process model was rated as signif-icantly higher than the perceived ease of use of the CRISP processmodel. This conclusion is reached by noting that for the surveyscores representing perceived ease of use, the mean rank is higherin the IKDDM group (30.98) than in the CRISP group (12.02).

5.4.2. Results for user satisfactionThe Mann–Whitney test is highly significant (p < 0.0001) for the

user satisfaction scores of the two groups (refer Table 13).The valueof the means rankings indicates that the user satisfaction with theIKDDM process model was rated as significantly higher than theuser satisfaction with the CRISP process model (refer Table 14).

This conclusion is reached by noting that for the survey scores rep-resenting user satisfaction, the mean rank is higher in the IKDDMgroup (30.67) than in the CRISP group (12.33).

5.4.3. Results for perceived usefulnessThe Mann–Whitney test is highly significant (p < 0.0001) for the

perceived usefulness scores of the two groups (Table 13). The value

Page 12: Evaluation of an integrated Knowledge Discovery and Data Mining process model

Table 16Data Mining success criteria for classification trees provided by Data Mining software (SAS EM, Clementine).

Measure Source for calculatingmeasure

SAS EM 4.3 SPSS clementine 12.0

Accuracy Test misclassificationrate

Implicit calculate using 1-test misclassification rate Explicit (modeling results)

Confusion matrix Implicit ImplicitLift or gains

indexVisual inspection of liftchart up to a particulardecile

Explicit-visual Explicit-visual

Lift Value can beestimated throughanalysis of lift chart

Implicit calculate using tree/exact Explicit (modeling results)

Profit and loss Profit and loss matrix Explicit (modeling results) Explicit (also provides additional measures)Simplicity User defined Implicit (calculate using number of leaves, and/or minimum rule length) Implicit (calculate using number of leaves)Stability User defined Implicit Calculate using a coarse measure such as Min [ACCTV/ACCT,

ACCT/ACCV] Where ACCTV is accuracy of validation data and ACCT isaccuracy on training data

Implicit models (by default) are built withgenerality. For assessing stability, validateagainst hold out sample

Visual inspection of liftchart at a particulardecile

Explicit-visual Explicit-visual

ROC curve Plot of 1-specificity on x-axis and sensitivity on yaxis

Explicit-visual visual inspection of chart must be used to employ ROC asan evaluation measure

Explicit-visual

Area underROC curveor AUC

Area calculated usingtrapezoidal rule

No Explicit (empirical ROC curve and non-parametric estimate of the area under theempirical ROC curve and its 95% CI)

KS statistic (Kolmogorov–Smirnov)

Maximum KS value No

NoAverage

squarederror

Modeling results Explicit No

Sensitivity Confusion matrix Implicit (Calculate using TP/[TP + FN] Where TP is the true positive rateand FN is the false negative rate)

Implicit (Calculate using TP/[TP + FN] WhereTP is the true positive rate and FN is the falsenegative rate)

Specificity Confusion matrix Implicit Calculate using TN/[FP + TN] Where TP is the true positive rateand FN is the false negative rate

Implicit Calculate using TN/[FP + TN] WhereTP is the true positive rate and FN is the falsenegative rate

Table 17Data Mining success criteria for association rules.

Measure Source for calculatingmeasure

SAS EM 4.3 SPSSclementine12.0

Lift Ratio of confidence to theprior probability of havingthe consequent

Explicit(modelingresults)

Explicit(modelingresults)

Excess Lift-1 Implicitcalculateusing lift-1

Implicitcalculateusing lift-1

Simplicity Length of rule Implicitcalculateusing lengthof rule

Implicitcalculateusing lengthof rule

Support Proportion of ID’s for whichentire rule, antecedents,consequents are true

Explicit(modelingresults)

Explicit(Modelingresults)

Confidence Ratio of rule support toantecedent support

Explicit(modelingresults)

Explicit(Modelingresults)

Interestfactor

Ratio between the jointprobability of two variableswith respect to theirexpected probabilities underthe independenceassumption

No No

Monetaryvalue

Profitability of a rule Explicit(modelingresults)

Explicit(modelingresults)

Deployability % of training data thatsatisfies the conditions of theantecedent but does notsatisfy the consequent

No Explicit(modelingresults)

11346 S. Sharma et al. / Expert Systems with Applications 39 (2012) 11335–11348

of the means rankings (Table 14) indicates that the perceived use-fulness of the IKDDM process model was rated as significantlyhigher than the perceived usefulness of the CRISP process model.This conclusion is reached by noting that for the survey scores rep-resenting perceived usefulness, the average rank is higher in theIKDDM group (31.48) than in the CRISP group (11.52).

5.4.4. Results for perceived semantic qualityThe Mann–Whitney test is highly significant (p < 0.0001) for the

perceived semantic quality scores of the two groups (Table 13). Thevalue of the means rankings (Table 14) indicates that the perceivedsemantic quality of the IKDDM process model was rated as signifi-cantly higher than the perceived semantic quality of the CRISP pro-cess model. This conclusion is reached by noting that for the surveyscores representing semantic quality, the mean rank is higher in theIKDDM group (29.60) than in the CRISP group (13.40).

6. Discussion

The results of Mann–Whitney test on the overall survey scoresrepresenting the quality of the process models indicate that asignificant difference existed between the CRISP and IKDDM mod-els. The test results clearly indicate that the IKDDM model outper-formed the CRISP model by a highly significant margin (p < 0.001).This is an important result and signifies that users rated the effec-tiveness and efficacy of the IKDDM model as much higher than theCRISP model. The results of Mann–Whitney test across the fourconstructs also indicated that the IKDDM group and CRISP groupsignificantly differed in their perceptions of ease of use, usefulness,semantic quality and levels of user satisfaction of the modelemployed by them to execute tasks in Data Mining. The IKDDM

Page 13: Evaluation of an integrated Knowledge Discovery and Data Mining process model

S. Sharma et al. / Expert Systems with Applications 39 (2012) 11335–11348 11347

group reported significantly higher levels of perceived ease of use,perceived usefulness, semantic quality and user satisfaction ascompared to the CRISP group.

The results confirm that IKDDM is more effective and efficientthan the CRISP model in executing tasks of the KDDM process.The limitations of existing KDDM process models (such as use ofonly a checklist approach, lack of explicit support towards execu-tion of tasks) as identified in this research are certainly also per-ceived as being problematic by the Data Mining users.

In keeping with the essence of design science research, the pres-ent design of the artifact can only be regarded as a ‘‘satisfactorysolution’’ (Simon, 1996). However the initial results of the testingof IKDDM against CRISP (a leading model which is the most de-tailed of previously proposed models) has generated promising re-sults. These can be regarded as a measure of the significance of thedesigned artifact, and its contribution to the existing knowledgebase. To the best of our knowledge, this is the first study to conducta rigorous formal evaluation of KDDM process models. More suchstudies are needed as they help to objectively assess the qualityof such models and provide important directives in terms ofimproving these critical process models.

Appendix A. Extract from test instrument

Consider the case of a telecommunications services firm calledABC Global. The firm is facing the issue of losing its existing cus-tomers to its competitors. On further analysis the firm finds thatit is the customers who have been with the firm for more 2 years(i.e. whose tenure is >2 years), who are most likely to leave (orchurn). At present, 7% of the customers are churning away and thisis resulting in a loss of $1 million for the company. The companywishes to bring this rate of churn down to 3% over the Financialyear 2008–2009.

Question 1: Which of the following statements of businessobjectives reflects the business objective of the Data Mining pro-ject being pursued by ABC Global?

(A) Reduce Churn rate of existing customers to 4% by 2009.(B) Reduce Churn rate of customers with tenure >2 to 3% over

2008–2009.(C) Predict the probability to churn of customers with tenure >2

over 2008–2009.(D) Increase profits by reducing churn rate of customers with

tenure >2 to 4% over 2008–2009.

Question 8: Which of the following Data Mining success crite-rion applies to both classification problems and association rules?

(A) Area under ROC curve.(B) KS (Kolmogorov–Smirnov) statistic.(C) Support.(D) Lift.

Question 12: How are the modeling parameters depth andbreadth of a decision tree related to the accuracy and efficiencyof the tree?

(A) Modeling parameter breadth is related to the accuracy of thetree, whereas depth is related to the efficiency of the tree.

(B) Modeling parameter depth is related to the accuracy ofthe tree, whereas breadth is related to the efficiency of the tree.

(C) Both depth and breadth are related to accuracy of the tree,but neither is related to the efficiency of the tree.

(D) There is no relation between modeling parameters depth andbreadth and the accuracy and efficiency of the tree.

Appendix B. CRISP-DM documentation extracts provided forQuestions 1–3

Question 1

Output Business objectives

Describe the customer’s primary objective, from abusiness perspective, in the Data Mining project. Inaddition to the primary business objective, thereare typically a large number of related businessquestions that the customer would like to address.For example, the primary business goal might beto keep current customers by predicting whenthey are prone to move to a competitor, whilesecondary business objectives might be todetermine whether lower fees affect only oneparticular segment of customers

Activities

Informally describe the problem which issupposed to be solved with Data Mining Specify all business questions as precisely aspossible Specify any other business requirements (e.g., thebusiness does not want to lose any customers) Specify expected benefits in business terms

Question 8

Output Data Mining success criteria

Define the criteria for a successful outcome to theproject in technical terms, for example a certainlevel of predictive accuracy or a propensity topurchase profile with a given degree of ‘‘lift.’’ Aswith business success criteria, it may be necessaryto describe these in subjective terms, in which casethe person or persons making the subjectivejudgment should be identified

Activities

Specify criteria for model assessment (e.g., modelaccuracy, performance and complexity). Define benchmarks for evaluation criteria Specify criteria which address subjectiveassessment criteria (e.g. model explain ability anddata and marketing insight, provided by themodel)

Question 12

Ontpat Parameter settins

With, any modeling tool, there are often a largenumber of parameters that can be adjusted List theparameters and their chosen values, along vrtththe rationale for the choice

Activities

SetiniLLilparuiieters Document reasons for choosing those values

Appendix C. IKDDM documentation extracts for Questions 1, 43,8, 12

Question 1: Setting up Business ObjectivesConsider the following steps to formulate a business objective:

Page 14: Evaluation of an integrated Knowledge Discovery and Data Mining process model

11348 S. Sharma et al. / Expert Systems with Applications 39 (2012) 11335–11348

C.1. Step 1: Select purpose

Purpose: the purpose signifies the motivation behind formulat-ing the objective, or why the objective is being formulated. In thecontext of Data Mining projects, purpose can be of the followingfive types:

1. Increase/Improve.2. Decrease/Reduce.3. Identify.4. Understand.5. Determine (Hypothesis Testing).

C.2. Step 2: select object of study and its defining characteristic

Object Name and Defining Characteristic: the object is the entityunder the study. Examples of objects can include: (1) Customers,(2) Suppliers, (3) Products, (4) Employees, (5) Transactions, etc.

In selecting the object it is important to provide further qualify-ing information in form of the defining characteristic of the object.For instance, if the object is chosen as simply ‘customers’, it maynot be clear as to which customers of the firm are of interest and aresultant Data Mining endeavor may be based on the entirecustomer base of the firm. However, the results of Data Mining soobtained are likely to be diluted as it is well known that differenttypes of customers behave differently. So when specifying theobject, we must augment it by adding more information. See exam-ples for various types of objects and their defining characteristics inTable 15 below).

C.3. Step 3: select focus variable (the variable of interest)

Focus: the focus is the variable or the quality attribute of the en-tity under study, i.e. what is being studied through the Data Miningproject. The focus of a Data Mining project can be on a tangible orquantitatively measurable behavior, or on an intangible attribute.Below we provide examples of both types.

Quantitative focus: such a focus variable can be measured interms of%, rate, amount etc. For e.g., churn rate or loss rate of aCUSTOMER [OBJECT].

Qualitative focus: such a focus variable cannot be measured interms of%, rate, amount etc. For e.g., factors affecting motivation ofEMPLOYEES [OBJECT].

C.4. Step 4: formulate preliminary business objective using PURPOSE,OBJECT, AND FOCUS variable selected earlier

For example the preliminary business objective can be: Increase(PURPOSE) the approval rate (FOCUS) of sub-prime customers(OBJECT AND DEFINING CHARACTERISTIC

C.5. Step 5: finalize business objective by

Adding information about Time Frame over which objectivemust be achieved.Adding information about the delta change if focus variable isquantitative.

For example the business objective can be refined as: Increase(PURPOSE) the approval rate (FOCUS) of sub-prime customers (OB-JECT AND DEFINING CHARACTERISTIC by 4% (DELTA CHANGE INFOCUS VARIABLE) over 2009–2010 (TIME FRAME)

This statement can be regarded as FINAL statement of businessobjective

Question 8:Question 12:

Relationship between depth of a tree and efficiency of a clas-sification tree The average number of layers from the root to theterminal nodes is referred to as the average depth of the tree. Ingeneral, the average depth of the tree will reflect the weight givento efficiency.

Relationship between breadth of a tree and accuracy of aclassification tree The average number of internal nodes in eachlevel of the tree is referred to as the average breadth of the tree.In general, the average breadth of the tree will reflect the relativeweight given to classifier accuracy.

See Tables 16 and 17

References

Bernstein, A., Provost, F., & Hill, S. (2005). Toward intelligent assistance for a datamining process: An ontology-based approach for cost-sensitive classification.IEEE Transactions on Knowledge and Data Engineering, 17(4), 503–518.

Berry, M., & Linoff, G. (1997). Data mining techniques for marketing, sales andcustomer support. John Wiley and Sons.

Berry, M. & Linoff, G. (2000). Mastering data mining: The art and relationship ofcustomer relationship management.

Brachman, R., & Anand, T. (1996). The process of knowledge discovery in databases:A human-centered approach. In U. Fayyad, G. Paitestsky-Shapiro, P. Smith, & R.Uthuruswamy (Eds.), Advances in knowledge discovery and data mining(pp. 36–57). Menlo Park: AAAI Press.

Cabena, P., Hadjinian, P., Stadler, R., & Verhees, J. (1998). Discovering data mining:From concepts to implementation. Prentice Hall.

Charest, M., Delisle, S., Cervantes, O., & Shen, Y. (2006). Intelligent data miningassistance via CBR and ontologies. In Proceedings of the 17th internationalconference on database and expert systems applications (DEXA’06).

Chin, W. (1998). The partial least squares approach for structural equationmodeling. In G. A. Marcoulides & N. J. Mahwah (Eds.), Modern methods forbusiness research (pp. 295–336). Lawrence Erlbaum Associates.

Cios, K., Teresinska, A., Konieczna, S., Potocka, J., & Sharma, S. (2000). Diagnosingmyocardial perfusion from PECTbull’s-eye maps – A knowledge discoveryapproach. IEEE Engineering in Medicine and Biology Magazine, Special Issue onMedical Data Mining and Knowledge Discovery, 19(4), 17–25.

CRISP-DM (2003). CRoss industry standard process for data mining 1.0: Step by stepdata mining guide. <http://www.crisp-dm.org/> Retrieved 01.10.10.

DeLone, W. H., & McLean, E. R. (1992). Information systems success: The quest forthe dependent variable. Information Systems Research, 3(1), 60–95.

Diamantopoulos, A., & Winklhofer, H. (2001). Index construction with formativeindicators: An alternative to scale development. Journal of Marketing Research,38(2), 269–277.

Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., & Uthurusamy, R. (Eds.). (1996).Advances in knowledge discovery and data mining. MIT Press.

Field, A. (2000). Discovering statistics using SPSS for windows. Sage Publications.Fornell, C., & Larcker, D. (1981). Evaluating structural equation models with

unobservable variables and measurement error. Journal of Marketing Research,18, 39–50.

Hevner, A. R., March, S. T., Park, J., & Ram, S. (2004). Design science in informationsystems research. MIS Quarterly, 28(1), 75–105.

Hulland, J. (1999). Use of partial least squares (PLS) in strategic managementresearch: A review of four recent studies. Strategic Management.

KDNuggets (2007, 08/2007). Poll: Data mining methodology. <http://www.kdnuggets.com/polls/2007/data_mining_methodology.htm> Retrieved 01.02.10.

Kurgan, L. A., & Musilek, P. (2006). A survey of knowledge discovery and datamining process models. The Knowledge Engineering Review, 21(1), 1–24.

Maes, A. & Poels, G. (2006). Evaluating quality of conceptual models based on userperceptions. In 25th international conference on conceptual modeling, Tucson, AZ.

Nunnally, J. C. (1978). Psychometric theory. New York: McGraw Hill.Ringle, C. M., Wende, S., & Will, A. (2005). SMART-PLS. Hamburg, Germany:

University of Hamburg.Pedhazur, E. J., & Schmelkin, L. P. (1991). Measurement, design, and analysis: An

integrated approach. Hillsdale, NJ: Lawrence Erlbaum Associates.Pyle, D. (2003). Business Modeling and Data Mining. Morgan Kaufmann Publishers.Seddon, P. B. (1997). A respecification and extension of the DeLone and McLean

model of IS success. Information Systems Research, 8(3), 240–253.Shearer, C. (2000). The CRISP-DM methodology: The new blueprint for data mining.

Journal of Data Warehousing, 5(4), 13–22.Sheer, A. -W., & Hars, A. (1992). Extending data modeling to cover the whole enter-

prise. Communications of the ACM, 35(9), 166–172.Simon, H. A. (1996). The sciences of the artificial. Cambridge, MA: MIT Press.Sharma, S., & Osei-Bryson, K.-M. (2008). Implementation of business understanding

phase of data mining projects. Expert Systems with Applications, 36(2:2),4114–4124.

Sharma, S., & Osei-Bryson, K-M. (2010). Toward an integrated knowledgediscovery and data mining process model. Knowledge Engineering Review,25(1), 49–67.

Yoo, Y., & Alavi, A. (2001). Media and group cohesion: Relative influences onsocial presence, task participation, and group consensus. MIS Quarterly, 25,371–390.