data mining cmpt 455/826 - week 10, day 2 jan-apr 2009 – w10d21

24
Data Mining CMPT 455/826 - Week 10, Day 2 Jan-Apr 2009 – w10d2 1

Upload: brandon-miles

Post on 25-Dec-2015

220 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Data Mining CMPT 455/826 - Week 10, Day 2 Jan-Apr 2009 – w10d21

Data Mining

CMPT 455/826 - Week 10, Day 2

Jan-Apr 2009 – w10d2 1

Page 2: Data Mining CMPT 455/826 - Week 10, Day 2 Jan-Apr 2009 – w10d21

A Methodology for Evaluating and Selecting Data Mining

Software

(based on Collier)

Jan-Apr 2009 – w10d2 2

Page 3: Data Mining CMPT 455/826 - Week 10, Day 2 Jan-Apr 2009 – w10d21

Evaluation Categories

• Performance• is the ability to handle a variety of data sources in an efficient manner

• Functionality • is the inclusion of a variety of capabilities, techniques, and methodologies for

data mining

• Usability • is accommodation of different levels and types of users without loss of

functionality or usefulness

• Ancillary Task Support • allows the user to perform the variety of data cleansing, manipulation,

transformation, visualization and other tasks that support data mining

Jan-Apr 2009 – w10d2 3

Page 4: Data Mining CMPT 455/826 - Week 10, Day 2 Jan-Apr 2009 – w10d21

Methodology

1. Tool pre-screening

2. Identify Additional Selection Criteria

3. Weight Selection Criteria

4. Tool Scoring

5. Score Evaluation

6. Tool Selection

Jan-Apr 2009 – w10d2 4

Page 5: Data Mining CMPT 455/826 - Week 10, Day 2 Jan-Apr 2009 – w10d21

Introduction to Data Mining

from Discovering Knowledge in Data:

An Introduction to Data Mining

by Daniel T. Larose

Jan-Apr 2009 – w10d2 5

Page 6: Data Mining CMPT 455/826 - Week 10, Day 2 Jan-Apr 2009 – w10d21

Need For Human Direction Of Data Mining

• Many software vendors market their analytical software as being plug-and-play out of-the-box applications that will provide solutions to otherwise intractable problems without the need for human supervision or interaction. Some early definitions of data mining followed this focus on automation.

• Humans need to be actively involved at every phase of the data mining process.

• Rather than asking where humans fit into data mining, we should instead inquire about how we may design data mining into the very human process of problem solving.

Jan-Apr 2009 – w10d2 6

Page 7: Data Mining CMPT 455/826 - Week 10, Day 2 Jan-Apr 2009 – w10d21

CRISP–DM: The Six Phases

1. Business understanding phase

2. Data understanding phase

3. Data preparation phase

4. Modeling phase

5. Evaluation phase

6. Deployment phase

Jan-Apr 2009 – w10d2 7

Page 8: Data Mining CMPT 455/826 - Week 10, Day 2 Jan-Apr 2009 – w10d21

CRISP–DM: Modeling phase

a) Select and apply appropriate modeling techniques

b) Calibrate model settings to optimize results

c) Remember that often, several different techniques may be used for the same data mining problem

d) If necessary, loop back to the data preparation phase to bring the form of the data into line with the specific requirements of a particular data mining technique

Jan-Apr 2009 – w10d2 8

Page 9: Data Mining CMPT 455/826 - Week 10, Day 2 Jan-Apr 2009 – w10d21

CRISP–DM: Evaluation phase

1. Evaluate the one or more models delivered in the modeling phase – for quality and effectiveness before deploying them for use in the field.

2. Determine whether the model in fact achieves the objectives – set for it in the first phase.

3. Establish if something has not been accounted for sufficiently– some important facet of the business or research problem

4. Come to a decision – regarding use of the data mining results.

Jan-Apr 2009 – w10d2 9

Page 10: Data Mining CMPT 455/826 - Week 10, Day 2 Jan-Apr 2009 – w10d21

Fallacies Of Data Mining

• Fallacy 1. There are data mining tools that we can turn loose on our data repositories and use to find answers to our problems.

• Reality. There are no automatic data mining tools that will solve your problems mechanically "while you wait." Rather, data mining is a process, as we have seen above. CRISP–DM is one method for fitting the data mining process into the overall business or research plan of action.

Jan-Apr 2009 – w10d2 10

Page 11: Data Mining CMPT 455/826 - Week 10, Day 2 Jan-Apr 2009 – w10d21

Fallacies Of Data Mining

• Fallacy 2. The data mining process is autonomous, requiring little or no human oversight.

• Reality. As we saw above, the data mining process requires significant human interactivity at each stage. Even after the model is deployed, the introduction of new data often requires an updating of the model. Continuous quality monitoring and other evaluative measures must be assessed by human analysts.

Jan-Apr 2009 – w10d2 11

Page 12: Data Mining CMPT 455/826 - Week 10, Day 2 Jan-Apr 2009 – w10d21

Fallacies Of Data Mining

• Fallacy 3. Data mining pays for itself quite quickly.

• Reality. The return rates vary, depending on the startup costs, analysis personnel costs, data warehousing preparation costs, and so on.

Jan-Apr 2009 – w10d2 12

Page 13: Data Mining CMPT 455/826 - Week 10, Day 2 Jan-Apr 2009 – w10d21

Fallacies Of Data Mining

• Fallacy 4. Data mining software packages are intuitive and easy to use.

• Reality. Again, ease of use varies. However, data analysts must combine subject matter knowledge with an analytical mind and a familiarity with the overall business or research model.

Jan-Apr 2009 – w10d2 13

Page 14: Data Mining CMPT 455/826 - Week 10, Day 2 Jan-Apr 2009 – w10d21

Fallacies Of Data Mining

• Fallacy 5. Data mining will identify the causes of our business or research problems.

• Reality. The knowledge discovery process will help you to uncover patterns of behavior. Again, it is up to humans to identify the causes.

Jan-Apr 2009 – w10d2 14

Page 15: Data Mining CMPT 455/826 - Week 10, Day 2 Jan-Apr 2009 – w10d21

Fallacies Of Data Mining

• Fallacy 6. Data mining will clean up a messy database automatically.

• Reality. Well, not automatically. As a preliminary phase in the data mining process, data preparation often deals with data that has not been examined or used in years. Therefore, organizations beginning a new data mining operation will often be confronted with the problem of data that has been lying around for years, is stale, and needs considerable updating.

Jan-Apr 2009 – w10d2 15

Page 16: Data Mining CMPT 455/826 - Week 10, Day 2 Jan-Apr 2009 – w10d21

What Tasks Can Data Mining Accomplish?

• Description• Estimation• Prediction• Classification• Clustering• Association

Jan-Apr 2009 – w10d2 16

Page 17: Data Mining CMPT 455/826 - Week 10, Day 2 Jan-Apr 2009 – w10d21

Interestingness Measures for Data Mining: A Survey

(based on Geng and Hamilton)

Jan-Apr 2009 – w10d2 17

Page 18: Data Mining CMPT 455/826 - Week 10, Day 2 Jan-Apr 2009 – w10d21

Interestingness measures

• are intended for selecting and ranking patterns – according to their potential interest to the user– regardless of the kind of patterns being mined

• Interestingness is a broad concept that emphasizes – conciseness, coverage, reliability, peculiarity, diversity, novelty,

surprisingness, utility, and actionability

• These nine criteria can be further categorized into: – objective, subjective, and semantics-based

• A concise pattern or set of patterns – is relatively easy to understand and remember and thus is added more easily to the user’s

knowledge (set of beliefs)

Jan-Apr 2009 – w10d2 18

Page 19: Data Mining CMPT 455/826 - Week 10, Day 2 Jan-Apr 2009 – w10d21

Towards comprehensive support for organizational mining

(based on Song)

Jan-Apr 2009 – w10d2 19

Page 20: Data Mining CMPT 455/826 - Week 10, Day 2 Jan-Apr 2009 – w10d21

Process mining

• requires the availability of an event log

– Most has been focusing on control-flow discovery • i.e., constructing a process model based on an event log

– assume that it is possible to sequentially record events such that • each event refers to an activity (i.e., a well-defined step in the process) • and is related to a particular case (i.e., a process instance)

– some mining techniques use additional information such as • the performer or originator of the event (i.e., the person/resource executing

or initiating the activity), • the timestamp of the event, or data elements recorded with the event (e.g.,

the size of an order)

Jan-Apr 2009 – w10d2 20

Page 21: Data Mining CMPT 455/826 - Week 10, Day 2 Jan-Apr 2009 – w10d21

Event logs

• An interesting class of information systems – that produce event logs are – the so-called Process-Aware Information Systems (PAISs)

• These systems provide very detailed information– about the activities that have been executed

Jan-Apr 2009 – w10d2 21

Page 22: Data Mining CMPT 455/826 - Week 10, Day 2 Jan-Apr 2009 – w10d21

Organizational mining

• Discovery – aims at constructing a model that reflects current situations

– the organizational model • represents the current organizational structure and

– the social network • shows the communication structure in an organization

Jan-Apr 2009 – w10d2 22

Page 23: Data Mining CMPT 455/826 - Week 10, Day 2 Jan-Apr 2009 – w10d21

Organizational mining

• Conformance checking – examines whether the modeled behaviour matches the

observed behaviour

– involves two dimension of conformance measures in the control flow perspective

• fitness – is the degree of the association between the log traces and the execution paths specified by

the process model

• appropriateness– is the degree of accuracy with which the process– model describes observed behaviour

Jan-Apr 2009 – w10d2 23

Page 24: Data Mining CMPT 455/826 - Week 10, Day 2 Jan-Apr 2009 – w10d21

Organizational mining

• Extension – aims at enriching an existing model

• by extending the model • through the projection of information • extracted from the logs onto the initial model

Jan-Apr 2009 – w10d2 24