introduction to data mining - steinbach.pdf

7/23/2019 Introduction to data mining - Steinbach.pdf

1/790


2/790


3/790

\ ( (

PANG.N NG

TAN

Mich igan ta teUnivers i t y

MICHAEL STEINBACH

Univers i t yf M innesota

VI

PI

N KUMAR

Univers i t yf M innesota

and

ArmyHighPer formance

Comput ing

esearch enter

+f.f_l

crf.rfh.

.W

if f

aqtY

6l$

t .T.R.C.

i'&'ufe61ttt1/.

Y

\

t.\

$t,/,1'

n,5

\.

7\

V

'48

Boston San Francisco

NewYork

London Toronto Sydney

Tokyo

Singapore Madrid

MexicoCity

Munich Paris

CapeTown

HongKong

Montreal


4/790

G.R

r+6,q

If

you purchased

this book

within

the United States

or Canada

you

should be aware that it has been

wrongfirlly imported

without

the approval

of the Publishel

or the

Author.

3

6

{)gq*

3

AcquisitionsEditor

Matt

Goldstein

ProjectEditor

Katherine Harutunian

Production

Supervisor Marilyn

Lloyd

Production Services Paul C. Anagnostopoulosof Windfall Software

Marketing

Manager

Michelle

Brown

Copyeditor

Kathy

Smith

Proofreader

IenniferMcClain

Technicallllustration

GeorgeNichols

Cover

Design

Supervisor

Joyce

Cosentino

Wells

Cover Design

Night & Day Design

Cover Image

@ 2005

Rob Casey/Brand

X

pictures

hepress and

Manufacturing

Caroline

Fell

Printer

HamiltonPrinting

Access

the latest information

about

Addison-Wesley

titles from

our

iWorld

Wide

Web site:

http /www. aw-bc.com/computing

Many

of the

designations

used by

manufacturers

and sellers

to

distiriguish their products

are

claimed

as trademarks.

where

those

designations

appear n

this book, and

Addison-

Wesley was

aware

of a

trademark claim,

the designations

have

been

printed

in initial

caps

or all caps.

The

programs

and

applications presented

n

this book

have

been ncl,[rded

or

their

instructional value.

They

have been

tested

with

care,

but are not guatanteed

or

any

particular

purpose.

The

publisher

does

not

offer

any

warranties

or representations,

nor

does

it accept any

liabilities

with respect

o

the programs

or applications.

Copyright

@

2006

by

Pearson Education,

Inc.

For information

on

obtaining

permission

for

use

of material

in this

work,

please

submit

a

written

request o

PearsonEducation,

Inc.,

Rights

and Contract

Department,

75

Arlington

Street,

Suite 300,

Boston,

MA02II6

or fax your

request

o

(617)

g4g-j047.

All rights

reserved.

No

part

of

this publication

may be

reproduced,

stored

n

a retrieval

system,

or transmitted,

in any form

or

by any

means,

electronic,

mechanical,

photocopying,

recording,

or any other

media

embodiments

now

known

or hereafter

o become

known,

without

the

prior

written

permission

of the publisher.

printed

in the

united

States

of

America.

lsBN

0-321-42052-7

2 3 4 5 67

8 9 10-HAM-O8

7 06


5/790

our

famili,es


6/790


7/790

Preface

Advances in

data

generation

and

collection are

producing

data

sets of mas-

sive

size

in

commerce and

a variety of scientific disciplines. Data warehouses

store details

of the sales

and operations of businesses, arth-orbiting satellites

beam high-resolution

mages

and sensordata back to Earth, and

genomics

ex-

periments

generate

sequence,

tructural, and

functional

data for an

increasing

number

of organisms.

The ease with

which

data can now be

gathered

and

stored has

created a new

attitude toward data analysis:

Gather

whatever

data

you

can whenever

and

wherever

possible.

It has become

an article of

faith

that the

gathered

data

will have value, either

for

the

purpose

that initially

motivated

its collection

or

for

purposes

not

yet

envisioned.

The field

of data mining

grew

out of the limitations of current data anal-

ysis

techniques n handling

the challenges

posedl

by

these

new

types

of data

sets. Data mining

does not replace

other areas

of data

analysis,

but rather

takes them

as the

foundation

for much

of

its work.

While some areas

of data

mining, such as association

analysis, are unique

to the field, other areas,such

as clustering, classification,

and

anomaly detection, build

upon

a long history

of work on these topics in other fields. Indeed, the willingness of data mining

researchers

o draw

upon existing techniques has contributed to the strength

and

breadth of the

field,

as well as to

its rapid

growth.

Another

strength of

the

field has

been

its emphasis

on collaboration

with

researchers

in

other areas. The

challenges

of

analyzing

new types of data

cannot be met by simply

applying data

analysis techniques in isolation

from

those who

understand the

data and the domain

in which

it

resides.

Often,

skill

in

building

multidisciplinary

teams has been as

responsible

or the success

f

data mining

projects

as

the creation of

new and

innovative

algorithms. Just

as, historically, many

developments

n

statistics

were

driven by the

needs of

agriculture, industry, medicine, and business,rxranyof the developments n

data mining are being

driven

by

the

needsof those same fields.

This

book began

as a set of notes and

lecture

slides

for

a data

mining

course

hat

has

been offered

at the University

of Minnesota

since Spring

1998

to upper-division undergraduate

and

graduate

Students.

Presentation

slides


8/790

viii

Preface

and

exercises

developed in

these

offerings

grew

with

time and served

as a basis

for the book. A

survey

of clustering

techniques

in data mining,

originally

written

in

preparation

for

research

in the

area,

served as a starting

point

for one of the chapters in the book.

Over time,

the

clustering chapter

was

joined

by

chapters on data,

classification,

association analysis, and anomaly

detection. The book in its current form has been class tested at the home

institutions of the

authors-the University of Minnesota

and Michigan

State

University-as

well as several other universities.

A number

of data mining books appeared in

the

meantime,

but were

not

completely satisfactory for our students

primarily graduate

and undergrad-

uate

students in

computer

science,

but including students

from industry and

a wide

variety

of other disciplines. Their mathematical and computer back-

grounds

varied considerably, but they

shared a common

goal:

to

learn

about

data

mining

as directly

as

possible

in

order to

quickly

apply it to

problems

in

their

own domains. Thus, texts with

extensive

mathematical

or statistical

prerequisiteswere unappealing to many of them, as were texts that required a

substantial

database

background.

The

book that evolved

n response

o these

students

needs ocuses

as

directly

as

possible

on the

key

conceptsof data min-

ing by

illustrating

them with examples,

simple descriptions

of key algorithms,

and

exercises.

Overview Specifically,

his

book

provides

a comprehensive

ntroduction

to

data

mining and is

designed

o be

accessible nd useful

o

students,

nstructors,

researchers,and

professionals.

Areas

covered nclude

data

preprocessing,

vi-

sualization,

predictive

modeling, association

analysis,

clustering, and anomaly

detection. The

goal

is

to

present

fundamental

concepts

and

algorithms

for

each

topic,

thus

providing

the

reader

with the

necessarybackground for

the

application

of data mining to real

problems.

In addition, this book also

pro-

vides

a

starting

point

for

those

readers

who

are

interested

n

pursuing

research

in

data

mining or related

fields.

The

book covers ive main topics: data,

classification,

associationanalysis,

clustering,

and anomaly

detection.

Except

for anomaly

detection,

eachof these

areas

is covered in

a

pair

of

chapters.

For classification, association analysis,

and

clustering,

the introductory chapter

coversbasic

concepts, representative

algorithms, and evaluation techniques,while the more advanced chapter dis-

cussesadvanced concepts

and

algorithms.

The objective

is

to

provide

the

reader with

a sound understanding of the foundations of data mining, while

still covering many important advanced

opics. Becauseof this approach, the

book

is useful

both as a learning tool and

as a

reference.


9/790

Preface

ix

To

help

the readers

better understand the concepts that

have

been

pre-

sented,

we

provide

an

extensive set of examples, figures, and exercises.

Bib-

Iiographic

notes

are included

at the end of each chapter for readers who

are

interested

in more

advanced

opics, historically

important

papers,

and

recent

trends.

The

book also

contains a comprehensivesubject and author

index.

To

the

Instructor

As

a textbook, this book

is

suitable

for

a

wide range

of students

at

the advanced undergraduate or

graduate

level.

Since

students

come

to

this subject

with diverse backgrounds hat

may not include

extensive

knowledge

of statistics

or databases,our book requires

minimal

prerequisites-

no database

knowledge is needed

and we assume only a

modest

background

in

statistics

or mathematics. To

this end, the book was designed to

be as

self-contained

as

possible.

Necessary

material

from

statistics,

linear

algebra,

and machine

earning

is either integrated into

the

body of the text, or for

some

advanced

opics,

covered

n

the appendices.

Since

the chapters

covering major data mining topics are self-contained,

the

order

in which topics

can be covered s

quite

flexible. The

core

material

is

covered

in

Chapters 2, 4,

6, 8, and

10. Although the introductory

data

chapter (2)

should

be covered irst, the basic classification, association

analy-

sis, and

clustering chapters

(4,

6, and 8,

respectively) can be covered

n any

order.

Because

of the relationship of

anomaly

detection

(10)

to classification

(4)

and

clustering

(8),

these

chapters should

precede

Chapter

10. Various

topics

can be

selected rom the advanced classification, association analysis,

and

clustering

chapters

(5,

7,

and

9,

respectively) to fit the scheduleand in-

terests

of the

instructor

and students. We also

advise that the lectures

be

augmented

by

projects

or

practical

exercises

n data

mining. Although

they

are time

consuming,

such hands-on assignments

greatly

enhance he

value

of

the

course.

Support

Materials

The supplements or the book are available at

Addison-

Wesley's

Website

www.aw.con/cssupport.

Support

materials available

to all

readers

of this

book

include

PowerPoint lecture

slides

Suggestions or student projects

Data mining resources

such as data mining algorithms and

data

sets

On-line tutorials that

give

step-by-stepexamples

or selecteddata

mining

techniquesdescribed n

the book using actual

data sets and data

analysis

software

o

o

o

o


10/790

x Preface

Additional

support materials,

including

solutions

to exercises,

re available

only to instructors

adopting

this textbook for classroom use.

Please contact

your

school's

Addison-Wesley representative or information

on

obtaining ac-

cess o this

material.

Comments

and

suggestions,as well as

reports of errors,

can be sent

to the

authors through d [email protected].

Acknowledgments Many

people

contributed to this book.

We begin by

acknowledging

our families to whom

this book is dedicated.

Without their

patience

and support, this

project

would have

been

impossible.

We

would

like

to thank the current

and

former

students of our

data

mining

groups

at the University

of Minnesota

and Michigan

State

for

their contribu-

tions. Eui-Hong

(Sam)

Han

and Mahesh

Joshi

helped

with the

initial data

min-

ing classes.

Some

ofthe exercises nd

presentation

slides hat they

created

can

be found in the book and its accompanying

slides. Students

in our data

min-

ing

groups

who

provided

comments

on drafts of the book or who

contributed

in

other ways

include

Shyam Boriah, Haibin

Cheng, Varun Chandola,

Eric

Eilertson,

Levent

Ertoz,

Jing Gao,

Rohit Gupta, Sridhar

Iyer,

Jung-Eun

Lee,

Benjamin Mayer,

Aysel

Ozgur, Uygar Oztekin, Gaurav

Pandey,

Kashif

Riaz,

Jerry Scripps, Gyorgy Simon,

Hui

Xiong,

Jieping

Ye, and Pusheng

Zhang. We

would also like to thank the students

of our data

mining

classes

at the Univer-

sity of Minnesota

and Michigan

State

University

who worked with

early

drafbs

of

the book and

provided

invaluable feedback.

We

specifically note

the helpful

suggestionsof

Bernardo

Craemer, Arifin

Ruslim, Jamshid Vayghan,

and

Yu

Wei.

Joydeep Ghosh (University of Texas) and Sanjay Ranka (University of

Florida)

class

ested early versions

of the book.

We

also received

many useful

suggestions

directly

from

the following

UT students: Pankaj Adhikari, Ra-

jiv

Bhatia, Fbederic

Bosche, Arindam

Chakraborty,

Meghana Deodhar, Chris

Everson, David Gardner,

Saad Godil, Todd Hay,

Clint Jones,

Ajay

Joshi,

Joonsoo Lee, Yue

Luo, Anuj Nanavati,

Tyler

Olsen, Sunyoung

Park,

Aashish

Phansalkar,

Geoff

Prewett, Michael

Ryoo,

Daryl

Shannon, and

Mei

Yang.

Ronald

Kostoff

(ONR)

read

an early

version of the clustering

chapter and

offered

numerous suggestions.

George Karypis

provided

invaluable

IATEX as-

sistance in creating

an

author index. Irene Moulitsas

also

provided

assistance

with IATEX and reviewed some of the appendices. Musetta Steinbach was very

helpful in

finding

errors in

the

figures.

We

would like to acknowledge

our colleagues at the

University

of Min-

nesota

and Michigan State who have helped

create a

positive

environment

for

data mining research. They include

Dan Boley,

Joyce Chai,

Anil Jain, Ravi


11/790

Preface

xi

Janardan,

Rong Jin, GeorgeKarypis, Haesun Park, William

F.

Punch, Shashi

Shekhar,

and

Jaideep Srivastava. The collaborators on our

many

data

mining

projects,

who

also have

our

gratitude,

include Ramesh

Agrawal,

Steve

Can-

non, Piet

C. de

Groen,

FYanHill, Yongdae Kim,

Steve

Klooster, Kerry Long,

Nihar

Mahapatra,

Chris

Potter,

Jonathan

Shapiro,

Kevin

Silverstein,

Nevin

Young, and Zhi-Li Zhang.

The

departments of

Computer Scienceand

Engineering at the

University of

Minnesota

and Michigan

State

University

provided

computing

resources

and

a supportive

environment for

this

project.

ARDA, ARL, ARO, DOE, NASA,

and NSF

provided

research

support

for Pang-Ning Tan, Michael

Steinbach,

and

Vipin

Kumar. In

particular,

Kamal Abdali, Dick Brackney,

Jagdish Chan-

dra,

Joe

Coughlan, Michael

Coyle, Stephen

Davis, Flederica Darema, Richard

Hirsch,

Chandrika Kamath,

Raju

Namburu, N. Radhakrishnan, James

Sido-

ran,

Bhavani

Thuraisingham,

Walt Tiernin, Maria

Zemankova,and Xiaodong

Zhanghave

been supportive

of our research

n

data

mining and high-performance

computing.

It

was

a

pleasure

working with the helpful staff at

Pearson Education.

In

particular,

we would like

to thank

Michelle Brown, Matt

Goldstein,

Katherine

Harutunian,

Marilyn Lloyd, Kathy

Smith,

and

Joyce

Wells. We

would

also

like

to

thank

George Nichols, who helped

with

the art

work and Paul

Anag-

nostopoulos,

who

provided

I4.T[X support.

We

are

grateful

to the following

Pearson

reviewers:

Chien-Chung Chan

(University

of

Akron), Zhengxin

Chen

(University

of Nebraska

at Omaha), Chris Clifton

(Purdue

University),

Joy-

deep

Ghosh

(University

of

Texas, Austin), Nazli

Goharian

(Illinois

Institute

of Technology),

J. Michael Hardin

(University

of

Alabama),

James Hearne

(Western

Washington

University),

Hillol Kargupta

(University

of Maryland,

Baltimore

County and

Agnik, LLC), Eamonn Keogh

(University

of California-

Riverside),

Bing Liu

(University

of Illinois at

Chicago),

Mariofanna Milanova

(University

of

Arkansas

at Little

Rock), Srinivasan

Parthasarathy

(Ohio

State

University),

Zbigniew

W. Ras

(University

of

North

Carolina

at Charlotte),

Xintao

Wu

(University

of

North

Carolina at Charlotte),

and

Mohammed

J.

Zaki

(Rensselaer

Polvtechnic Institute).


12/790


13/790

Gontents

Preface

Introduction

1

1.1

What

Is

Data Mining?

2

7.2 Motivating

Challenges

4

1.3 The

Origins of Data

Mining

6

1.4 Data Mining Tasks 7

1.5

Scope and

Organization

of the

Book

11

1.6 Bibliographic Notes

13

vll

t.7 Exercises

Data

16

19

2.I Types of Data

22

2.1.I Attributes

and

Measurement

23

2.L.2 Types

of

Data

Sets

.

29

2.2 Data

Quality

36

2.2.I Measurement and Data Collection Issues 37

2.2.2 Issues

Related

to

Applications

2.3 Data Preprocessing

2.3.L Aggregation

2.3.2

Sampling

2.3.3 Dimensionality

Reduction

2.3.4 Feature

Subset

Selection

2.3.5 Feature

Creation

2.3.6 Discretization

and

Binarization

2.3:7

Variable

Tlansformation

.

2.4 Measures

of Similarity and

Dissimilarity

. . .

2.4.L Basics

2.4.2

Similarity and

Dissimilarity between Simple

Attributes .

2.4.3 Dissimilarities between Data Objects

.

2.4.4

Similarities between

Data

Objects

43

44

45

47

50

52

55

57

63

65

66

67

69

72


14/790

xiv

Contents

2.4.5

Examples of

Proximity

Measures

2.4.6 Issues

n

Proximity

Calculation

2.4.7

Selecting

he

Right

Proximity

Measure

2.5 BibliographicNotes

2.6

Exercises

Exploring Data

3.i The Iris Data

Se t

3.2 Summary

Statistics

3.2.L

Frequencies

and the Mode

3.2.2

Percentiles

3.2.3 Measuresof Location:

Mean and Median

3.2.4 Measuresof Spread:

Range and Variance

3.2.5

Multivariate

Summary

Statistics

3.2.6 Other

Ways

to

Summarize

the

Data

3.3 Visualization

3.3.1 Motivations for

Visualization

3.3.2 General Concepts

3.3.3

Techniques

3.3.4

Visualizing

Higher-Dimensional

Data

.

3.3.5

Do's

and

Don'ts

3.4

OLAP

and Multidimensional

Data Analysis

3.4.I

Representing

ris

Data

as a Multidimensional

Array

3.4.2

Multidimensional

Data: The

General Case

3.4.3 Analyzing

Multidimensional

Data

3.4.4 Final Comments on Multidimensional Data Analysis

Bibliographic

Notes

Exercises

Classification:

Basic Concepts,

Decision Tlees,

and Model

Evaluation

4.1 Preliminaries

4.2

General Approach to Solving

a Classification

Problem

4.3 Decision

Tlee Induction

4.3.1 How

a Decision Tlee

Works

4.3.2 How to Build a Decision TYee

4.3.3 Methods

for

Expressing

Attribute Test Conditions

4.3.4 Measures or

Selecting

he Best

Split

.

4.3.5

Algorithm for Decision

Tlee Induction

4.3.6 An

Examole: Web

Robot Detection

3.5

3.6

73

80

83

84

88

97

98

98

99

100

101

102

704

105

105

105

106

110

724

130

131

131

133

135

139

139

747

L45

746

748

150

150

151

155

158

164

166


15/790

Contents

xv

4.3.7

Characteristics of

Decision Tlee Induction

4.4 Model

Overfitting

4.4.L

Overfitting

Due

to

Presenceof Noise

4.4.2

Overfitting Due to Lack of Representative

Samples

4.4.3

Overfitting and the

Multiple

Comparison

Procedure

4.4.4 Estimation of Generalization Errors

4.4.5

Handling

Overfitting

in Decision

Tlee Induction

4.5

Evaluating

the Performance of a

Classifier

4.5.I

Holdout Method

4.5.2

Random Subsampling

. . .

4.5.3

Cross-Validation

4.5.4

Bootstrap

4.6 Methods

for

Comparing Classifiers

4.6.L

Estimating a

Confidence

nterval for Accuracy

4.6.2

Comparing the

Performance of Two

Models .

4.6.3 Comparing the Performance of Two Classifiers

4.7

BibliographicNotes

4.8

Exercises

5 Classification:

Alternative Techniques

5.1 Rule-Based

Classifier

5.1.1 How

a Rule-Based Classifier

Works

5.1.2

Rule-Ordering Schemes

5.1.3 How

to

Build a

Rule-Based

Classifier

5.1.4 Direct Methods for

Rule

Extraction

5.1.5 Indirect Methods for

Rule

Extraction

5.1.6

Characteristics of Rule-Based Classifiers

5.2

Nearest-Neighbor

classifiers

5.2.L Algorithm

5.2.2

Characteristics

of Nearest-Neighbor Classifiers

5.3

Bayesian

Classifiers

5.3.1

Bayes Theorem

5.3.2

Using

the Bayes Theorem

for

Classification

5.3.3 Naive

Bayes Classifier

5.3.4 Bayes

Error Rate

5.3.5 Bayesian Belief Networks

5.4

Artificial

Neural Network

(ANN)

5.4.I Perceptron

5.4.2

Multilayer

Artificial Neural Network

5.4.3

Characteristics

of ANN

168

172

L75

L77

178

179

184

186

186

187

187

188

188

189

191

192

193

198

207

207

209

2 I I

2r2

2r3

22L

223

223

225

226

227

228

229

23L

238

240

246

247

25r

255


16/790

xvi

Contents

5.5

Support Vector

Machine

(SVM)

5.5.1 Maximum Margin Hyperplanes

5.5.2 Linear

SVM: SeparableCase

5.5.3 Linear

SVM:

Nonseparable

Case

5.5.4

Nonlinear

SVM

.

5.5.5 Characteristics of SVM

Ensemble

Methods

5.6.1

Rationale

for Ensemble

Method

5.6.2 Methods

for

Constructing an Ensemble Classifier

5.6.3 Bias-Variance Decomposition

5.6.4 Bagging

5.6.5

Boosting

5.6.6 Random

Forests

5.6.7 Empirical

Comparison among Ensemble Methods

Class

Imbalance Problem

5.7.1 Alternative Metrics

5.7.2

The

Receiver

Operating

Characteristic Curve

5.7.3 Cost-Sensitive

Learning

. .

5.7.4

Sampling-Based

Approaches

.

Multiclass Problem

Bibliographic

Notes

Exercises

5 .6

o . t

256

256

259

266

270

276

276

277

278

28r

283

285

290

294

294

295

298

302

305

306

309

315

c .6

5.9

5.10

Association

Analysis: Basic

Concepts and

Algorithms

327

6.1

Problem

Definition

.

328

6.2 Flequent Itemset Generation

332

6.2.I The Apri,ori

Principle

333

6.2.2 Fbequent temset

Generation in

the

Apri,ori, Algorithm .

335

6.2.3

Candidate

Generation

and Pruning .

. .

338

6.2.4

Support

Counting 342

6.2.5

Computational Complexity

345

6.3 Rule

Generatiorr

349

6.3.1

Confidence-Based

Pruning

350

6.3.2

Rule Generation

in

Apri,ori, Algorithm

350

6.3.3

An

Example: Congressional

Voting Records

352

6.4

Compact Representation of Fbequent temsets

353

6.4.7

Maximal

Flequent Itemsets

354

6.4.2 Closed Frequent Itemsets

355

6.5 Alternative

Methods

for

Generating

Frequent Itemsets

359

6.6 FP-Growth

Alsorithm

363


17/790

Contents

xvii

6.6.1

FP-tee

Representation

6.6.2

Frequent

Itemset Generation

in FP-Growth Algorithm

.

6.7

Evaluation

of

Association

Patterns

6.7.l Objective

Measuresof

Interestingness

6.7.2

Measures

beyond

Pairs

of

Binary Variables

6.7.3 Simpson'sParadox

6.8

Effect of

Skewed

Support

Distribution

6.9

Bibliographic

Notes

363

366

370

37r

382

384

386

390

404

4L5

415

4t8

418

422

424

426

429

429

431

436

439

442

443

444

447

448

453

457

457

458

458

460

461

463

465

469

473

6.10

Exercises

7 Association Analysis:

Advanced

7.I Handling

Categorical Attributes

7.2

Handling

Continuous Attributes

Concepts

7.2.I Discretization-Based

Methods

7.2.2 Statistics-BasedMethods

7.2.3 Non-discretizalion Methods

Handling a

Concept Hierarchy

Seouential

Patterns

7.4.7

Problem

Formulation

7.4.2 Sequential Pattern Discovery

7.4.3

Timing

Constraints

7.4.4

Alternative

Counting Schemes

7.5 Subgraph

Patterns

7.5.1 Graphs

and Subgraphs

.

7.5.2

Frequent

Subgraph

Mining

7.5.3 Apri,od-like Method

7.5.4 Candidate

Generation

7.5.5 Candidate Pruning

7.5.6

Support

Counting

7.6

Infrequent

Patterns

7.6.7

Negative

Patterns

7.6.2

Negatively

Correlated

Patterns

7.6.3 Comparisons among Infrequent

Patterns, Negative

Pat-

terns,

and Negatively Correlated

Patterns

7.6.4

Techniques

or Mining Interesting

Infrequent Patterns

7.6.5

Techniques

Based on Mining

Negative Patterns

7.6.6 Techniques

Based on Support

Expectation .

7.7

Bibliographic

Notes

7.8 Exercises

7.3

7.4


18/790

xviii Contents

Cluster Analysis:

Basic Concepts

and Algorithms

8.1 Overview

8.1.1 What Is

Cluster

Analysis?

8.I.2 Different

Types of Clusterings .

8.1.3 Different Types of

Clusters

8.2 K-means

8.2.7

The

Basic K-means Algorithm

8.2.2 K-means: Additional Issues

8.2.3 Bisecting K-means

8.2.4

K-means

and

Different

Types

of

Clusters

8.2.5

Strengths and

Weaknesses

8.2.6 K-means

as

an

Optimization Problem

8.3

Agglomerative

Hierarchical

Clustering

8.3.1 Basic Agglomerative

Hierarchical

Clustering

Algorithm

8.3.2

Specific

Techniques

8.3.3 The Lance-Williams Formula for Cluster Proximity .

8.3.4 Key Issues

n

Hierarchical

Clustering

.

8.3.5

Strengths and Weaknesses

DBSCAN

8.4.1 Tladitional Density:

Center-Based

Approach

8.4.2

The

DBSCAN Algorithm

8.4.3 Strengths

and

Weaknesses

Cluster Evaluation

8.5.1

Overview

8.5.2 Unsupervised

Cluster Evaluation Using

Cohesion and

Separation

8.5.3

Unsupervised Cluster Evaluation Using the Proximity

Matrix

8.5.4

Unsupervised

Evaluation

of

Hierarchical

Clustering

.

8.5.5 Determining the

Correct Number of Clusters

8.5.6

Clustering Tendency

8.5.7

Supervised Measures

of Cluster Validity

8.5.8

Assessing

he Significance

of Cluster

Validity Measures

8.4

8.5

487

490

490

49r

493

496

497

506

508

510

510

513

515

516

518

524

524

526

526

527

528

530

532

533

536

542

544

546

547

548

553

ooo

559

8.6

Bibliograph

8.7

Exercises

ic Notes

Cluster Analysis:

Additional Issues

and Algorithms 569

9.1

Characteristics

of Data, Clusters,

and Clustering

Algorithms .

570

9.1.1 Example:

Comparing

K-means

and

DBSCAN . . . . . .

570

9.1.2 Data

Characteristics 577


19/790

Contents

xix

9.1.3 Cluster

Characteristics

.

573

9.L.4 General

Characteristics

f Clustering

Algorithms

575

9.2

Prototype-Based

lustering 577

9.2.1

F\zzy

Clustering

577

9.2.2 Clustering

Using

Mixture Models 583

9.2.3 Self-Organizingaps(SOM) 594

9.3

Density-Based

lustering

600

9.3.1

Grid-Based

lustering

601

9.3.2 Subspace

lustering

604

9.3.3

DENCLUE:

A Kernel-Based

cheme

or

Density-Based

Clustering 608

9.4 Graph-Based

lustering

612

9.4.1 Sparsification 613

9.4.2

Minimum

Spanning

Tlee

(MST)

Clustering

. . .

674

9.4.3 OPOSSUM:

Optimal

Partitioning

of Sparse

Similarities

UsingMETIS 616

9.4.4 Chameleon: Hierarchical

Clustering

with

Dynamic

Modeling

9.4.5 Shared Nearest Neighbor

Similarity

9.4.6 The

Jarvis-Patrick Clustering

Algorithm

9.4.7 SNN Density

9.4.8 SNN Density-Based

Clustering

9.5 Scalable Clustering Algorithms

9.5.1 Scalability:

General

Issuesand

Approaches

9.5.2

BIRCH

9.5.3 CURE

9.6 Which Clustering Algorithm?

9.7 Bibliographic

Notes

9.8

Exercises

616

622

625

627

629

630

630

633

635

639

643

647

10

Anomaly Detection 651

10.1 Preliminaries

653

10.1.1

Causesof Anomalies

653

10.1.2 Approaches

o

Anomaly Detection 654

10.1.3 The

Use

of

Class

Labels

655

10.1.4 ssues 656

10.2

Statistical

Approaches 658

t0.2.7 Detecting

Outliers

in a

Univariate

Normal Distribution 659

10.2.2

Out l iersinaMult ivariateNormalDistr ibut ion

. . . . .

661

10.2.3 A Mixture Model Approach for

Anomaly

Detection. 662


20/790

xx

Contents

10.2.4

Strengths

and Weaknesses

10.3 Proximity-Based

Outlier Detection

10.3.1

Strengths

and

Weaknesses

10.4

Density-Based

Outlier

Detection

10.4.1

Detection of

Outliers

Using Relative

Density

70.4.2 Strengths and Weaknesses

10.5

Clustering-Based

Techniques

10.5.1

Assessing he Extent

to

Which an

Object Belongs to

a

Cluster

10.5.2 Impact

of

Outliers on

the

Initial

Clustering

10.5.3

The Number of

Clusters o

Use

10.5.4

Strengths

and

Weaknesses

665

666

666

668

669

670

67L

672

674

674

674

675

680

685

b6i)

10.6

Bibliograph

10.7

Exercises

ic Notes

Appendix A Linear Algebra

A.1

Vectors

A.1.1

Definition

685

4.I.2

Vector Addition and Multiplication by a

Scalar 685

A.1.3

Vector

Spaces

687

4.7.4 The

Dot Product,

Orthogonality, and Orthogonal

Projections

688

A.1.5

Vectors and

Data

Analysis

690

42

Matrices

691

A.2.1 Matrices:

Definitions

691

A-2.2 Matrices: Addition and Multiplication by a Scalar 692

4.2.3 Matrices:

Multiplication

693

4.2.4

Linear tansformations

and Inverse

Matrices

695

4.2.5 Eigenvalue

and Singular

Value Decomposition

.

697

4.2.6

Matrices

and

Data

Analysis

699

A.3

Bibliographic

Notes

700

Appendix B Dimensionality

Reduction

7OL

8.1

PCA

and

SVD 70I

B.1.1 Principal

Components

Analysis (PCA)

70L

8.7.2 SVD .

706

8.2 Other

Dimensionality

Reduction Techniques

708

8.2.I

Factor Analysis

708

8.2.2 Locally

Linear

Embedding

(LLE)

.

770

8.2.3 Multidimensional

Scaling, FastMap, and

ISOMAP

7I2


21/790

Contents xxi

8.2.4

Common Issues

B.3 Bibliographic

Notes

Appendix

C Probability and Statistics

C.1

Probability

C.1.1 Expected Values

C.2 Statistics

C.2.L Point Estimation

C.2.2

Central

Limit Theorem

C.2.3

Interval Estimation

C.3 Hypothesis Testing

Appendix

D Regression

D.1 Preliminaries

D.2

Simple

Linear

Regression

D.2.L Least SquareMethod

D.2.2 Analyzing

Regression

Errors

D.2.3 Analyzing

Goodnessof

Fi t

D.3

Multivariate

Linear

Regression

D.4

Alternative Least-Square

Regression

Methods

Appendix

E Optimization

E.1

Unconstrained

Optimizafion

E.1.1 Numerical Methods

8.2

Constrained

Optimization

E.2.I Equality Constraints

8.2.2 Inequality

Constraints

Author

Index

Subject

Index

Copyright

Permissions

715

7L6

7L9

7L9

722

723

724

724

725

726

739

739

742

746

746

747

750

758

769

729

729

730

731

733

735

736

737


22/790


23/790

1

Introduction

Rapid advances in data

collection and storage

technology have enabled or-

ganizations

to accumulate vast

amounts

of data. However, extracting useful

information has

proven

extremely challenging. Often, traditional data analy-

sis tools

and techniques

cannot be used because

of the massive size of a data

set.

Sometimes, he non-traditional nature of the data

means

that traditional

approachescannot

be applied even f the data set

is relatively small. In other

situations,

the

questions

hat

need

to be answered

cannot be addressedusing

existing

data analysis techniques, and thus, new methods

need

to

be

devel-

oped.

Data mining

is a technology

that blends traditional

data analysis

methods

with

sophisticated algorithms for

processing

arge volumes of data.

It has also

opened

up exciting opportunities for exploring and analyzing

new types of

data and for analyzing old types of data in new ways. In this introductory

chapter, we

present

an overview of data mining and

outline the key topics

to

be covered n this

book. We start

with

a

description of some

well-known

applications

that require new techniques or data analysis .

Business Point-of-sale

data collection

(bar

code scanners,

radio frequency

identification

(RFID),

and smart card

technology) have allowed

retailers to

collect up-to-the-minute

data about customer

purchases

at the checkout coun-

ters

of their stores.

Retailers can utilize

this information, along

with other

business-critical

data such as Web logs from e-commerce Web

sites and cus-

tomer service records from call centers, to help them better understand the

needs

of

their

customers and make more informed business

decisions.

Data mining

techniques can be used to support

a wide

range

of business

intelligence

applications such as customer

profiling,

targeted

marketing, work-

flow management,

store layout,

and

fraud

detection.

It

can also

help retailers


24/790

2

Chapter 1 lntroduction

answer important

business

questions

such as

"Who

are the most

profitable

customers?"

"What

products

can

be cross-soldor up-sold?" and

"What

is

the

revenue

outlook of

the company

for

next

year?))

Some

of

these

questions

mo-

tivated the creation of

association analvsis

(Chapters

6

and

7),

a new data

analysis

technique.

Medicine,

Science, and Engineering

Researchers

n medicine, science,

and

engineering are rapidly accumulating

data that is key to important new

discoveries. For

example, as an important

step toward

improving

our under-

standing of the Earth's

climate system, NASA has

deployed a seriesof

Earth-

orbiting satellites

that continuously

generate global

observations of the

Iand

surface, oceans, and

atmosphere. However, because of the

size and spatio-

temporal nature

of the data, traditional

methods are often not suitable for

analyzing these

data sets. Techniques

developed

n

data mining can aid Earth

scientists

in

answering

questions

such as

"What

is the relationship

between

the frequency

and intensity

of ecosystem disturbances such

as droughts and

hurricanes o

global

warming?"

"How

is land

surface

precipitation

and temper-

ature

affected by ocean surface

temperature?" and

"How

well can we

predict

the beginning

and end of the

growing

season

or

a

region?"

As

another example, researchers

n molecular biology hope

to

use

the

large

amounts

of

genomic

data currently

being

gathered

to better understand the

structure and function

of

genes.

In the

past,

traditional

methods in

molecu-

lar

biology allowed

scientists to study

only a few

genes

at a time in a

given

experiment.

Recent

breakthroughs

in microarray technology have

enabled sci-

entists to

compare he behavior

of thousands of

genes

under var ious situations.

Such comparisons can help

determine the function of each

gene

and

perhaps

isolate

the

genes

esponsible or

certain

diseases.

However,

he noisy and high-

dimensional nature

of data requires new

types of data analysis. In addition

to

analyzing

gene

array

data, data mining

can also be used to address other

important

biological challenges

such as

protein

structure

prediction,

multiple

sequence lignment,

the

modeling

of

biochemical

pathways,

and

phylogenetics.

1.1 What Is

Data Mining?

Data mining is the processof automatically discovering useful information in

large

data

repositories.

Data mining

techniques are deployed

to scour

large

databases in

order to find novel

and useful

patterns

that might

otherwise

remain

unknown. They

also

provide

capabilities to

predict

the outcome of a


25/790

1.1

What

Is

Data Mining?

3

future

observation,

such

as

predicting

whether

a

newly

arrived.

customer

will

spend

more

than

$100

at a department

store.

Not

all information

discovery

asks are considered

o be

data mining.

For

example,

Iooking

up individual

records

using

a database managemenr

sysrem

or

finding

particular

Web

pages

via a

query

to

an

Internet

search

engine

are

tasks related to the area of information retrieval. Although such tasks are

important

and may

involve

the

use of

the sophisticated

algorithms

and

data

structures,

they rely

on traditional

computer

science echniques

and obvious

features

of the

data

to

create index

structures

for

efficiently

organizing

and

retrieving

information.

Nonetheless,

data mining

techniques

have been

used

to

enhance

nformation

retrieval

systems.

Data

Mining

and

Knowledge

Discovery

Data

mining

is

an integral

part

of knowledge

discovery

in

databases

(KDD), which is the overall process of converting raw data into useful in-

formation,

as shown

in

Figure

1.1.

This

process

consists of

a series

of trans-

formation

steps,

from

data

preprocessing

o

postprocessing

of data mining

results.

Information

Figure

1. The

process

f

knowledge

iscovery

n

databases

KDD).

The

input

data

can be stored

in a variety

of

formats

(flat

files,

spread-

sheets,

or relational

tables) and

may

reside

in a centralized

data repository

or

be

distributed

across multiple

sites. The purpose

of

preprocessing

is

to transform the raw input data into an appropriate format for subsequent

analysis.

The

steps

involved

in

data

preprocessing

nclude

fusing

data

from

multiple

sources,

cleaning

data

to

remove

noise

and

duplicate

observations,

and

selecting

records

and features

that

are

relevant

to the

data mining

task

at

hand.

Because

of

the many

ways

data can

be collected

and stored,

data


26/790

4 Chapter 1

Introduction

preprocessing

s

perhaps

the

most

laborious and

time-consuming

step

in the

overall

knowledge

discovery

process.

,,Closing

the

loop" is the

phrase

often used

to refer to the

process

of

in-

tegrating data

mining results

into

decision support

systems.

For example,

in business applications,

the

insights offered

by data

mining

results can

be

integrated with campaign management tools so that effective marketing pro-

motions

can be

conducted

and

tested.

Such

integration

requires

a

postpro-

cessing step that

ensures

hat

only valid and

useful

results

are

incorporated

into

the decision

support system.

An example

of

postprocessing s visualiza-

tion

(see

Chapter

3), which

allows

analysts to

explore

the data

and

the data

mining

results from a variety

of viewpoints.

Statistical

measures

or

hypoth-

esis

testing methods

can also

be applied

during

postprocessing o eliminate

spurious

data

mining

results.

L.2 Motivating Challenges

As

mentioned earlier,

traditional

data analysis

techniques

have

often encoun-

tered

practical

difficulties

in meeting the

challenges

posed

by

new data

sets.

The

following are

some of the specific

challenges

hat motivated

the develop-

ment

of data mining.

Scalability

Because

of advances

n

data

generation and collection,

data sets

with

sizes of

gigabytes,

terabytes,

or even

petabytes are becoming

common.

If

data mining

algorithms are

to

handle

these

massive

data

sets, then they

must be scalable. Many data mining algorithms employ special searchstrate-

gies

to handle exponential

search

problems.

Scalability

may also

require the

implementation

of novel data structures

to access

ndividual

records

n

an

ef-

ficient

manner.

For instance,

out-of-core algorithms

may be

necessary

when

processing

data

sets that cannot

fit into main memory.

Scalability

can also

be

improved

by using

sampling

or developing

parallel

and

distributed

algorithms.

High

Dimensionality

It

is now common

to

encounter data

sets with

hun-

dreds

or thousands

of

attributes

instead of the

handful common

a

few decades

ago.

In bioinformatics,

progress n microarray

technology

has

produced

gene

expression

data

involving thousands

of

features.

Data sets

with temporal

or

spatial

components

also tend

to

have high

dimensionality.

For example,

consider

a data

set that contains

measurements

of temperature

at

various

locations.

If

the

temperature

measurements

are

taken

repeatedly

for an ex-

tended

period,

the

number of dimensions

(features)

ncreases

n

proportion

to


27/790

L.2 Motivating

Challenges

5

the number

of measurements

aken.

Tladitional

data

analysis echniques

hat

were

developed

for low-dimensional

data

often do not work

well for

such high-

dimensional

data. Also,

for

some data

analysis

algorithms,

the computational

complexity

increases

apidly

as

the dimensionality (the

number

of features)

increases.

Heterogeneous

and

Complex

Data TYaditional

data analysis

methods

often

deal with

data sets

containing

attributes

of the same

ype, either

contin-

uous or

categorical.

As

the role

of data

mining in

business,science,

medicine,

and

other

flelds

has

grown,

so has the need

for techniques

that

can handle

heterogeneous

ttributes.

Recent

years

have

also seen he

emergence

of more

complex

data

objects.

Examples

of such non-traditional

types of data

include

collections

of

Web

pages

containing

semi-structured

text and hyperlinks;

DNA

data

with

sequential

and three-dimensional

structure;

and climate

data that

consists

of

time

seriesmeasurements

temperature, pressure,

etc.) at various

locations

on

the Earth's

surface. Techniques

developed

or mining

such com-

plex

objects

should

take into

consideration

relationships

in the

data,

such as

temporal

and

spatial

autocorrelation, graph

connectivity,

and

parent-child

re-

lationships

between

he

elements

n semi-structured

ext

and

XML

documents.

Data

ownership

and Distribution

Sometimes, the data needed

for an

analysis

is

not

stored in

one location

or owned by one

organization.

Instead,

the

data is geographically

distributed

among

resources

belonging

to multiple

entities.

This

requires

the

development

of distributed

data mining

techniques.

Among the key challenges aced by distributed data mining algorithms in-

clude (1)

how

to reduce

the amount

of

communication needed

o

perform

the

distributed

computatior,

(2)

how

to effectively

consolidate the

data mining

results

obtained

from

multiple

sources,

and

(3)

how

to address

data

security

issues.

Non-traditional

Analysis

The

traditional

statistical approach

s

based on

a hypothesize-and-test

paradigm.

In

other words, a hypothesis

is

proposed,

an

experiment

is

designed

o

gather

the data,

and then the data

is

analyzed

with respect

o the hypothesis.

Unfortunately,

this

process

s

extremely

labor-

intensive. Current data analysis tasks

often require

the

generation

and evalu-

ation

of thousands

of hypotheses,

and consequently,

he development

of

some

data

mining

techniques

has

been motivated

by

the desire to

automate

the

process

of

hypothesis generation

and evaluation.

Furthermore,

the data

sets

analyzed

in

data

mining

are typically

not the result

of a carefully

designed


28/790

6 Chapter

1 Introduction

experiment

and

often

representopportunistic samples

of

the data,

rather than

random samples.

Also, the data sets

frequently

involve

non-traditional

types

of data and

data distributions.

1.3 The Origins of Data Mining

Brought

together

by the

goal

of

meeting the challenges

of

the

previous

sec-

tion,

researchers

rom

different

disciplines began

to

focus on developing

more

efficient

and

scalable ools

that could handle diverse

ypes

of data.

This work,

which culminated

in the

field

of data

mining, built

upon the

methodology

and

algorithms

that researchers

had

previously

used.

In

particular,

data

mining

draws

upon

ideas,

such

as

(1)

sampling, estimation,

and

hypothesis

testing

from statistics

and

(2)

search algorithms,

modeling

techniques,

and

learning

theories

from artificial

intelligence,

pattern

recognition,

and

machine

earning.

Data mining has also been quick to adopt ideas from other areas, ncluding

optimization,

evolutionary

computing,

information theory,

signal

processing,

visualization,

and

information

retrieval.

A number

of other

areas also

play

key supporting

roles.

In

particular,

database

systems

are

needed to

provide

support

for efficient

storage,

index-

ing, and

query

processing.

Techniques

rom high

performance (parallel)

com-

puting

are

often important

in addressing he

massive size of

some

data

sets.

Distributed

techniques

can also

help

address

he issueof

size and are

essential

when the

data

cannot be

gathered

in one location.

Figure

1.2 shows he

relationship of data

mining to other

areas.

Figure

.2.Data iningsa conlluencef

manyisciplines.


29/790

Data

Mining Tasks

7

1.4

Data

Mining

Tasks

Data

mining

tasks

are

generally

divided

into two

major

categories:

Predictive

tasks.

The

objective

of these

asks

s

to

predict

the value

of a

par-

ticular

attribute

based

on the values

of other

attributes. The attribute

to

be

predicted

is

commonly

known

as the target

or dependent vari-

able,

while

the

attributes

used for

making the

prediction

are known

as

the explanatory

or

independent

variables.

Descriptive

tasks. Here,

the objective is

to derive

patterns

(correlations,

trends,

clusters,

trajectories,

and anomalies)

that

summarize the

un-

derlying

relationships

in

data. Descriptive

data mining tasks

are often

exploratory

in nature

and frequently

require

postprocessing

echniques

to validate

and

explain

the results.

Figure 1.3 illustrates four of the core data mining tasks that are described

in

the

remainder

of

this book.

Four

f

he

core atamining

asks.

L.4

I

Figure.3.


30/790

8 Chapter

1

Introduction

Predictive modeling

refers to the task of

building a

model

for the

target

variable as a function

of the

explanatory variables.

There

are two

types of

predictive

modeling

tasks: classification,

which

is used

for

discrete target

variables,

and regression,

which

is

used

for continuous

target

variables.

For

example,

predicting

whether

a Web user

will

make a

purchase

at an

online

bookstore is a classification ask because he target variable is binary-valued.

On the other

hand,

forecasting

the future

price

of

a stock

is a

regression task

because

price

is a continuous-valued

attribute.

The

goal

of

both tasks

is to

learn

a

model that

minimizes

the error between

the

predicted

and

true

values

of the target

variable.

Predictive

modeling

can

be used to

identify

customers

that

will respond

to a

marketing campaign,

predict

disturbances

n the

Earth's

ecosystem,or

judge

whether

a

patient

has a

particular

disease

based on

the

results of

medical

tests.

Example 1.1

(Predicting

the

Type of

a

Flower). Consider

he task of

predicting a speciesof flower based on the characteristics of the flower. In

particular,

consider

classifying

an

Iris flower

as to

whether it belongs

to one

of the

following

three Iris species: Setosa,

Versicolour,

or

Virginica.

To

per-

form this task,

we need a data

set containing

the

characteristics

of

various

flowers of these

three species.

A

data set

with

this type of

information

is

the well-known

Iris data set from

the

UCI

Machine

Learning

Repository

at

http:

/hrurw.ics.uci.edu/-mlearn.

In

addition

to the species

of a flower,

this

data set contains

four

other

attributes:

sepal

width, sepal

length,

petal

length,

and

petal

width.

(The Iris data set and

its attributes

are described

further in

Section

3.1.)

Figure

1.4

shows a

plot

of

petal

width

versus

petal

length for the 150 flowers in the Iris data set. Petal width is broken into the

categories

ow,

med'ium, and hi'gh,

which correspond

o the

intervals

[0'

0.75),

[0.75,

1.75),

[1.75,

oo),

respectively.

Also,

petal

length

s broken

nto categories

low, med,'ium, nd

hi,gh,which

correspond o the

intervals

[0'

2.5),

[2.5,5),

[5 '

oo),

respectively.

Based on these

categories

of

petal

width

and length,

the

following rules

can be derived:

Petal

width low and

petal

length

low implies

Setosa.

Petal width medium

and

petal

length

medium implies

Versicolour.

Petal width

high and

petal

length

high implies Virginica.

While these rules do not classify all the flowers, they do a good (but not

perfect)

job

of

classifying

most of the

flowers.

Note that

flowers

from the

Setosa

species

are

well separated

from the

Versicolour

and

Virginica

species

with

respect to

petal

width and

length,

but

the

latter two

species

overlap

somewhat with

respect to these

attributes.

I


31/790

r

Setosa

.

Versicolour

o

Virginica

L.4

Data Mining

Tasks

I

l - - - - a - - f o - - - - - - - i

l a

o

r

, f t f o o t o

a i

:

o o

o I

I

'

t 0 f 0 a o

0?oo

r

a

a r f I

? 1 . 7 5

E

()

r

1 . 5

E

=

(t'

()

( L l

0_l _.

o_

o

. t

a a r O

. .4.

a?o

o a a a a

a aaaaaaa

a a a a

a

aa

a a a a a

I

I

l l t l l

l l t

I

I l l l l t t

I

I t

1 2 2 . 5 3 4 5 (

PetalLength

cm)

Figure .4.

Petal

idth ersus

etal

engthor

1

50

ris lowers,

Association

analysis

is used

to discover

patterns

that describe strongly

as-

sociated

eatures

n the

data.

The

discovered

patterns

are

typically represented

in

the

form

of

implication

rules

or feature

subsets.

Because

of the exponential

size of

its

search

space, he goal

of association

analysis s

to extract

the most

interesting patterns

in

an efficient

manner.

Useful applications

of association

analysis

include

finding

groups

of

genes

hat have

related functionality,

identi-

fying

Web

pages

hat

are accessed

ogether, or understanding

the

relationships

between

different

elements

of

Earth's

climate

system.

Example

1.2 (Market

Basket

Analysis).

The

transactions

shown

in

Ta-

ble 1.1

illustrate

point-of-sale

data

collected at

the checkout counters

of a

grocery

store. Association

analysis

can be applied

to find items

that are

fre-

quently

bought

together

by

customers. For

example,

we may discover

the

rule

{Diapers}

-----*

{lt:.ft},

which suggests

hat

customers who buy diapers

also tend to buy milk. This type of rule can be used to identify potential

cross-selling

opportunities

among related

items.

I

Cluster

analysis

seeks

o find

groups

of closely related

observationsso

that

observations

hat

belong to

the same cluster are

more

similar to each

other


32/790

10

Chapter

1

Introduction

Table

1. Marketasket

ata.

Tlansaction

ID Items

1

2

3

4

r

o

7

8

9

10

{Bread,

Butter,

Diapers,

Milk}

{Coffee,

Sugar, Cookies,

Sakoon}

{Bread,

Butter, Coffee,

Diapers,

Milk,

Eggs}

{Bread, Butter, Salmon,Chicken}

{fgg",

Bread, Butter}

{Salmon,

Diapers, Milk}

{Bread,

Tea, Sugar, Eggs}

{Coffee,

Sugar, Chicken,

Eggs}

{Bread,

Diapers, Mi1k,

Salt}

{Tea,

Eggs,

Cookies,

Diapers,

Milk}

than

observations

hat belong

to other clusters.

Clustering

has been used

to

group

sets of

related

customers,

ind areas of

the ocean

hat

have a significant

impact on the Earth's climate, and compressdata.

Example

1.3

(Document

Clustering).

The collection

of

news

articles

shown

in Table 1.2

can be

grouped

based on

their

respective topics.

Each

article

is represented

as

a set

of

word-frequency

pairs (r,

"),

where tu

is

a

word

and

c is the number

of times the

word appears

in the article.

There are

two

natural

clusters

in the data

set.

The first cluster

consists

of the first

four ar-

ticles,

which correspond

to

news about the economy,

while

the second

cluster

contains

the

last four articles,

which correspond

to

news

about

health care.

A

good

clustering algorithm

should

be able to

identify

these

two clusters

based

on the similarity between words that appear

in the articles.

Table .2.

Collectionf news

rticles.

Article

Words

I

2

.)

A

r

J

o

7

8

dollar:

1, industry:

4, country: 2,

loan: 3,

deal: 2,

government: 2

machinery:

2, labor: 3,

market:

4, industry:

2, work: 3,

country:

1

job:

5,

inflation: 3, rise:

2,

jobless:

2,

market:

3,

country:

2,

index: 3

domestic:

3, forecast:

2,

gain:

1, market:

2, sale: 3,

price:

2

patient: 4,

symptom:

2, drug: 3,

health:

2, clinic:

2, doctor:

2

pharmaceutical:2,

company: 3,

drug:

2,vaccine:1,

flu:

3

death: 2, cancer: 4, drug: 3, public: 4, health: 3, director:

2

medical:

2,

cost: 3,

increase:

2,

patient:

2, health: 3,

care:

1


33/790

1.5

Scope

and

Organization of

the Book 11

Anomaly

detection

is

the task

of

identifying

observationswhose

character-

istics

are significantly

different

from the rest

of

the data. Such

observations

are known

as

anomalies

or outliers.

The

goal

of an anomaly

detection al-

gorithm

is to

discover

the

real

anomalies

and

avoid

falsely labeling

normal

objects

as anomalous.

In

other

words, a

good

anomaly

detector must have

a high detection rate and a low false alarm rate. Applications of anomaly

detection include

the

detection

of fraud, network

intrusions,

unusual

patterns

of disease,

and

ecosystem

disturbances.

Example

1.4

(Credit

Card

trYaud Detection).

A

credit

card company

records

he

transactions

made

by every

credit card holder,

along with

personal

information

such

as credit limit,

age, annual income,

and address.

since the

number

of

fraudulent

cases is

relatively

small

compared to the

number of

legitimate

transactions,

anomaly

detection techniques

can be applied

to build

a

profile

of legitimate

transactions

for

the users.

When a

new

transaction

arrives, it is compared against the profile of the user. If the characteristicsof

the

transaction

are very different

from

the

previously

created

profile,

then the

transaction is

flagged

as

potentially

fraudulent.

I

1.5

Scope

and

Organization

of

the

Book

This

book introduces

the major principles

and

techniquesused n

data mining

from

an

algorithmic perspective.

A

study of these

principles

and

techniques

s

essential or

developing

a better

understanding

of how data mining

technology

can

be applied

to various kinds

of data. This

book also serves

as a starting

point for readerswho are interested in doing research n this field.

We begin

the

technical discussion

of this

book

with

a chapter

on data

(Chapter

2),

which discusses

he basic types

of data, data

quality,

prepro-

cessing

techniques,

and measures

of similarity

and dissimilarity.

Although

this material

can be

covered

quickly,

it

provides

an essential

foundation for

data analysis.

Chapter 3, on

data exploration,

discusses ummary

statistics,

visualization

techniques,

and

On-Line Analytical

Processing

(OLAP).

These

techniques

provide

the means

for

quickly

gaining

insight into

a data

set.

Chapters 4

and

5

cover

classification.

Chapter 4

provides

a foundation

by

discussing

decision

tree classifiers

and several issues

that

are

important

to all classification:

overfitting, performance

evaluation,

and the

comparison

of different

classification

models.

Using this

foundation,

Chapter

5

describes

a

number

of

other important

classification techniques: rule-based

systems,

nearest-neighbor

lassifiers,

Bayesian

classifiers,

artificial neural

networks, sup-

port

vector machines,

and ensemble

classifiers,

which are collections

of classi-


34/790

2 Chapter

1

lntroduction

fiers.

The multiclass and

imbalanced class

problems are also

discussed.

These

topics

can be covered

ndependently.

Association

analysis

s explored

in

Chapters

6 and 7.

Chapter 6

describes

the

basics of association

analysis:

frequent

itemsets, association

rules,

and

some

of the algorithms

used to

generate

them.

Specific

types of

frequent

itemsets-maximal, closed,and hyperclique-that are important for data min-

ing are

also discussed,

and the

chapter concludes

with

a discussion

of

evalua-

tion

measures or

association analysis. Chapter

7 considers

a

variety of

more

advanced

opics, including

how

association

analysis

can be

applied to categor-

ical and

continuous

data or to data that

has a concept

hierarchy.

(A

concept

hierarchy

is

a

hierarchical categorization

of objects,

e.g., store

items, clothing,

shoes,

sneakers.)

This

chapter

also describes

how

association

analysis can

be

extended

to find sequential

patterns (patterns involving

order),

patterns in

graphs,

and

negative

relationships

(if

one item

is

present,

then

the other

is

not).

Cluster

analysis

s discussed

n

Chapters 8

and

9. Chapter

8 first describes

the different

types

of clusters and

then

presents

hree specific

clustering

tech-

niques:

K-means, agglomerative

hierarchical

clustering, and

DBSCAN.

This

is followed

by a discussion

of techniques

for validating

the

results

of a cluster-

ing algorithm.

Additional clustering

concepts

and techniques

are explored

in

Chapter

9,

including

fiszzy and

probabilistic

clustering, Self-Organizing

Maps

(SOM),

graph-based

clustering,

and density-based

clustering.

There is

also a

discussion

of scalability

issues and

factors

to

consider when

selecting a

clus-

tering algorithm.

The last chapter, Chapter 10,

is on anomaly

detection.

After

some

basic

definitions,

several

different types

of anomaly

detection are

considered:

sta-

tistical,

distance-based,

density-based,

and

clustering-based.

Appendices

A

through

E

give

a brief

review

of

important topics

that are

used

in

portions

of

the

book:

linear

algebra,

dimensionality

reduction, statistics,

regression,

and

optimization.

The subject of

data

mining, while relatively

young

compared

to statistics

or machine

learning,

is already

too large to cover

in a single

book. Selected

references

to topics

that are only

briefly covered,

such as

data

quality' are

provided

in the bibliographic

notes of the appropriate

chapter. References

o

topics not covered n this book, such as data mining for streams and privacy-

preserving

data mining, are

provided

in

the bibliographic

notes

of this chapter.


35/790

Bibliographic Notes

13

1.6

Bibliographic

Notes

The

topic

of data

mining has

inspired many

textbooks. Introductory

text-

books include

those by Dunham

[10],

Han

and Kamber

l2L),

Hand

et al.

[23],

and

Roiger and

Geatz

[36].

Data mining

books with a

stronger emphasis on

businessapplications include the works by Berry and Linoff [2], Pyle [34],and

Parr

Rud

[33].

Books with

an emphasis

on statistical learning include

those

by

Cherkassky

and Mulier

[6],

and Hastie

et al.

124].

Some books with an

emphasis

on machine

learning

or

pattern

recognition

are those by Duda

et

al.

[9],

Kantardzic

[25],

Mitchell

[31],

Webb

[41],

and

Witten and

F]ank

[42].

There

are

also some more

specialized

books:

Chakrabarti

[a]

(web

mining),

Fayyad

et al.

[13]

(collection

of early

articles on data mining), Fayyad

et al.

111]

visualization),

Grossman et al.

[18]

(science

and engineering), Kargupta

and

Chan

[26]

(distributed

data mining),

Wang

et al.

[a0]

(bioinformatics),

and Zaki

and Ho

[44]

(parallel

data mining).

There are several conferences elated to data mining. Some of the main

conferences

dedicated

to this field include

the ACM

SIGKDD

International

Conferenceon Knowledge

Discovery

and

Data

Mining

(KDD),

the IEEE In-

ternational

Conferenceon Data

Mining

(ICDM),

the SIAM

International

Con-

ference

on Data

Mining

(SDM),

the

European

Conference

on Principles

and

Practice

of Knowledge

Discovery

in Databases (PKDD),

and the Pacific-Asia

Conference

on Knowledge

Discovery

and Data Mining

(PAKDD).

Data min-

ing

papers

can

also be found in

other major

conferencessuch as the ACM

SIGMOD/PODS

conference,

he International

Conferenceon Very Large Data

Bases (VLDB),

the

Conferenceon Information

and

Knowledge Management

(CIKM),

the International

Conferenceon Data Engineering

(ICDE),

the

In-

ternational

Conferenceon Machine

Learning

(ICML),

and the National

Con-

ference

on Artificial Intelligence

(AAAI).

Journal

publications

on data mining inc lude

IEEE Transact'ions n Knowl-

edge

and Data Engi,neering,

Data

Mi,ning and Knowledge Discouery,

Knowl-

edge

and Information

Systems,

Intelli,gent

Data Analysi,s, nforrnati,on

Sys-

tems,

and lhe

Journal of Intelligent

Informati,on

Systems.

There

have

been a number

of

general

articles on data

mining

that define he

field

or its relationship

to

other

fields,

particularly

statistics.

Fayyad

et al.

[12]

describe

data mining

and how it

fits into the total knowledgediscovery

process.

Chen et

al.

[5]

give

a database

perspective

on data mining.

Ramakrishnan

and

Grama

[35]

provide

a

general

discussion

of data

mining

and

present

several

viewpoints.

Hand

[22]

describeshow

data mining differs from

statistics, as does

Fliedman

lf

4].

Lambert

[29]

explores

he useof statistics for large data

sets and

provides

some

comments

on the

respective

roles of data mining

and statistics.

1_.6


36/790

L4 Chapter

1 Introduction

Glymour

et al.

116]

consider

the lessons hat

statistics

may

have

for

data

mining. Smyth

et

aL

[38]

describe

how the evolution

of

data

mining

is

being

driven by

new types

of data and

applications,

such as those

involving

streams,

graphs,

and text.

Emerging applications

in data

mining are considered

by

Han

et

al.

[20]

and Smyth

[37]

describes

some

research

challenges

n

data

mining.

A discussionof how developments n data mining researchcan be turned into

practical

tools

is

given

by

Wu

et al.

[43].

Data

mining standards

are

the

introduction to data mining - steinbach.pdf

Documents