introduction to data mining - steinbach.pdf
TRANSCRIPT
-
7/23/2019 Introduction to data mining - Steinbach.pdf
1/790
-
7/23/2019 Introduction to data mining - Steinbach.pdf
2/790
-
7/23/2019 Introduction to data mining - Steinbach.pdf
3/790
\ ( (
PANG.N NG
TAN
Mich igan ta teUnivers i t y
MICHAEL STEINBACH
Univers i t yf M innesota
VI
PI
N KUMAR
Univers i t yf M innesota
and
ArmyHighPer formance
Comput ing
esearch enter
+f.f_l
crf.rfh.
.W
if f
aqtY
6l$
t .T.R.C.
i'&'ufe61ttt1/.
Y
\
t.\
$t,/,1'
n,5
\.
7\
V
'48
Boston San Francisco
NewYork
London Toronto Sydney
Tokyo
Singapore Madrid
MexicoCity
Munich Paris
CapeTown
HongKong
Montreal
-
7/23/2019 Introduction to data mining - Steinbach.pdf
4/790
G.R
r+6,q
If
you purchased
this book
within
the United States
or Canada
you
should be aware that it has been
wrongfirlly imported
without
the approval
of the Publishel
or the
Author.
3
6
{)gq*
3
AcquisitionsEditor
Matt
Goldstein
ProjectEditor
Katherine Harutunian
Production
Supervisor Marilyn
Lloyd
Production Services Paul C. Anagnostopoulosof Windfall Software
Marketing
Manager
Michelle
Brown
Copyeditor
Kathy
Smith
Proofreader
IenniferMcClain
Technicallllustration
GeorgeNichols
Cover
Design
Supervisor
Joyce
Cosentino
Wells
Cover Design
Night & Day Design
Cover Image
@ 2005
Rob Casey/Brand
X
pictures
hepress and
Manufacturing
Caroline
Fell
Printer
HamiltonPrinting
Access
the latest information
about
Addison-Wesley
titles from
our
iWorld
Wide
Web site:
http /www. aw-bc.com/computing
Many
of the
designations
used by
manufacturers
and sellers
to
distiriguish their products
are
claimed
as trademarks.
where
those
designations
appear n
this book, and
Addison-
Wesley was
aware
of a
trademark claim,
the designations
have
been
printed
in initial
caps
or all caps.
The
programs
and
applications presented
n
this book
have
been ncl,[rded
or
their
instructional value.
They
have been
tested
with
care,
but are not guatanteed
or
any
particular
purpose.
The
publisher
does
not
offer
any
warranties
or representations,
nor
does
it accept any
liabilities
with respect
o
the programs
or applications.
Copyright
@
2006
by
Pearson Education,
Inc.
For information
on
obtaining
permission
for
use
of material
in this
work,
please
submit
a
written
request o
PearsonEducation,
Inc.,
Rights
and Contract
Department,
75
Arlington
Street,
Suite 300,
Boston,
MA02II6
or fax your
request
o
(617)
g4g-j047.
All rights
reserved.
No
part
of
this publication
may be
reproduced,
stored
n
a retrieval
system,
or transmitted,
in any form
or
by any
means,
electronic,
mechanical,
photocopying,
recording,
or any other
media
embodiments
now
known
or hereafter
o become
known,
without
the
prior
written
permission
of the publisher.
printed
in the
united
States
of
America.
lsBN
0-321-42052-7
2 3 4 5 67
8 9 10-HAM-O8
7 06
-
7/23/2019 Introduction to data mining - Steinbach.pdf
5/790
our
famili,es
-
7/23/2019 Introduction to data mining - Steinbach.pdf
6/790
-
7/23/2019 Introduction to data mining - Steinbach.pdf
7/790
Preface
Advances in
data
generation
and
collection are
producing
data
sets of mas-
sive
size
in
commerce and
a variety of scientific disciplines. Data warehouses
store details
of the sales
and operations of businesses, arth-orbiting satellites
beam high-resolution
mages
and sensordata back to Earth, and
genomics
ex-
periments
generate
sequence,
tructural, and
functional
data for an
increasing
number
of organisms.
The ease with
which
data can now be
gathered
and
stored has
created a new
attitude toward data analysis:
Gather
whatever
data
you
can whenever
and
wherever
possible.
It has become
an article of
faith
that the
gathered
data
will have value, either
for
the
purpose
that initially
motivated
its collection
or
for
purposes
not
yet
envisioned.
The field
of data mining
grew
out of the limitations of current data anal-
ysis
techniques n handling
the challenges
posedl
by
these
new
types
of data
sets. Data mining
does not replace
other areas
of data
analysis,
but rather
takes them
as the
foundation
for much
of
its work.
While some areas
of data
mining, such as association
analysis, are unique
to the field, other areas,such
as clustering, classification,
and
anomaly detection, build
upon
a long history
of work on these topics in other fields. Indeed, the willingness of data mining
researchers
o draw
upon existing techniques has contributed to the strength
and
breadth of the
field,
as well as to
its rapid
growth.
Another
strength of
the
field has
been
its emphasis
on collaboration
with
researchers
in
other areas. The
challenges
of
analyzing
new types of data
cannot be met by simply
applying data
analysis techniques in isolation
from
those who
understand the
data and the domain
in which
it
resides.
Often,
skill
in
building
multidisciplinary
teams has been as
responsible
or the success
f
data mining
projects
as
the creation of
new and
innovative
algorithms. Just
as, historically, many
developments
n
statistics
were
driven by the
needs of
agriculture, industry, medicine, and business,rxranyof the developments n
data mining are being
driven
by
the
needsof those same fields.
This
book began
as a set of notes and
lecture
slides
for
a data
mining
course
hat
has
been offered
at the University
of Minnesota
since Spring
1998
to upper-division undergraduate
and
graduate
Students.
Presentation
slides
-
7/23/2019 Introduction to data mining - Steinbach.pdf
8/790
viii
Preface
and
exercises
developed in
these
offerings
grew
with
time and served
as a basis
for the book. A
survey
of clustering
techniques
in data mining,
originally
written
in
preparation
for
research
in the
area,
served as a starting
point
for one of the chapters in the book.
Over time,
the
clustering chapter
was
joined
by
chapters on data,
classification,
association analysis, and anomaly
detection. The book in its current form has been class tested at the home
institutions of the
authors-the University of Minnesota
and Michigan
State
University-as
well as several other universities.
A number
of data mining books appeared in
the
meantime,
but were
not
completely satisfactory for our students
primarily graduate
and undergrad-
uate
students in
computer
science,
but including students
from industry and
a wide
variety
of other disciplines. Their mathematical and computer back-
grounds
varied considerably, but they
shared a common
goal:
to
learn
about
data
mining
as directly
as
possible
in
order to
quickly
apply it to
problems
in
their
own domains. Thus, texts with
extensive
mathematical
or statistical
prerequisiteswere unappealing to many of them, as were texts that required a
substantial
database
background.
The
book that evolved
n response
o these
students
needs ocuses
as
directly
as
possible
on the
key
conceptsof data min-
ing by
illustrating
them with examples,
simple descriptions
of key algorithms,
and
exercises.
Overview Specifically,
his
book
provides
a comprehensive
ntroduction
to
data
mining and is
designed
o be
accessible nd useful
o
students,
nstructors,
researchers,and
professionals.
Areas
covered nclude
data
preprocessing,
vi-
sualization,
predictive
modeling, association
analysis,
clustering, and anomaly
detection. The
goal
is
to
present
fundamental
concepts
and
algorithms
for
each
topic,
thus
providing
the
reader
with the
necessarybackground for
the
application
of data mining to real
problems.
In addition, this book also
pro-
vides
a
starting
point
for
those
readers
who
are
interested
n
pursuing
research
in
data
mining or related
fields.
The
book covers ive main topics: data,
classification,
associationanalysis,
clustering,
and anomaly
detection.
Except
for anomaly
detection,
eachof these
areas
is covered in
a
pair
of
chapters.
For classification, association analysis,
and
clustering,
the introductory chapter
coversbasic
concepts, representative
algorithms, and evaluation techniques,while the more advanced chapter dis-
cussesadvanced concepts
and
algorithms.
The objective
is
to
provide
the
reader with
a sound understanding of the foundations of data mining, while
still covering many important advanced
opics. Becauseof this approach, the
book
is useful
both as a learning tool and
as a
reference.
-
7/23/2019 Introduction to data mining - Steinbach.pdf
9/790
Preface
ix
To
help
the readers
better understand the concepts that
have
been
pre-
sented,
we
provide
an
extensive set of examples, figures, and exercises.
Bib-
Iiographic
notes
are included
at the end of each chapter for readers who
are
interested
in more
advanced
opics, historically
important
papers,
and
recent
trends.
The
book also
contains a comprehensivesubject and author
index.
To
the
Instructor
As
a textbook, this book
is
suitable
for
a
wide range
of students
at
the advanced undergraduate or
graduate
level.
Since
students
come
to
this subject
with diverse backgrounds hat
may not include
extensive
knowledge
of statistics
or databases,our book requires
minimal
prerequisites-
no database
knowledge is needed
and we assume only a
modest
background
in
statistics
or mathematics. To
this end, the book was designed to
be as
self-contained
as
possible.
Necessary
material
from
statistics,
linear
algebra,
and machine
earning
is either integrated into
the
body of the text, or for
some
advanced
opics,
covered
n
the appendices.
Since
the chapters
covering major data mining topics are self-contained,
the
order
in which topics
can be covered s
quite
flexible. The
core
material
is
covered
in
Chapters 2, 4,
6, 8, and
10. Although the introductory
data
chapter (2)
should
be covered irst, the basic classification, association
analy-
sis, and
clustering chapters
(4,
6, and 8,
respectively) can be covered
n any
order.
Because
of the relationship of
anomaly
detection
(10)
to classification
(4)
and
clustering
(8),
these
chapters should
precede
Chapter
10. Various
topics
can be
selected rom the advanced classification, association analysis,
and
clustering
chapters
(5,
7,
and
9,
respectively) to fit the scheduleand in-
terests
of the
instructor
and students. We also
advise that the lectures
be
augmented
by
projects
or
practical
exercises
n data
mining. Although
they
are time
consuming,
such hands-on assignments
greatly
enhance he
value
of
the
course.
Support
Materials
The supplements or the book are available at
Addison-
Wesley's
Website
www.aw.con/cssupport.
Support
materials available
to all
readers
of this
book
include
PowerPoint lecture
slides
Suggestions or student projects
Data mining resources
such as data mining algorithms and
data
sets
On-line tutorials that
give
step-by-stepexamples
or selecteddata
mining
techniquesdescribed n
the book using actual
data sets and data
analysis
software
o
o
o
o
-
7/23/2019 Introduction to data mining - Steinbach.pdf
10/790
x Preface
Additional
support materials,
including
solutions
to exercises,
re available
only to instructors
adopting
this textbook for classroom use.
Please contact
your
school's
Addison-Wesley representative or information
on
obtaining ac-
cess o this
material.
Comments
and
suggestions,as well as
reports of errors,
can be sent
to the
authors through d [email protected].
Acknowledgments Many
people
contributed to this book.
We begin by
acknowledging
our families to whom
this book is dedicated.
Without their
patience
and support, this
project
would have
been
impossible.
We
would
like
to thank the current
and
former
students of our
data
mining
groups
at the University
of Minnesota
and Michigan
State
for
their contribu-
tions. Eui-Hong
(Sam)
Han
and Mahesh
Joshi
helped
with the
initial data
min-
ing classes.
Some
ofthe exercises nd
presentation
slides hat they
created
can
be found in the book and its accompanying
slides. Students
in our data
min-
ing
groups
who
provided
comments
on drafts of the book or who
contributed
in
other ways
include
Shyam Boriah, Haibin
Cheng, Varun Chandola,
Eric
Eilertson,
Levent
Ertoz,
Jing Gao,
Rohit Gupta, Sridhar
Iyer,
Jung-Eun
Lee,
Benjamin Mayer,
Aysel
Ozgur, Uygar Oztekin, Gaurav
Pandey,
Kashif
Riaz,
Jerry Scripps, Gyorgy Simon,
Hui
Xiong,
Jieping
Ye, and Pusheng
Zhang. We
would also like to thank the students
of our data
mining
classes
at the Univer-
sity of Minnesota
and Michigan
State
University
who worked with
early
drafbs
of
the book and
provided
invaluable feedback.
We
specifically note
the helpful
suggestionsof
Bernardo
Craemer, Arifin
Ruslim, Jamshid Vayghan,
and
Yu
Wei.
Joydeep Ghosh (University of Texas) and Sanjay Ranka (University of
Florida)
class
ested early versions
of the book.
We
also received
many useful
suggestions
directly
from
the following
UT students: Pankaj Adhikari, Ra-
jiv
Bhatia, Fbederic
Bosche, Arindam
Chakraborty,
Meghana Deodhar, Chris
Everson, David Gardner,
Saad Godil, Todd Hay,
Clint Jones,
Ajay
Joshi,
Joonsoo Lee, Yue
Luo, Anuj Nanavati,
Tyler
Olsen, Sunyoung
Park,
Aashish
Phansalkar,
Geoff
Prewett, Michael
Ryoo,
Daryl
Shannon, and
Mei
Yang.
Ronald
Kostoff
(ONR)
read
an early
version of the clustering
chapter and
offered
numerous suggestions.
George Karypis
provided
invaluable
IATEX as-
sistance in creating
an
author index. Irene Moulitsas
also
provided
assistance
with IATEX and reviewed some of the appendices. Musetta Steinbach was very
helpful in
finding
errors in
the
figures.
We
would like to acknowledge
our colleagues at the
University
of Min-
nesota
and Michigan State who have helped
create a
positive
environment
for
data mining research. They include
Dan Boley,
Joyce Chai,
Anil Jain, Ravi
-
7/23/2019 Introduction to data mining - Steinbach.pdf
11/790
Preface
xi
Janardan,
Rong Jin, GeorgeKarypis, Haesun Park, William
F.
Punch, Shashi
Shekhar,
and
Jaideep Srivastava. The collaborators on our
many
data
mining
projects,
who
also have
our
gratitude,
include Ramesh
Agrawal,
Steve
Can-
non, Piet
C. de
Groen,
FYanHill, Yongdae Kim,
Steve
Klooster, Kerry Long,
Nihar
Mahapatra,
Chris
Potter,
Jonathan
Shapiro,
Kevin
Silverstein,
Nevin
Young, and Zhi-Li Zhang.
The
departments of
Computer Scienceand
Engineering at the
University of
Minnesota
and Michigan
State
University
provided
computing
resources
and
a supportive
environment for
this
project.
ARDA, ARL, ARO, DOE, NASA,
and NSF
provided
research
support
for Pang-Ning Tan, Michael
Steinbach,
and
Vipin
Kumar. In
particular,
Kamal Abdali, Dick Brackney,
Jagdish Chan-
dra,
Joe
Coughlan, Michael
Coyle, Stephen
Davis, Flederica Darema, Richard
Hirsch,
Chandrika Kamath,
Raju
Namburu, N. Radhakrishnan, James
Sido-
ran,
Bhavani
Thuraisingham,
Walt Tiernin, Maria
Zemankova,and Xiaodong
Zhanghave
been supportive
of our research
n
data
mining and high-performance
computing.
It
was
a
pleasure
working with the helpful staff at
Pearson Education.
In
particular,
we would like
to thank
Michelle Brown, Matt
Goldstein,
Katherine
Harutunian,
Marilyn Lloyd, Kathy
Smith,
and
Joyce
Wells. We
would
also
like
to
thank
George Nichols, who helped
with
the art
work and Paul
Anag-
nostopoulos,
who
provided
I4.T[X support.
We
are
grateful
to the following
Pearson
reviewers:
Chien-Chung Chan
(University
of
Akron), Zhengxin
Chen
(University
of Nebraska
at Omaha), Chris Clifton
(Purdue
University),
Joy-
deep
Ghosh
(University
of
Texas, Austin), Nazli
Goharian
(Illinois
Institute
of Technology),
J. Michael Hardin
(University
of
Alabama),
James Hearne
(Western
Washington
University),
Hillol Kargupta
(University
of Maryland,
Baltimore
County and
Agnik, LLC), Eamonn Keogh
(University
of California-
Riverside),
Bing Liu
(University
of Illinois at
Chicago),
Mariofanna Milanova
(University
of
Arkansas
at Little
Rock), Srinivasan
Parthasarathy
(Ohio
State
University),
Zbigniew
W. Ras
(University
of
North
Carolina
at Charlotte),
Xintao
Wu
(University
of
North
Carolina at Charlotte),
and
Mohammed
J.
Zaki
(Rensselaer
Polvtechnic Institute).
-
7/23/2019 Introduction to data mining - Steinbach.pdf
12/790
-
7/23/2019 Introduction to data mining - Steinbach.pdf
13/790
Gontents
Preface
Introduction
1
1.1
What
Is
Data Mining?
2
7.2 Motivating
Challenges
4
1.3 The
Origins of Data
Mining
6
1.4 Data Mining Tasks 7
1.5
Scope and
Organization
of the
Book
11
1.6 Bibliographic Notes
13
vll
t.7 Exercises
Data
16
19
2.I Types of Data
22
2.1.I Attributes
and
Measurement
23
2.L.2 Types
of
Data
Sets
.
29
2.2 Data
Quality
36
2.2.I Measurement and Data Collection Issues 37
2.2.2 Issues
Related
to
Applications
2.3 Data Preprocessing
2.3.L Aggregation
2.3.2
Sampling
2.3.3 Dimensionality
Reduction
2.3.4 Feature
Subset
Selection
2.3.5 Feature
Creation
2.3.6 Discretization
and
Binarization
2.3:7
Variable
Tlansformation
.
2.4 Measures
of Similarity and
Dissimilarity
. . .
2.4.L Basics
2.4.2
Similarity and
Dissimilarity between Simple
Attributes .
2.4.3 Dissimilarities between Data Objects
.
2.4.4
Similarities between
Data
Objects
43
44
45
47
50
52
55
57
63
65
66
67
69
72
-
7/23/2019 Introduction to data mining - Steinbach.pdf
14/790
xiv
Contents
2.4.5
Examples of
Proximity
Measures
2.4.6 Issues
n
Proximity
Calculation
2.4.7
Selecting
he
Right
Proximity
Measure
2.5 BibliographicNotes
2.6
Exercises
Exploring Data
3.i The Iris Data
Se t
3.2 Summary
Statistics
3.2.L
Frequencies
and the Mode
3.2.2
Percentiles
3.2.3 Measuresof Location:
Mean and Median
3.2.4 Measuresof Spread:
Range and Variance
3.2.5
Multivariate
Summary
Statistics
3.2.6 Other
Ways
to
Summarize
the
Data
3.3 Visualization
3.3.1 Motivations for
Visualization
3.3.2 General Concepts
3.3.3
Techniques
3.3.4
Visualizing
Higher-Dimensional
Data
.
3.3.5
Do's
and
Don'ts
3.4
OLAP
and Multidimensional
Data Analysis
3.4.I
Representing
ris
Data
as a Multidimensional
Array
3.4.2
Multidimensional
Data: The
General Case
3.4.3 Analyzing
Multidimensional
Data
3.4.4 Final Comments on Multidimensional Data Analysis
Bibliographic
Notes
Exercises
Classification:
Basic Concepts,
Decision Tlees,
and Model
Evaluation
4.1 Preliminaries
4.2
General Approach to Solving
a Classification
Problem
4.3 Decision
Tlee Induction
4.3.1 How
a Decision Tlee
Works
4.3.2 How to Build a Decision TYee
4.3.3 Methods
for
Expressing
Attribute Test Conditions
4.3.4 Measures or
Selecting
he Best
Split
.
4.3.5
Algorithm for Decision
Tlee Induction
4.3.6 An
Examole: Web
Robot Detection
3.5
3.6
73
80
83
84
88
97
98
98
99
100
101
102
704
105
105
105
106
110
724
130
131
131
133
135
139
139
747
L45
746
748
150
150
151
155
158
164
166
-
7/23/2019 Introduction to data mining - Steinbach.pdf
15/790
Contents
xv
4.3.7
Characteristics of
Decision Tlee Induction
4.4 Model
Overfitting
4.4.L
Overfitting
Due
to
Presenceof Noise
4.4.2
Overfitting Due to Lack of Representative
Samples
4.4.3
Overfitting and the
Multiple
Comparison
Procedure
4.4.4 Estimation of Generalization Errors
4.4.5
Handling
Overfitting
in Decision
Tlee Induction
4.5
Evaluating
the Performance of a
Classifier
4.5.I
Holdout Method
4.5.2
Random Subsampling
. . .
4.5.3
Cross-Validation
4.5.4
Bootstrap
4.6 Methods
for
Comparing Classifiers
4.6.L
Estimating a
Confidence
nterval for Accuracy
4.6.2
Comparing the
Performance of Two
Models .
4.6.3 Comparing the Performance of Two Classifiers
4.7
BibliographicNotes
4.8
Exercises
5 Classification:
Alternative Techniques
5.1 Rule-Based
Classifier
5.1.1 How
a Rule-Based Classifier
Works
5.1.2
Rule-Ordering Schemes
5.1.3 How
to
Build a
Rule-Based
Classifier
5.1.4 Direct Methods for
Rule
Extraction
5.1.5 Indirect Methods for
Rule
Extraction
5.1.6
Characteristics of Rule-Based Classifiers
5.2
Nearest-Neighbor
classifiers
5.2.L Algorithm
5.2.2
Characteristics
of Nearest-Neighbor Classifiers
5.3
Bayesian
Classifiers
5.3.1
Bayes Theorem
5.3.2
Using
the Bayes Theorem
for
Classification
5.3.3 Naive
Bayes Classifier
5.3.4 Bayes
Error Rate
5.3.5 Bayesian Belief Networks
5.4
Artificial
Neural Network
(ANN)
5.4.I Perceptron
5.4.2
Multilayer
Artificial Neural Network
5.4.3
Characteristics
of ANN
168
172
L75
L77
178
179
184
186
186
187
187
188
188
189
191
192
193
198
207
207
209
2 I I
2r2
2r3
22L
223
223
225
226
227
228
229
23L
238
240
246
247
25r
255
-
7/23/2019 Introduction to data mining - Steinbach.pdf
16/790
xvi
Contents
5.5
Support Vector
Machine
(SVM)
5.5.1 Maximum Margin Hyperplanes
5.5.2 Linear
SVM: SeparableCase
5.5.3 Linear
SVM:
Nonseparable
Case
5.5.4
Nonlinear
SVM
.
5.5.5 Characteristics of SVM
Ensemble
Methods
5.6.1
Rationale
for Ensemble
Method
5.6.2 Methods
for
Constructing an Ensemble Classifier
5.6.3 Bias-Variance Decomposition
5.6.4 Bagging
5.6.5
Boosting
5.6.6 Random
Forests
5.6.7 Empirical
Comparison among Ensemble Methods
Class
Imbalance Problem
5.7.1 Alternative Metrics
5.7.2
The
Receiver
Operating
Characteristic Curve
5.7.3 Cost-Sensitive
Learning
. .
5.7.4
Sampling-Based
Approaches
.
Multiclass Problem
Bibliographic
Notes
Exercises
5 .6
o . t
256
256
259
266
270
276
276
277
278
28r
283
285
290
294
294
295
298
302
305
306
309
315
c .6
5.9
5.10
Association
Analysis: Basic
Concepts and
Algorithms
327
6.1
Problem
Definition
.
328
6.2 Flequent Itemset Generation
332
6.2.I The Apri,ori
Principle
333
6.2.2 Fbequent temset
Generation in
the
Apri,ori, Algorithm .
335
6.2.3
Candidate
Generation
and Pruning .
. .
338
6.2.4
Support
Counting 342
6.2.5
Computational Complexity
345
6.3 Rule
Generatiorr
349
6.3.1
Confidence-Based
Pruning
350
6.3.2
Rule Generation
in
Apri,ori, Algorithm
350
6.3.3
An
Example: Congressional
Voting Records
352
6.4
Compact Representation of Fbequent temsets
353
6.4.7
Maximal
Flequent Itemsets
354
6.4.2 Closed Frequent Itemsets
355
6.5 Alternative
Methods
for
Generating
Frequent Itemsets
359
6.6 FP-Growth
Alsorithm
363
-
7/23/2019 Introduction to data mining - Steinbach.pdf
17/790
Contents
xvii
6.6.1
FP-tee
Representation
6.6.2
Frequent
Itemset Generation
in FP-Growth Algorithm
.
6.7
Evaluation
of
Association
Patterns
6.7.l Objective
Measuresof
Interestingness
6.7.2
Measures
beyond
Pairs
of
Binary Variables
6.7.3 Simpson'sParadox
6.8
Effect of
Skewed
Support
Distribution
6.9
Bibliographic
Notes
363
366
370
37r
382
384
386
390
404
4L5
415
4t8
418
422
424
426
429
429
431
436
439
442
443
444
447
448
453
457
457
458
458
460
461
463
465
469
473
6.10
Exercises
7 Association Analysis:
Advanced
7.I Handling
Categorical Attributes
7.2
Handling
Continuous Attributes
Concepts
7.2.I Discretization-Based
Methods
7.2.2 Statistics-BasedMethods
7.2.3 Non-discretizalion Methods
Handling a
Concept Hierarchy
Seouential
Patterns
7.4.7
Problem
Formulation
7.4.2 Sequential Pattern Discovery
7.4.3
Timing
Constraints
7.4.4
Alternative
Counting Schemes
7.5 Subgraph
Patterns
7.5.1 Graphs
and Subgraphs
.
7.5.2
Frequent
Subgraph
Mining
7.5.3 Apri,od-like Method
7.5.4 Candidate
Generation
7.5.5 Candidate Pruning
7.5.6
Support
Counting
7.6
Infrequent
Patterns
7.6.7
Negative
Patterns
7.6.2
Negatively
Correlated
Patterns
7.6.3 Comparisons among Infrequent
Patterns, Negative
Pat-
terns,
and Negatively Correlated
Patterns
7.6.4
Techniques
or Mining Interesting
Infrequent Patterns
7.6.5
Techniques
Based on Mining
Negative Patterns
7.6.6 Techniques
Based on Support
Expectation .
7.7
Bibliographic
Notes
7.8 Exercises
7.3
7.4
-
7/23/2019 Introduction to data mining - Steinbach.pdf
18/790
xviii Contents
Cluster Analysis:
Basic Concepts
and Algorithms
8.1 Overview
8.1.1 What Is
Cluster
Analysis?
8.I.2 Different
Types of Clusterings .
8.1.3 Different Types of
Clusters
8.2 K-means
8.2.7
The
Basic K-means Algorithm
8.2.2 K-means: Additional Issues
8.2.3 Bisecting K-means
8.2.4
K-means
and
Different
Types
of
Clusters
8.2.5
Strengths and
Weaknesses
8.2.6 K-means
as
an
Optimization Problem
8.3
Agglomerative
Hierarchical
Clustering
8.3.1 Basic Agglomerative
Hierarchical
Clustering
Algorithm
8.3.2
Specific
Techniques
8.3.3 The Lance-Williams Formula for Cluster Proximity .
8.3.4 Key Issues
n
Hierarchical
Clustering
.
8.3.5
Strengths and Weaknesses
DBSCAN
8.4.1 Tladitional Density:
Center-Based
Approach
8.4.2
The
DBSCAN Algorithm
8.4.3 Strengths
and
Weaknesses
Cluster Evaluation
8.5.1
Overview
8.5.2 Unsupervised
Cluster Evaluation Using
Cohesion and
Separation
8.5.3
Unsupervised Cluster Evaluation Using the Proximity
Matrix
8.5.4
Unsupervised
Evaluation
of
Hierarchical
Clustering
.
8.5.5 Determining the
Correct Number of Clusters
8.5.6
Clustering Tendency
8.5.7
Supervised Measures
of Cluster Validity
8.5.8
Assessing
he Significance
of Cluster
Validity Measures
8.4
8.5
487
490
490
49r
493
496
497
506
508
510
510
513
515
516
518
524
524
526
526
527
528
530
532
533
536
542
544
546
547
548
553
ooo
559
8.6
Bibliograph
8.7
Exercises
ic Notes
Cluster Analysis:
Additional Issues
and Algorithms 569
9.1
Characteristics
of Data, Clusters,
and Clustering
Algorithms .
570
9.1.1 Example:
Comparing
K-means
and
DBSCAN . . . . . .
570
9.1.2 Data
Characteristics 577
-
7/23/2019 Introduction to data mining - Steinbach.pdf
19/790
Contents
xix
9.1.3 Cluster
Characteristics
.
573
9.L.4 General
Characteristics
f Clustering
Algorithms
575
9.2
Prototype-Based
lustering 577
9.2.1
F\zzy
Clustering
577
9.2.2 Clustering
Using
Mixture Models 583
9.2.3 Self-Organizingaps(SOM) 594
9.3
Density-Based
lustering
600
9.3.1
Grid-Based
lustering
601
9.3.2 Subspace
lustering
604
9.3.3
DENCLUE:
A Kernel-Based
cheme
or
Density-Based
Clustering 608
9.4 Graph-Based
lustering
612
9.4.1 Sparsification 613
9.4.2
Minimum
Spanning
Tlee
(MST)
Clustering
. . .
674
9.4.3 OPOSSUM:
Optimal
Partitioning
of Sparse
Similarities
UsingMETIS 616
9.4.4 Chameleon: Hierarchical
Clustering
with
Dynamic
Modeling
9.4.5 Shared Nearest Neighbor
Similarity
9.4.6 The
Jarvis-Patrick Clustering
Algorithm
9.4.7 SNN Density
9.4.8 SNN Density-Based
Clustering
9.5 Scalable Clustering Algorithms
9.5.1 Scalability:
General
Issuesand
Approaches
9.5.2
BIRCH
9.5.3 CURE
9.6 Which Clustering Algorithm?
9.7 Bibliographic
Notes
9.8
Exercises
616
622
625
627
629
630
630
633
635
639
643
647
10
Anomaly Detection 651
10.1 Preliminaries
653
10.1.1
Causesof Anomalies
653
10.1.2 Approaches
o
Anomaly Detection 654
10.1.3 The
Use
of
Class
Labels
655
10.1.4 ssues 656
10.2
Statistical
Approaches 658
t0.2.7 Detecting
Outliers
in a
Univariate
Normal Distribution 659
10.2.2
Out l iersinaMult ivariateNormalDistr ibut ion
. . . . .
661
10.2.3 A Mixture Model Approach for
Anomaly
Detection. 662
-
7/23/2019 Introduction to data mining - Steinbach.pdf
20/790
xx
Contents
10.2.4
Strengths
and Weaknesses
10.3 Proximity-Based
Outlier Detection
10.3.1
Strengths
and
Weaknesses
10.4
Density-Based
Outlier
Detection
10.4.1
Detection of
Outliers
Using Relative
Density
70.4.2 Strengths and Weaknesses
10.5
Clustering-Based
Techniques
10.5.1
Assessing he Extent
to
Which an
Object Belongs to
a
Cluster
10.5.2 Impact
of
Outliers on
the
Initial
Clustering
10.5.3
The Number of
Clusters o
Use
10.5.4
Strengths
and
Weaknesses
665
666
666
668
669
670
67L
672
674
674
674
675
680
685
b6i)
10.6
Bibliograph
10.7
Exercises
ic Notes
Appendix A Linear Algebra
A.1
Vectors
A.1.1
Definition
685
4.I.2
Vector Addition and Multiplication by a
Scalar 685
A.1.3
Vector
Spaces
687
4.7.4 The
Dot Product,
Orthogonality, and Orthogonal
Projections
688
A.1.5
Vectors and
Data
Analysis
690
42
Matrices
691
A.2.1 Matrices:
Definitions
691
A-2.2 Matrices: Addition and Multiplication by a Scalar 692
4.2.3 Matrices:
Multiplication
693
4.2.4
Linear tansformations
and Inverse
Matrices
695
4.2.5 Eigenvalue
and Singular
Value Decomposition
.
697
4.2.6
Matrices
and
Data
Analysis
699
A.3
Bibliographic
Notes
700
Appendix B Dimensionality
Reduction
7OL
8.1
PCA
and
SVD 70I
B.1.1 Principal
Components
Analysis (PCA)
70L
8.7.2 SVD .
706
8.2 Other
Dimensionality
Reduction Techniques
708
8.2.I
Factor Analysis
708
8.2.2 Locally
Linear
Embedding
(LLE)
.
770
8.2.3 Multidimensional
Scaling, FastMap, and
ISOMAP
7I2
-
7/23/2019 Introduction to data mining - Steinbach.pdf
21/790
Contents xxi
8.2.4
Common Issues
B.3 Bibliographic
Notes
Appendix
C Probability and Statistics
C.1
Probability
C.1.1 Expected Values
C.2 Statistics
C.2.L Point Estimation
C.2.2
Central
Limit Theorem
C.2.3
Interval Estimation
C.3 Hypothesis Testing
Appendix
D Regression
D.1 Preliminaries
D.2
Simple
Linear
Regression
D.2.L Least SquareMethod
D.2.2 Analyzing
Regression
Errors
D.2.3 Analyzing
Goodnessof
Fi t
D.3
Multivariate
Linear
Regression
D.4
Alternative Least-Square
Regression
Methods
Appendix
E Optimization
E.1
Unconstrained
Optimizafion
E.1.1 Numerical Methods
8.2
Constrained
Optimization
E.2.I Equality Constraints
8.2.2 Inequality
Constraints
Author
Index
Subject
Index
Copyright
Permissions
715
7L6
7L9
7L9
722
723
724
724
725
726
739
739
742
746
746
747
750
758
769
729
729
730
731
733
735
736
737
-
7/23/2019 Introduction to data mining - Steinbach.pdf
22/790
-
7/23/2019 Introduction to data mining - Steinbach.pdf
23/790
1
Introduction
Rapid advances in data
collection and storage
technology have enabled or-
ganizations
to accumulate vast
amounts
of data. However, extracting useful
information has
proven
extremely challenging. Often, traditional data analy-
sis tools
and techniques
cannot be used because
of the massive size of a data
set.
Sometimes, he non-traditional nature of the data
means
that traditional
approachescannot
be applied even f the data set
is relatively small. In other
situations,
the
questions
hat
need
to be answered
cannot be addressedusing
existing
data analysis techniques, and thus, new methods
need
to
be
devel-
oped.
Data mining
is a technology
that blends traditional
data analysis
methods
with
sophisticated algorithms for
processing
arge volumes of data.
It has also
opened
up exciting opportunities for exploring and analyzing
new types of
data and for analyzing old types of data in new ways. In this introductory
chapter, we
present
an overview of data mining and
outline the key topics
to
be covered n this
book. We start
with
a
description of some
well-known
applications
that require new techniques or data analysis .
Business Point-of-sale
data collection
(bar
code scanners,
radio frequency
identification
(RFID),
and smart card
technology) have allowed
retailers to
collect up-to-the-minute
data about customer
purchases
at the checkout coun-
ters
of their stores.
Retailers can utilize
this information, along
with other
business-critical
data such as Web logs from e-commerce Web
sites and cus-
tomer service records from call centers, to help them better understand the
needs
of
their
customers and make more informed business
decisions.
Data mining
techniques can be used to support
a wide
range
of business
intelligence
applications such as customer
profiling,
targeted
marketing, work-
flow management,
store layout,
and
fraud
detection.
It
can also
help retailers
-
7/23/2019 Introduction to data mining - Steinbach.pdf
24/790
2
Chapter 1 lntroduction
answer important
business
questions
such as
"Who
are the most
profitable
customers?"
"What
products
can
be cross-soldor up-sold?" and
"What
is
the
revenue
outlook of
the company
for
next
year?))
Some
of
these
questions
mo-
tivated the creation of
association analvsis
(Chapters
6
and
7),
a new data
analysis
technique.
Medicine,
Science, and Engineering
Researchers
n medicine, science,
and
engineering are rapidly accumulating
data that is key to important new
discoveries. For
example, as an important
step toward
improving
our under-
standing of the Earth's
climate system, NASA has
deployed a seriesof
Earth-
orbiting satellites
that continuously
generate global
observations of the
Iand
surface, oceans, and
atmosphere. However, because of the
size and spatio-
temporal nature
of the data, traditional
methods are often not suitable for
analyzing these
data sets. Techniques
developed
n
data mining can aid Earth
scientists
in
answering
questions
such as
"What
is the relationship
between
the frequency
and intensity
of ecosystem disturbances such
as droughts and
hurricanes o
global
warming?"
"How
is land
surface
precipitation
and temper-
ature
affected by ocean surface
temperature?" and
"How
well can we
predict
the beginning
and end of the
growing
season
or
a
region?"
As
another example, researchers
n molecular biology hope
to
use
the
large
amounts
of
genomic
data currently
being
gathered
to better understand the
structure and function
of
genes.
In the
past,
traditional
methods in
molecu-
lar
biology allowed
scientists to study
only a few
genes
at a time in a
given
experiment.
Recent
breakthroughs
in microarray technology have
enabled sci-
entists to
compare he behavior
of thousands of
genes
under var ious situations.
Such comparisons can help
determine the function of each
gene
and
perhaps
isolate
the
genes
esponsible or
certain
diseases.
However,
he noisy and high-
dimensional nature
of data requires new
types of data analysis. In addition
to
analyzing
gene
array
data, data mining
can also be used to address other
important
biological challenges
such as
protein
structure
prediction,
multiple
sequence lignment,
the
modeling
of
biochemical
pathways,
and
phylogenetics.
1.1 What Is
Data Mining?
Data mining is the processof automatically discovering useful information in
large
data
repositories.
Data mining
techniques are deployed
to scour
large
databases in
order to find novel
and useful
patterns
that might
otherwise
remain
unknown. They
also
provide
capabilities to
predict
the outcome of a
-
7/23/2019 Introduction to data mining - Steinbach.pdf
25/790
1.1
What
Is
Data Mining?
3
future
observation,
such
as
predicting
whether
a
newly
arrived.
customer
will
spend
more
than
$100
at a department
store.
Not
all information
discovery
asks are considered
o be
data mining.
For
example,
Iooking
up individual
records
using
a database managemenr
sysrem
or
finding
particular
Web
pages
via a
query
to
an
Internet
search
engine
are
tasks related to the area of information retrieval. Although such tasks are
important
and may
involve
the
use of
the sophisticated
algorithms
and
data
structures,
they rely
on traditional
computer
science echniques
and obvious
features
of the
data
to
create index
structures
for
efficiently
organizing
and
retrieving
information.
Nonetheless,
data mining
techniques
have been
used
to
enhance
nformation
retrieval
systems.
Data
Mining
and
Knowledge
Discovery
Data
mining
is
an integral
part
of knowledge
discovery
in
databases
(KDD), which is the overall process of converting raw data into useful in-
formation,
as shown
in
Figure
1.1.
This
process
consists of
a series
of trans-
formation
steps,
from
data
preprocessing
o
postprocessing
of data mining
results.
Information
Figure
1. The
process
f
knowledge
iscovery
n
databases
KDD).
The
input
data
can be stored
in a variety
of
formats
(flat
files,
spread-
sheets,
or relational
tables) and
may
reside
in a centralized
data repository
or
be
distributed
across multiple
sites. The purpose
of
preprocessing
is
to transform the raw input data into an appropriate format for subsequent
analysis.
The
steps
involved
in
data
preprocessing
nclude
fusing
data
from
multiple
sources,
cleaning
data
to
remove
noise
and
duplicate
observations,
and
selecting
records
and features
that
are
relevant
to the
data mining
task
at
hand.
Because
of
the many
ways
data can
be collected
and stored,
data
-
7/23/2019 Introduction to data mining - Steinbach.pdf
26/790
4 Chapter 1
Introduction
preprocessing
s
perhaps
the
most
laborious and
time-consuming
step
in the
overall
knowledge
discovery
process.
,,Closing
the
loop" is the
phrase
often used
to refer to the
process
of
in-
tegrating data
mining results
into
decision support
systems.
For example,
in business applications,
the
insights offered
by data
mining
results can
be
integrated with campaign management tools so that effective marketing pro-
motions
can be
conducted
and
tested.
Such
integration
requires
a
postpro-
cessing step that
ensures
hat
only valid and
useful
results
are
incorporated
into
the decision
support system.
An example
of
postprocessing s visualiza-
tion
(see
Chapter
3), which
allows
analysts to
explore
the data
and
the data
mining
results from a variety
of viewpoints.
Statistical
measures
or
hypoth-
esis
testing methods
can also
be applied
during
postprocessing o eliminate
spurious
data
mining
results.
L.2 Motivating Challenges
As
mentioned earlier,
traditional
data analysis
techniques
have
often encoun-
tered
practical
difficulties
in meeting the
challenges
posed
by
new data
sets.
The
following are
some of the specific
challenges
hat motivated
the develop-
ment
of data mining.
Scalability
Because
of advances
n
data
generation and collection,
data sets
with
sizes of
gigabytes,
terabytes,
or even
petabytes are becoming
common.
If
data mining
algorithms are
to
handle
these
massive
data
sets, then they
must be scalable. Many data mining algorithms employ special searchstrate-
gies
to handle exponential
search
problems.
Scalability
may also
require the
implementation
of novel data structures
to access
ndividual
records
n
an
ef-
ficient
manner.
For instance,
out-of-core algorithms
may be
necessary
when
processing
data
sets that cannot
fit into main memory.
Scalability
can also
be
improved
by using
sampling
or developing
parallel
and
distributed
algorithms.
High
Dimensionality
It
is now common
to
encounter data
sets with
hun-
dreds
or thousands
of
attributes
instead of the
handful common
a
few decades
ago.
In bioinformatics,
progress n microarray
technology
has
produced
gene
expression
data
involving thousands
of
features.
Data sets
with temporal
or
spatial
components
also tend
to
have high
dimensionality.
For example,
consider
a data
set that contains
measurements
of temperature
at
various
locations.
If
the
temperature
measurements
are
taken
repeatedly
for an ex-
tended
period,
the
number of dimensions
(features)
ncreases
n
proportion
to
-
7/23/2019 Introduction to data mining - Steinbach.pdf
27/790
L.2 Motivating
Challenges
5
the number
of measurements
aken.
Tladitional
data
analysis echniques
hat
were
developed
for low-dimensional
data
often do not work
well for
such high-
dimensional
data. Also,
for
some data
analysis
algorithms,
the computational
complexity
increases
apidly
as
the dimensionality (the
number
of features)
increases.
Heterogeneous
and
Complex
Data TYaditional
data analysis
methods
often
deal with
data sets
containing
attributes
of the same
ype, either
contin-
uous or
categorical.
As
the role
of data
mining in
business,science,
medicine,
and
other
flelds
has
grown,
so has the need
for techniques
that
can handle
heterogeneous
ttributes.
Recent
years
have
also seen he
emergence
of more
complex
data
objects.
Examples
of such non-traditional
types of data
include
collections
of
Web
pages
containing
semi-structured
text and hyperlinks;
DNA
data
with
sequential
and three-dimensional
structure;
and climate
data that
consists
of
time
seriesmeasurements
temperature, pressure,
etc.) at various
locations
on
the Earth's
surface. Techniques
developed
or mining
such com-
plex
objects
should
take into
consideration
relationships
in the
data,
such as
temporal
and
spatial
autocorrelation, graph
connectivity,
and
parent-child
re-
lationships
between
he
elements
n semi-structured
ext
and
XML
documents.
Data
ownership
and Distribution
Sometimes, the data needed
for an
analysis
is
not
stored in
one location
or owned by one
organization.
Instead,
the
data is geographically
distributed
among
resources
belonging
to multiple
entities.
This
requires
the
development
of distributed
data mining
techniques.
Among the key challenges aced by distributed data mining algorithms in-
clude (1)
how
to reduce
the amount
of
communication needed
o
perform
the
distributed
computatior,
(2)
how
to effectively
consolidate the
data mining
results
obtained
from
multiple
sources,
and
(3)
how
to address
data
security
issues.
Non-traditional
Analysis
The
traditional
statistical approach
s
based on
a hypothesize-and-test
paradigm.
In
other words, a hypothesis
is
proposed,
an
experiment
is
designed
o
gather
the data,
and then the data
is
analyzed
with respect
o the hypothesis.
Unfortunately,
this
process
s
extremely
labor-
intensive. Current data analysis tasks
often require
the
generation
and evalu-
ation
of thousands
of hypotheses,
and consequently,
he development
of
some
data
mining
techniques
has
been motivated
by
the desire to
automate
the
process
of
hypothesis generation
and evaluation.
Furthermore,
the data
sets
analyzed
in
data
mining
are typically
not the result
of a carefully
designed
-
7/23/2019 Introduction to data mining - Steinbach.pdf
28/790
6 Chapter
1 Introduction
experiment
and
often
representopportunistic samples
of
the data,
rather than
random samples.
Also, the data sets
frequently
involve
non-traditional
types
of data and
data distributions.
1.3 The Origins of Data Mining
Brought
together
by the
goal
of
meeting the challenges
of
the
previous
sec-
tion,
researchers
rom
different
disciplines began
to
focus on developing
more
efficient
and
scalable ools
that could handle diverse
ypes
of data.
This work,
which culminated
in the
field
of data
mining, built
upon the
methodology
and
algorithms
that researchers
had
previously
used.
In
particular,
data
mining
draws
upon
ideas,
such
as
(1)
sampling, estimation,
and
hypothesis
testing
from statistics
and
(2)
search algorithms,
modeling
techniques,
and
learning
theories
from artificial
intelligence,
pattern
recognition,
and
machine
earning.
Data mining has also been quick to adopt ideas from other areas, ncluding
optimization,
evolutionary
computing,
information theory,
signal
processing,
visualization,
and
information
retrieval.
A number
of other
areas also
play
key supporting
roles.
In
particular,
database
systems
are
needed to
provide
support
for efficient
storage,
index-
ing, and
query
processing.
Techniques
rom high
performance (parallel)
com-
puting
are
often important
in addressing he
massive size of
some
data
sets.
Distributed
techniques
can also
help
address
he issueof
size and are
essential
when the
data
cannot be
gathered
in one location.
Figure
1.2 shows he
relationship of data
mining to other
areas.
Figure
.2.Data iningsa conlluencef
manyisciplines.
-
7/23/2019 Introduction to data mining - Steinbach.pdf
29/790
Data
Mining Tasks
7
1.4
Data
Mining
Tasks
Data
mining
tasks
are
generally
divided
into two
major
categories:
Predictive
tasks.
The
objective
of these
asks
s
to
predict
the value
of a
par-
ticular
attribute
based
on the values
of other
attributes. The attribute
to
be
predicted
is
commonly
known
as the target
or dependent vari-
able,
while
the
attributes
used for
making the
prediction
are known
as
the explanatory
or
independent
variables.
Descriptive
tasks. Here,
the objective is
to derive
patterns
(correlations,
trends,
clusters,
trajectories,
and anomalies)
that
summarize the
un-
derlying
relationships
in
data. Descriptive
data mining tasks
are often
exploratory
in nature
and frequently
require
postprocessing
echniques
to validate
and
explain
the results.
Figure 1.3 illustrates four of the core data mining tasks that are described
in
the
remainder
of
this book.
Four
f
he
core atamining
asks.
L.4
I
Figure.3.
-
7/23/2019 Introduction to data mining - Steinbach.pdf
30/790
8 Chapter
1
Introduction
Predictive modeling
refers to the task of
building a
model
for the
target
variable as a function
of the
explanatory variables.
There
are two
types of
predictive
modeling
tasks: classification,
which
is used
for
discrete target
variables,
and regression,
which
is
used
for continuous
target
variables.
For
example,
predicting
whether
a Web user
will
make a
purchase
at an
online
bookstore is a classification ask because he target variable is binary-valued.
On the other
hand,
forecasting
the future
price
of
a stock
is a
regression task
because
price
is a continuous-valued
attribute.
The
goal
of
both tasks
is to
learn
a
model that
minimizes
the error between
the
predicted
and
true
values
of the target
variable.
Predictive
modeling
can
be used to
identify
customers
that
will respond
to a
marketing campaign,
predict
disturbances
n the
Earth's
ecosystem,or
judge
whether
a
patient
has a
particular
disease
based on
the
results of
medical
tests.
Example 1.1
(Predicting
the
Type of
a
Flower). Consider
he task of
predicting a speciesof flower based on the characteristics of the flower. In
particular,
consider
classifying
an
Iris flower
as to
whether it belongs
to one
of the
following
three Iris species: Setosa,
Versicolour,
or
Virginica.
To
per-
form this task,
we need a data
set containing
the
characteristics
of
various
flowers of these
three species.
A
data set
with
this type of
information
is
the well-known
Iris data set from
the
UCI
Machine
Learning
Repository
at
http:
/hrurw.ics.uci.edu/-mlearn.
In
addition
to the species
of a flower,
this
data set contains
four
other
attributes:
sepal
width, sepal
length,
petal
length,
and
petal
width.
(The Iris data set and
its attributes
are described
further in
Section
3.1.)
Figure
1.4
shows a
plot
of
petal
width
versus
petal
length for the 150 flowers in the Iris data set. Petal width is broken into the
categories
ow,
med'ium, and hi'gh,
which correspond
o the
intervals
[0'
0.75),
[0.75,
1.75),
[1.75,
oo),
respectively.
Also,
petal
length
s broken
nto categories
low, med,'ium, nd
hi,gh,which
correspond o the
intervals
[0'
2.5),
[2.5,5),
[5 '
oo),
respectively.
Based on these
categories
of
petal
width
and length,
the
following rules
can be derived:
Petal
width low and
petal
length
low implies
Setosa.
Petal width medium
and
petal
length
medium implies
Versicolour.
Petal width
high and
petal
length
high implies Virginica.
While these rules do not classify all the flowers, they do a good (but not
perfect)
job
of
classifying
most of the
flowers.
Note that
flowers
from the
Setosa
species
are
well separated
from the
Versicolour
and
Virginica
species
with
respect to
petal
width and
length,
but
the
latter two
species
overlap
somewhat with
respect to these
attributes.
I
-
7/23/2019 Introduction to data mining - Steinbach.pdf
31/790
r
Setosa
.
Versicolour
o
Virginica
L.4
Data Mining
Tasks
I
l - - - - a - - f o - - - - - - - i
l a
o
r
, f t f o o t o
a i
:
o o
o I
I
'
t 0 f 0 a o
0?oo
r
a
a r f I
? 1 . 7 5
E
()
r
1 . 5
E
=
(t'
()
( L l
0_l _.
o_
o
. t
a a r O
. .4.
a?o
o a a a a
a aaaaaaa
a a a a
a
aa
a a a a a
I
I
l l t l l
l l t
I
I l l l l t t
I
I t
1 2 2 . 5 3 4 5 (
PetalLength
cm)
Figure .4.
Petal
idth ersus
etal
engthor
1
50
ris lowers,
Association
analysis
is used
to discover
patterns
that describe strongly
as-
sociated
eatures
n the
data.
The
discovered
patterns
are
typically represented
in
the
form
of
implication
rules
or feature
subsets.
Because
of the exponential
size of
its
search
space, he goal
of association
analysis s
to extract
the most
interesting patterns
in
an efficient
manner.
Useful applications
of association
analysis
include
finding
groups
of
genes
hat have
related functionality,
identi-
fying
Web
pages
hat
are accessed
ogether, or understanding
the
relationships
between
different
elements
of
Earth's
climate
system.
Example
1.2 (Market
Basket
Analysis).
The
transactions
shown
in
Ta-
ble 1.1
illustrate
point-of-sale
data
collected at
the checkout counters
of a
grocery
store. Association
analysis
can be applied
to find items
that are
fre-
quently
bought
together
by
customers. For
example,
we may discover
the
rule
{Diapers}
-----*
{lt:.ft},
which suggests
hat
customers who buy diapers
also tend to buy milk. This type of rule can be used to identify potential
cross-selling
opportunities
among related
items.
I
Cluster
analysis
seeks
o find
groups
of closely related
observationsso
that
observations
hat
belong to
the same cluster are
more
similar to each
other
-
7/23/2019 Introduction to data mining - Steinbach.pdf
32/790
10
Chapter
1
Introduction
Table
1. Marketasket
ata.
Tlansaction
ID Items
1
2
3
4
r
o
7
8
9
10
{Bread,
Butter,
Diapers,
Milk}
{Coffee,
Sugar, Cookies,
Sakoon}
{Bread,
Butter, Coffee,
Diapers,
Milk,
Eggs}
{Bread, Butter, Salmon,Chicken}
{fgg",
Bread, Butter}
{Salmon,
Diapers, Milk}
{Bread,
Tea, Sugar, Eggs}
{Coffee,
Sugar, Chicken,
Eggs}
{Bread,
Diapers, Mi1k,
Salt}
{Tea,
Eggs,
Cookies,
Diapers,
Milk}
than
observations
hat belong
to other clusters.
Clustering
has been used
to
group
sets of
related
customers,
ind areas of
the ocean
hat
have a significant
impact on the Earth's climate, and compressdata.
Example
1.3
(Document
Clustering).
The collection
of
news
articles
shown
in Table 1.2
can be
grouped
based on
their
respective topics.
Each
article
is represented
as
a set
of
word-frequency
pairs (r,
"),
where tu
is
a
word
and
c is the number
of times the
word appears
in the article.
There are
two
natural
clusters
in the data
set.
The first cluster
consists
of the first
four ar-
ticles,
which correspond
to
news about the economy,
while
the second
cluster
contains
the
last four articles,
which correspond
to
news
about
health care.
A
good
clustering algorithm
should
be able to
identify
these
two clusters
based
on the similarity between words that appear
in the articles.
Table .2.
Collectionf news
rticles.
Article
Words
I
2
.)
A
r
J
o
7
8
dollar:
1, industry:
4, country: 2,
loan: 3,
deal: 2,
government: 2
machinery:
2, labor: 3,
market:
4, industry:
2, work: 3,
country:
1
job:
5,
inflation: 3, rise:
2,
jobless:
2,
market:
3,
country:
2,
index: 3
domestic:
3, forecast:
2,
gain:
1, market:
2, sale: 3,
price:
2
patient: 4,
symptom:
2, drug: 3,
health:
2, clinic:
2, doctor:
2
pharmaceutical:2,
company: 3,
drug:
2,vaccine:1,
flu:
3
death: 2, cancer: 4, drug: 3, public: 4, health: 3, director:
2
medical:
2,
cost: 3,
increase:
2,
patient:
2, health: 3,
care:
1
-
7/23/2019 Introduction to data mining - Steinbach.pdf
33/790
1.5
Scope
and
Organization of
the Book 11
Anomaly
detection
is
the task
of
identifying
observationswhose
character-
istics
are significantly
different
from the rest
of
the data. Such
observations
are known
as
anomalies
or outliers.
The
goal
of an anomaly
detection al-
gorithm
is to
discover
the
real
anomalies
and
avoid
falsely labeling
normal
objects
as anomalous.
In
other
words, a
good
anomaly
detector must have
a high detection rate and a low false alarm rate. Applications of anomaly
detection include
the
detection
of fraud, network
intrusions,
unusual
patterns
of disease,
and
ecosystem
disturbances.
Example
1.4
(Credit
Card
trYaud Detection).
A
credit
card company
records
he
transactions
made
by every
credit card holder,
along with
personal
information
such
as credit limit,
age, annual income,
and address.
since the
number
of
fraudulent
cases is
relatively
small
compared to the
number of
legitimate
transactions,
anomaly
detection techniques
can be applied
to build
a
profile
of legitimate
transactions
for
the users.
When a
new
transaction
arrives, it is compared against the profile of the user. If the characteristicsof
the
transaction
are very different
from
the
previously
created
profile,
then the
transaction is
flagged
as
potentially
fraudulent.
I
1.5
Scope
and
Organization
of
the
Book
This
book introduces
the major principles
and
techniquesused n
data mining
from
an
algorithmic perspective.
A
study of these
principles
and
techniques
s
essential or
developing
a better
understanding
of how data mining
technology
can
be applied
to various kinds
of data. This
book also serves
as a starting
point for readerswho are interested in doing research n this field.
We begin
the
technical discussion
of this
book
with
a chapter
on data
(Chapter
2),
which discusses
he basic types
of data, data
quality,
prepro-
cessing
techniques,
and measures
of similarity
and dissimilarity.
Although
this material
can be
covered
quickly,
it
provides
an essential
foundation for
data analysis.
Chapter 3, on
data exploration,
discusses ummary
statistics,
visualization
techniques,
and
On-Line Analytical
Processing
(OLAP).
These
techniques
provide
the means
for
quickly
gaining
insight into
a data
set.
Chapters 4
and
5
cover
classification.
Chapter 4
provides
a foundation
by
discussing
decision
tree classifiers
and several issues
that
are
important
to all classification:
overfitting, performance
evaluation,
and the
comparison
of different
classification
models.
Using this
foundation,
Chapter
5
describes
a
number
of
other important
classification techniques: rule-based
systems,
nearest-neighbor
lassifiers,
Bayesian
classifiers,
artificial neural
networks, sup-
port
vector machines,
and ensemble
classifiers,
which are collections
of classi-
-
7/23/2019 Introduction to data mining - Steinbach.pdf
34/790
2 Chapter
1
lntroduction
fiers.
The multiclass and
imbalanced class
problems are also
discussed.
These
topics
can be covered
ndependently.
Association
analysis
s explored
in
Chapters
6 and 7.
Chapter 6
describes
the
basics of association
analysis:
frequent
itemsets, association
rules,
and
some
of the algorithms
used to
generate
them.
Specific
types of
frequent
itemsets-maximal, closed,and hyperclique-that are important for data min-
ing are
also discussed,
and the
chapter concludes
with
a discussion
of
evalua-
tion
measures or
association analysis. Chapter
7 considers
a
variety of
more
advanced
opics, including
how
association
analysis
can be
applied to categor-
ical and
continuous
data or to data that
has a concept
hierarchy.
(A
concept
hierarchy
is
a
hierarchical categorization
of objects,
e.g., store
items, clothing,
shoes,
sneakers.)
This
chapter
also describes
how
association
analysis can
be
extended
to find sequential
patterns (patterns involving
order),
patterns in
graphs,
and
negative
relationships
(if
one item
is
present,
then
the other
is
not).
Cluster
analysis
s discussed
n
Chapters 8
and
9. Chapter
8 first describes
the different
types
of clusters and
then
presents
hree specific
clustering
tech-
niques:
K-means, agglomerative
hierarchical
clustering, and
DBSCAN.
This
is followed
by a discussion
of techniques
for validating
the
results
of a cluster-
ing algorithm.
Additional clustering
concepts
and techniques
are explored
in
Chapter
9,
including
fiszzy and
probabilistic
clustering, Self-Organizing
Maps
(SOM),
graph-based
clustering,
and density-based
clustering.
There is
also a
discussion
of scalability
issues and
factors
to
consider when
selecting a
clus-
tering algorithm.
The last chapter, Chapter 10,
is on anomaly
detection.
After
some
basic
definitions,
several
different types
of anomaly
detection are
considered:
sta-
tistical,
distance-based,
density-based,
and
clustering-based.
Appendices
A
through
E
give
a brief
review
of
important topics
that are
used
in
portions
of
the
book:
linear
algebra,
dimensionality
reduction, statistics,
regression,
and
optimization.
The subject of
data
mining, while relatively
young
compared
to statistics
or machine
learning,
is already
too large to cover
in a single
book. Selected
references
to topics
that are only
briefly covered,
such as
data
quality' are
provided
in the bibliographic
notes of the appropriate
chapter. References
o
topics not covered n this book, such as data mining for streams and privacy-
preserving
data mining, are
provided
in
the bibliographic
notes
of this chapter.
-
7/23/2019 Introduction to data mining - Steinbach.pdf
35/790
Bibliographic Notes
13
1.6
Bibliographic
Notes
The
topic
of data
mining has
inspired many
textbooks. Introductory
text-
books include
those by Dunham
[10],
Han
and Kamber
l2L),
Hand
et al.
[23],
and
Roiger and
Geatz
[36].
Data mining
books with a
stronger emphasis on
businessapplications include the works by Berry and Linoff [2], Pyle [34],and
Parr
Rud
[33].
Books with
an emphasis
on statistical learning include
those
by
Cherkassky
and Mulier
[6],
and Hastie
et al.
124].
Some books with an
emphasis
on machine
learning
or
pattern
recognition
are those by Duda
et
al.
[9],
Kantardzic
[25],
Mitchell
[31],
Webb
[41],
and
Witten and
F]ank
[42].
There
are
also some more
specialized
books:
Chakrabarti
[a]
(web
mining),
Fayyad
et al.
[13]
(collection
of early
articles on data mining), Fayyad
et al.
111]
visualization),
Grossman et al.
[18]
(science
and engineering), Kargupta
and
Chan
[26]
(distributed
data mining),
Wang
et al.
[a0]
(bioinformatics),
and Zaki
and Ho
[44]
(parallel
data mining).
There are several conferences elated to data mining. Some of the main
conferences
dedicated
to this field include
the ACM
SIGKDD
International
Conferenceon Knowledge
Discovery
and
Data
Mining
(KDD),
the IEEE In-
ternational
Conferenceon Data
Mining
(ICDM),
the SIAM
International
Con-
ference
on Data
Mining
(SDM),
the
European
Conference
on Principles
and
Practice
of Knowledge
Discovery
in Databases (PKDD),
and the Pacific-Asia
Conference
on Knowledge
Discovery
and Data Mining
(PAKDD).
Data min-
ing
papers
can
also be found in
other major
conferencessuch as the ACM
SIGMOD/PODS
conference,
he International
Conferenceon Very Large Data
Bases (VLDB),
the
Conferenceon Information
and
Knowledge Management
(CIKM),
the International
Conferenceon Data Engineering
(ICDE),
the
In-
ternational
Conferenceon Machine
Learning
(ICML),
and the National
Con-
ference
on Artificial Intelligence
(AAAI).
Journal
publications
on data mining inc lude
IEEE Transact'ions n Knowl-
edge
and Data Engi,neering,
Data
Mi,ning and Knowledge Discouery,
Knowl-
edge
and Information
Systems,
Intelli,gent
Data Analysi,s, nforrnati,on
Sys-
tems,
and lhe
Journal of Intelligent
Informati,on
Systems.
There
have
been a number
of
general
articles on data
mining
that define he
field
or its relationship
to
other
fields,
particularly
statistics.
Fayyad
et al.
[12]
describe
data mining
and how it
fits into the total knowledgediscovery
process.
Chen et
al.
[5]
give
a database
perspective
on data mining.
Ramakrishnan
and
Grama
[35]
provide
a
general
discussion
of data
mining
and
present
several
viewpoints.
Hand
[22]
describeshow
data mining differs from
statistics, as does
Fliedman
lf
4].
Lambert
[29]
explores
he useof statistics for large data
sets and
provides
some
comments
on the
respective
roles of data mining
and statistics.
1_.6
-
7/23/2019 Introduction to data mining - Steinbach.pdf
36/790
L4 Chapter
1 Introduction
Glymour
et al.
116]
consider
the lessons hat
statistics
may
have
for
data
mining. Smyth
et
aL
[38]
describe
how the evolution
of
data
mining
is
being
driven by
new types
of data and
applications,
such as those
involving
streams,
graphs,
and text.
Emerging applications
in data
mining are considered
by
Han
et
al.
[20]
and Smyth
[37]
describes
some
research
challenges
n
data
mining.
A discussionof how developments n data mining researchcan be turned into
practical
tools
is
given
by
Wu
et al.
[43].
Data
mining standards
are
the