introduction to data mining - steinbach.pdf

792

Upload: si12nisha

Post on 16-Feb-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/23/2019 Introduction to data mining - Steinbach.pdf

    1/790

  • 7/23/2019 Introduction to data mining - Steinbach.pdf

    2/790

  • 7/23/2019 Introduction to data mining - Steinbach.pdf

    3/790

    \ ( (

    PANG.N NG

    TAN

    Mich igan ta teUnivers i t y

    MICHAEL STEINBACH

    Univers i t yf M innesota

    VI

    PI

    N KUMAR

    Univers i t yf M innesota

    and

    ArmyHighPer formance

    Comput ing

    esearch enter

    +f.f_l

    crf.rfh.

    .W

    if f

    aqtY

    6l$

    t .T.R.C.

    i'&'ufe61ttt1/.

    Y

    \

    t.\

    $t,/,1'

    n,5

    \.

    7\

    V

    '48

    Boston San Francisco

    NewYork

    London Toronto Sydney

    Tokyo

    Singapore Madrid

    MexicoCity

    Munich Paris

    CapeTown

    HongKong

    Montreal

  • 7/23/2019 Introduction to data mining - Steinbach.pdf

    4/790

    G.R

    r+6,q

    If

    you purchased

    this book

    within

    the United States

    or Canada

    you

    should be aware that it has been

    wrongfirlly imported

    without

    the approval

    of the Publishel

    or the

    Author.

    3

    6

    {)gq*

    3

    AcquisitionsEditor

    Matt

    Goldstein

    ProjectEditor

    Katherine Harutunian

    Production

    Supervisor Marilyn

    Lloyd

    Production Services Paul C. Anagnostopoulosof Windfall Software

    Marketing

    Manager

    Michelle

    Brown

    Copyeditor

    Kathy

    Smith

    Proofreader

    IenniferMcClain

    Technicallllustration

    GeorgeNichols

    Cover

    Design

    Supervisor

    Joyce

    Cosentino

    Wells

    Cover Design

    Night & Day Design

    Cover Image

    @ 2005

    Rob Casey/Brand

    X

    pictures

    hepress and

    Manufacturing

    Caroline

    Fell

    Printer

    HamiltonPrinting

    Access

    the latest information

    about

    Addison-Wesley

    titles from

    our

    iWorld

    Wide

    Web site:

    http /www. aw-bc.com/computing

    Many

    of the

    designations

    used by

    manufacturers

    and sellers

    to

    distiriguish their products

    are

    claimed

    as trademarks.

    where

    those

    designations

    appear n

    this book, and

    Addison-

    Wesley was

    aware

    of a

    trademark claim,

    the designations

    have

    been

    printed

    in initial

    caps

    or all caps.

    The

    programs

    and

    applications presented

    n

    this book

    have

    been ncl,[rded

    or

    their

    instructional value.

    They

    have been

    tested

    with

    care,

    but are not guatanteed

    or

    any

    particular

    purpose.

    The

    publisher

    does

    not

    offer

    any

    warranties

    or representations,

    nor

    does

    it accept any

    liabilities

    with respect

    o

    the programs

    or applications.

    Copyright

    @

    2006

    by

    Pearson Education,

    Inc.

    For information

    on

    obtaining

    permission

    for

    use

    of material

    in this

    work,

    please

    submit

    a

    written

    request o

    PearsonEducation,

    Inc.,

    Rights

    and Contract

    Department,

    75

    Arlington

    Street,

    Suite 300,

    Boston,

    MA02II6

    or fax your

    request

    o

    (617)

    g4g-j047.

    All rights

    reserved.

    No

    part

    of

    this publication

    may be

    reproduced,

    stored

    n

    a retrieval

    system,

    or transmitted,

    in any form

    or

    by any

    means,

    electronic,

    mechanical,

    photocopying,

    recording,

    or any other

    media

    embodiments

    now

    known

    or hereafter

    o become

    known,

    without

    the

    prior

    written

    permission

    of the publisher.

    printed

    in the

    united

    States

    of

    America.

    lsBN

    0-321-42052-7

    2 3 4 5 67

    8 9 10-HAM-O8

    7 06

  • 7/23/2019 Introduction to data mining - Steinbach.pdf

    5/790

    our

    famili,es

  • 7/23/2019 Introduction to data mining - Steinbach.pdf

    6/790

  • 7/23/2019 Introduction to data mining - Steinbach.pdf

    7/790

    Preface

    Advances in

    data

    generation

    and

    collection are

    producing

    data

    sets of mas-

    sive

    size

    in

    commerce and

    a variety of scientific disciplines. Data warehouses

    store details

    of the sales

    and operations of businesses, arth-orbiting satellites

    beam high-resolution

    mages

    and sensordata back to Earth, and

    genomics

    ex-

    periments

    generate

    sequence,

    tructural, and

    functional

    data for an

    increasing

    number

    of organisms.

    The ease with

    which

    data can now be

    gathered

    and

    stored has

    created a new

    attitude toward data analysis:

    Gather

    whatever

    data

    you

    can whenever

    and

    wherever

    possible.

    It has become

    an article of

    faith

    that the

    gathered

    data

    will have value, either

    for

    the

    purpose

    that initially

    motivated

    its collection

    or

    for

    purposes

    not

    yet

    envisioned.

    The field

    of data mining

    grew

    out of the limitations of current data anal-

    ysis

    techniques n handling

    the challenges

    posedl

    by

    these

    new

    types

    of data

    sets. Data mining

    does not replace

    other areas

    of data

    analysis,

    but rather

    takes them

    as the

    foundation

    for much

    of

    its work.

    While some areas

    of data

    mining, such as association

    analysis, are unique

    to the field, other areas,such

    as clustering, classification,

    and

    anomaly detection, build

    upon

    a long history

    of work on these topics in other fields. Indeed, the willingness of data mining

    researchers

    o draw

    upon existing techniques has contributed to the strength

    and

    breadth of the

    field,

    as well as to

    its rapid

    growth.

    Another

    strength of

    the

    field has

    been

    its emphasis

    on collaboration

    with

    researchers

    in

    other areas. The

    challenges

    of

    analyzing

    new types of data

    cannot be met by simply

    applying data

    analysis techniques in isolation

    from

    those who

    understand the

    data and the domain

    in which

    it

    resides.

    Often,

    skill

    in

    building

    multidisciplinary

    teams has been as

    responsible

    or the success

    f

    data mining

    projects

    as

    the creation of

    new and

    innovative

    algorithms. Just

    as, historically, many

    developments

    n

    statistics

    were

    driven by the

    needs of

    agriculture, industry, medicine, and business,rxranyof the developments n

    data mining are being

    driven

    by

    the

    needsof those same fields.

    This

    book began

    as a set of notes and

    lecture

    slides

    for

    a data

    mining

    course

    hat

    has

    been offered

    at the University

    of Minnesota

    since Spring

    1998

    to upper-division undergraduate

    and

    graduate

    Students.

    Presentation

    slides

  • 7/23/2019 Introduction to data mining - Steinbach.pdf

    8/790

    viii

    Preface

    and

    exercises

    developed in

    these

    offerings

    grew

    with

    time and served

    as a basis

    for the book. A

    survey

    of clustering

    techniques

    in data mining,

    originally

    written

    in

    preparation

    for

    research

    in the

    area,

    served as a starting

    point

    for one of the chapters in the book.

    Over time,

    the

    clustering chapter

    was

    joined

    by

    chapters on data,

    classification,

    association analysis, and anomaly

    detection. The book in its current form has been class tested at the home

    institutions of the

    authors-the University of Minnesota

    and Michigan

    State

    University-as

    well as several other universities.

    A number

    of data mining books appeared in

    the

    meantime,

    but were

    not

    completely satisfactory for our students

    primarily graduate

    and undergrad-

    uate

    students in

    computer

    science,

    but including students

    from industry and

    a wide

    variety

    of other disciplines. Their mathematical and computer back-

    grounds

    varied considerably, but they

    shared a common

    goal:

    to

    learn

    about

    data

    mining

    as directly

    as

    possible

    in

    order to

    quickly

    apply it to

    problems

    in

    their

    own domains. Thus, texts with

    extensive

    mathematical

    or statistical

    prerequisiteswere unappealing to many of them, as were texts that required a

    substantial

    database

    background.

    The

    book that evolved

    n response

    o these

    students

    needs ocuses

    as

    directly

    as

    possible

    on the

    key

    conceptsof data min-

    ing by

    illustrating

    them with examples,

    simple descriptions

    of key algorithms,

    and

    exercises.

    Overview Specifically,

    his

    book

    provides

    a comprehensive

    ntroduction

    to

    data

    mining and is

    designed

    o be

    accessible nd useful

    o

    students,

    nstructors,

    researchers,and

    professionals.

    Areas

    covered nclude

    data

    preprocessing,

    vi-

    sualization,

    predictive

    modeling, association

    analysis,

    clustering, and anomaly

    detection. The

    goal

    is

    to

    present

    fundamental

    concepts

    and

    algorithms

    for

    each

    topic,

    thus

    providing

    the

    reader

    with the

    necessarybackground for

    the

    application

    of data mining to real

    problems.

    In addition, this book also

    pro-

    vides

    a

    starting

    point

    for

    those

    readers

    who

    are

    interested

    n

    pursuing

    research

    in

    data

    mining or related

    fields.

    The

    book covers ive main topics: data,

    classification,

    associationanalysis,

    clustering,

    and anomaly

    detection.

    Except

    for anomaly

    detection,

    eachof these

    areas

    is covered in

    a

    pair

    of

    chapters.

    For classification, association analysis,

    and

    clustering,

    the introductory chapter

    coversbasic

    concepts, representative

    algorithms, and evaluation techniques,while the more advanced chapter dis-

    cussesadvanced concepts

    and

    algorithms.

    The objective

    is

    to

    provide

    the

    reader with

    a sound understanding of the foundations of data mining, while

    still covering many important advanced

    opics. Becauseof this approach, the

    book

    is useful

    both as a learning tool and

    as a

    reference.

  • 7/23/2019 Introduction to data mining - Steinbach.pdf

    9/790

    Preface

    ix

    To

    help

    the readers

    better understand the concepts that

    have

    been

    pre-

    sented,

    we

    provide

    an

    extensive set of examples, figures, and exercises.

    Bib-

    Iiographic

    notes

    are included

    at the end of each chapter for readers who

    are

    interested

    in more

    advanced

    opics, historically

    important

    papers,

    and

    recent

    trends.

    The

    book also

    contains a comprehensivesubject and author

    index.

    To

    the

    Instructor

    As

    a textbook, this book

    is

    suitable

    for

    a

    wide range

    of students

    at

    the advanced undergraduate or

    graduate

    level.

    Since

    students

    come

    to

    this subject

    with diverse backgrounds hat

    may not include

    extensive

    knowledge

    of statistics

    or databases,our book requires

    minimal

    prerequisites-

    no database

    knowledge is needed

    and we assume only a

    modest

    background

    in

    statistics

    or mathematics. To

    this end, the book was designed to

    be as

    self-contained

    as

    possible.

    Necessary

    material

    from

    statistics,

    linear

    algebra,

    and machine

    earning

    is either integrated into

    the

    body of the text, or for

    some

    advanced

    opics,

    covered

    n

    the appendices.

    Since

    the chapters

    covering major data mining topics are self-contained,

    the

    order

    in which topics

    can be covered s

    quite

    flexible. The

    core

    material

    is

    covered

    in

    Chapters 2, 4,

    6, 8, and

    10. Although the introductory

    data

    chapter (2)

    should

    be covered irst, the basic classification, association

    analy-

    sis, and

    clustering chapters

    (4,

    6, and 8,

    respectively) can be covered

    n any

    order.

    Because

    of the relationship of

    anomaly

    detection

    (10)

    to classification

    (4)

    and

    clustering

    (8),

    these

    chapters should

    precede

    Chapter

    10. Various

    topics

    can be

    selected rom the advanced classification, association analysis,

    and

    clustering

    chapters

    (5,

    7,

    and

    9,

    respectively) to fit the scheduleand in-

    terests

    of the

    instructor

    and students. We also

    advise that the lectures

    be

    augmented

    by

    projects

    or

    practical

    exercises

    n data

    mining. Although

    they

    are time

    consuming,

    such hands-on assignments

    greatly

    enhance he

    value

    of

    the

    course.

    Support

    Materials

    The supplements or the book are available at

    Addison-

    Wesley's

    Website

    www.aw.con/cssupport.

    Support

    materials available

    to all

    readers

    of this

    book

    include

    PowerPoint lecture

    slides

    Suggestions or student projects

    Data mining resources

    such as data mining algorithms and

    data

    sets

    On-line tutorials that

    give

    step-by-stepexamples

    or selecteddata

    mining

    techniquesdescribed n

    the book using actual

    data sets and data

    analysis

    software

    o

    o

    o

    o

  • 7/23/2019 Introduction to data mining - Steinbach.pdf

    10/790

    x Preface

    Additional

    support materials,

    including

    solutions

    to exercises,

    re available

    only to instructors

    adopting

    this textbook for classroom use.

    Please contact

    your

    school's

    Addison-Wesley representative or information

    on

    obtaining ac-

    cess o this

    material.

    Comments

    and

    suggestions,as well as

    reports of errors,

    can be sent

    to the

    authors through d [email protected].

    Acknowledgments Many

    people

    contributed to this book.

    We begin by

    acknowledging

    our families to whom

    this book is dedicated.

    Without their

    patience

    and support, this

    project

    would have

    been

    impossible.

    We

    would

    like

    to thank the current

    and

    former

    students of our

    data

    mining

    groups

    at the University

    of Minnesota

    and Michigan

    State

    for

    their contribu-

    tions. Eui-Hong

    (Sam)

    Han

    and Mahesh

    Joshi

    helped

    with the

    initial data

    min-

    ing classes.

    Some

    ofthe exercises nd

    presentation

    slides hat they

    created

    can

    be found in the book and its accompanying

    slides. Students

    in our data

    min-

    ing

    groups

    who

    provided

    comments

    on drafts of the book or who

    contributed

    in

    other ways

    include

    Shyam Boriah, Haibin

    Cheng, Varun Chandola,

    Eric

    Eilertson,

    Levent

    Ertoz,

    Jing Gao,

    Rohit Gupta, Sridhar

    Iyer,

    Jung-Eun

    Lee,

    Benjamin Mayer,

    Aysel

    Ozgur, Uygar Oztekin, Gaurav

    Pandey,

    Kashif

    Riaz,

    Jerry Scripps, Gyorgy Simon,

    Hui

    Xiong,

    Jieping

    Ye, and Pusheng

    Zhang. We

    would also like to thank the students

    of our data

    mining

    classes

    at the Univer-

    sity of Minnesota

    and Michigan

    State

    University

    who worked with

    early

    drafbs

    of

    the book and

    provided

    invaluable feedback.

    We

    specifically note

    the helpful

    suggestionsof

    Bernardo

    Craemer, Arifin

    Ruslim, Jamshid Vayghan,

    and

    Yu

    Wei.

    Joydeep Ghosh (University of Texas) and Sanjay Ranka (University of

    Florida)

    class

    ested early versions

    of the book.

    We

    also received

    many useful

    suggestions

    directly

    from

    the following

    UT students: Pankaj Adhikari, Ra-

    jiv

    Bhatia, Fbederic

    Bosche, Arindam

    Chakraborty,

    Meghana Deodhar, Chris

    Everson, David Gardner,

    Saad Godil, Todd Hay,

    Clint Jones,

    Ajay

    Joshi,

    Joonsoo Lee, Yue

    Luo, Anuj Nanavati,

    Tyler

    Olsen, Sunyoung

    Park,

    Aashish

    Phansalkar,

    Geoff

    Prewett, Michael

    Ryoo,

    Daryl

    Shannon, and

    Mei

    Yang.

    Ronald

    Kostoff

    (ONR)

    read

    an early

    version of the clustering

    chapter and

    offered

    numerous suggestions.

    George Karypis

    provided

    invaluable

    IATEX as-

    sistance in creating

    an

    author index. Irene Moulitsas

    also

    provided

    assistance

    with IATEX and reviewed some of the appendices. Musetta Steinbach was very

    helpful in

    finding

    errors in

    the

    figures.

    We

    would like to acknowledge

    our colleagues at the

    University

    of Min-

    nesota

    and Michigan State who have helped

    create a

    positive

    environment

    for

    data mining research. They include

    Dan Boley,

    Joyce Chai,

    Anil Jain, Ravi

  • 7/23/2019 Introduction to data mining - Steinbach.pdf

    11/790

    Preface

    xi

    Janardan,

    Rong Jin, GeorgeKarypis, Haesun Park, William

    F.

    Punch, Shashi

    Shekhar,

    and

    Jaideep Srivastava. The collaborators on our

    many

    data

    mining

    projects,

    who

    also have

    our

    gratitude,

    include Ramesh

    Agrawal,

    Steve

    Can-

    non, Piet

    C. de

    Groen,

    FYanHill, Yongdae Kim,

    Steve

    Klooster, Kerry Long,

    Nihar

    Mahapatra,

    Chris

    Potter,

    Jonathan

    Shapiro,

    Kevin

    Silverstein,

    Nevin

    Young, and Zhi-Li Zhang.

    The

    departments of

    Computer Scienceand

    Engineering at the

    University of

    Minnesota

    and Michigan

    State

    University

    provided

    computing

    resources

    and

    a supportive

    environment for

    this

    project.

    ARDA, ARL, ARO, DOE, NASA,

    and NSF

    provided

    research

    support

    for Pang-Ning Tan, Michael

    Steinbach,

    and

    Vipin

    Kumar. In

    particular,

    Kamal Abdali, Dick Brackney,

    Jagdish Chan-

    dra,

    Joe

    Coughlan, Michael

    Coyle, Stephen

    Davis, Flederica Darema, Richard

    Hirsch,

    Chandrika Kamath,

    Raju

    Namburu, N. Radhakrishnan, James

    Sido-

    ran,

    Bhavani

    Thuraisingham,

    Walt Tiernin, Maria

    Zemankova,and Xiaodong

    Zhanghave

    been supportive

    of our research

    n

    data

    mining and high-performance

    computing.

    It

    was

    a

    pleasure

    working with the helpful staff at

    Pearson Education.

    In

    particular,

    we would like

    to thank

    Michelle Brown, Matt

    Goldstein,

    Katherine

    Harutunian,

    Marilyn Lloyd, Kathy

    Smith,

    and

    Joyce

    Wells. We

    would

    also

    like

    to

    thank

    George Nichols, who helped

    with

    the art

    work and Paul

    Anag-

    nostopoulos,

    who

    provided

    I4.T[X support.

    We

    are

    grateful

    to the following

    Pearson

    reviewers:

    Chien-Chung Chan

    (University

    of

    Akron), Zhengxin

    Chen

    (University

    of Nebraska

    at Omaha), Chris Clifton

    (Purdue

    University),

    Joy-

    deep

    Ghosh

    (University

    of

    Texas, Austin), Nazli

    Goharian

    (Illinois

    Institute

    of Technology),

    J. Michael Hardin

    (University

    of

    Alabama),

    James Hearne

    (Western

    Washington

    University),

    Hillol Kargupta

    (University

    of Maryland,

    Baltimore

    County and

    Agnik, LLC), Eamonn Keogh

    (University

    of California-

    Riverside),

    Bing Liu

    (University

    of Illinois at

    Chicago),

    Mariofanna Milanova

    (University

    of

    Arkansas

    at Little

    Rock), Srinivasan

    Parthasarathy

    (Ohio

    State

    University),

    Zbigniew

    W. Ras

    (University

    of

    North

    Carolina

    at Charlotte),

    Xintao

    Wu

    (University

    of

    North

    Carolina at Charlotte),

    and

    Mohammed

    J.

    Zaki

    (Rensselaer

    Polvtechnic Institute).

  • 7/23/2019 Introduction to data mining - Steinbach.pdf

    12/790

  • 7/23/2019 Introduction to data mining - Steinbach.pdf

    13/790

    Gontents

    Preface

    Introduction

    1

    1.1

    What

    Is

    Data Mining?

    2

    7.2 Motivating

    Challenges

    4

    1.3 The

    Origins of Data

    Mining

    6

    1.4 Data Mining Tasks 7

    1.5

    Scope and

    Organization

    of the

    Book

    11

    1.6 Bibliographic Notes

    13

    vll

    t.7 Exercises

    Data

    16

    19

    2.I Types of Data

    22

    2.1.I Attributes

    and

    Measurement

    23

    2.L.2 Types

    of

    Data

    Sets

    .

    29

    2.2 Data

    Quality

    36

    2.2.I Measurement and Data Collection Issues 37

    2.2.2 Issues

    Related

    to

    Applications

    2.3 Data Preprocessing

    2.3.L Aggregation

    2.3.2

    Sampling

    2.3.3 Dimensionality

    Reduction

    2.3.4 Feature

    Subset

    Selection

    2.3.5 Feature

    Creation

    2.3.6 Discretization

    and

    Binarization

    2.3:7

    Variable

    Tlansformation

    .

    2.4 Measures

    of Similarity and

    Dissimilarity

    . . .

    2.4.L Basics

    2.4.2

    Similarity and

    Dissimilarity between Simple

    Attributes .

    2.4.3 Dissimilarities between Data Objects

    .

    2.4.4

    Similarities between

    Data

    Objects

    43

    44

    45

    47

    50

    52

    55

    57

    63

    65

    66

    67

    69

    72

  • 7/23/2019 Introduction to data mining - Steinbach.pdf

    14/790

    xiv

    Contents

    2.4.5

    Examples of

    Proximity

    Measures

    2.4.6 Issues

    n

    Proximity

    Calculation

    2.4.7

    Selecting

    he

    Right

    Proximity

    Measure

    2.5 BibliographicNotes

    2.6

    Exercises

    Exploring Data

    3.i The Iris Data

    Se t

    3.2 Summary

    Statistics

    3.2.L

    Frequencies

    and the Mode

    3.2.2

    Percentiles

    3.2.3 Measuresof Location:

    Mean and Median

    3.2.4 Measuresof Spread:

    Range and Variance

    3.2.5

    Multivariate

    Summary

    Statistics

    3.2.6 Other

    Ways

    to

    Summarize

    the

    Data

    3.3 Visualization

    3.3.1 Motivations for

    Visualization

    3.3.2 General Concepts

    3.3.3

    Techniques

    3.3.4

    Visualizing

    Higher-Dimensional

    Data

    .

    3.3.5

    Do's

    and

    Don'ts

    3.4

    OLAP

    and Multidimensional

    Data Analysis

    3.4.I

    Representing

    ris

    Data

    as a Multidimensional

    Array

    3.4.2

    Multidimensional

    Data: The

    General Case

    3.4.3 Analyzing

    Multidimensional

    Data

    3.4.4 Final Comments on Multidimensional Data Analysis

    Bibliographic

    Notes

    Exercises

    Classification:

    Basic Concepts,

    Decision Tlees,

    and Model

    Evaluation

    4.1 Preliminaries

    4.2

    General Approach to Solving

    a Classification

    Problem

    4.3 Decision

    Tlee Induction

    4.3.1 How

    a Decision Tlee

    Works

    4.3.2 How to Build a Decision TYee

    4.3.3 Methods

    for

    Expressing

    Attribute Test Conditions

    4.3.4 Measures or

    Selecting

    he Best

    Split

    .

    4.3.5

    Algorithm for Decision

    Tlee Induction

    4.3.6 An

    Examole: Web

    Robot Detection

    3.5

    3.6

    73

    80

    83

    84

    88

    97

    98

    98

    99

    100

    101

    102

    704

    105

    105

    105

    106

    110

    724

    130

    131

    131

    133

    135

    139

    139

    747

    L45

    746

    748

    150

    150

    151

    155

    158

    164

    166

  • 7/23/2019 Introduction to data mining - Steinbach.pdf

    15/790

    Contents

    xv

    4.3.7

    Characteristics of

    Decision Tlee Induction

    4.4 Model

    Overfitting

    4.4.L

    Overfitting

    Due

    to

    Presenceof Noise

    4.4.2

    Overfitting Due to Lack of Representative

    Samples

    4.4.3

    Overfitting and the

    Multiple

    Comparison

    Procedure

    4.4.4 Estimation of Generalization Errors

    4.4.5

    Handling

    Overfitting

    in Decision

    Tlee Induction

    4.5

    Evaluating

    the Performance of a

    Classifier

    4.5.I

    Holdout Method

    4.5.2

    Random Subsampling

    . . .

    4.5.3

    Cross-Validation

    4.5.4

    Bootstrap

    4.6 Methods

    for

    Comparing Classifiers

    4.6.L

    Estimating a

    Confidence

    nterval for Accuracy

    4.6.2

    Comparing the

    Performance of Two

    Models .

    4.6.3 Comparing the Performance of Two Classifiers

    4.7

    BibliographicNotes

    4.8

    Exercises

    5 Classification:

    Alternative Techniques

    5.1 Rule-Based

    Classifier

    5.1.1 How

    a Rule-Based Classifier

    Works

    5.1.2

    Rule-Ordering Schemes

    5.1.3 How

    to

    Build a

    Rule-Based

    Classifier

    5.1.4 Direct Methods for

    Rule

    Extraction

    5.1.5 Indirect Methods for

    Rule

    Extraction

    5.1.6

    Characteristics of Rule-Based Classifiers

    5.2

    Nearest-Neighbor

    classifiers

    5.2.L Algorithm

    5.2.2

    Characteristics

    of Nearest-Neighbor Classifiers

    5.3

    Bayesian

    Classifiers

    5.3.1

    Bayes Theorem

    5.3.2

    Using

    the Bayes Theorem

    for

    Classification

    5.3.3 Naive

    Bayes Classifier

    5.3.4 Bayes

    Error Rate

    5.3.5 Bayesian Belief Networks

    5.4

    Artificial

    Neural Network

    (ANN)

    5.4.I Perceptron

    5.4.2

    Multilayer

    Artificial Neural Network

    5.4.3

    Characteristics

    of ANN

    168

    172

    L75

    L77

    178

    179

    184

    186

    186

    187

    187

    188

    188

    189

    191

    192

    193

    198

    207

    207

    209

    2 I I

    2r2

    2r3

    22L

    223

    223

    225

    226

    227

    228

    229

    23L

    238

    240

    246

    247

    25r

    255

  • 7/23/2019 Introduction to data mining - Steinbach.pdf

    16/790

    xvi

    Contents

    5.5

    Support Vector

    Machine

    (SVM)

    5.5.1 Maximum Margin Hyperplanes

    5.5.2 Linear

    SVM: SeparableCase

    5.5.3 Linear

    SVM:

    Nonseparable

    Case

    5.5.4

    Nonlinear

    SVM

    .

    5.5.5 Characteristics of SVM

    Ensemble

    Methods

    5.6.1

    Rationale

    for Ensemble

    Method

    5.6.2 Methods

    for

    Constructing an Ensemble Classifier

    5.6.3 Bias-Variance Decomposition

    5.6.4 Bagging

    5.6.5

    Boosting

    5.6.6 Random

    Forests

    5.6.7 Empirical

    Comparison among Ensemble Methods

    Class

    Imbalance Problem

    5.7.1 Alternative Metrics

    5.7.2

    The

    Receiver

    Operating

    Characteristic Curve

    5.7.3 Cost-Sensitive

    Learning

    . .

    5.7.4

    Sampling-Based

    Approaches

    .

    Multiclass Problem

    Bibliographic

    Notes

    Exercises

    5 .6

    o . t

    256

    256

    259

    266

    270

    276

    276

    277

    278

    28r

    283

    285

    290

    294

    294

    295

    298

    302

    305

    306

    309

    315

    c .6

    5.9

    5.10

    Association

    Analysis: Basic

    Concepts and

    Algorithms

    327

    6.1

    Problem

    Definition

    .

    328

    6.2 Flequent Itemset Generation

    332

    6.2.I The Apri,ori

    Principle

    333

    6.2.2 Fbequent temset

    Generation in

    the

    Apri,ori, Algorithm .

    335

    6.2.3

    Candidate

    Generation

    and Pruning .

    . .

    338

    6.2.4

    Support

    Counting 342

    6.2.5

    Computational Complexity

    345

    6.3 Rule

    Generatiorr

    349

    6.3.1

    Confidence-Based

    Pruning

    350

    6.3.2

    Rule Generation

    in

    Apri,ori, Algorithm

    350

    6.3.3

    An

    Example: Congressional

    Voting Records

    352

    6.4

    Compact Representation of Fbequent temsets

    353

    6.4.7

    Maximal

    Flequent Itemsets

    354

    6.4.2 Closed Frequent Itemsets

    355

    6.5 Alternative

    Methods

    for

    Generating

    Frequent Itemsets

    359

    6.6 FP-Growth

    Alsorithm

    363

  • 7/23/2019 Introduction to data mining - Steinbach.pdf

    17/790

    Contents

    xvii

    6.6.1

    FP-tee

    Representation

    6.6.2

    Frequent

    Itemset Generation

    in FP-Growth Algorithm

    .

    6.7

    Evaluation

    of

    Association

    Patterns

    6.7.l Objective

    Measuresof

    Interestingness

    6.7.2

    Measures

    beyond

    Pairs

    of

    Binary Variables

    6.7.3 Simpson'sParadox

    6.8

    Effect of

    Skewed

    Support

    Distribution

    6.9

    Bibliographic

    Notes

    363

    366

    370

    37r

    382

    384

    386

    390

    404

    4L5

    415

    4t8

    418

    422

    424

    426

    429

    429

    431

    436

    439

    442

    443

    444

    447

    448

    453

    457

    457

    458

    458

    460

    461

    463

    465

    469

    473

    6.10

    Exercises

    7 Association Analysis:

    Advanced

    7.I Handling

    Categorical Attributes

    7.2

    Handling

    Continuous Attributes

    Concepts

    7.2.I Discretization-Based

    Methods

    7.2.2 Statistics-BasedMethods

    7.2.3 Non-discretizalion Methods

    Handling a

    Concept Hierarchy

    Seouential

    Patterns

    7.4.7

    Problem

    Formulation

    7.4.2 Sequential Pattern Discovery

    7.4.3

    Timing

    Constraints

    7.4.4

    Alternative

    Counting Schemes

    7.5 Subgraph

    Patterns

    7.5.1 Graphs

    and Subgraphs

    .

    7.5.2

    Frequent

    Subgraph

    Mining

    7.5.3 Apri,od-like Method

    7.5.4 Candidate

    Generation

    7.5.5 Candidate Pruning

    7.5.6

    Support

    Counting

    7.6

    Infrequent

    Patterns

    7.6.7

    Negative

    Patterns

    7.6.2

    Negatively

    Correlated

    Patterns

    7.6.3 Comparisons among Infrequent

    Patterns, Negative

    Pat-

    terns,

    and Negatively Correlated

    Patterns

    7.6.4

    Techniques

    or Mining Interesting

    Infrequent Patterns

    7.6.5

    Techniques

    Based on Mining

    Negative Patterns

    7.6.6 Techniques

    Based on Support

    Expectation .

    7.7

    Bibliographic

    Notes

    7.8 Exercises

    7.3

    7.4

  • 7/23/2019 Introduction to data mining - Steinbach.pdf

    18/790

    xviii Contents

    Cluster Analysis:

    Basic Concepts

    and Algorithms

    8.1 Overview

    8.1.1 What Is

    Cluster

    Analysis?

    8.I.2 Different

    Types of Clusterings .

    8.1.3 Different Types of

    Clusters

    8.2 K-means

    8.2.7

    The

    Basic K-means Algorithm

    8.2.2 K-means: Additional Issues

    8.2.3 Bisecting K-means

    8.2.4

    K-means

    and

    Different

    Types

    of

    Clusters

    8.2.5

    Strengths and

    Weaknesses

    8.2.6 K-means

    as

    an

    Optimization Problem

    8.3

    Agglomerative

    Hierarchical

    Clustering

    8.3.1 Basic Agglomerative

    Hierarchical

    Clustering

    Algorithm

    8.3.2

    Specific

    Techniques

    8.3.3 The Lance-Williams Formula for Cluster Proximity .

    8.3.4 Key Issues

    n

    Hierarchical

    Clustering

    .

    8.3.5

    Strengths and Weaknesses

    DBSCAN

    8.4.1 Tladitional Density:

    Center-Based

    Approach

    8.4.2

    The

    DBSCAN Algorithm

    8.4.3 Strengths

    and

    Weaknesses

    Cluster Evaluation

    8.5.1

    Overview

    8.5.2 Unsupervised

    Cluster Evaluation Using

    Cohesion and

    Separation

    8.5.3

    Unsupervised Cluster Evaluation Using the Proximity

    Matrix

    8.5.4

    Unsupervised

    Evaluation

    of

    Hierarchical

    Clustering

    .

    8.5.5 Determining the

    Correct Number of Clusters

    8.5.6

    Clustering Tendency

    8.5.7

    Supervised Measures

    of Cluster Validity

    8.5.8

    Assessing

    he Significance

    of Cluster

    Validity Measures

    8.4

    8.5

    487

    490

    490

    49r

    493

    496

    497

    506

    508

    510

    510

    513

    515

    516

    518

    524

    524

    526

    526

    527

    528

    530

    532

    533

    536

    542

    544

    546

    547

    548

    553

    ooo

    559

    8.6

    Bibliograph

    8.7

    Exercises

    ic Notes

    Cluster Analysis:

    Additional Issues

    and Algorithms 569

    9.1

    Characteristics

    of Data, Clusters,

    and Clustering

    Algorithms .

    570

    9.1.1 Example:

    Comparing

    K-means

    and

    DBSCAN . . . . . .

    570

    9.1.2 Data

    Characteristics 577

  • 7/23/2019 Introduction to data mining - Steinbach.pdf

    19/790

    Contents

    xix

    9.1.3 Cluster

    Characteristics

    .

    573

    9.L.4 General

    Characteristics

    f Clustering

    Algorithms

    575

    9.2

    Prototype-Based

    lustering 577

    9.2.1

    F\zzy

    Clustering

    577

    9.2.2 Clustering

    Using

    Mixture Models 583

    9.2.3 Self-Organizingaps(SOM) 594

    9.3

    Density-Based

    lustering

    600

    9.3.1

    Grid-Based

    lustering

    601

    9.3.2 Subspace

    lustering

    604

    9.3.3

    DENCLUE:

    A Kernel-Based

    cheme

    or

    Density-Based

    Clustering 608

    9.4 Graph-Based

    lustering

    612

    9.4.1 Sparsification 613

    9.4.2

    Minimum

    Spanning

    Tlee

    (MST)

    Clustering

    . . .

    674

    9.4.3 OPOSSUM:

    Optimal

    Partitioning

    of Sparse

    Similarities

    UsingMETIS 616

    9.4.4 Chameleon: Hierarchical

    Clustering

    with

    Dynamic

    Modeling

    9.4.5 Shared Nearest Neighbor

    Similarity

    9.4.6 The

    Jarvis-Patrick Clustering

    Algorithm

    9.4.7 SNN Density

    9.4.8 SNN Density-Based

    Clustering

    9.5 Scalable Clustering Algorithms

    9.5.1 Scalability:

    General

    Issuesand

    Approaches

    9.5.2

    BIRCH

    9.5.3 CURE

    9.6 Which Clustering Algorithm?

    9.7 Bibliographic

    Notes

    9.8

    Exercises

    616

    622

    625

    627

    629

    630

    630

    633

    635

    639

    643

    647

    10

    Anomaly Detection 651

    10.1 Preliminaries

    653

    10.1.1

    Causesof Anomalies

    653

    10.1.2 Approaches

    o

    Anomaly Detection 654

    10.1.3 The

    Use

    of

    Class

    Labels

    655

    10.1.4 ssues 656

    10.2

    Statistical

    Approaches 658

    t0.2.7 Detecting

    Outliers

    in a

    Univariate

    Normal Distribution 659

    10.2.2

    Out l iersinaMult ivariateNormalDistr ibut ion

    . . . . .

    661

    10.2.3 A Mixture Model Approach for

    Anomaly

    Detection. 662

  • 7/23/2019 Introduction to data mining - Steinbach.pdf

    20/790

    xx

    Contents

    10.2.4

    Strengths

    and Weaknesses

    10.3 Proximity-Based

    Outlier Detection

    10.3.1

    Strengths

    and

    Weaknesses

    10.4

    Density-Based

    Outlier

    Detection

    10.4.1

    Detection of

    Outliers

    Using Relative

    Density

    70.4.2 Strengths and Weaknesses

    10.5

    Clustering-Based

    Techniques

    10.5.1

    Assessing he Extent

    to

    Which an

    Object Belongs to

    a

    Cluster

    10.5.2 Impact

    of

    Outliers on

    the

    Initial

    Clustering

    10.5.3

    The Number of

    Clusters o

    Use

    10.5.4

    Strengths

    and

    Weaknesses

    665

    666

    666

    668

    669

    670

    67L

    672

    674

    674

    674

    675

    680

    685

    b6i)

    10.6

    Bibliograph

    10.7

    Exercises

    ic Notes

    Appendix A Linear Algebra

    A.1

    Vectors

    A.1.1

    Definition

    685

    4.I.2

    Vector Addition and Multiplication by a

    Scalar 685

    A.1.3

    Vector

    Spaces

    687

    4.7.4 The

    Dot Product,

    Orthogonality, and Orthogonal

    Projections

    688

    A.1.5

    Vectors and

    Data

    Analysis

    690

    42

    Matrices

    691

    A.2.1 Matrices:

    Definitions

    691

    A-2.2 Matrices: Addition and Multiplication by a Scalar 692

    4.2.3 Matrices:

    Multiplication

    693

    4.2.4

    Linear tansformations

    and Inverse

    Matrices

    695

    4.2.5 Eigenvalue

    and Singular

    Value Decomposition

    .

    697

    4.2.6

    Matrices

    and

    Data

    Analysis

    699

    A.3

    Bibliographic

    Notes

    700

    Appendix B Dimensionality

    Reduction

    7OL

    8.1

    PCA

    and

    SVD 70I

    B.1.1 Principal

    Components

    Analysis (PCA)

    70L

    8.7.2 SVD .

    706

    8.2 Other

    Dimensionality

    Reduction Techniques

    708

    8.2.I

    Factor Analysis

    708

    8.2.2 Locally

    Linear

    Embedding

    (LLE)

    .

    770

    8.2.3 Multidimensional

    Scaling, FastMap, and

    ISOMAP

    7I2

  • 7/23/2019 Introduction to data mining - Steinbach.pdf

    21/790

    Contents xxi

    8.2.4

    Common Issues

    B.3 Bibliographic

    Notes

    Appendix

    C Probability and Statistics

    C.1

    Probability

    C.1.1 Expected Values

    C.2 Statistics

    C.2.L Point Estimation

    C.2.2

    Central

    Limit Theorem

    C.2.3

    Interval Estimation

    C.3 Hypothesis Testing

    Appendix

    D Regression

    D.1 Preliminaries

    D.2

    Simple

    Linear

    Regression

    D.2.L Least SquareMethod

    D.2.2 Analyzing

    Regression

    Errors

    D.2.3 Analyzing

    Goodnessof

    Fi t

    D.3

    Multivariate

    Linear

    Regression

    D.4

    Alternative Least-Square

    Regression

    Methods

    Appendix

    E Optimization

    E.1

    Unconstrained

    Optimizafion

    E.1.1 Numerical Methods

    8.2

    Constrained

    Optimization

    E.2.I Equality Constraints

    8.2.2 Inequality

    Constraints

    Author

    Index

    Subject

    Index

    Copyright

    Permissions

    715

    7L6

    7L9

    7L9

    722

    723

    724

    724

    725

    726

    739

    739

    742

    746

    746

    747

    750

    758

    769

    729

    729

    730

    731

    733

    735

    736

    737

  • 7/23/2019 Introduction to data mining - Steinbach.pdf

    22/790

  • 7/23/2019 Introduction to data mining - Steinbach.pdf

    23/790

    1

    Introduction

    Rapid advances in data

    collection and storage

    technology have enabled or-

    ganizations

    to accumulate vast

    amounts

    of data. However, extracting useful

    information has

    proven

    extremely challenging. Often, traditional data analy-

    sis tools

    and techniques

    cannot be used because

    of the massive size of a data

    set.

    Sometimes, he non-traditional nature of the data

    means

    that traditional

    approachescannot

    be applied even f the data set

    is relatively small. In other

    situations,

    the

    questions

    hat

    need

    to be answered

    cannot be addressedusing

    existing

    data analysis techniques, and thus, new methods

    need

    to

    be

    devel-

    oped.

    Data mining

    is a technology

    that blends traditional

    data analysis

    methods

    with

    sophisticated algorithms for

    processing

    arge volumes of data.

    It has also

    opened

    up exciting opportunities for exploring and analyzing

    new types of

    data and for analyzing old types of data in new ways. In this introductory

    chapter, we

    present

    an overview of data mining and

    outline the key topics

    to

    be covered n this

    book. We start

    with

    a

    description of some

    well-known

    applications

    that require new techniques or data analysis .

    Business Point-of-sale

    data collection

    (bar

    code scanners,

    radio frequency

    identification

    (RFID),

    and smart card

    technology) have allowed

    retailers to

    collect up-to-the-minute

    data about customer

    purchases

    at the checkout coun-

    ters

    of their stores.

    Retailers can utilize

    this information, along

    with other

    business-critical

    data such as Web logs from e-commerce Web

    sites and cus-

    tomer service records from call centers, to help them better understand the

    needs

    of

    their

    customers and make more informed business

    decisions.

    Data mining

    techniques can be used to support

    a wide

    range

    of business

    intelligence

    applications such as customer

    profiling,

    targeted

    marketing, work-

    flow management,

    store layout,

    and

    fraud

    detection.

    It

    can also

    help retailers

  • 7/23/2019 Introduction to data mining - Steinbach.pdf

    24/790

    2

    Chapter 1 lntroduction

    answer important

    business

    questions

    such as

    "Who

    are the most

    profitable

    customers?"

    "What

    products

    can

    be cross-soldor up-sold?" and

    "What

    is

    the

    revenue

    outlook of

    the company

    for

    next

    year?))

    Some

    of

    these

    questions

    mo-

    tivated the creation of

    association analvsis

    (Chapters

    6

    and

    7),

    a new data

    analysis

    technique.

    Medicine,

    Science, and Engineering

    Researchers

    n medicine, science,

    and

    engineering are rapidly accumulating

    data that is key to important new

    discoveries. For

    example, as an important

    step toward

    improving

    our under-

    standing of the Earth's

    climate system, NASA has

    deployed a seriesof

    Earth-

    orbiting satellites

    that continuously

    generate global

    observations of the

    Iand

    surface, oceans, and

    atmosphere. However, because of the

    size and spatio-

    temporal nature

    of the data, traditional

    methods are often not suitable for

    analyzing these

    data sets. Techniques

    developed

    n

    data mining can aid Earth

    scientists

    in

    answering

    questions

    such as

    "What

    is the relationship

    between

    the frequency

    and intensity

    of ecosystem disturbances such

    as droughts and

    hurricanes o

    global

    warming?"

    "How

    is land

    surface

    precipitation

    and temper-

    ature

    affected by ocean surface

    temperature?" and

    "How

    well can we

    predict

    the beginning

    and end of the

    growing

    season

    or

    a

    region?"

    As

    another example, researchers

    n molecular biology hope

    to

    use

    the

    large

    amounts

    of

    genomic

    data currently

    being

    gathered

    to better understand the

    structure and function

    of

    genes.

    In the

    past,

    traditional

    methods in

    molecu-

    lar

    biology allowed

    scientists to study

    only a few

    genes

    at a time in a

    given

    experiment.

    Recent

    breakthroughs

    in microarray technology have

    enabled sci-

    entists to

    compare he behavior

    of thousands of

    genes

    under var ious situations.

    Such comparisons can help

    determine the function of each

    gene

    and

    perhaps

    isolate

    the

    genes

    esponsible or

    certain

    diseases.

    However,

    he noisy and high-

    dimensional nature

    of data requires new

    types of data analysis. In addition

    to

    analyzing

    gene

    array

    data, data mining

    can also be used to address other

    important

    biological challenges

    such as

    protein

    structure

    prediction,

    multiple

    sequence lignment,

    the

    modeling

    of

    biochemical

    pathways,

    and

    phylogenetics.

    1.1 What Is

    Data Mining?

    Data mining is the processof automatically discovering useful information in

    large

    data

    repositories.

    Data mining

    techniques are deployed

    to scour

    large

    databases in

    order to find novel

    and useful

    patterns

    that might

    otherwise

    remain

    unknown. They

    also

    provide

    capabilities to

    predict

    the outcome of a

  • 7/23/2019 Introduction to data mining - Steinbach.pdf

    25/790

    1.1

    What

    Is

    Data Mining?

    3

    future

    observation,

    such

    as

    predicting

    whether

    a

    newly

    arrived.

    customer

    will

    spend

    more

    than

    $100

    at a department

    store.

    Not

    all information

    discovery

    asks are considered

    o be

    data mining.

    For

    example,

    Iooking

    up individual

    records

    using

    a database managemenr

    sysrem

    or

    finding

    particular

    Web

    pages

    via a

    query

    to

    an

    Internet

    search

    engine

    are

    tasks related to the area of information retrieval. Although such tasks are

    important

    and may

    involve

    the

    use of

    the sophisticated

    algorithms

    and

    data

    structures,

    they rely

    on traditional

    computer

    science echniques

    and obvious

    features

    of the

    data

    to

    create index

    structures

    for

    efficiently

    organizing

    and

    retrieving

    information.

    Nonetheless,

    data mining

    techniques

    have been

    used

    to

    enhance

    nformation

    retrieval

    systems.

    Data

    Mining

    and

    Knowledge

    Discovery

    Data

    mining

    is

    an integral

    part

    of knowledge

    discovery

    in

    databases

    (KDD), which is the overall process of converting raw data into useful in-

    formation,

    as shown

    in

    Figure

    1.1.

    This

    process

    consists of

    a series

    of trans-

    formation

    steps,

    from

    data

    preprocessing

    o

    postprocessing

    of data mining

    results.

    Information

    Figure

    1. The

    process

    f

    knowledge

    iscovery

    n

    databases

    KDD).

    The

    input

    data

    can be stored

    in a variety

    of

    formats

    (flat

    files,

    spread-

    sheets,

    or relational

    tables) and

    may

    reside

    in a centralized

    data repository

    or

    be

    distributed

    across multiple

    sites. The purpose

    of

    preprocessing

    is

    to transform the raw input data into an appropriate format for subsequent

    analysis.

    The

    steps

    involved

    in

    data

    preprocessing

    nclude

    fusing

    data

    from

    multiple

    sources,

    cleaning

    data

    to

    remove

    noise

    and

    duplicate

    observations,

    and

    selecting

    records

    and features

    that

    are

    relevant

    to the

    data mining

    task

    at

    hand.

    Because

    of

    the many

    ways

    data can

    be collected

    and stored,

    data

  • 7/23/2019 Introduction to data mining - Steinbach.pdf

    26/790

    4 Chapter 1

    Introduction

    preprocessing

    s

    perhaps

    the

    most

    laborious and

    time-consuming

    step

    in the

    overall

    knowledge

    discovery

    process.

    ,,Closing

    the

    loop" is the

    phrase

    often used

    to refer to the

    process

    of

    in-

    tegrating data

    mining results

    into

    decision support

    systems.

    For example,

    in business applications,

    the

    insights offered

    by data

    mining

    results can

    be

    integrated with campaign management tools so that effective marketing pro-

    motions

    can be

    conducted

    and

    tested.

    Such

    integration

    requires

    a

    postpro-

    cessing step that

    ensures

    hat

    only valid and

    useful

    results

    are

    incorporated

    into

    the decision

    support system.

    An example

    of

    postprocessing s visualiza-

    tion

    (see

    Chapter

    3), which

    allows

    analysts to

    explore

    the data

    and

    the data

    mining

    results from a variety

    of viewpoints.

    Statistical

    measures

    or

    hypoth-

    esis

    testing methods

    can also

    be applied

    during

    postprocessing o eliminate

    spurious

    data

    mining

    results.

    L.2 Motivating Challenges

    As

    mentioned earlier,

    traditional

    data analysis

    techniques

    have

    often encoun-

    tered

    practical

    difficulties

    in meeting the

    challenges

    posed

    by

    new data

    sets.

    The

    following are

    some of the specific

    challenges

    hat motivated

    the develop-

    ment

    of data mining.

    Scalability

    Because

    of advances

    n

    data

    generation and collection,

    data sets

    with

    sizes of

    gigabytes,

    terabytes,

    or even

    petabytes are becoming

    common.

    If

    data mining

    algorithms are

    to

    handle

    these

    massive

    data

    sets, then they

    must be scalable. Many data mining algorithms employ special searchstrate-

    gies

    to handle exponential

    search

    problems.

    Scalability

    may also

    require the

    implementation

    of novel data structures

    to access

    ndividual

    records

    n

    an

    ef-

    ficient

    manner.

    For instance,

    out-of-core algorithms

    may be

    necessary

    when

    processing

    data

    sets that cannot

    fit into main memory.

    Scalability

    can also

    be

    improved

    by using

    sampling

    or developing

    parallel

    and

    distributed

    algorithms.

    High

    Dimensionality

    It

    is now common

    to

    encounter data

    sets with

    hun-

    dreds

    or thousands

    of

    attributes

    instead of the

    handful common

    a

    few decades

    ago.

    In bioinformatics,

    progress n microarray

    technology

    has

    produced

    gene

    expression

    data

    involving thousands

    of

    features.

    Data sets

    with temporal

    or

    spatial

    components

    also tend

    to

    have high

    dimensionality.

    For example,

    consider

    a data

    set that contains

    measurements

    of temperature

    at

    various

    locations.

    If

    the

    temperature

    measurements

    are

    taken

    repeatedly

    for an ex-

    tended

    period,

    the

    number of dimensions

    (features)

    ncreases

    n

    proportion

    to

  • 7/23/2019 Introduction to data mining - Steinbach.pdf

    27/790

    L.2 Motivating

    Challenges

    5

    the number

    of measurements

    aken.

    Tladitional

    data

    analysis echniques

    hat

    were

    developed

    for low-dimensional

    data

    often do not work

    well for

    such high-

    dimensional

    data. Also,

    for

    some data

    analysis

    algorithms,

    the computational

    complexity

    increases

    apidly

    as

    the dimensionality (the

    number

    of features)

    increases.

    Heterogeneous

    and

    Complex

    Data TYaditional

    data analysis

    methods

    often

    deal with

    data sets

    containing

    attributes

    of the same

    ype, either

    contin-

    uous or

    categorical.

    As

    the role

    of data

    mining in

    business,science,

    medicine,

    and

    other

    flelds

    has

    grown,

    so has the need

    for techniques

    that

    can handle

    heterogeneous

    ttributes.

    Recent

    years

    have

    also seen he

    emergence

    of more

    complex

    data

    objects.

    Examples

    of such non-traditional

    types of data

    include

    collections

    of

    Web

    pages

    containing

    semi-structured

    text and hyperlinks;

    DNA

    data

    with

    sequential

    and three-dimensional

    structure;

    and climate

    data that

    consists

    of

    time

    seriesmeasurements

    temperature, pressure,

    etc.) at various

    locations

    on

    the Earth's

    surface. Techniques

    developed

    or mining

    such com-

    plex

    objects

    should

    take into

    consideration

    relationships

    in the

    data,

    such as

    temporal

    and

    spatial

    autocorrelation, graph

    connectivity,

    and

    parent-child

    re-

    lationships

    between

    he

    elements

    n semi-structured

    ext

    and

    XML

    documents.

    Data

    ownership

    and Distribution

    Sometimes, the data needed

    for an

    analysis

    is

    not

    stored in

    one location

    or owned by one

    organization.

    Instead,

    the

    data is geographically

    distributed

    among

    resources

    belonging

    to multiple

    entities.

    This

    requires

    the

    development

    of distributed

    data mining

    techniques.

    Among the key challenges aced by distributed data mining algorithms in-

    clude (1)

    how

    to reduce

    the amount

    of

    communication needed

    o

    perform

    the

    distributed

    computatior,

    (2)

    how

    to effectively

    consolidate the

    data mining

    results

    obtained

    from

    multiple

    sources,

    and

    (3)

    how

    to address

    data

    security

    issues.

    Non-traditional

    Analysis

    The

    traditional

    statistical approach

    s

    based on

    a hypothesize-and-test

    paradigm.

    In

    other words, a hypothesis

    is

    proposed,

    an

    experiment

    is

    designed

    o

    gather

    the data,

    and then the data

    is

    analyzed

    with respect

    o the hypothesis.

    Unfortunately,

    this

    process

    s

    extremely

    labor-

    intensive. Current data analysis tasks

    often require

    the

    generation

    and evalu-

    ation

    of thousands

    of hypotheses,

    and consequently,

    he development

    of

    some

    data

    mining

    techniques

    has

    been motivated

    by

    the desire to

    automate

    the

    process

    of

    hypothesis generation

    and evaluation.

    Furthermore,

    the data

    sets

    analyzed

    in

    data

    mining

    are typically

    not the result

    of a carefully

    designed

  • 7/23/2019 Introduction to data mining - Steinbach.pdf

    28/790

    6 Chapter

    1 Introduction

    experiment

    and

    often

    representopportunistic samples

    of

    the data,

    rather than

    random samples.

    Also, the data sets

    frequently

    involve

    non-traditional

    types

    of data and

    data distributions.

    1.3 The Origins of Data Mining

    Brought

    together

    by the

    goal

    of

    meeting the challenges

    of

    the

    previous

    sec-

    tion,

    researchers

    rom

    different

    disciplines began

    to

    focus on developing

    more

    efficient

    and

    scalable ools

    that could handle diverse

    ypes

    of data.

    This work,

    which culminated

    in the

    field

    of data

    mining, built

    upon the

    methodology

    and

    algorithms

    that researchers

    had

    previously

    used.

    In

    particular,

    data

    mining

    draws

    upon

    ideas,

    such

    as

    (1)

    sampling, estimation,

    and

    hypothesis

    testing

    from statistics

    and

    (2)

    search algorithms,

    modeling

    techniques,

    and

    learning

    theories

    from artificial

    intelligence,

    pattern

    recognition,

    and

    machine

    earning.

    Data mining has also been quick to adopt ideas from other areas, ncluding

    optimization,

    evolutionary

    computing,

    information theory,

    signal

    processing,

    visualization,

    and

    information

    retrieval.

    A number

    of other

    areas also

    play

    key supporting

    roles.

    In

    particular,

    database

    systems

    are

    needed to

    provide

    support

    for efficient

    storage,

    index-

    ing, and

    query

    processing.

    Techniques

    rom high

    performance (parallel)

    com-

    puting

    are

    often important

    in addressing he

    massive size of

    some

    data

    sets.

    Distributed

    techniques

    can also

    help

    address

    he issueof

    size and are

    essential

    when the

    data

    cannot be

    gathered

    in one location.

    Figure

    1.2 shows he

    relationship of data

    mining to other

    areas.

    Figure

    .2.Data iningsa conlluencef

    manyisciplines.

  • 7/23/2019 Introduction to data mining - Steinbach.pdf

    29/790

    Data

    Mining Tasks

    7

    1.4

    Data

    Mining

    Tasks

    Data

    mining

    tasks

    are

    generally

    divided

    into two

    major

    categories:

    Predictive

    tasks.

    The

    objective

    of these

    asks

    s

    to

    predict

    the value

    of a

    par-

    ticular

    attribute

    based

    on the values

    of other

    attributes. The attribute

    to

    be

    predicted

    is

    commonly

    known

    as the target

    or dependent vari-

    able,

    while

    the

    attributes

    used for

    making the

    prediction

    are known

    as

    the explanatory

    or

    independent

    variables.

    Descriptive

    tasks. Here,

    the objective is

    to derive

    patterns

    (correlations,

    trends,

    clusters,

    trajectories,

    and anomalies)

    that

    summarize the

    un-

    derlying

    relationships

    in

    data. Descriptive

    data mining tasks

    are often

    exploratory

    in nature

    and frequently

    require

    postprocessing

    echniques

    to validate

    and

    explain

    the results.

    Figure 1.3 illustrates four of the core data mining tasks that are described

    in

    the

    remainder

    of

    this book.

    Four

    f

    he

    core atamining

    asks.

    L.4

    I

    Figure.3.

  • 7/23/2019 Introduction to data mining - Steinbach.pdf

    30/790

    8 Chapter

    1

    Introduction

    Predictive modeling

    refers to the task of

    building a

    model

    for the

    target

    variable as a function

    of the

    explanatory variables.

    There

    are two

    types of

    predictive

    modeling

    tasks: classification,

    which

    is used

    for

    discrete target

    variables,

    and regression,

    which

    is

    used

    for continuous

    target

    variables.

    For

    example,

    predicting

    whether

    a Web user

    will

    make a

    purchase

    at an

    online

    bookstore is a classification ask because he target variable is binary-valued.

    On the other

    hand,

    forecasting

    the future

    price

    of

    a stock

    is a

    regression task

    because

    price

    is a continuous-valued

    attribute.

    The

    goal

    of

    both tasks

    is to

    learn

    a

    model that

    minimizes

    the error between

    the

    predicted

    and

    true

    values

    of the target

    variable.

    Predictive

    modeling

    can

    be used to

    identify

    customers

    that

    will respond

    to a

    marketing campaign,

    predict

    disturbances

    n the

    Earth's

    ecosystem,or

    judge

    whether

    a

    patient

    has a

    particular

    disease

    based on

    the

    results of

    medical

    tests.

    Example 1.1

    (Predicting

    the

    Type of

    a

    Flower). Consider

    he task of

    predicting a speciesof flower based on the characteristics of the flower. In

    particular,

    consider

    classifying

    an

    Iris flower

    as to

    whether it belongs

    to one

    of the

    following

    three Iris species: Setosa,

    Versicolour,

    or

    Virginica.

    To

    per-

    form this task,

    we need a data

    set containing

    the

    characteristics

    of

    various

    flowers of these

    three species.

    A

    data set

    with

    this type of

    information

    is

    the well-known

    Iris data set from

    the

    UCI

    Machine

    Learning

    Repository

    at

    http:

    /hrurw.ics.uci.edu/-mlearn.

    In

    addition

    to the species

    of a flower,

    this

    data set contains

    four

    other

    attributes:

    sepal

    width, sepal

    length,

    petal

    length,

    and

    petal

    width.

    (The Iris data set and

    its attributes

    are described

    further in

    Section

    3.1.)

    Figure

    1.4

    shows a

    plot

    of

    petal

    width

    versus

    petal

    length for the 150 flowers in the Iris data set. Petal width is broken into the

    categories

    ow,

    med'ium, and hi'gh,

    which correspond

    o the

    intervals

    [0'

    0.75),

    [0.75,

    1.75),

    [1.75,

    oo),

    respectively.

    Also,

    petal

    length

    s broken

    nto categories

    low, med,'ium, nd

    hi,gh,which

    correspond o the

    intervals

    [0'

    2.5),

    [2.5,5),

    [5 '

    oo),

    respectively.

    Based on these

    categories

    of

    petal

    width

    and length,

    the

    following rules

    can be derived:

    Petal

    width low and

    petal

    length

    low implies

    Setosa.

    Petal width medium

    and

    petal

    length

    medium implies

    Versicolour.

    Petal width

    high and

    petal

    length

    high implies Virginica.

    While these rules do not classify all the flowers, they do a good (but not

    perfect)

    job

    of

    classifying

    most of the

    flowers.

    Note that

    flowers

    from the

    Setosa

    species

    are

    well separated

    from the

    Versicolour

    and

    Virginica

    species

    with

    respect to

    petal

    width and

    length,

    but

    the

    latter two

    species

    overlap

    somewhat with

    respect to these

    attributes.

    I

  • 7/23/2019 Introduction to data mining - Steinbach.pdf

    31/790

    r

    Setosa

    .

    Versicolour

    o

    Virginica

    L.4

    Data Mining

    Tasks

    I

    l - - - - a - - f o - - - - - - - i

    l a

    o

    r

    , f t f o o t o

    a i

    :

    o o

    o I

    I

    '

    t 0 f 0 a o

    0?oo

    r

    a

    a r f I

    ? 1 . 7 5

    E

    ()

    r

    1 . 5

    E

    =

    (t'

    ()

    ( L l

    0_l _.

    o_

    o

    . t

    a a r O

    . .4.

    a?o

    o a a a a

    a aaaaaaa

    a a a a

    a

    aa

    a a a a a

    I

    I

    l l t l l

    l l t

    I

    I l l l l t t

    I

    I t

    1 2 2 . 5 3 4 5 (

    PetalLength

    cm)

    Figure .4.

    Petal

    idth ersus

    etal

    engthor

    1

    50

    ris lowers,

    Association

    analysis

    is used

    to discover

    patterns

    that describe strongly

    as-

    sociated

    eatures

    n the

    data.

    The

    discovered

    patterns

    are

    typically represented

    in

    the

    form

    of

    implication

    rules

    or feature

    subsets.

    Because

    of the exponential

    size of

    its

    search

    space, he goal

    of association

    analysis s

    to extract

    the most

    interesting patterns

    in

    an efficient

    manner.

    Useful applications

    of association

    analysis

    include

    finding

    groups

    of

    genes

    hat have

    related functionality,

    identi-

    fying

    Web

    pages

    hat

    are accessed

    ogether, or understanding

    the

    relationships

    between

    different

    elements

    of

    Earth's

    climate

    system.

    Example

    1.2 (Market

    Basket

    Analysis).

    The

    transactions

    shown

    in

    Ta-

    ble 1.1

    illustrate

    point-of-sale

    data

    collected at

    the checkout counters

    of a

    grocery

    store. Association

    analysis

    can be applied

    to find items

    that are

    fre-

    quently

    bought

    together

    by

    customers. For

    example,

    we may discover

    the

    rule

    {Diapers}

    -----*

    {lt:.ft},

    which suggests

    hat

    customers who buy diapers

    also tend to buy milk. This type of rule can be used to identify potential

    cross-selling

    opportunities

    among related

    items.

    I

    Cluster

    analysis

    seeks

    o find

    groups

    of closely related

    observationsso

    that

    observations

    hat

    belong to

    the same cluster are

    more

    similar to each

    other

  • 7/23/2019 Introduction to data mining - Steinbach.pdf

    32/790

    10

    Chapter

    1

    Introduction

    Table

    1. Marketasket

    ata.

    Tlansaction

    ID Items

    1

    2

    3

    4

    r

    o

    7

    8

    9

    10

    {Bread,

    Butter,

    Diapers,

    Milk}

    {Coffee,

    Sugar, Cookies,

    Sakoon}

    {Bread,

    Butter, Coffee,

    Diapers,

    Milk,

    Eggs}

    {Bread, Butter, Salmon,Chicken}

    {fgg",

    Bread, Butter}

    {Salmon,

    Diapers, Milk}

    {Bread,

    Tea, Sugar, Eggs}

    {Coffee,

    Sugar, Chicken,

    Eggs}

    {Bread,

    Diapers, Mi1k,

    Salt}

    {Tea,

    Eggs,

    Cookies,

    Diapers,

    Milk}

    than

    observations

    hat belong

    to other clusters.

    Clustering

    has been used

    to

    group

    sets of

    related

    customers,

    ind areas of

    the ocean

    hat

    have a significant

    impact on the Earth's climate, and compressdata.

    Example

    1.3

    (Document

    Clustering).

    The collection

    of

    news

    articles

    shown

    in Table 1.2

    can be

    grouped

    based on

    their

    respective topics.

    Each

    article

    is represented

    as

    a set

    of

    word-frequency

    pairs (r,

    "),

    where tu

    is

    a

    word

    and

    c is the number

    of times the

    word appears

    in the article.

    There are

    two

    natural

    clusters

    in the data

    set.

    The first cluster

    consists

    of the first

    four ar-

    ticles,

    which correspond

    to

    news about the economy,

    while

    the second

    cluster

    contains

    the

    last four articles,

    which correspond

    to

    news

    about

    health care.

    A

    good

    clustering algorithm

    should

    be able to

    identify

    these

    two clusters

    based

    on the similarity between words that appear

    in the articles.

    Table .2.

    Collectionf news

    rticles.

    Article

    Words

    I

    2

    .)

    A

    r

    J

    o

    7

    8

    dollar:

    1, industry:

    4, country: 2,

    loan: 3,

    deal: 2,

    government: 2

    machinery:

    2, labor: 3,

    market:

    4, industry:

    2, work: 3,

    country:

    1

    job:

    5,

    inflation: 3, rise:

    2,

    jobless:

    2,

    market:

    3,

    country:

    2,

    index: 3

    domestic:

    3, forecast:

    2,

    gain:

    1, market:

    2, sale: 3,

    price:

    2

    patient: 4,

    symptom:

    2, drug: 3,

    health:

    2, clinic:

    2, doctor:

    2

    pharmaceutical:2,

    company: 3,

    drug:

    2,vaccine:1,

    flu:

    3

    death: 2, cancer: 4, drug: 3, public: 4, health: 3, director:

    2

    medical:

    2,

    cost: 3,

    increase:

    2,

    patient:

    2, health: 3,

    care:

    1

  • 7/23/2019 Introduction to data mining - Steinbach.pdf

    33/790

    1.5

    Scope

    and

    Organization of

    the Book 11

    Anomaly

    detection

    is

    the task

    of

    identifying

    observationswhose

    character-

    istics

    are significantly

    different

    from the rest

    of

    the data. Such

    observations

    are known

    as

    anomalies

    or outliers.

    The

    goal

    of an anomaly

    detection al-

    gorithm

    is to

    discover

    the

    real

    anomalies

    and

    avoid

    falsely labeling

    normal

    objects

    as anomalous.

    In

    other

    words, a

    good

    anomaly

    detector must have

    a high detection rate and a low false alarm rate. Applications of anomaly

    detection include

    the

    detection

    of fraud, network

    intrusions,

    unusual

    patterns

    of disease,

    and

    ecosystem

    disturbances.

    Example

    1.4

    (Credit

    Card

    trYaud Detection).

    A

    credit

    card company

    records

    he

    transactions

    made

    by every

    credit card holder,

    along with

    personal

    information

    such

    as credit limit,

    age, annual income,

    and address.

    since the

    number

    of

    fraudulent

    cases is

    relatively

    small

    compared to the

    number of

    legitimate

    transactions,

    anomaly

    detection techniques

    can be applied

    to build

    a

    profile

    of legitimate

    transactions

    for

    the users.

    When a

    new

    transaction

    arrives, it is compared against the profile of the user. If the characteristicsof

    the

    transaction

    are very different

    from

    the

    previously

    created

    profile,

    then the

    transaction is

    flagged

    as

    potentially

    fraudulent.

    I

    1.5

    Scope

    and

    Organization

    of

    the

    Book

    This

    book introduces

    the major principles

    and

    techniquesused n

    data mining

    from

    an

    algorithmic perspective.

    A

    study of these

    principles

    and

    techniques

    s

    essential or

    developing

    a better

    understanding

    of how data mining

    technology

    can

    be applied

    to various kinds

    of data. This

    book also serves

    as a starting

    point for readerswho are interested in doing research n this field.

    We begin

    the

    technical discussion

    of this

    book

    with

    a chapter

    on data

    (Chapter

    2),

    which discusses

    he basic types

    of data, data

    quality,

    prepro-

    cessing

    techniques,

    and measures

    of similarity

    and dissimilarity.

    Although

    this material

    can be

    covered

    quickly,

    it

    provides

    an essential

    foundation for

    data analysis.

    Chapter 3, on

    data exploration,

    discusses ummary

    statistics,

    visualization

    techniques,

    and

    On-Line Analytical

    Processing

    (OLAP).

    These

    techniques

    provide

    the means

    for

    quickly

    gaining

    insight into

    a data

    set.

    Chapters 4

    and

    5

    cover

    classification.

    Chapter 4

    provides

    a foundation

    by

    discussing

    decision

    tree classifiers

    and several issues

    that

    are

    important

    to all classification:

    overfitting, performance

    evaluation,

    and the

    comparison

    of different

    classification

    models.

    Using this

    foundation,

    Chapter

    5

    describes

    a

    number

    of

    other important

    classification techniques: rule-based

    systems,

    nearest-neighbor

    lassifiers,

    Bayesian

    classifiers,

    artificial neural

    networks, sup-

    port

    vector machines,

    and ensemble

    classifiers,

    which are collections

    of classi-

  • 7/23/2019 Introduction to data mining - Steinbach.pdf

    34/790

    2 Chapter

    1

    lntroduction

    fiers.

    The multiclass and

    imbalanced class

    problems are also

    discussed.

    These

    topics

    can be covered

    ndependently.

    Association

    analysis

    s explored

    in

    Chapters

    6 and 7.

    Chapter 6

    describes

    the

    basics of association

    analysis:

    frequent

    itemsets, association

    rules,

    and

    some

    of the algorithms

    used to

    generate

    them.

    Specific

    types of

    frequent

    itemsets-maximal, closed,and hyperclique-that are important for data min-

    ing are

    also discussed,

    and the

    chapter concludes

    with

    a discussion

    of

    evalua-

    tion

    measures or

    association analysis. Chapter

    7 considers

    a

    variety of

    more

    advanced

    opics, including

    how

    association

    analysis

    can be

    applied to categor-

    ical and

    continuous

    data or to data that

    has a concept

    hierarchy.

    (A

    concept

    hierarchy

    is

    a

    hierarchical categorization

    of objects,

    e.g., store

    items, clothing,

    shoes,

    sneakers.)

    This

    chapter

    also describes

    how

    association

    analysis can

    be

    extended

    to find sequential

    patterns (patterns involving

    order),

    patterns in

    graphs,

    and

    negative

    relationships

    (if

    one item

    is

    present,

    then

    the other

    is

    not).

    Cluster

    analysis

    s discussed

    n

    Chapters 8

    and

    9. Chapter

    8 first describes

    the different

    types

    of clusters and

    then

    presents

    hree specific

    clustering

    tech-

    niques:

    K-means, agglomerative

    hierarchical

    clustering, and

    DBSCAN.

    This

    is followed

    by a discussion

    of techniques

    for validating

    the

    results

    of a cluster-

    ing algorithm.

    Additional clustering

    concepts

    and techniques

    are explored

    in

    Chapter

    9,

    including

    fiszzy and

    probabilistic

    clustering, Self-Organizing

    Maps

    (SOM),

    graph-based

    clustering,

    and density-based

    clustering.

    There is

    also a

    discussion

    of scalability

    issues and

    factors

    to

    consider when

    selecting a

    clus-

    tering algorithm.

    The last chapter, Chapter 10,

    is on anomaly

    detection.

    After

    some

    basic

    definitions,

    several

    different types

    of anomaly

    detection are

    considered:

    sta-

    tistical,

    distance-based,

    density-based,

    and

    clustering-based.

    Appendices

    A

    through

    E

    give

    a brief

    review

    of

    important topics

    that are

    used

    in

    portions

    of

    the

    book:

    linear

    algebra,

    dimensionality

    reduction, statistics,

    regression,

    and

    optimization.

    The subject of

    data

    mining, while relatively

    young

    compared

    to statistics

    or machine

    learning,

    is already

    too large to cover

    in a single

    book. Selected

    references

    to topics

    that are only

    briefly covered,

    such as

    data

    quality' are

    provided

    in the bibliographic

    notes of the appropriate

    chapter. References

    o

    topics not covered n this book, such as data mining for streams and privacy-

    preserving

    data mining, are

    provided

    in

    the bibliographic

    notes

    of this chapter.

  • 7/23/2019 Introduction to data mining - Steinbach.pdf

    35/790

    Bibliographic Notes

    13

    1.6

    Bibliographic

    Notes

    The

    topic

    of data

    mining has

    inspired many

    textbooks. Introductory

    text-

    books include

    those by Dunham

    [10],

    Han

    and Kamber

    l2L),

    Hand

    et al.

    [23],

    and

    Roiger and

    Geatz

    [36].

    Data mining

    books with a

    stronger emphasis on

    businessapplications include the works by Berry and Linoff [2], Pyle [34],and

    Parr

    Rud

    [33].

    Books with

    an emphasis

    on statistical learning include

    those

    by

    Cherkassky

    and Mulier

    [6],

    and Hastie

    et al.

    124].

    Some books with an

    emphasis

    on machine

    learning

    or

    pattern

    recognition

    are those by Duda

    et

    al.

    [9],

    Kantardzic

    [25],

    Mitchell

    [31],

    Webb

    [41],

    and

    Witten and

    F]ank

    [42].

    There

    are

    also some more

    specialized

    books:

    Chakrabarti

    [a]

    (web

    mining),

    Fayyad

    et al.

    [13]

    (collection

    of early

    articles on data mining), Fayyad

    et al.

    111]

    visualization),

    Grossman et al.

    [18]

    (science

    and engineering), Kargupta

    and

    Chan

    [26]

    (distributed

    data mining),

    Wang

    et al.

    [a0]

    (bioinformatics),

    and Zaki

    and Ho

    [44]

    (parallel

    data mining).

    There are several conferences elated to data mining. Some of the main

    conferences

    dedicated

    to this field include

    the ACM

    SIGKDD

    International

    Conferenceon Knowledge

    Discovery

    and

    Data

    Mining

    (KDD),

    the IEEE In-

    ternational

    Conferenceon Data

    Mining

    (ICDM),

    the SIAM

    International

    Con-

    ference

    on Data

    Mining

    (SDM),

    the

    European

    Conference

    on Principles

    and

    Practice

    of Knowledge

    Discovery

    in Databases (PKDD),

    and the Pacific-Asia

    Conference

    on Knowledge

    Discovery

    and Data Mining

    (PAKDD).

    Data min-

    ing

    papers

    can

    also be found in

    other major

    conferencessuch as the ACM

    SIGMOD/PODS

    conference,

    he International

    Conferenceon Very Large Data

    Bases (VLDB),

    the

    Conferenceon Information

    and

    Knowledge Management

    (CIKM),

    the International

    Conferenceon Data Engineering

    (ICDE),

    the

    In-

    ternational

    Conferenceon Machine

    Learning

    (ICML),

    and the National

    Con-

    ference

    on Artificial Intelligence

    (AAAI).

    Journal

    publications

    on data mining inc lude

    IEEE Transact'ions n Knowl-

    edge

    and Data Engi,neering,

    Data

    Mi,ning and Knowledge Discouery,

    Knowl-

    edge

    and Information

    Systems,

    Intelli,gent

    Data Analysi,s, nforrnati,on

    Sys-

    tems,

    and lhe

    Journal of Intelligent

    Informati,on

    Systems.

    There

    have

    been a number

    of

    general

    articles on data

    mining

    that define he

    field

    or its relationship

    to

    other

    fields,

    particularly

    statistics.

    Fayyad

    et al.

    [12]

    describe

    data mining

    and how it

    fits into the total knowledgediscovery

    process.

    Chen et

    al.

    [5]

    give

    a database

    perspective

    on data mining.

    Ramakrishnan

    and

    Grama

    [35]

    provide

    a

    general

    discussion

    of data

    mining

    and

    present

    several

    viewpoints.

    Hand

    [22]

    describeshow

    data mining differs from

    statistics, as does

    Fliedman

    lf

    4].

    Lambert

    [29]

    explores

    he useof statistics for large data

    sets and

    provides

    some

    comments

    on the

    respective

    roles of data mining

    and statistics.

    1_.6

  • 7/23/2019 Introduction to data mining - Steinbach.pdf

    36/790

    L4 Chapter

    1 Introduction

    Glymour

    et al.

    116]

    consider

    the lessons hat

    statistics

    may

    have

    for

    data

    mining. Smyth

    et

    aL

    [38]

    describe

    how the evolution

    of

    data

    mining

    is

    being

    driven by

    new types

    of data and

    applications,

    such as those

    involving

    streams,

    graphs,

    and text.

    Emerging applications

    in data

    mining are considered

    by

    Han

    et

    al.

    [20]

    and Smyth

    [37]

    describes

    some

    research

    challenges

    n

    data

    mining.

    A discussionof how developments n data mining researchcan be turned into

    practical

    tools

    is

    given

    by

    Wu

    et al.

    [43].

    Data

    mining standards

    are

    the