cs/engmt/cpeng 404 data mining & knowledge discovery dan st. clair lect 1 – intro. to data...

51

Upload: mya-motley

Post on 14-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

CS/EngMt/CpEng 404

Data Mining &

Knowledge Discovery

Dan St. Clair

Lect 1 – Intro. To Data Mining & Data Warehouses

2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 3

• Data collected on almost everything

• WWW rich data resource

• Data warehouses required to hold data

Information Age Produces Large Amounts of Data

2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 4

The problem:

How do we turn information into useful knowledge?

Solution:

Data mining & knowledge discovery

2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 5

Data Mining & Knowledge Discovery

This class provides

• Tools & techniques for producing useful knowledge from information

• Experience in using these tools

2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 6

Data Mining & Knowledge Discovery in CS 404

• We will study– Data warehouses– Classification & Association rule miners (C4.5)– Neural networks (BP, SOM)– Classical tools

• Correlation

• Regression

• Clustering

• We will do several projects requiring mining knowledge from “real” data

2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 7

CS 404 Class Information

Prerequisites:

CS 347 (Artificial Intelligence) or CS 304 (Database Systems)

and Stat 215

Texts:• Han, J. & Kamber, M., Data Mining: Concepts and

Techniques, Morgan Kaufmann, 2000.• Quinlan, J., C4.5 Programs for Machine Learning,

Morgan Kaufmann, 1988.

2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 8

CS 404 Class Information

Reference: (This or a similar Matlab reference is recommended.)

Hanselman, D. and Littlefield, B., Mastering Matlab 6: A Comprehensive Tutorial and Reference, Prentice Hall, 2001.

Software:• C4.5 – provided to class w/o charge• Matlab – Can purchase from Mathworks or can login

to UMR. • Microsoft Excel (provided on UMR CLC computers)

2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 10

Who am I?

• Professor and Chair UMR Computer Science Dept.

• Research area -- Data mining, machine intelligence, neural networksdiagnostics pattern recognition & analysis

intelligent graphics system monitoring & assessment

data mining

• “Applied” experience– Union Pacific Technologies Intelligent Systems Advisor

– Visiting Principal Scientist McDonnell Douglas Research Laboratories

– NASA’s Johnson Space Center

– Defense: Navy, Army, and Air Force

– Co-founder & former Chief Scientist of intelligent software systems company

2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 11

Even MoreCS 404 Class Information

Han, one of the authors of the data mining text has a web page at:

www.cs.sfu.ca/~han/DM_Book.html

Which contains several interesting things including:

1. A list of errata for the data mining book

2. A set of slides he uses in the data mining course he teaches. [I will be using some of these slides in my lectures.]

You may want to check these out.

2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 12

Topics to Be Covered in Lecture 1Intro. to Data Mining & Knowledge Discovery

• Intro. to CS 404• What is Data Mining & KD?• Data sources• Data mining tasks• Data wareshousing (Ch. 2)

• Multidimensional data models & schema

• Intro. to CS 404• What is Data Mining & KD?• Data sources• Data mining tasks• Data wareshousing (Ch. 2)

• Multidimensional data models & schema

We just finished

this.

2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 13

Topics to Be Covered in Lecture 1Intro. to Data Mining & Knowledge Discovery

• Intro. to CS 404• What is Data Mining & KD?• Data sources• Data mining tasks• Data wareshousing (Ch. 2)

• Multidimensional data models & schema

• Intro. to CS 404• What is Data Mining & KD?• Data sources• Data mining tasks• Data wareshousing (Ch. 2)

• Multidimensional data models & schema

2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 14

The set of values:

12345 1000.00 SA67890 2846.92 CK

has no meaning. It is data but it is NOT information.

Information: Information is the result of organizing data into meaningful quantities.

The following relational table helps turns the data into information since it associates meaningwith the data:

Account Number Balance type

12345 1000.00 SA67890 2846.92 CK

A database is a “structured” collection of data stored and operated on within a managementenvironment known as a Database Management Systems (DBMS) or database system. TheDBMS helps to transform data into information.

Data -- Information -- Knowledge

Knowledge can be created from information.

15CS 404 Data Mining & Knowledge Discovery 2002 by D. C. St. Clair

What Is Data Mining?How Does It Differ From Existing Database Technologies?

Data Sources: Databases, data warehouses, Internet

Decision Support SystemsTools for asking questions & doing analyses when you know what you want to ask and where you are going. (Ex. OLAP tools)

Data MiningProcess of discovering knowledge (meaningful new correlations, patterns, and trends) in data by sifting through large amounts of data (100M-10G) using pattern recognition as well as statistical and mathematical techniques.

16CS 404 Data Mining & Knowledge Discovery 2002 by D. C. St. Clair

Other Names Used in Conjunction With Data Mining

• Knowledge discovery(mining) in databases (KDD)• Knowledge extraction• Data/pattern analysis• Data archeology• Data dredging• Information harvesting• What is not data mining

– (Deductive) query processing– Expert systems or small ml/statistical programs

Slide is modified from slides provided by Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000.

2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 18

Data Mining

Example

Potential-Customer*Person Age Sex Income CustomerAnn Smith 32 F 10,000 yesJoan Gray 53 F 1,000,000 yesMary Blythe 27 F 20,000 noJane Brown 55 F 20,000 yesBob Smith 30 M 100,000 yesJack Brown 50 M 200,000 yes

Married-ToHusband WifeBob Smith Ann SmithJack Brown Jane Brown

Potential-Customer*Person Age Sex Income CustomerAnn Smith 32 F 10,000 yesJoan Gray 53 F 1,000,000 yesMary Blythe 27 F 20,000 noJane Brown 55 F 20,000 yesBob Smith 30 M 100,000 yesJack Brown 50 M 200,000 yes

Married-ToHusband WifeBob Smith Ann SmithJack Brown Jane BrownKnowledge Within A Relation

IF Income(Person) 100,000 THEN Potential-Customer(Person)

IF Sex(Person) = F AND Age(Person) 32 THEN Potential-Customer(Person)

Knowledge From Multiple Relations

IF Married-To(Person,Spouse) AND Income(Person) 100 000 THEN Potential-Customer(Spouse) IF Married-To(Person,Spouse) AND Potential-Customer(Person) THEN Potential-Customer(Spouse).

* Dzeroski, Saso, Inductive Logic Programming and Knowledge Discovery in Databases, Advances in Knowledge Discovery andData Mining, Ed. U. Fayyad, G.Piatetsky-Shapiro, P. Smyth, & R. Uthurusamy, AAAI Press, 1996, pp. 117-152.

19CS 404 Data Mining & Knowledge Discovery 2002 by D. C. St. Clair

Simple Concept Learning -- Example

“Routine”, “well-understood” chemistry experiment performed numerous times.

• Expected result occurred about half the time• Unexpected result occurred remainder of the time

Numerous repetitions of experiment produced similar results

Careful analysis determined:

• One result produced when setup was in sunlight

• Second result produced when setup was in shade

Careful investigation showed:

Experiment sensitive to ultraviolet radiation

Result:

Patented method for determining presence of ultraviolet radiation

20CS 404 Data Mining & Knowledge Discovery 2002 by D. C. St. Clair

The Knowledge Discovery Process

Source: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P, From Data Mining To Knowledge Discovery In Databases, AI Magazine, Fall 1996.Source: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P, From Data Mining To Knowledge Discovery In Databases, AI Magazine, Fall 1996.

Preprocessing

DataSources

TargetData

TransformedData

PreprocessedData

Patterns /Models

Knowledge

Selection

Interpretation/Evaluation

Transformation

Data Mining

2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 21

Topics to Be Covered in Lecture 1Intro. to Data Mining & Knowledge Discovery

• Intro. to CS 404• What is Data Mining & KD?• Data sources• Data mining tasks• Data wareshousing (Ch. 2)

• Multidimensional data models & schema

• Intro. to CS 404• What is Data Mining & KD?• Data sources• Data mining tasks• Data wareshousing (Ch. 2)

• Multidimensional data models & schema

22CS 404 Data Mining & Knowledge Discovery 2002 by D. C. St. Clair

Data Sources

• Relational Databases• Data Warehouses• WWW• Audio• Video• Printed Materials

::

2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 23

Relational Databases

2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 24

Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000

Multidimensional Data Cube

2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 25

Topics to Be Covered in Lecture 1Intro. to Data Mining & Knowledge Discovery

• Intro. to CS 404• What is Data Mining & KD?• Data sources• Data mining tasks• Data wareshousing (Ch. 2)

• Multidimensional data models & schema

• Intro. to CS 404• What is Data Mining & KD?• Data sources• Data mining tasks• Data wareshousing (Ch. 2)

• Multidimensional data models & schema

26CS 404 Data Mining & Knowledge Discovery 2002 by D. C. St. Clair

Data Mining Tasks

• Predictive– Perform inference on current data

• Descriptive (KDD)– Characterize general properties of data

Notes: – A measure of certainty or “belief” must be

associated with each pattern– “Interesting” patterns must be identified

27CS 404 Data Mining & Knowledge Discovery 2002 by D. C. St. Clair

Kinds of Data Patterns to Be “Mined”

• Concept/class description

• Association analyses

• Classification & prediction

• Cluster analysis

• Outlier analysis

28CS 404 Data Mining & Knowledge Discovery 2002 by D. C. St. Clair

Concept/class Descriptions

Example 1

Produce a description summarizing characteristics of customers who purchase diapers

• Objective: produce a description of those in the target class• Characterizes class/concept

Example 1

Produce a description summarizing characteristics of customers who purchase diapers

• Objective: produce a description of those in the target class• Characterizes class/concept

Example 2

What properties identify diaper buyers from other store customers?

• Discriminates class/concept• Leads to other questions

– What else do they buy– When do they purchase these items?

29CS 404 Data Mining & Knowledge Discovery 2002 by D. C. St. Clair

Association AnalysisAssoc. Anal. -- discovery of association relationships between attribute-value conditions.

Such relationships may be expressed in many ways. On common way is through association rules.

nm BBAA ^....^^.....^ 11 X => Y

30CS 404 Data Mining & Knowledge Discovery 2002 by D. C. St. Clair

Association Rules

Example

age (X, “20 .. 29”) ^ income (X, “20K..29K”) =>

buys (X, “CD changer)

[support = 2% confidence = 60% ]

% of data instances satisfying all three components of rule

% of data instances where hypothesis is satisfied and conclusion is predicted correctly

31CS 404 Data Mining & Knowledge Discovery 2002 by D. C. St. Clair

Classification & Prediction

Income

Debt

o

x

x

x

x x

x

xx

xx

oo

o

o

o

o

o o

o

o

o

o

Regression Line

Source: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P, From Data Mining To Knowledge Discovery In Databases, AI Magazine, Fall 1996.

Source: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P, From Data Mining To Knowledge Discovery In Databases, AI Magazine, Fall 1996.

32CS 404 Data Mining & Knowledge Discovery 2002 by D. C. St. Clair

Classification (nonlinear)

Income

Debt

x

x

x

oo

o

o

o

o

o o

o

o

o

o Loan

No Loan

x

x

o

x x

x

xx

Source: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P, From Data Mining To Knowledge Discovery In Databases, AI Magazine, Fall 1996.

Source: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P, From Data Mining To Knowledge Discovery In Databases, AI Magazine, Fall 1996.

33CS 404 Data Mining & Knowledge Discovery 2002 by D. C. St. Clair

Cluster Analysis

Income

Debt

+

+

+

+

+ +

+

++

++

++

+

+

+

+

+ +

+

+

+

+

Source: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P, From Data Mining To Knowledge Discovery In Databases, AI Magazine, Fall 1996.

Source: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P, From Data Mining To Knowledge Discovery In Databases, AI Magazine, Fall 1996.

34CS 404 Data Mining & Knowledge Discovery 2002 by D. C. St. Clair

Some Major Data Mining Issues

• Mining methodologies

• User interaction

• Performance (accuracy, robustness)

• Heterogeneous databases

• Interestingness

• Mining methodologies

• User interaction

• Performance (accuracy, robustness)

• Heterogeneous databases

• Interestingness

2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 35

Topics to Be Covered in Lecture 1Intro. to Data Mining & Knowledge Discovery

• Intro. to CS 404• What is Data Mining & KD?• Data sources• Data mining tasks• Data wareshousing (Ch. 2)

• Multidimensional data models & schema

• Intro. to CS 404• What is Data Mining & KD?• Data sources• Data mining tasks• Data wareshousing (Ch. 2)

• Multidimensional data models & schema

36CS 404 Data Mining & Knowledge Discovery 2002 by D. C. St. Clair

The Knowledge Discovery Process

Source: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P, From Data Mining To Knowledge Discovery In Databases, AI Magazine, Fall 1996.Source: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P, From Data Mining To Knowledge Discovery In Databases, AI Magazine, Fall 1996.

Preprocessing

DataSources

TargetData

TransformedData

PreprocessedData

Patterns /Models

Knowledge

Selection

Interpretation/Evaluation

Transformation

Data Mining

We’ll start h

ere!

2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 37

Chapter 2: Data Warehousing and OLAP Technology for Data Mining

• What is a data warehouse?

• A multi-dimensional data model

• Data warehouse architecture

• Data warehouse implementation

• From data warehousing to data mining

38CS 404 Data Mining & Knowledge Discovery 2002 by D. C. St. Clair

What Is a Data Warehouse?

DWs provide architectures and tools to support the systematic

–organization, –understanding, and –use of data.

Note: DWs may consist of data from numerous sources including business, scientific, as well as engineering data.

39CS 404 Data Mining & Knowledge Discovery 2002 by D. C. St. Clair

Features of a Data Warehouse

• Subject-oriented -- organized around major subjects• Integrated -- integrates multiple heterogeneous data

sources– Relational databases– Flat files– On-line transaction records

• Consistency is enforced• Time-variant -- data stored to provide historical data• Nonvolatile

– Physically separate from operational environment– Operations on data: initial loading & retrieval

2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 40

OLTP vs. OLAP OLTP OLAP

users clerk, IT professional knowledge worker

function day to day operations decision support

DB design application-oriented subject-oriented

data current, up-to-date detailed, flat relational isolated

historical, summarized, multidimensional integrated, consolidated

usage repetitive ad-hoc

access read/write index/hash on prim. key

lots of scans

unit of work short, simple transaction complex query

# records accessed tens millions

#users thousands hundreds

DB size 100MB-GB 100GB-TB

metric transaction throughput query throughput, response

Slide is modified from slides provided by Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000.

2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 41

Topics to Be Covered in Lecture 1Intro. to Data Mining & Knowledge Discovery

• Intro. to CS 404• What is Data Mining & KD?• Data sources• Data mining tasks• Data wareshousing (Ch. 2)

• Multidimensional data models & schema

• Intro. to CS 404• What is Data Mining & KD?• Data sources• Data mining tasks• Data wareshousing (Ch. 2)

• Multidimensional data models & schema

42CS 404 Data Mining & Knowledge Discovery 2002 by D. C. St. Clair

Multidimensional Data Models

All figure references in this lecture are to the text: Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000.All figure references in this lecture are to the text: Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000.

Figure 2.1 3-D data cube AllElectronics sales data

43CS 404 Data Mining & Knowledge Discovery 2002 by D. C. St. Clair

4-D Data Cube of AllElectronics Sales Data

All figure references in this lecture are to the text: Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000.All figure references in this lecture are to the text: Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000.

Figure 2.2 4-D data cube AllElectronics sales data

44CS 404 Data Mining & Knowledge Discovery 2002 by D. C. St. Clair

Fig. 2.3 A Lattice of Cuboids

time,item

time,item,location

all

time item location supplier

time,location

time,supplier

item,location

item,supplier

location,supplier

time,item,supplier

time,location,supplier

item,location,supplier

time, item, location, supplier

0-D(apex) cuboid

1-D cuboids

2-D cuboids

3-D cuboids

4-D(base) cuboid

45CS 404 Data Mining & Knowledge Discovery 2002 by D. C. St. Clair

Conceptual Modeling of Data Warehouses

• Modeling data warehouses: dimensions & measures– Star schema: A fact table in the middle connected to a set of

dimension tables

– Snowflake schema: A refinement of star schema where some

dimensional hierarchy is normalized into a set of smaller

dimension tables, forming a shape similar to snowflake

– Fact constellations: Multiple fact tables share dimension tables,

viewed as a collection of stars, therefore called galaxy schema or

fact constellation

Slide is modified from slides provided by Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000.

46CS 404 Data Mining & Knowledge Discovery 2002 by D. C. St. Clair

Fig. 2.4 Example of Star Schema

time_keydayday_of_the_weekmonthquarteryear

time

location_keystreetcityprovince_or_streetcountry

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_keyitem_namebrandtypesupplier_type

item

branch_keybranch_namebranch_type

branch

Slide is modified from slides provided by Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000.

47CS 404 Data Mining & Knowledge Discovery 2002 by D. C. St. Clair

Fig. 2.5 Example of Snowflake Schema

time_keydayday_of_the_weekmonthquarteryear

time

location_keystreetcity_key

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_keyitem_namebrandtypesupplier_key

item

branch_keybranch_namebranch_type

branch

supplier_keysupplier_type

supplier

city_keycityprovince_or_streetcountry

city

Slide is modified from slides provided by Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000.

48CS 404 Data Mining & Knowledge Discovery 2002 by D. C. St. Clair

Fig 2.6 Example of Fact Constellation

time_keydayday_of_the_weekmonthquarteryear

time

location_keystreetcityprovince_or_streetcountry

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_keyitem_namebrandtypesupplier_type

item

branch_keybranch_namebranch_type

branch

Shipping Fact Table

time_key

item_key

shipper_key

from_location

to_location

dollars_cost

units_shipped

shipper_keyshipper_namelocation_keyshipper_type

shipper

Slide is modified from slides provided by Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000.

2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 49

A Data Mining Query Language, DMQL: Language Primitives

• Cube Definition (Fact Table)define cube <cube_name> [<dimension_list>]:

<measure_list>

• Dimension Definition ( Dimension Table )define dimension <dimension_name> as

(<attribute_or_subdimension_list>)

• Special Case (Shared Dimension Tables)– First time as “cube definition”– define dimension <dimension_name> as

<dimension_name_first_time> in cube <cube_name_first_time>

2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 50

Defining a Star Schema in DMQL

define cube sales_star [time, item, branch, location]:dollars_sold = sum(sales_in_dollars), avg_sales =

avg(sales_in_dollars), units_sold = count(*)

define dimension time as (time_key, day, day_of_week, month, quarter, year)

define dimension item as (item_key, item_name, brand, type, supplier_type)

define dimension branch as (branch_key, branch_name, branch_type)

define dimension location as (location_key, street, city, province_or_state, country)

CS/EngMt/CpEng 404

Data Mining &

Knowledge Discovery

Dan St. Clair

Lect 1 – Intro. To Data Mining & Data Warehouses

Program

Completed

Program

Completed

University of Missouri-RollaUniversity of Missouri-Rolla

Copyright 2001 Curators of University of MissouriCopyright 2001 Curators of University of Missouri