cs/engmt/cpeng 404 data mining & knowledge discovery dan st. clair lect 1 – intro. to data...
TRANSCRIPT
CS/EngMt/CpEng 404
Data Mining &
Knowledge Discovery
Dan St. Clair
Lect 1 – Intro. To Data Mining & Data Warehouses
2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 3
• Data collected on almost everything
• WWW rich data resource
• Data warehouses required to hold data
Information Age Produces Large Amounts of Data
2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 4
The problem:
How do we turn information into useful knowledge?
Solution:
Data mining & knowledge discovery
2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 5
Data Mining & Knowledge Discovery
This class provides
• Tools & techniques for producing useful knowledge from information
• Experience in using these tools
2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 6
Data Mining & Knowledge Discovery in CS 404
• We will study– Data warehouses– Classification & Association rule miners (C4.5)– Neural networks (BP, SOM)– Classical tools
• Correlation
• Regression
• Clustering
• We will do several projects requiring mining knowledge from “real” data
2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 7
CS 404 Class Information
Prerequisites:
CS 347 (Artificial Intelligence) or CS 304 (Database Systems)
and Stat 215
Texts:• Han, J. & Kamber, M., Data Mining: Concepts and
Techniques, Morgan Kaufmann, 2000.• Quinlan, J., C4.5 Programs for Machine Learning,
Morgan Kaufmann, 1988.
2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 8
CS 404 Class Information
Reference: (This or a similar Matlab reference is recommended.)
Hanselman, D. and Littlefield, B., Mastering Matlab 6: A Comprehensive Tutorial and Reference, Prentice Hall, 2001.
Software:• C4.5 – provided to class w/o charge• Matlab – Can purchase from Mathworks or can login
to UMR. • Microsoft Excel (provided on UMR CLC computers)
2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 9
CS 404 Class Information (Cont.d)
Instructor: D.C. St. Clair, Ph.D.
325 Computer SciencePhone: (573) 341-6352 Fax: (573) 341-4501e-mail: [email protected]
Class web page:www.umr.edu/~stclair or http://web.umr.edu/~stclair/class/
classfiles/cs404_fs02/
Things you will find on the class web page:• Syllabus• Schedule• Homework assignments• Lecture notes
2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 10
Who am I?
• Professor and Chair UMR Computer Science Dept.
• Research area -- Data mining, machine intelligence, neural networksdiagnostics pattern recognition & analysis
intelligent graphics system monitoring & assessment
data mining
• “Applied” experience– Union Pacific Technologies Intelligent Systems Advisor
– Visiting Principal Scientist McDonnell Douglas Research Laboratories
– NASA’s Johnson Space Center
– Defense: Navy, Army, and Air Force
– Co-founder & former Chief Scientist of intelligent software systems company
2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 11
Even MoreCS 404 Class Information
Han, one of the authors of the data mining text has a web page at:
www.cs.sfu.ca/~han/DM_Book.html
Which contains several interesting things including:
1. A list of errata for the data mining book
2. A set of slides he uses in the data mining course he teaches. [I will be using some of these slides in my lectures.]
You may want to check these out.
2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 12
Topics to Be Covered in Lecture 1Intro. to Data Mining & Knowledge Discovery
• Intro. to CS 404• What is Data Mining & KD?• Data sources• Data mining tasks• Data wareshousing (Ch. 2)
• Multidimensional data models & schema
• Intro. to CS 404• What is Data Mining & KD?• Data sources• Data mining tasks• Data wareshousing (Ch. 2)
• Multidimensional data models & schema
We just finished
this.
2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 13
Topics to Be Covered in Lecture 1Intro. to Data Mining & Knowledge Discovery
• Intro. to CS 404• What is Data Mining & KD?• Data sources• Data mining tasks• Data wareshousing (Ch. 2)
• Multidimensional data models & schema
• Intro. to CS 404• What is Data Mining & KD?• Data sources• Data mining tasks• Data wareshousing (Ch. 2)
• Multidimensional data models & schema
2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 14
The set of values:
12345 1000.00 SA67890 2846.92 CK
has no meaning. It is data but it is NOT information.
Information: Information is the result of organizing data into meaningful quantities.
The following relational table helps turns the data into information since it associates meaningwith the data:
Account Number Balance type
12345 1000.00 SA67890 2846.92 CK
A database is a “structured” collection of data stored and operated on within a managementenvironment known as a Database Management Systems (DBMS) or database system. TheDBMS helps to transform data into information.
Data -- Information -- Knowledge
Knowledge can be created from information.
15CS 404 Data Mining & Knowledge Discovery 2002 by D. C. St. Clair
What Is Data Mining?How Does It Differ From Existing Database Technologies?
Data Sources: Databases, data warehouses, Internet
Decision Support SystemsTools for asking questions & doing analyses when you know what you want to ask and where you are going. (Ex. OLAP tools)
Data MiningProcess of discovering knowledge (meaningful new correlations, patterns, and trends) in data by sifting through large amounts of data (100M-10G) using pattern recognition as well as statistical and mathematical techniques.
16CS 404 Data Mining & Knowledge Discovery 2002 by D. C. St. Clair
Other Names Used in Conjunction With Data Mining
• Knowledge discovery(mining) in databases (KDD)• Knowledge extraction• Data/pattern analysis• Data archeology• Data dredging• Information harvesting• What is not data mining
– (Deductive) query processing– Expert systems or small ml/statistical programs
Slide is modified from slides provided by Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000.
2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 18
Data Mining
Example
Potential-Customer*Person Age Sex Income CustomerAnn Smith 32 F 10,000 yesJoan Gray 53 F 1,000,000 yesMary Blythe 27 F 20,000 noJane Brown 55 F 20,000 yesBob Smith 30 M 100,000 yesJack Brown 50 M 200,000 yes
Married-ToHusband WifeBob Smith Ann SmithJack Brown Jane Brown
Potential-Customer*Person Age Sex Income CustomerAnn Smith 32 F 10,000 yesJoan Gray 53 F 1,000,000 yesMary Blythe 27 F 20,000 noJane Brown 55 F 20,000 yesBob Smith 30 M 100,000 yesJack Brown 50 M 200,000 yes
Married-ToHusband WifeBob Smith Ann SmithJack Brown Jane BrownKnowledge Within A Relation
IF Income(Person) 100,000 THEN Potential-Customer(Person)
IF Sex(Person) = F AND Age(Person) 32 THEN Potential-Customer(Person)
Knowledge From Multiple Relations
IF Married-To(Person,Spouse) AND Income(Person) 100 000 THEN Potential-Customer(Spouse) IF Married-To(Person,Spouse) AND Potential-Customer(Person) THEN Potential-Customer(Spouse).
* Dzeroski, Saso, Inductive Logic Programming and Knowledge Discovery in Databases, Advances in Knowledge Discovery andData Mining, Ed. U. Fayyad, G.Piatetsky-Shapiro, P. Smyth, & R. Uthurusamy, AAAI Press, 1996, pp. 117-152.
19CS 404 Data Mining & Knowledge Discovery 2002 by D. C. St. Clair
Simple Concept Learning -- Example
“Routine”, “well-understood” chemistry experiment performed numerous times.
• Expected result occurred about half the time• Unexpected result occurred remainder of the time
Numerous repetitions of experiment produced similar results
Careful analysis determined:
• One result produced when setup was in sunlight
• Second result produced when setup was in shade
Careful investigation showed:
Experiment sensitive to ultraviolet radiation
Result:
Patented method for determining presence of ultraviolet radiation
20CS 404 Data Mining & Knowledge Discovery 2002 by D. C. St. Clair
The Knowledge Discovery Process
Source: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P, From Data Mining To Knowledge Discovery In Databases, AI Magazine, Fall 1996.Source: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P, From Data Mining To Knowledge Discovery In Databases, AI Magazine, Fall 1996.
Preprocessing
DataSources
TargetData
TransformedData
PreprocessedData
Patterns /Models
Knowledge
Selection
Interpretation/Evaluation
Transformation
Data Mining
2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 21
Topics to Be Covered in Lecture 1Intro. to Data Mining & Knowledge Discovery
• Intro. to CS 404• What is Data Mining & KD?• Data sources• Data mining tasks• Data wareshousing (Ch. 2)
• Multidimensional data models & schema
• Intro. to CS 404• What is Data Mining & KD?• Data sources• Data mining tasks• Data wareshousing (Ch. 2)
• Multidimensional data models & schema
22CS 404 Data Mining & Knowledge Discovery 2002 by D. C. St. Clair
Data Sources
• Relational Databases• Data Warehouses• WWW• Audio• Video• Printed Materials
::
2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 24
Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000
Multidimensional Data Cube
2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 25
Topics to Be Covered in Lecture 1Intro. to Data Mining & Knowledge Discovery
• Intro. to CS 404• What is Data Mining & KD?• Data sources• Data mining tasks• Data wareshousing (Ch. 2)
• Multidimensional data models & schema
• Intro. to CS 404• What is Data Mining & KD?• Data sources• Data mining tasks• Data wareshousing (Ch. 2)
• Multidimensional data models & schema
26CS 404 Data Mining & Knowledge Discovery 2002 by D. C. St. Clair
Data Mining Tasks
• Predictive– Perform inference on current data
• Descriptive (KDD)– Characterize general properties of data
Notes: – A measure of certainty or “belief” must be
associated with each pattern– “Interesting” patterns must be identified
27CS 404 Data Mining & Knowledge Discovery 2002 by D. C. St. Clair
Kinds of Data Patterns to Be “Mined”
• Concept/class description
• Association analyses
• Classification & prediction
• Cluster analysis
• Outlier analysis
28CS 404 Data Mining & Knowledge Discovery 2002 by D. C. St. Clair
Concept/class Descriptions
Example 1
Produce a description summarizing characteristics of customers who purchase diapers
• Objective: produce a description of those in the target class• Characterizes class/concept
Example 1
Produce a description summarizing characteristics of customers who purchase diapers
• Objective: produce a description of those in the target class• Characterizes class/concept
Example 2
What properties identify diaper buyers from other store customers?
• Discriminates class/concept• Leads to other questions
– What else do they buy– When do they purchase these items?
29CS 404 Data Mining & Knowledge Discovery 2002 by D. C. St. Clair
Association AnalysisAssoc. Anal. -- discovery of association relationships between attribute-value conditions.
Such relationships may be expressed in many ways. On common way is through association rules.
nm BBAA ^....^^.....^ 11 X => Y
30CS 404 Data Mining & Knowledge Discovery 2002 by D. C. St. Clair
Association Rules
Example
age (X, “20 .. 29”) ^ income (X, “20K..29K”) =>
buys (X, “CD changer)
[support = 2% confidence = 60% ]
% of data instances satisfying all three components of rule
% of data instances where hypothesis is satisfied and conclusion is predicted correctly
31CS 404 Data Mining & Knowledge Discovery 2002 by D. C. St. Clair
Classification & Prediction
Income
Debt
o
x
x
x
x x
x
xx
xx
oo
o
o
o
o
o o
o
o
o
o
Regression Line
Source: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P, From Data Mining To Knowledge Discovery In Databases, AI Magazine, Fall 1996.
Source: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P, From Data Mining To Knowledge Discovery In Databases, AI Magazine, Fall 1996.
32CS 404 Data Mining & Knowledge Discovery 2002 by D. C. St. Clair
Classification (nonlinear)
Income
Debt
x
x
x
oo
o
o
o
o
o o
o
o
o
o Loan
No Loan
x
x
o
x x
x
xx
Source: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P, From Data Mining To Knowledge Discovery In Databases, AI Magazine, Fall 1996.
Source: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P, From Data Mining To Knowledge Discovery In Databases, AI Magazine, Fall 1996.
33CS 404 Data Mining & Knowledge Discovery 2002 by D. C. St. Clair
Cluster Analysis
Income
Debt
+
+
+
+
+ +
+
++
++
++
+
+
+
+
+ +
+
+
+
+
Source: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P, From Data Mining To Knowledge Discovery In Databases, AI Magazine, Fall 1996.
Source: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P, From Data Mining To Knowledge Discovery In Databases, AI Magazine, Fall 1996.
34CS 404 Data Mining & Knowledge Discovery 2002 by D. C. St. Clair
Some Major Data Mining Issues
• Mining methodologies
• User interaction
• Performance (accuracy, robustness)
• Heterogeneous databases
• Interestingness
• Mining methodologies
• User interaction
• Performance (accuracy, robustness)
• Heterogeneous databases
• Interestingness
2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 35
Topics to Be Covered in Lecture 1Intro. to Data Mining & Knowledge Discovery
• Intro. to CS 404• What is Data Mining & KD?• Data sources• Data mining tasks• Data wareshousing (Ch. 2)
• Multidimensional data models & schema
• Intro. to CS 404• What is Data Mining & KD?• Data sources• Data mining tasks• Data wareshousing (Ch. 2)
• Multidimensional data models & schema
36CS 404 Data Mining & Knowledge Discovery 2002 by D. C. St. Clair
The Knowledge Discovery Process
Source: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P, From Data Mining To Knowledge Discovery In Databases, AI Magazine, Fall 1996.Source: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P, From Data Mining To Knowledge Discovery In Databases, AI Magazine, Fall 1996.
Preprocessing
DataSources
TargetData
TransformedData
PreprocessedData
Patterns /Models
Knowledge
Selection
Interpretation/Evaluation
Transformation
Data Mining
We’ll start h
ere!
2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 37
Chapter 2: Data Warehousing and OLAP Technology for Data Mining
• What is a data warehouse?
• A multi-dimensional data model
• Data warehouse architecture
• Data warehouse implementation
• From data warehousing to data mining
38CS 404 Data Mining & Knowledge Discovery 2002 by D. C. St. Clair
What Is a Data Warehouse?
DWs provide architectures and tools to support the systematic
–organization, –understanding, and –use of data.
Note: DWs may consist of data from numerous sources including business, scientific, as well as engineering data.
39CS 404 Data Mining & Knowledge Discovery 2002 by D. C. St. Clair
Features of a Data Warehouse
• Subject-oriented -- organized around major subjects• Integrated -- integrates multiple heterogeneous data
sources– Relational databases– Flat files– On-line transaction records
• Consistency is enforced• Time-variant -- data stored to provide historical data• Nonvolatile
– Physically separate from operational environment– Operations on data: initial loading & retrieval
2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 40
OLTP vs. OLAP OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date detailed, flat relational isolated
historical, summarized, multidimensional integrated, consolidated
usage repetitive ad-hoc
access read/write index/hash on prim. key
lots of scans
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
Slide is modified from slides provided by Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000.
2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 41
Topics to Be Covered in Lecture 1Intro. to Data Mining & Knowledge Discovery
• Intro. to CS 404• What is Data Mining & KD?• Data sources• Data mining tasks• Data wareshousing (Ch. 2)
• Multidimensional data models & schema
• Intro. to CS 404• What is Data Mining & KD?• Data sources• Data mining tasks• Data wareshousing (Ch. 2)
• Multidimensional data models & schema
42CS 404 Data Mining & Knowledge Discovery 2002 by D. C. St. Clair
Multidimensional Data Models
All figure references in this lecture are to the text: Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000.All figure references in this lecture are to the text: Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000.
Figure 2.1 3-D data cube AllElectronics sales data
43CS 404 Data Mining & Knowledge Discovery 2002 by D. C. St. Clair
4-D Data Cube of AllElectronics Sales Data
All figure references in this lecture are to the text: Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000.All figure references in this lecture are to the text: Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000.
Figure 2.2 4-D data cube AllElectronics sales data
44CS 404 Data Mining & Knowledge Discovery 2002 by D. C. St. Clair
Fig. 2.3 A Lattice of Cuboids
time,item
time,item,location
all
time item location supplier
time,location
time,supplier
item,location
item,supplier
location,supplier
time,item,supplier
time,location,supplier
item,location,supplier
time, item, location, supplier
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D cuboids
4-D(base) cuboid
45CS 404 Data Mining & Knowledge Discovery 2002 by D. C. St. Clair
Conceptual Modeling of Data Warehouses
• Modeling data warehouses: dimensions & measures– Star schema: A fact table in the middle connected to a set of
dimension tables
– Snowflake schema: A refinement of star schema where some
dimensional hierarchy is normalized into a set of smaller
dimension tables, forming a shape similar to snowflake
– Fact constellations: Multiple fact tables share dimension tables,
viewed as a collection of stars, therefore called galaxy schema or
fact constellation
Slide is modified from slides provided by Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000.
46CS 404 Data Mining & Knowledge Discovery 2002 by D. C. St. Clair
Fig. 2.4 Example of Star Schema
time_keydayday_of_the_weekmonthquarteryear
time
location_keystreetcityprovince_or_streetcountry
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_keyitem_namebrandtypesupplier_type
item
branch_keybranch_namebranch_type
branch
Slide is modified from slides provided by Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000.
47CS 404 Data Mining & Knowledge Discovery 2002 by D. C. St. Clair
Fig. 2.5 Example of Snowflake Schema
time_keydayday_of_the_weekmonthquarteryear
time
location_keystreetcity_key
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_keyitem_namebrandtypesupplier_key
item
branch_keybranch_namebranch_type
branch
supplier_keysupplier_type
supplier
city_keycityprovince_or_streetcountry
city
Slide is modified from slides provided by Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000.
48CS 404 Data Mining & Knowledge Discovery 2002 by D. C. St. Clair
Fig 2.6 Example of Fact Constellation
time_keydayday_of_the_weekmonthquarteryear
time
location_keystreetcityprovince_or_streetcountry
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_keyitem_namebrandtypesupplier_type
item
branch_keybranch_namebranch_type
branch
Shipping Fact Table
time_key
item_key
shipper_key
from_location
to_location
dollars_cost
units_shipped
shipper_keyshipper_namelocation_keyshipper_type
shipper
Slide is modified from slides provided by Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000.
2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 49
A Data Mining Query Language, DMQL: Language Primitives
• Cube Definition (Fact Table)define cube <cube_name> [<dimension_list>]:
<measure_list>
• Dimension Definition ( Dimension Table )define dimension <dimension_name> as
(<attribute_or_subdimension_list>)
• Special Case (Shared Dimension Tables)– First time as “cube definition”– define dimension <dimension_name> as
<dimension_name_first_time> in cube <cube_name_first_time>
2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 50
Defining a Star Schema in DMQL
define cube sales_star [time, item, branch, location]:dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city, province_or_state, country)
CS/EngMt/CpEng 404
Data Mining &
Knowledge Discovery
Dan St. Clair
Lect 1 – Intro. To Data Mining & Data Warehouses