dbm630 lecture01

25
DBM630: Data Mining and Data Warehousing MS.IT. Rangsit University Lecture 1 Introduction to Data Mining and Data Warehousing 1 by Kritsada Sriphaew (sriphaew.k AT gmail.com) Text: Data Mining: Concepts and Techniques, By Jiawei Han and Micheline Kamber, Morgan Kaufmann Publishers (2006). ISBN: 978-1558609013 Semester 2/2011

Upload: aj-kritsada-sriphaew

Post on 24-May-2015

150 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: Dbm630 Lecture01

DBM630: Data Mining and

Data Warehousing

MS.IT. Rangsit University

Lecture 1

Introduction to Data Mining and Data Warehousing

1

by Kritsada Sriphaew (sriphaew.k AT gmail.com)

Text: Data Mining: Concepts and Techniques, By Jiawei Han and Micheline Kamber, Morgan Kaufmann Publishers (2006). ISBN: 978-1558609013

Semester 2/2011

Page 2: Dbm630 Lecture01

Administrative Matters

Data Mining and Data Warehousing by Kritsada Sriphaew 2

Course Syllabus

Lecture Notes & Assignments & Quizzes

Course’s Communication Announcements, discussion, lecture notes, etc.

Page: http://www.facebook.com/pages/Data-mining-MSIT-RSU/

Page 3: Dbm630 Lecture01

How we will be evaluated?

Data Mining and Data Warehousing by Kritsada Sriphaew 3

Assessment Tasks

To Pass

At least 60% of the overall scores.

Tasks % Scores

Quizzes (Approx. 2 times) 20

Assignment

(Disscussion/Demonstration)

20

Final 60

Page 4: Dbm630 Lecture01

Text Books

Data Mining and Data Warehousing by Kritsada Sriphaew 4

Mandatory Book Data Mining: Concepts and Techniques

By Jiawei Han and Micheline Kamber

Morgan Kaufmann Publishers (2006), Second Edition,

ISBN-10: 1558609016, ISBN-13: 978-1558609013

Supplementary Book Practical Machine Learning Tools and Techniques with JAVA Implementations By Ian H. Witten and Eibe Frank, Data Mining

Morgan Kaufmann Publishers (2005), 2nd Edition

ISBN-10: 0120884070, ISBN-13: 978-0120884070

Page 5: Dbm630 Lecture01

Course Description (What we’LL learn?)

Data Mining and Data Warehousing by Kritsada Sriphaew 5

Introduction to data warehousing. Characteristics of data warehousing, drawbacks and benefits of data warehousing, architecture of data warehousing, internal data structure for data warehousing, data integration, creating high quality data, data mart, online analytical processing (OLAP). Introduction to data mining, types of data for mining, architecture of typical data mining system, data preprocessing, association rule mining, classification and prediction, clustering, data mining applications, current trends in data mining, text mining, web mining, including tools for data mining analysis such as WEKA, SAS, etc.

แนวคดิเบือ้งตน้ของคลงัขอ้มลู คุณลกัษณะของคลงัขอ้มลู ขอ้ดแีละขอ้เสยีของคลงัขอ้มลู สถาปตัยกรรมของคลงัขอ้มลู โครงสรา้งการจดัเกบ็ขอ้มลูภายในคลงัขอ้มลู การบูรณาการขอ้มลู การสรา้งขอ้มลูทีม่คีุณภาพ ดาตา้มารท์ การประมวลผลออนไลน์เชงิวเิคราะห ์แนวคดิเบือ้งตน้การท าเหมอืงขอ้มลู ชนิดขอ้มลูส าหรบัการท าเหมอืงขอ้มลู สถาปตัยกรรมของระบบเหมอืงขอ้มลู การเตรยีมขอ้มลู การขดุคน้กฎสมัพนัธ ์การจ าแนกประเภทและการท านาย การจดักลุม่ การท าเหมอืงขอ้มลูทีม่คีวามซบัซอ้น การประยกุตใ์ชเ้หมอืงขอ้มลู แนวโน้มปจัจุบนัการท าเหมอืงขอ้มลู เหมอืงขอ้มลูตวัอกัษร เหมอืงขอ้มลูเวบ็ รวมถงึการใชเ้ครือ่งมอืในการวเิคราะหเ์หมอืงขอ้มลู เชน่ WEKA, SAS เป็นตน้

Page 6: Dbm630 Lecture01

Course Schedule (tentative)

Data Mining and Data Warehousing by Kritsada Sriphaew 6

Week Date Topics

1 8 JAN Introduction to Data Mining and Data Warehousing

2 15 JAN Data Warehouse and OLAP Technology – I

3 22 JAN Data Warehouse and OLAP Technology – II

4 29 JAN Data Mining Concepts and Data Preparation

5 5 FEB Association Rule Mining

6 12 FEB Classification Model: Decision Tree, Classification Rules

7 19 FEB Classification Model: Naïve Bayes

8 26 FEB Prediction Model: Regression

9 4 MAR Clustering

10 11 MAR Data Mining Application: Text Mining, Web Mining, Social Network

Analysis

11 18 MAR Introduction to Data Mining Tool: WEKA

12 25 MAR Tutorials

Final

Page 7: Dbm630 Lecture01

Prerequisites

Data Mining and Data Warehousing by Kritsada Sriphaew 7

Basic Database Concepts

Basic Statistics:

Probability, Sampling, Logic, Linear Regression, …

Algorithms:

Basic Data Structures, Dynamic Programming, ...

We provide some backgrounds, but the class will be fast pace if you have some basics in advance.

Page 8: Dbm630 Lecture01

Introduction

Data Mining and Data Warehousing by Kritsada Sriphaew 8

Motivation: Why mine data?

KDD: Knowledge Discovery in Databases

What is Data Mining?

Data Mining: on What kind of Data?

Data Mining Tasks

Data Mining Applications

Page 9: Dbm630 Lecture01

Evolution of Database Technology

Data Mining and Data Warehousing by Kritsada Sriphaew 9

1960s:

Data collection, database creation, IMS and network DBMS

1970s:

Relational data model, relational DBMS implementation

1980s:

RDBMS, advanced data models (extended-relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.)

1990s—2000s:

Data mining and data warehousing, multimedia databases, and Web databases

Page 10: Dbm630 Lecture01

Large Data Sets: A Motivation

Data Mining and Data Warehousing by Kritsada Sriphaew 10

There is often information “hidden” in the data that is not readily evident.

Human analysts take weeks to discover useful information.

Much of the data is never been analyzed at all

How do you explore millions of

records, tens or hundreds of

fields, and find patterns?

Page 11: Dbm630 Lecture01

KDD Process (Knowledge Discovery in Databases)

Data Mining and Data Warehousing by Kritsada Sriphaew 11

adapted from: U. Fayyad, et al. (1995), “From Knowledge Discovery to Data Mining: An Overview,” Advances in Knowledge Discovery and Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press

Data Target

Data

Selection

Knowledge

Preprocessed

Data

Patterns

Data Mining

Interpretation/

Evaluation

Preprocessing

Page 12: Dbm630 Lecture01

Knowledge Discovery

Data Mining and Data Warehousing by Kritsada Sriphaew 12

Page 13: Dbm630 Lecture01

Business Intelligence (BI) vs. Data Mining

Data Mining and Data Warehousing by Kritsada Sriphaew 13

A word to call processes, techniques and tools that support business decision using information technology

Increasing potential

to support

business decisions End User

Business Analyst

Data Analyst

DBA

Making Decisions

Data Presentation

Visualization Techniques

Data Mining

Knowledge Discovery

Data Exploration

OLAP

Statistical Analysis, Querying and Reporting

Data Warehouses / Data Marts

Data Sources Paper, Files, Information Providers, Database Systems, OLTP

Page 14: Dbm630 Lecture01

Terminology

Data Mining and Data Warehousing by Kritsada Sriphaew 14

Data Mining A step in the knowledge discovery process consisting of

particular algorithms (methods) that under some acceptable objective, produces a particular enumeration of patterns (models) over the data.

Knowledge Discovery Process The process of using data mining methods (algorithms)

to extract (identify) what is deemed knowledge according to the specifications of measures and thresholds, using a database along with any necessary preprocessing or transformations.

Page 15: Dbm630 Lecture01

Other definitions of Data Mining

Data Mining and Data Warehousing by Kritsada Sriphaew 15

Non‐trivial extraction of implicit, previously unknown and useful information from data

Automatic or semi-automatic process for analyzing large databases to find patterns that are:

valid: hold on new data with some certainty

novel: non‐obvious to the system

useful: should be possible to act on the item

understandable: humans should be able to interpret the pattern

Page 16: Dbm630 Lecture01

Origins of Data Mining

Data Mining and Data Warehousing by Kritsada Sriphaew 16

Overlaps various fields, but focus on

Scalability

Algorithm and Architecture

Automation to handle large data

Page 17: Dbm630 Lecture01

Data Mining: on What kind of Data?

Data Mining and Data Warehousing by Kritsada Sriphaew 17

Relational Databases

Data Warehouses

Transactional Databases

Advanced Database Systems Object-Relational Spatial and Temporal Time-Series Multimedia Text Heterogeneous, Legacy, and Distributed WWW

GeneFilter Comparison Report

GeneFilter 1 Name: GeneFilter 1 Name:

O2#1 8-20-99adjfinal N2#1finaladj

INTENSITIES

RAW NORMALIZED

ORF NAME GENE NAME CHRM F G R GF1 GF2 GF1 GF2 DIFFERENCE RATIO

YAL001C TFC3 1 1 A 1 2 12.03 7.38 403.83 209.79 194.04 1.92

YBL080C PET112 2 1 A 1 3 53.21 35.62 "1,786.11" "1,013.13" 772.98 1.76

YBR154C RPB5 2 1 A 1 4 79.26 78.51 "2,660.73" "2,232.86" 427.87 1.19

YCL044C 3 1 A 1 5 53.22 44.66 "1,786.53" "1,270.12" 516.41 1.41

YDL020C SON1 4 1 A 1 6 23.80 20.34 799.06 578.42 220.64 1.38

YDL211C 4 1 A 1 7 17.31 35.34 581.00 "1,005.18" -424.18 -1.73

YDR155C CPH1 4 1 A 1 8 349.78 401.84 "11,741.98" "11,428.10" 313.88 1.03

YDR346C 4 1 A 1 9 64.97 65.88 "2,180.87" "1,873.67" 307.21 1.16

YAL010C MDM10 1 1 A 2 2 13.73 9.61 461.03 273.36 187.67 1.69

YBL088C TEL1 2 1 A 2 3 8.50 7.74 285.38 220.01 65.37 1.30

YBR162C 2 1 A 2 4 226.84 293.83 "7,614.82" "8,356.39" -741.57 -1.10

YCL052C PBN1 3 1 A 2 5 41.28 34.79 "1,385.79" 989.41 396.38 1.40

YDL028C MPS1 4 1 A 2 6 7.95 6.24 266.99 177.34 89.65 1.51

YDL219W 4 1 A 2 7 16.08 11.33 539.93 322.20 217.74 1.68

YDR163W 4 1 A 2 8 19.13 14.19 642.17 403.56 238.61 1.59

YDR354W TRP4 4 1 A 2 9 62.24 40.74 "2,089.48" "1,158.64" 930.84 1.80

YAL018C 1 1 A 3 2 10.72 8.81 359.75 250.60 109.15 1.44

YBL096C 2 1 A 3 3 10.91 8.98 366.40 255.40 111.00 1.43

YBR169C SSE2 2 1 A 3 4 17.33 27.81 581.80 790.84 -209.05 -1.36

YCL060C 3 1 A 3 5 17.99 24.75 603.96 703.75 -99.79 -1.17

YDL036C 4 1 A 3 6 14.22 8.86 477.39 251.94 225.44 1.89

YDL227C HO 4 1 A 3 7 25.61 31.52 859.71 896.46 -36.75 -1.04

YDR171W HSP42 4 1 A 3 8 102.08 98.37 "3,426.83" "2,797.58" 629.25 1.22

YDR362C 4 1 A 3 9 16.32 12.95 547.96 368.39 179.57 1.49

YAL026C DRS2 1 1 A 4 2 11.32 7.97 379.85 226.53 153.33 1.68

YBL102W SFT2 2 1 A 4 3 55.88 63.74 "1,875.82" "1,812.81" 63.02 1.03

YBR177C 2 1 A 4 4 63.31 29.03 "2,125.20" 825.60 "1,299.60" 2.57

YCL068C 3 1 A 4 5 8.33 4.47 279.51 127.16 152.35 2.20

YDL044C MTF2 4 1 A 4 6 11.73 6.96 393.88 198.07 195.81 1.99

YDL235C YPD1 4 1 A 4 7 38.71 30.20 "1,299.33" 858.83 440.50 1.51

YDR179C 4 1 A 4 8 12.77 11.05 428.60 314.12 114.48 1.36

YDR370C 4 1 A 4 9 16.70 15.30 560.62 435.13 125.49 1.29

YAL034C FUN19 1 1 A 5 2 20.89 24.21 701.32 688.59 12.73 1.02

YBL111C 2 1 A 5 3 22.38 13.67 751.39 388.69 362.70 1.93

YBR185C MBA1 2 1 A 5 4 38.42 19.96 "1,289.61" 567.78 721.83 2.27

YCLX03C 3 1 A 5 5 8.69 3.66 291.77 104.11 187.66 2.80

YDL052C SLC1 4 1 A 5 6 52.37 49.87 "1,758.05" "1,418.33" 339.73 1.24

YDL243C 4 1 A 5 7 15.56 12.95 522.24 368.30 153.94 1.42

YDR186C 4 1 A 5 8 16.48 15.01 553.30 426.75 126.55 1.30

YDR378C 4 1 A 5 9 31.13 28.08 "1,045.01" 798.50 246.50 1.31

YAL040C CLN3 1 1 A 6 2 126.65 107.34 "4,251.70" "3,052.61" "1,199.08" 1.39

YBR006W 2 1 A 6 3 22.74 11.10 763.49 315.55 447.94 2.42

YBR193C 2 1 A 6 4 14.81 15.55 497.07 442.20 54.87 1.12

YCLX11W 3 1 A 6 5 161.96 175.34 "5,436.86" "4,986.41" 450.44 1.09

YDL060W 4 1 A 6 6 29.84 37.13 "1,001.65" "1,055.98" -54.34 -1.05

YDR003W 4 1 A 6 7 23.99 23.22 805.48 660.25 145.22 1.22

YDR194C MSS116 4 1 A 6 8 66.58 47.16 "2,235.07" "1,341.29" 893.78 1.67

YDR386W 4 1 A 6 9 11.27 5.75 378.27 163.46 214.81 2.31

YAL047C 1 1 A 7 2 15.54 11.30 521.74 321.28 200.46 1.62

YBR012W-B 2 1 A 7 3 54.70 79.97 "1,836.29" "2,274.15" -437.86 -1.24

YBR201W DER1 2 1 A 7 4 21.67 19.57 727.49 556.64 170.85 1.31

YCR007C 3 1 A 7 5 25.02 15.96 840.01 453.76 386.25 1.85

YDL068W 4 1 A 7 6 18.32 13.11 614.83 372.78 242.05 1.65

Structure - 3D Anatomy

Function – 1D Signal

Metadata – Annotation

Page 18: Dbm630 Lecture01

Data Mining Tasks

Classification

Clustering

Association Rule Mining

Sequential Pattern Discovery

Regression

Anomaly Detection

Page 19: Dbm630 Lecture01

Ex: Classifying Galaxy

Data Mining and Data Warehousing by Kritsada Sriphaew 19

Page 20: Dbm630 Lecture01

Ex: Market Basket Analysis

Data Mining and Data Warehousing by Kritsada Sriphaew 20

Where should detergents be placed in the

Store to maximize their sales? ? Are window cleaning products purchased

when detergents and orange juice are

bought together? ?

How are the demographics of the

neighborhood affecting what customers

are buying?

?

Is soda typically purchased with bananas?

Does the brand of soda make a difference? ?

Page 21: Dbm630 Lecture01

Ex: Anomaly Detection

Data Mining and Data Warehousing by Kritsada Sriphaew 21

Detect significant deviations from normal behavior

Applications:

Credit Card Fraud Detection

Network Intrusion Detection

Page 22: Dbm630 Lecture01

Some Success Stories

Data Mining and Data Warehousing by Kritsada Sriphaew 22

Network intrusion detection using a combination of sequential rule discovery and classification tree on 4 GB DARPA data Won over (manual) knowledge engineering approach

http://www.cs.columbia.edu/~sal/JAM/PROJECT/ provides good detailed description of the entire process

Major US bank: Customer attrition prediction Segment customers based on financial behavior: 3 segments

Build attrition models for each of the 3 segments

40‐50% of attritions were predicted == factor of 18 increase

Targeted credit marketing: major US banks find customer segments based on 13 months credit balances

build another response model based on surveys

increased response 4 times -- 2%

Page 23: Dbm630 Lecture01

How You’LL Benefit

Confidently discuss the role and applicability of data warehousing and data mining to business/organization problems

Get background knowledge for further explore to your thesis, independent study or your career’s projects since data mining methods (to extract knowledge from the data) are very useful for every fields.

Page 24: Dbm630 Lecture01

Assignment

Assignments will aim to test your detailed knowledge and understanding of the topics, as well as your critical thinking and research ability. Assignments may include tasks involving: writing detailed designs; reading research papers; learning and using specialist software/hardware.

Assessment: the assignment will be worth 20% of the total course assessment.

Page 25: Dbm630 Lecture01

25

PreTest 1. Select only one of the following items to fill in the blanks.

(a) Characterization/Discrimination

(b) Classification

(c) Numeric Prediction

(d) Clustering

(e) Association Analysis

(f) Trend Analysis

Which function matches with the following task?

______(1) To estimate the price of the stock A in next month

______(2) To display a portion of sold products, according to their types.

______(3) To know which products are likely to be sold with which products

______(4) To group customers to a set of similar groups based on their features

______(5) To find the value of an experiment when a substance is tested.

______(6) To predict that a customer tends to be a good customer or not.

2. Assume that we want to design a model to forecast tomorrow’s SET index,

please suggest the detail of the model that we should construct and

recommend the input and output to the model.