powerpoint template

60
LOGO Data Warehousing & Data Mining Lecturer: Dr. Bo Yuan E-mail: [email protected]

Upload: butest

Post on 11-May-2015

1.515 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: PowerPoint Template

LOGO

Data Warehousing &

Data Mining

Lecturer: Dr. Bo Yuan

E-mail: [email protected]

Page 2: PowerPoint Template

Welcome

2

Page 3: PowerPoint Template

Mining? Warehousing?

3

Page 4: PowerPoint Template

Data Rich, Information Poor

4

Page 5: PowerPoint Template

Heterogeneous Data

5

Page 6: PowerPoint Template

The Value of Data

6

Page 7: PowerPoint Template

Data Integration & Analysis

7

Page 8: PowerPoint Template

From Data To Intelligence

8

Decision Models

Data Mining

Preprocessing

Database

Decision Support

Knowledge

Information

Data

Page 9: PowerPoint Template

Business Intelligence

9

Page 10: PowerPoint Template

Related Areas

10

Data Mining

Page 11: PowerPoint Template

Is DM really important?

Q: Your job sounds extremely interesting. What jobs would you recommend to a young person with an interest, and maybe a bachelors degree, in economics?

A: If you are looking for a career where your services will be in high demand, you should find something where you provide a scarce, complementary service to something that is getting ubiquitous and cheap. So what’s getting ubiquitous and cheap? Data. And what is complementary to data? Analysis. So my recommendation is to take lots of courses about how to manipulate and analyze data: databases, machine learning, econometrics, statistics, visualization, and so on.

An interview with Google Chief Economist Hal Varian from the New York Times

11

Page 12: PowerPoint Template

It is all about data …

12

Financial Institutions

Healthcare

Telecommunication

Consulting Companies

Government

Bioinformatics

WWW

Retail

Page 14: PowerPoint Template

Aims & Objectives

Course Aims To gain a good understanding of popular data mining techniques. To gain experience in implementing and using data mining methods. To gain an appreciation for the basic principles of data warehousing.

Learning Objectives Able to implement and apply data mining techniques to solve problems. Understand the main issues and core problems in data mining. Understand the relationship between data mining and other fields. Appreciate data mining research ideas and practice. Get familiar with academic writing and presentation.

Graduate Attributes In-depth knowledge of the field of study Effective communication Independence and teamwork Critical judgment

14

Page 15: PowerPoint Template

Learning Activities

Week 1: Introduction

Week 2: Principles of Data Warehousing ETL, OLAP, Metadata

Week 3: Data Preprocessing

Week 4 – Week 7: Data Mining (Foundations) Bayesian Classifiers, Decision Trees, Neural Networks, Regression, Clustering Support Vector Machines, Association Rules

Week 8: Field Study

Week 9 – Week 11: Data Mining (Advanced) Semi-supervised Learning, Active Learning Ensemble Learning, Evolutionary Computation

Week 12 – Week 13: Special Topic A (Text Mining & Web Information Retrieval)

Week 14: Special Topic B (Bioinformatics, CRM, Privacy Issue)

Week 15: Project Presentation

15

Page 16: PowerPoint Template

Assessment

Assignment 1 Type: Class Presentation Weight: 10% Task Description: Individual 25 minutes talks on selected topics

Assignment 2 Type: Algorithm Experimentation Weight: 10% Task Description: Coding and testing of selected data mining algorithms

Assignment 3 Type: Problem Solving Weight: 30% Task Description: Group project on solving real-world data mining problems

Final Exam Type: Closed Book Examination Weight: 50% Duration: 120 minutes

16

Presentation matters!

Page 18: PowerPoint Template

Learning Resources

18

International Conference on Data Mining

International Conference on Data Engineering

International Conference on Machine Learning

Pacific-Asia Conference on Knowledge Discovery and Data Mining

ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Page 19: PowerPoint Template

Rules & Policies

Plagiarism Plagiarism is the act of misrepresenting as one's own original work the

ideas, interpretations, words or creative works of another.

Direct copying of paragraphs, sentences, a single sentence or significant parts of a sentence.

Presenting as independent work done in collaboration with others.

Copying ideas, concepts, research results, computer codes, statistical tables, designs, images, sounds or text or any combination of these.

Paraphrasing, summarizing or simply rearranging another person's words, ideas, etc without changing the basic structure and/or meaning of the text.

Copying or adapting another student's original work into a submitted assessment item. 19

Page 21: PowerPoint Template

21

10 Minutes …

Page 22: PowerPoint Template

Data

Definition “Data are pieces of information that represent the qualitative or quantitative

attributes of a variable or set of variables. Data are often viewed as the lowest level of abstraction from which information and knowledge are derived.”

Data Types Continuous, Binary Discrete, String Symbolic

Storage Physical Logical

Major Issues Transformation Errors and corruption

22

Page 23: PowerPoint Template

Database

Definition “A database is an integrated collection of logically related records or files that is

stored in a computer system which consolidates records previously stored in separate files into a common pool of data records that provides data for many applications.”

“A database is a collection of information that is organized so that it can easily be accessed, managed, and updated.”

Relational Databases

23

Page 24: PowerPoint Template

Relational Model

24

Page 25: PowerPoint Template

First Normal Form(1NF)

There's no top-to-bottom ordering to the rows.

There's no left-to-right ordering to the columns.

There are no duplicate rows.

Every cell contains exactly one value from the applicable domain.

25

Customer

Customer ID First Name Surname Telephone Number

123 Robert Ingram 555-861-2025

456 Jane Wright 555-403-1659

789 Maria Fernandez 555-808-9633

Page 26: PowerPoint Template

First Normal Form(1NF)

26

Customer

Customer ID First Name Surname Telephone Number

123 Robert Ingram 555-861-2025

456 Jane Wright 555-403-1659555-776-4100

789 Maria Fernandez 555-808-9633

Customer

Customer ID First Name Surname Tel. No. 1 Tel. No. 2 Tel. No. 3

123 Robert Ingram 555-861-2025

456 Jane Wright 555-403-1659

555-776-4100

789 Maria Fernandez 555-808-9633

Page 27: PowerPoint Template

First Normal Form(1NF)

27

Customer Name

Customer ID First Name Surname

123 Robert Ingram

456 Jane Wright

789 Maria Fernandez

Customer Telephone No.

Customer ID Telephone No.

123 555-861-2025

456 555-403-1659

456 555-776-4100

789 555-808-9633

Page 28: PowerPoint Template

Second Normal Form(2NF)

Definition A 1NF table is in 2NF if and only if none of its non-prime attributes are

functionally dependent on a part (proper subset) of a candidate key.

28

Employees' Skills

Employee Skill Current Work Location

Jones Typing 114 Main Street

Jones Shorthand 114 Main Street

Jones Whittling 114 Main Street

Bravo Light Cleaning 73 Industrial Way

Ellis Alchemy 73 Industrial Way

Ellis Juggling 73 Industrial Way

Harrison Light Cleaning 73 Industrial Way

Page 29: PowerPoint Template

Second Normal Form(2NF)

29

Employees

Employee Current Work Location

Jones 114 Main Street

Bravo 73 Industrial Way

Ellis 73 Industrial Way

Harrison 73 Industrial Way

Employees' SkillsEmployee Skill

Jones Typing

Jones Shorthand

Jones Whittling

Bravo Light Cleaning

Ellis Alchemy

Ellis Juggling

Harrison Light Cleaning

Page 30: PowerPoint Template

Third Normal Form(3NF)

Definition: Every non-prime attribute of R is non-transitively dependent (directly dependent)

on every key of R.

30

Tournament Winners

Tournament Year Winner Winner Date of Birth

Indiana Invitational 1998 Al Fredrickson 21 July 1975

Cleveland Open 1999 Bob Albertson 28 September 1968

Des Moines Masters 1999 Al Fredrickson 21 July 1975

Indiana Invitational 1999 Chip Masterson 14 March 1977

Page 31: PowerPoint Template

Third Normal Form(3NF)

31

Tournament Winners

Tournament Year Winner

Indiana Invitational 1998 Al Fredrickson

Cleveland Open 1999 Bob Albertson

Des Moines Masters 1999 Al Fredrickson

Indiana Invitational 1999 Chip Masterson

Player Dates of BirthPlayer Date of Birth

Chip Masterson 14 March 1977

Al Fredrickson 21 July 1975

Bob Albertson 28 September 1968

Page 32: PowerPoint Template

Data Warehouse

Operational databases are optimized for the preservation of data integrity and speed of recording of business transactions.

Data warehouses are optimized for the speed of data retrieval.

Data warehouse is a repository of an organization's electronically stored data, which are designed to facilitate reporting and analysis.

W. H. Inmon states that the data warehouse is: Subject-oriented  Time-variant  Non-volatile  Integrated 

Data Warehousing Business Intelligence Tools Tools to extract, transform, and load data into the repository Tools to manage and retrieve metadata

32

Page 33: PowerPoint Template

Multidimensional Data

33

OLAP Cube

Page 35: PowerPoint Template

To Build a Data Warehouse

Data must be extracted from multiple, heterogeneous sources such as databases or other data feeds.

Data must be formatted for consistency within the data warehouse. Names, meanings and domains of data from unrelated sources must be reconciled.

Data must be cleaned to ensure validity. Data cleaning is an important part in building a data warehouse and it is one of the most labor-demanding tasks.

Data must be fitted into the data model of the warehouse. Data may have to be converted from relational, object-oriented, or legacy databases.

Data must be loaded into the warehouse. The sheer volume of data in the warehouse makes loading the data a significant task.

35

Page 36: PowerPoint Template

Data Warehouse vs. Database

36

Differences

Data warehouse Operational Database

Designed for the analysis of business measures by categories and attributes.

Designed for real time business operations.

Optimized for bulk loads and large, complex, unpredictable queries that access many rows per table.

Optimized for a common set of transactions, usually adding or retrieving a single row at a time per table.

Loaded with consistent, valid data; requires no real time validation.

Optimized for validation of incoming data during transactions; uses validation data tables.

Supports few concurrent users. Supports thousands of concurrent users.

Page 37: PowerPoint Template

Performance Dashboard

37

Page 38: PowerPoint Template

38

5 Minutes …

Page 39: PowerPoint Template

Data Mining

People have been analysing and investigating data for centuries.

Statistics Mean, Variance, Correlation, Distribution …

In modern days, data are often far beyond human comprehension. Diversity Volume Dimensionality

Definition Data Mining is the process of automatically extracting interesting and useful hidden patterns

from usually massive, incomplete and noisy data.

Not a fully automatic process Human interventions are often inevitable. Domain Knowledge Data Collection and Pre-processing

Synonym: Knowledge Discovery

One Field, Many Techniques, Unlimited Applications39

Page 40: PowerPoint Template

The Process of Data Mining

40

Page 41: PowerPoint Template

DM Techniques - Classification

“Classification is a procedure in which individual items are placed into groups based on quantitative information on one or more characteristics inherent in the items (referred to as variables, characters, etc) and based on a training set of previously labeled items”.

Given training data {(x1, y1), …, (xn, yn)}, the task is to produce a classifier that

maps any unknown object xi to its true classification label yi defined by some

unknown mapping.

Algorithms Decision Trees K-nearest neighbours Neural Networks Support Vector Machines

Applications Credit Scoring Churn Prediction Medical Diagnosis

41

X Y

Page 42: PowerPoint Template

Classification Boundaries

42

?

?

Page 43: PowerPoint Template

Confusion Matrix

43

Confusion Matrix

  actual value

  p n total

predictionoutcome

p' TruePositive

FalsePositive P'

n' FalseNegative

TrueNegative N'

total P N

Accuracy=(TP+TN)/(P+N)

Page 44: PowerPoint Template

Receiver Operating Characteristic

44

Page 45: PowerPoint Template

Lift

45

Page 46: PowerPoint Template

DM Techniques - Clustering

Clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense.

Distance Metrics Euclidean distance Manhattan distance Mahalanobis distance

Algorithms K-means Leader RPCL Affinity Propagation

Applications Market Research Image Segmentation Social Network Analysis

46

What is the difference between classification and clustering?

Page 48: PowerPoint Template

DM Techniques – Association Rule

48

Page 49: PowerPoint Template

Association Rule

49

Example data base with 4 items and 5 transactions

Transaction ID milk bread butter beer

1 1 1 0 0

2 0 1 1 0

3 0 0 0 1

4 1 1 1 0

5 0 1 0 0

Page 50: PowerPoint Template

DM Techniques – Regression

50

Page 51: PowerPoint Template

Regression

51

Page 52: PowerPoint Template

Overfitting – Regression

52

Page 53: PowerPoint Template

Overfitting – Classification

53

Page 54: PowerPoint Template

Cross Validation

54

Data

Training Set

Test Set

EvaluationGenerated

Models

Page 55: PowerPoint Template

Seeing is Knowing

55

Page 56: PowerPoint Template

Data Preprocessing

Why data processing? Real data are often surprisingly dirty.

• Incomplete Data• Inconsistent Data• Noisy Data

Typical Issues• Missing Attribute Values• Different Coding/Naming Schemes• Infeasible Values• Outliers

Data Quality Accuracy Completeness Consistency Interpretability Credibility Timeliness 56

Page 57: PowerPoint Template

Data Preprocessing

Data quality is a crucial factor in successful data mining tasks.

Data Cleaning Fill in missing values. Correct inconsistent data. Identify outliers and noisy data.

Data Integration Combine data from different sources.

Data Transformation Normalization Aggregation Type Conversion

Data Reduction Feature Selection Sampling

57

Page 58: PowerPoint Template

Review

What is data mining?

Why is data mining important?

What are the typical data mining applications?

What is the general procedure of data mining?

What are the major techniques in data mining?

What is the difference between data warehouses and databases?

What to expect in this course?

Where to find relevant information?

How to make the most of this course?

58

Page 59: PowerPoint Template

Just in Case Someone Asks …

59

Page 60: PowerPoint Template

Just in Case Someone Asks …

60