lecture8_overview of dm techniques modified)

23
SCCS 453 DW and DM Semester 2, Year 2006 Songsri Tangsripairoj, Ph.D. 1 SCCS 453 Data Warehousing and Data Mining Songsri Tangsripairoj, Ph.D. [email protected] Department of Computer Science Faculty of Science, Mahidol University Lecture 8 Overview of Data Mining Techniques

Upload: vitaminbb

Post on 10-Apr-2015

298 views

Category:

Documents


0 download

DESCRIPTION

This lecture is use for Data Ware house and Data Mining course

TRANSCRIPT

Page 1: Lecture8_Overview of DM Techniques Modified)

SCCS 453 DW and DMSemester 2, Year 2006

Songsri Tangsripairoj, Ph.D. 1

SCCS 453 Data Warehousing and Data Mining

Songsri Tangsripairoj, [email protected]

Department of Computer ScienceFaculty of Science, Mahidol University

Lecture 8Overview of Data Mining Techniques

Page 2: Lecture8_Overview of DM Techniques Modified)

SCCS 453 DW and DMSemester 2, Year 2006

Songsri Tangsripairoj, Ph.D. 2

Topics Data Mining Tasks Data Mining Techniques Data Mining Models Data Mining Functions Demonstration Data Sets Data Mining Tools

Shopping Cart Analyzer (SCA) Weka by the University of Waikato

Page 3: Lecture8_Overview of DM Techniques Modified)

SCCS 453 DW and DMSemester 2, Year 2006

Songsri Tangsripairoj, Ph.D. 3

Data Mining Tasks Descriptive

Characterize general properties of the data in databases

Clustering and Summarization

Predictive Perform inference on the current data in order to

make prediction Classification and Estimation

Page 4: Lecture8_Overview of DM Techniques Modified)

SCCS 453 DW and DMSemester 2, Year 2006

Songsri Tangsripairoj, Ph.D. 4

Data Mining Techniques

Statistical techniques Have strong diagnostic tools Can be used for the development of confidence

intervals on parameter estimates, hypothesis testing

Artificial Intelligence techniques Require less assumptions about the data Are generally more automatic

Page 5: Lecture8_Overview of DM Techniques Modified)

SCCS 453 DW and DMSemester 2, Year 2006

Songsri Tangsripairoj, Ph.D. 5

Data Mining Techniques

Statistical Market-Basket Analysis - find groups of items

Memory-Based Reasoning- case based

Cluster Detection - undirected (quantitative MBA)

Artificial Intelligence Link Analysis - MCI’s Friends & Family

Decision Trees, Rule Induction - production rule

Neural Networks - automatic pattern detection

Genetic Algorithms - keep best parameters

Page 6: Lecture8_Overview of DM Techniques Modified)

SCCS 453 DW and DMSemester 2, Year 2006

Songsri Tangsripairoj, Ph.D. 6

Comparison of Features

Rules Neural Net CaseBase Genetic

Noisy data Good Very good Good Very good

Missing data Good Good Very good Good

Large sets Very good Poor Good Good

Different types Good Numerical Very good Transform

Accuracy High Very high High High

Explanation Very good Poor Very good Good

Integration Good Good Good Very good

Ease Easy Difficult Easy Difficult

Page 7: Lecture8_Overview of DM Techniques Modified)

SCCS 453 DW and DMSemester 2, Year 2006

Songsri Tangsripairoj, Ph.D. 7

Data Mining Models

Regression: Y = a + bX Classification: assign new record to class Predictive: assign value to new record Clustering: groups for data Time-series: assign future value Links: patterns in data

Page 8: Lecture8_Overview of DM Techniques Modified)

SCCS 453 DW and DMSemester 2, Year 2006

Songsri Tangsripairoj, Ph.D. 8

Data Mining Modeling ToolsRadding Algorithms Peacock Functions Basis Task

Cluster detection Cluster analysis Statistics Classification

Regression models Statistics Estimation

Logistic regression Statistics Classification

Discriminant analysis Statistics Classification

Neural networks Neural networks AI Classification

Kohonen nets AI Cluster

Decision trees Association rules AI Classification

Rule induction Association rules AI Description

Link analysis Description

Query tools Description

Descriptive statistics Statistics Description

Visualization tools Statistics Description

Page 9: Lecture8_Overview of DM Techniques Modified)

SCCS 453 DW and DMSemester 2, Year 2006

Songsri Tangsripairoj, Ph.D. 9

Data Mining Functions

Classification Identify categories in data

Prediction Formula to predict future observations

Association Rules using relationships among entities

Detection Anomalies & irregularities (fraud detection)

Page 10: Lecture8_Overview of DM Techniques Modified)

SCCS 453 DW and DMSemester 2, Year 2006

Songsri Tangsripairoj, Ph.D. 10

Financial ApplicationsTechnique Application Problem Type

Neural net Forecast stock price Prediction

NN, Rule Forecast bankruptcy

Fraud detection

Prediction

Detection

NN, Case Forecast interest rate Prediction

NN, visual Late loan detection Detection

Rule Credit assessment

Risk classification

Prediction

Classification

Rule, Case Corporate bond rate Prediction

Page 11: Lecture8_Overview of DM Techniques Modified)

SCCS 453 DW and DMSemester 2, Year 2006

Songsri Tangsripairoj, Ph.D. 11

Telecom Applications

Technique Application Problem Type

Neural net,

Rule induct

Forecast network behav. Prediction

Rule induct Churn

Fraud detection

Classification

Detection

Case based Call tracking Classification

Page 12: Lecture8_Overview of DM Techniques Modified)

SCCS 453 DW and DMSemester 2, Year 2006

Songsri Tangsripairoj, Ph.D. 12

Marketing Applications

Technique Application Problem Type

Rule induct Market segment

Cross-selling

Classification

Association

Rule induct, visual Lifestyle analysis

Performance analy.

Classification

Association

Rule induct, genetic, visual

Reaction to promotion

Prediction

Case based Online sales support Classification

Page 13: Lecture8_Overview of DM Techniques Modified)

SCCS 453 DW and DMSemester 2, Year 2006

Songsri Tangsripairoj, Ph.D. 13

Web Applications

Technique Application Problem Type

Rule induct,

Visualization

User browsing similarity analy.

Classification,

Association

Rule-based heuristics

Web page content similarity

Association

Page 14: Lecture8_Overview of DM Techniques Modified)

SCCS 453 DW and DMSemester 2, Year 2006

Songsri Tangsripairoj, Ph.D. 14

Other Applications

Technique Application Problem Type

Neural net Software cost Detection

Neural net,

rule induct

Litigation assessment

Prediction

Rule induct Insurance fraud

Healthcare except.

Detection

Detection

Case based Insurance claim

Software quality

Prediction

Classification

Genetic algor. Budget spending Classification

Page 15: Lecture8_Overview of DM Techniques Modified)

SCCS 453 DW and DMSemester 2, Year 2006

Songsri Tangsripairoj, Ph.D. 15

Demonstration Data Sets

Loan Application Data classification

Job Application Data classification

Insurance Fraud Data detection

Expenditure Data prediction

Page 16: Lecture8_Overview of DM Techniques Modified)

SCCS 453 DW and DMSemester 2, Year 2006

Songsri Tangsripairoj, Ph.D. 16

Loan Data

650 observations OUTCOMES (binary):

On-time cost of error: $300 Late (default)cost of error: $2,000

Variables: Age, Income, Assets, Debts, Want, Credit

Credit ordinal Transform: Assets, Debts, & Want →Risk

Page 17: Lecture8_Overview of DM Techniques Modified)

SCCS 453 DW and DMSemester 2, Year 2006

Songsri Tangsripairoj, Ph.D. 17

Example Loan Data

Page 18: Lecture8_Overview of DM Techniques Modified)

SCCS 453 DW and DMSemester 2, Year 2006

Songsri Tangsripairoj, Ph.D. 18

Job Application Data

500 observations OUTCOMES (ordinal):

Unacceptable Minimal Acceptable Excellent

Variables: Age, State, Degree, Major, Experience

State nominal; degree & major ordinal State is superfluous

Page 19: Lecture8_Overview of DM Techniques Modified)

SCCS 453 DW and DMSemester 2, Year 2006

Songsri Tangsripairoj, Ph.D. 19

Example Job App. Data

Page 20: Lecture8_Overview of DM Techniques Modified)

SCCS 453 DW and DMSemester 2, Year 2006

Songsri Tangsripairoj, Ph.D. 20

Insurance Claim Data

5000 observations OUTCOMES (binary):

OK cost of error $500 Fraudulent cost of error $2,500

Variables: Age, Gender, Claim, Tickets, Prior claims, Attorney

Gender & attorney nominal, tickets & prior claims categorical

Page 21: Lecture8_Overview of DM Techniques Modified)

SCCS 453 DW and DMSemester 2, Year 2006

Songsri Tangsripairoj, Ph.D. 21

Example Insurance Claim Data

Page 22: Lecture8_Overview of DM Techniques Modified)

SCCS 453 DW and DMSemester 2, Year 2006

Songsri Tangsripairoj, Ph.D. 22

Expenditure Data

10,000 observations OUTCOMES:

Could predict response in a number of categories Others

Variables: Age, Gender, Marital, Dependents, Income, Job

years, Town years, Education years, Drivers license, Own home, Number of credit cards

Churn, proportion of income spent on seven categories

Page 23: Lecture8_Overview of DM Techniques Modified)

SCCS 453 DW and DMSemester 2, Year 2006

Songsri Tangsripairoj, Ph.D. 23

Example Expenditure Data