overview of methods data mining techniques what techniques do, examples, advantages &...

17
Overview of Methods Data mining techniques What techniques do, examples, Advantages & disadvantages

Upload: claud-burke

Post on 01-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Overview of Methods Data mining techniques What techniques do, examples, Advantages & disadvantages

Overview of Methods

Data mining techniques

What techniques do, examples,

Advantages & disadvantages

Page 2: Overview of Methods Data mining techniques What techniques do, examples, Advantages & disadvantages

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

4-2

History• Statistics

• AI:– genetic algorithms, neural networks

• analogies with biology

– memory-based reasoning– link analysis from graph theory

Page 3: Overview of Methods Data mining techniques What techniques do, examples, Advantages & disadvantages

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

4-3

Techniques• Statistical

– Market-Basket Analysis - find groups of items

– Memory-Based Reasoning- case based

– Cluster Detection - undirected (quantitative MBA)

• Artificial Intelligence– Link Analysis - MCI’s Friends & Family

– Decision Trees, Rule Induction - production rule

– Neural Networks - automatic pattern detection

– Genetic Algorithms - keep best parameters

Page 4: Overview of Methods Data mining techniques What techniques do, examples, Advantages & disadvantages

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

4-4

Models• Regression: Y = a + bX• Classification: assign new record to class• Predictive: assign value to new record• Clustering: groups for data• Time-series: assign future value• Links: patterns in data

Page 5: Overview of Methods Data mining techniques What techniques do, examples, Advantages & disadvantages

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

4-5

Fitting

• Underfitting: not enough detail– leave out important variables

• Overfitting: too much detail– memorizes training set, but doesn’t help

with new data• data set too small• redundancy in data

Page 6: Overview of Methods Data mining techniques What techniques do, examples, Advantages & disadvantages

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

4-6

Comparison of Features

Rules Neural Net CaseBase Genetic

Noisy data Good Very good Good Very good

Missing data Good Good Very good Good

Large sets Very good Poor Good Good

Different types Good Numerical Very good Transform

Accuracy High Very high High High

Explanation Very good Poor Very good Good

Integration Good Good Good Very good

Ease Easy Difficult Easy Difficult

Page 7: Overview of Methods Data mining techniques What techniques do, examples, Advantages & disadvantages

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

4-7

Data Mining Functions

• Classification– Identify categories in data

• Prediction– Formula to predict future observations

• Association– Rules using relationships among entities

• Detection– Anomalies & irregularities (fraud detection)

Page 8: Overview of Methods Data mining techniques What techniques do, examples, Advantages & disadvantages

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

4-8

Financial ApplicationsTechnique Application Problem Type

Neural net Forecast stock price Prediction

NN, Rule Forecast bankruptcy

Fraud detection

Prediction

Detection

NN, Case Forecast interest rate Prediction

NN, visual Late loan detection Detection

Rule Credit assessment

Risk classification

Prediction

Classification

Rule, Case Corporate bond rate Prediction

Page 9: Overview of Methods Data mining techniques What techniques do, examples, Advantages & disadvantages

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

4-9

Telecom Applications

Technique Application Problem Type

Neural net,

Rule induct

Forecast network behav.

Prediction

Rule induct Churn

Fraud detection

Classification

Detection

Case based Call tracking Classification

Page 10: Overview of Methods Data mining techniques What techniques do, examples, Advantages & disadvantages

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

4-10

Marketing Applications

Technique Application Problem Type

Rule induct Market segment

Cross-selling

Classification

Association

Rule induct, visual Lifestyle analysis

Performance analy.

Classification

Association

Rule induct, genetic, visual

Reaction to promotion

Prediction

Case based Online sales support Classification

Page 11: Overview of Methods Data mining techniques What techniques do, examples, Advantages & disadvantages

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

4-11

Web Applications

Technique Application Problem Type

Rule induct,

Visualization

User browsing similarity analy.

Classification,

Association

Rule-based heuristics

Web page content similarity

Association

Page 12: Overview of Methods Data mining techniques What techniques do, examples, Advantages & disadvantages

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

4-12

Other ApplicationsTechnique Application Problem Type

Neural net Software cost Detection

Neural net,

rule induct

Litigation assessment

Prediction

Rule induct Insurance fraud

Healthcare except.

Detection

Detection

Case based Insurance claim

Software quality

Prediction

Classification

Genetic algor. Budget spending Classification

Page 13: Overview of Methods Data mining techniques What techniques do, examples, Advantages & disadvantages

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

4-13

Data Sets

• Loan Applications– classification

• Job Applications– classification

• Insurance Fraud– detection

• Expenditure Data– prediction

Page 14: Overview of Methods Data mining techniques What techniques do, examples, Advantages & disadvantages

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

4-14

Loan Data• 650 observations• OUTCOMES (binary):

– On-time cost of error: $300– Late (default) cost of error: $2,000

• Variables– Age, Income, Assets, Debts, Want, Credit

• Credit ordinal

– Transform: Assets, Debts, & Want →Risk

Page 15: Overview of Methods Data mining techniques What techniques do, examples, Advantages & disadvantages

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

4-15

Job Application Data

• 500 observations• OUTCOMES (ordinal):

– Unacceptable– Minimal– Acceptable– Excellent

• Variables– Age, State, Degree, Major, Experience

• State nominal; degree & major ordinal• State is superfluous

Page 16: Overview of Methods Data mining techniques What techniques do, examples, Advantages & disadvantages

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

4-16

Insurance Claim Data

• 5000 observations• OUTCOMES (binary):

– OK cost of error $500– Fraudulent cost of error $2,500

• Variables– Age, Gender, Claim, Tickets, Prior claims,

Attorney• Gender & attorney nominal, tickets & prior claims

categorical

Page 17: Overview of Methods Data mining techniques What techniques do, examples, Advantages & disadvantages

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

4-17

Expenditure Data

• 10,000 observations• OUTCOMES:

– Could predict response in a number of categories– Others

• Variables:– Age, Gender, Marital, Dependents, Income, Job

years, Town years, Education years, Drivers license, Own home, Number of credit cards

– Churn, proportion of income spent on seven categories