data mining

218
Data Mining Industrial Projects and Case Studies Kwok-Leung Tsui Industrial and Systems Engineering Georgia Institute of Technology

Upload: tommy96

Post on 22-Dec-2014

1.682 views

Category:

Documents


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Data Mining

Data Mining Industrial Projects and Case

Studies

Kwok-Leung Tsui

Industrial and Systems Engineering

Georgia Institute of Technology

Page 2: Data Mining

1. AT&T business data mining2. Inventory management in military maintenance 3. Sea cargo demand forecasting4. SMATRAQ project in transportation policies5. Location problem of letterbox6. Home improvement store shrinkage analysis 7. Hotels & resorts chain data mining8. Used car auction sales data mining9. Fast food restaurant call center

Industrial Projects

Page 3: Data Mining

Data Mining in Telecom. (Funded AT&T project)

~160 billion dollar per year industry (~70 B long distance & ~90 B dollars local)100 million + customers/accounts/lines>1 billion phone calls per day

Book closing (Estimating this month price/usage/revenue) Budgeting (Forecasting next year price/usage/revenue)Segmentation (Clustering of usage, growth, …)Cross Selling (Association Rule)Churn (Disconnect prediction & Tracking)Fraud (Detection of unusual usage time series behavior)Each of these problems worth hundreds millions dollars

Page 4: Data Mining

A contractor manages parts inventory for aircraft maintenance

Characterization and forecasting of demand and lead time distributions

60,000 different parts and 500 bench locations

Data tracked by an automated system

Demand data not available & stockout penalty

Inventory Management in Air Force (Funded project)

Page 5: Data Mining

Sea cargo network optimization

Contract planning & booking control

Characterize & forecast sea cargo demand distribution & cost structure

Improve ocean carrier and terminal operation efficiency

Data Mining in Sea Cargo Application (Funded TLIAP project)

Page 6: Data Mining

Strategies for Metropolitan Atlanta’s Regional Transportation & Air Quality

Five-year project sponsored by Transportation Dept., Federal Highway Admin., EPA, CDC, etc.

Assess air quality, travel behavior, land use & transportation policies

Reduce auto-dependence and vehicle emissions

SMARTRAQ Project for Transportation Policies

Page 7: Data Mining

Improve performance of express mail dropoff letter boxes

50,000 letter boxes & 8 month transaction data

Relate performance with important factors, e.g. regions, demographic, adjacent competition, pick-up schedule

Comparison with direct competitors

Customer demand analysis and forecast

Mining of Letter Box Transaction Data

Page 8: Data Mining

Inventory shrinkage costs US retailers 32 billions

Shrinkage = book inventory – inventory on hand

Working with a home improvement store’s Loss Prevention Group

Develop predictive model to relate shrinkage to important variables

Extract hidden knowledge to reduce loss and improve operation efficiency

Data Mining for Shrinkage Analysis in Retail Industry

Page 9: Data Mining

Manage chain hotels and resorts in different scale

Evaluate impact of promotional programs

Forecasting of customer behavior in frequent stay program

Monitor performance in customer survey

Predict performance with important factors

Data Mining for Hotels and Resorts Chain Business

Page 10: Data Mining

Maintain all used car auction data in last 20 years

Provide service to customers and dealers on auction price projection

Price depreciations on year,

Develop methods for mileage, seasonal, and regional adjustments

Data Mining of Used Car Auction Data

Page 11: Data Mining

Centralized call center for drive through customers of over 50 chain restaurants

Contractor manages call center with constraints on time to answer customers

Scheduling and management of human resources

Simulation and optimization algorithms

Data mining and forecasting on aggregate and individual demand

Fast Food Restaurant Call Center

Page 12: Data Mining

1. A Medical Case Study2. Profile Monitoring in Telecommunication3. Letterbox Transaction Data Mining4. A Market Analysis Case Study5. Air Force Parts Inventory Data Mining

Data Mining Case Studies

Page 13: Data Mining

1. Telecommunication Data Mining2. Churn Modeling in Wireless Industry3. Market Basket Analysis4. Supermarket Mining I5. Supermarket Mining II6. Banking and Finance

More DM Case Studies (Berry & Linoff)

Page 14: Data Mining

A Review & Analysis of MTS

(Technometrics, 2003)

W. H. Woodall and R. Koudelik, Virginia Tech

K.-L. Tsui and S. B. Kim, Georgia Tech

Z. G. Stoumbos, Rutgers University

Christos P. Carvounis, MD State University at Stony Brook

A Medical Case Study using MTS and DM Methods

Page 15: Data Mining

Primary MTS ReferencesTaguchi, G., and Rajesh, J. (2000), “New Trends in Multivariate Diagnosis,” Sankhya: The Indian Journal of Statistics, 62, 233-248.Taguchi, G., Chowdhury, S., and Wu, Y. (2001), The Mahalanobis-Taguchi System, New York: McGraw Hill.Taguchi, G., and Rajesh, J. (2002), a new book in MTS.

Page 16: Data Mining

P.C. Mahalanobis

Very influential in large-scale sample survey methodsFounder of the Indian Statistical Institute in 1931Architect of India’s industrial strategyAdvisor to Nehru and friend of R.A. Fisher

Page 17: Data Mining

Deming prize in Japan: 4 timesRockwell Medal (1986) Citation:Combine engineering & statistical methods to achieve rapid improvements in costs and quality by optimizing product design and manufacturing processes.1978-79: Ford / Bell Labs Teams "Discover" Method1980: First US Experiences (Xerox / Bell Labs)1990 - : Taguchi Methods or DOE well recognized by all industries for improving product or manufacturing process design.

Genichi Taguchi Japanese Quality Engineer

Page 18: Data Mining

MTS is said to be ………A groundbreaking new philosophy for data mining from multivariate data.A process of recognizing patterns and forecasting resultsUsed by Fiju, Nissan, Sharp, Xerox, Delphi Automotive Systems, Ford, GE and othersBeyond theory Intended to create an atmosphere of excitement for management, engineering and academia.

Page 19: Data Mining

Applications include the following:Patient monitoringMedical diagnosisWeather and earthquake forecastingFire detectionManufacturing inspectionClinical trialsCredit scoring

Page 20: Data Mining

MTS OverviewSimilar to a classification method using a discriminant-type function.Based on multivariate observations from a “normal” and an “abnormal” group. Used to develop a scale to measure how abnormal an item is while matching a pre-specified or estimated scale.MTS scale is used for variable selection, diagnosis, forecasting, and classification.

Page 21: Data Mining

MTS Procedure: Stage 1Identify p variables, Vi , i = 1, 2, …, pthat measure the “normality” of an item.Collect multivariate data on the normal group, Xj , j = 1, 2, …, m.Standardize each variable to obtain Zi vectors.Calculate the Mahalanobisdistances (MD) for the mobservations.

Page 22: Data Mining

( ) T 11i i iMD p −= Z S Z

i=1, …, mwhere S is the sample correlation matrix of the Z’s for the normal group.

Page 23: Data Mining

Stage 2Collect data on t abnormal items, Xi, i = m + 1, m + 2, …, m + t.

Standardize each variable using the normal group means and standard deviations.Calculate MD values MDi , i = m + 1, m + 2, …, m + t.

Page 24: Data Mining

According to the MTS, the scale is good if the MD values for the abnormal items are higher than those for the normal items (good separation).

Page 25: Data Mining

Stage 3Identify the useful variables using orthogonal arrays (OAs) and signal to noise (S/N) ratios.The MTS uses a design of experiments approach as an optimization tool to choose the variables that maximize the average S/N ratio.

Page 26: Data Mining

Use of DOE for Variable SelectionDesign an OA experiment using all variables.

For each row of the OA (a given set of variables)Compute MDi for each observation in abnormal groups;Determine a Mi value (the true severity level or working average) for each abnormal group;Compute S/N ratio based on MDi and Mi.

Determine significant variables using main effect analysis with S/N ratio as response.

Page 27: Data Mining

An Example of OA+ including variable; - excluding variable

Run V1 V2 V3 . . . . . . . . . V17 S/N Ratio1 + + + . . . . . . . . . + SN1 2 - + + + SN2 3 + - + + SN3 4 - - + + SN4 5 + + - + SN5 6 - + - + SN6 . . . . .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . .

.

.

.

.

.

32 - - - . . . . . . . . . - SN32

Page 28: Data Mining

Dynamic S/N ratio (multiple abnormal groups)

First regress Yi = SQRT(MDi) to Mi to obtain slope estimate (beta hat), then define S/N ratio:

⎥⎥⎦

⎢⎢⎣

⎡=⎥⎦

⎤⎢⎣⎡ −

MSEMSEMSESSR

r

2ˆlog101log10 β

Page 29: Data Mining

Larger-is-better S/N Ratio (single abnormal group)

⎥⎦

⎤⎢⎣

⎡− ∑

=

t

i iMDt 1

11log10

For t abnormal observations, the larger-is-better S/N ratio is

Page 30: Data Mining

Compute level averages of S/N ratios (+ and -) for each variable.Keep variables only with positive(significant) estimated main effects.

i iS N S N

+ −−

Main Effect Analysis

Page 31: Data Mining

Stage 4

Based on the chosen variables, use the MD scale for diagnosis and forecasting.A threshold is given such that the losses due to the two types of classification errors are balanced in some sense.

Page 32: Data Mining

A Medical Case Study

Medical diagnosis of liver disease.200 healthy patients and 17 unhealthy patients (10 with a mild level of disease and 7 with a medium case).Age, Gender and 15 blood test variables

(Data is made available.)

Page 33: Data Mining

Case Study Blood Test Variables with Normal Ranges

Variables Symbol Acronym Normal RangesTaguchi et al. (2001) Normal

RangesTotal Protein in Blood V3 TP 6.0 to 8.3 gm/dL 6.5-7.5 gm/dL

Albumin in Blood V4 Alb 3.4 to 5.4 gm/dL 3.5-4.5 gm/dL

Cholinesterase(Pseudocholinesterase) V5 ChE Depends on Technique

8 to 18 U/mL 0.60-1.00 dpHGlutamate O Transaminase

(Asparate Aminotransferase) V6 GOT 10 to 34 IU/L 2-25 Units

Glutamate P Transaminase(Alanine Transaminase) V7 GPT 6 to 59 U/L 0-22 Units

Lactic Dehydrogenase V8 LDH 105 to 333 IU/L 130-250 Units

Alkaline Phosphatase V9 Alp 0-250 U/L Normal250-750U/L Moderate Elevation 2.0-10.0 Units

r-Glutamyl Transpeptidase(gamma-Glutamate Transferase) V10 r-GPT 0 to 51 IU/L 0-68 Units

Leucine Aminopeptidase V11 LAP

Serum:

Male: 80 to 200 U/mLFemale: 75 to 185 U/mL

Total Cholesterol V12 TCh< 200 Desirable

200-239 Borderline high240+ High

Triglyceride V13 TG 10 to 190 mg/dL ⎯

Phospholipid V14 PL Platelet: 150,000 to 400,000/mm3 ⎯

Creatinine V15Cr

.

8 to 1.4 mg/dL ⎯

Blood Urea Nitrogen V16 BUN 7 to 20 mg/dL ⎯

Uric Acid V17 UA 4.1 to 8.8 mg/dL ⎯

Page 34: Data Mining

Some results and conclusions

Largest MD in healthy group 2.36 Lowest MD in unhealthy group 7.73

Thus, there is a lot of separation between the healthy and unhealthy group.

The Mi values are estimated from averages of MD values.

Page 35: Data Mining

OA32+ including variable; - excluding variable

Run V1 V2 V3 . . . . . . . . . V17 S/N Ratio1 + + + . . . . . . . . . + SN1 2 - + + + SN2 3 + - + + SN3 4 - - + + SN4 5 + + - + SN5 6 - + - + SN6 . . . . .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . .

.

.

.

.

.

32 - - - . . . . . . . . . - SN32

Page 36: Data Mining

average S/N ratioAll variables -6.25MTS combination -4.27 OA optimal comb. -3.34 Overall optimal comb. -1.76

Thus, the proposed method does not yield the optimum combination. MTS average S/N ratio was at about the 95th

percentile.

Page 37: Data Mining

Subject Disease Level All MTS OA Optimal Optimal1 Mild 7.727 13.937 8.058 13.3292 Mild 8.416 14.726 7.485 8.6163 Mild 10.291 17.342 9.498 8.0024 Mild 7.204 10.804 4.951 12.3115 Mild 10.590 18.379 9.367 12.0426 Mild 10.557 8.605 6.643 6.1397 Mild 13.317 13.896 7.794 6.1398 Mild 14.812 27.910 8.162 22.6669 Mild 15.693 28.110 10.278 26.000

10 Mild 18.911 35.740 20.992 14.42211 Medium 12.610 20.828 16.517 20.83312 Medium 12.256 18.578 14.607 19.31213 Medium 19.655 34.127 35.229 44.61414 Medium 43.039 85.564 13.105 32.72015 Medium 78.639 74.175 9.560 28.56016 Medium 97.268 104.424 29.201 31.81017 Medium 135.698 123.022 44.742 57.226

MDs for Unhealthy Group for Various Combinations of Variables

Page 38: Data Mining

Plots of MDs for Unhealthy Group for Various Combinations of Variables.:.

.::.Mild +---------+---------+---------+---------+---------+-------All

... . . . .Medium +---------+---------+---------+---------+---------+-------All

:::. :.

Mild +---------+---------+---------+---------+---------+-------MTS: . . . . .

Medium +---------+---------+---------+---------+---------+-------MTS

:.::: .

Mild +---------+---------+---------+---------+---------+-------OA Optimal.

.: .. .Medium +---------+---------+---------+---------+---------+-------OA Optimal

:::: :

Mild +---------+---------+---------+---------+---------+-------Optimal : :. . .

Medium +---------+---------+---------+---------+---------+-------Optimal

Page 39: Data Mining

Case Study Blood Test Variables with Normal Ranges

Variables Symbol Acronym Normal RangesTaguchi et al. (2001) Normal

RangesTotal Protein in Blood V3 TP 6.0 to 8.3 gm/dL 6.5-7.5 gm/dL

Albumin in Blood V4 Alb 3.4 to 5.4 gm/dL 3.5-4.5 gm/dL

Cholinesterase(Pseudocholinesterase) V5 ChE Depends on Technique

8 to 18 U/mL 0.60-1.00 dpHGlutamate O Transaminase

(Asparate Aminotransferase) V6 GOT 10 to 34 IU/L 2-25 Units

Glutamate P Transaminase(Alanine Transaminase) V7 GPT 6 to 59 U/L 0-22 Units

Lactic Dehydrogenase V8 LDH 105 to 333 IU/L 130-250 Units

Alkaline Phosphatase V9 Alp 0-250 U/L Normal250-750U/L Moderate Elevation 2.0-10.0 Units

r-Glutamyl Transpeptidase(gamma-Glutamate Transferase) V10 r-GPT 0 to 51 IU/L 0-68 Units

Leucine Aminopeptidase V11 LAP

Serum:

Male: 80 to 200 U/mLFemale: 75 to 185 U/mL

Total Cholesterol V12 TCh< 200 Desirable

200-239 Borderline high240+ High

Triglyceride V13 TG 10 to 190 mg/dL ⎯

Phospholipid V14 PL Platelet: 150,000 to 400,000/mm3 ⎯

Creatinine V15Cr

.

8 to 1.4 mg/dL ⎯

Blood Urea Nitrogen V16 BUN 7 to 20 mg/dL ⎯

Uric Acid V17 UA 4.1 to 8.8 mg/dL ⎯

Page 40: Data Mining

Variables for Unhealthy Patients Well Outside Normal Ranges

Subject Number Variable Number

1 12, 13

2 None

3 None

4 13

5 10

6 7

7 7

8 13

9 12, 13

10 4, 12

11 10, 12

12 10

13 10

14 10, 13

15 6, 7, 13

16 3, 6, 7, 10, 12

17 6, 7, 8, 10, 13

Page 41: Data Mining

Medical Analysis

V4, V6, V7, V9, and V10 are crucial for liver disease diagnosis and classification.Medical diagnosis shows that patients 15-17 exhibit some chronic liver disorder.Cluster analysis on V4, V6, V7, V9, and V10 yields only two groups. Only patients 15-17 are classified as “abnormal”. This result is consistent with medical diagnosis

Page 42: Data Mining

5.84.83.8Normal

Mild

Medium16 17 15

Dotplot for V4 Alb

Page 43: Data Mining

15010050

1617 15Medium

Mild

Normal

Dotplot for V6 GOT

Page 44: Data Mining

1701207020

1517 16Medium

Mild

Normal

Dotplot for V7 GPT

Page 45: Data Mining

300200100

1716 15Medium

Mild

Normal

Dotplot for V9 Alp

Page 46: Data Mining

2001000

1715 16Medium

Mild

Normal

Dotplot for V10 r-GPT

Page 47: Data Mining

Tree Classification Methods

Page 48: Data Mining

Classification Trees

• The CART (Classification And Regression Tree) methodology known as binary recursive partitioning. For more detailed information on CART, please see: Breiman, Friedman, Olshen, & Stone (1984): Classification and Regression Trees

• C4.5 is a decision tree learning system introduced by Quinlan (Quinlan, J. Ross (1993): C4.5: Programs for Machine Learning). The software is available at:http://www2.cs.uregina.ca/~hamilton/courses/831/notes/ml/dtrees/c4.5/tutorial.html

Page 49: Data Mining

Tree from Splus

V5 < 381.5

V10 < 63 V6 < 37.5

2(2)3(6)

2(8) 1(4)3(1)

1(196)

NoYes

NoYes NoYes

Page 50: Data Mining

Tree from SplusVariables actually used in tree construction: V5, V10, and V6.

Number of terminal nodes: 4

Misclassification error rate:0.01382 = 3 / 217

Classification matrix based on learning sample

Predicted ClassActual Class 1 2 3

1 200 0 02 0 8 23 1 0 6

Page 51: Data Mining

Tree from C4.5

V5 <= 364

V10 <= 63

V6 <= 26

1(200)3(1)

2(8)

3(6)2(2)

No

NoYes

NoYes

Yes

Page 52: Data Mining

Tree from C4.5

Variables actually used in tree construction: V5, V10, and V6.

Number of terminal nodes: 4

Misclassification error rate:0.0046 = 1 / 217

Classification matrix based on learning sample

Predicted ClassActual Class 1 2 3

1 200 0 02 0 10 03 1 0 6

Page 53: Data Mining

150100

0

0

100

V6 GOT

200

50

300

400

50

500

100

600

700

150 200

V5 ChE

0250V10 r-GPT

Normal

Mild

Medium

1617

15

Scatter Plot of V5 vs. V10 vs. V6

Page 54: Data Mining

7006005004003002001000

150

100

50

0

V5 ChE

V6

GO

T

Medium

Mild

Normal

17

16 15

Scatter Plot of V5 vs. V6

Page 55: Data Mining

7006005004003002001000

250

200

150

100

50

0

V5 ChE

V10

r-G

PT

Medium

Mild

Normal17

16

15

Scatter Plot of V5 vs. V10

Page 56: Data Mining

250200150100500

150

100

50

0

V10 r-GPT

V6

GO

T

Medium

Mild

Normal

15 16 17

Scatter Plot of V10 vs. V6

Page 57: Data Mining

700600500400300200100Normal

Mild

Medium16

1715

Dotplot for V5 ChE

Page 58: Data Mining

15010050

1617 15Medium

Mild

Normal

Dotplot for V6 GOT

Page 59: Data Mining

2001000

1715 16Medium

Mild

Normal

Dotplot for V10 r-GPT

Page 60: Data Mining

Comparison with Taguchi Approaches

All variables: V1 – V17

MTS: V4, V5, V10, V12, V13, V14, V15, V17

OA Optimal: V1, V4, V5, V10, V11, V14, V15, V16, V17

Optimal: V3, V5, V10, V11, V12, V13, V17

Classification Trees : V5, V6, V10

Page 61: Data Mining

Disease Level All MTS OA Optimal Optimal TreesMild 7.727 13.937 8.058 13.329 7.366Mild 8.416 14.726 7.485 8.616 18.789Mild 10.291 17.342 9.498 8.002 9.068Mild 7.204 10.804 4.951 12.311 6.517Mild 10.590 18.379 9.367 12.042 29.864Mild 10.557 8.605 6.643 6.139 10.869Mild 13.317 13.896 7.794 6.139 10.869Mild 14.812 27.910 8.162 22.666 8.222Mild 15.693 28.110 10.278 26.000 9.155Mild 18.911 35.740 20.992 14.422 16.420

Medium 12.610 20.828 16.517 20.833 42.681Medium 12.256 18.578 14.607 19.312 38.523Medium 19.655 34.127 35.229 44.614 86.796Medium 43.039 85.564 13.105 32.720 28.252Medium 78.639 74.175 9.560 28.560 208.102Medium 97.268 104.424 29.201 31.810 228.428Medium 135.698 123.022 44.742 57.226 199.304

MDs for Unhealthy Group for Various Combinations of Variables

Page 62: Data Mining

.:.

.::.Mild +---------+---------+---------+---------+---------+-------All

... . . . .Medium +---------+---------+---------+---------+---------+-------All

:::. :.

Mild +---------+---------+---------+---------+---------+-------MTS: . . . . .

Medium +---------+---------+---------+---------+---------+-------MTS:

.::: .

Mild +---------+---------+---------+---------+---------+-------OA Optimal.

.: .. .Medium +---------+---------+---------+---------+---------+-------OA Optimal

:::: :

Mild +---------+---------+---------+---------+---------+-------Optimal : :. . .

Medium +---------+---------+---------+---------+---------+-------Optimal.:

::.. .Mild +---------+---------+---------+---------+---------+-------Trees

. .. . . . .Medium +---------+---------+---------+---------+---------+-------Trees

0 50 100 150 200 250

Page 63: Data Mining

ConclusionThe MD values and dotplots show that

only the MD scale based on the variables used by classification trees, i.e., V5, V6and V10, does a good job discriminating between patients with mild level disease and patients with medium level disease. (Maybe MD is a good measure for multivariate data.)

Page 64: Data Mining

Comparison with Medical Analysis

V4, V6, V7, V9, and V10 are crucial for liver disease diagnosis and classification.Medical diagnosis shows that patients 15-17 exhibit some chronic liver disorder.Cluster analysis on V4, V6, V7, V9, and V10 yields only two groups. Only patients 15-17 are classified as “abnormal”. This result is consistent with medical diagnosis

Page 65: Data Mining

Correlations

Variables in Classification Trees V5 V6 V10

V4 0.501 -0.505 -0.184

V6 -0.370 1 0.507

V7 -0.365 0.905 0.485

V9 -0.305 0.197 0.269

Var

iabl

es C

ruci

al fo

r M

edia

l Dia

gnos

is

V10 -0.189 0.507 1

Page 66: Data Mining

5.84.83.8Normal

Mild

Medium16 17 15

Dotplot for V4 Alb

Page 67: Data Mining

1701207020

1517 16Medium

Mild

Normal

Dotplot for V7 GPT

Page 68: Data Mining

300200100

1716 15Medium

Mild

Normal

Dotplot for V9 Alp

Page 69: Data Mining

OA & main effect analysis do not give overall optimum.MTS discriminant function (S/N ratios) does not separate the two unhealthy groups.The variables selected from MTS are not appropriate to detect liver disease based on medical diagnosis.Tree methods separate the two unhealthy groups.MD may be a good distance measure for multivariate data.Results are based on current data and training error.

Case Study Summary

Page 70: Data Mining

DiscussionsThe MTS ignores considerable previous work in application areas such as medical diagnosis and classification methods.The MTS ignores sampling variation and discounts variation between units.The use of OA cannot be justified.The MTS is not a well-defined approach.Traditional statistical approaches may work better in many cases.Despite flaws, we expect the MTS to be used, in many companies.

Page 71: Data Mining

150100500

180

160

140

120

100

80

60

40

20

0

V6 GOT

V7

GP

T

Medium

Mild

Normal15

17

16

Correlation (V6, V7) = 0.905

Page 72: Data Mining

300200100

350

250

150

V12 TCh

V14

PL

Medium

Mild

Normal

15

17

16

Correlation (V12, V14) = 0.807

Page 73: Data Mining

250200150100500

120

110

100

90

80

70

60

50

40

V10 r-GPT

V11

LA

P

Medium

Mild

Normal

1615

17

Correlation (V10, V11) = 0.646

Page 74: Data Mining

4003002001000

350

250

150

V13 TG

V14

PL

Medium

Mild

Normal

17

1516

Correlation (V13, V14) = 0.616

Page 75: Data Mining

8.57.56.55.5

6

5

4

V3 TP

V4

Alb

Medium

Mild

Normal

17

15

16

Correlation (V3, V4) = 0.604

Page 76: Data Mining

A SPC Approach for Business Activity Monitoring

(IIE Transcations, 2006)

W. Jiang, Stevens Institute of TechnologyT. Au, AT&T

K.-L. Tsui, Georgia Institute of Technology

A Telecommunication Case Study

Page 77: Data Mining

A General Framework for Modeling & Monitoring of Dynamic Systems

Page 78: Data Mining

Dynamic Monitoring (A General Framework)

Actions

Segmentation & Model Selection

Monitoring

Dynamic Update

Problem

Profile–Time domain profile– Profile w. controllable predictors– Profile w. uncontrollable predictors

Model Selection– Global w/o segmentation– Global w. segmentation– Local within Segment

–Detection/Classification– Interpretation–Forecasting/Prediction Segmentation

– Known– Unknown

– Phase I: estimating unknown parameter– Phase II: monitoring and detecting– Anticipated drifts Vs. unanticipated

changes

Objective

Page 79: Data Mining

ApplicationsManufacturing Processes

Stamping Tonnage Signal Data (functional data)Nortel’s Antenna Signal Data (functional data)Mass Flow Controller (MFC) Calibration (linear profile)Vertical Density Profile (VDP) Data (nonlinear profile)

Service OperationsUsed Car Price Mining and PredictionTelecom. Customer UsageHotel Performance MonitoringFast food drive through call center forecasting & scheduling

Page 80: Data Mining

Manufacturing:Stamping Tonnage Signal Data

Figure 2: An Tonnage Signal and Some Possible Faults (Jin and Shi 1999)

Page 81: Data Mining

Stamping Tonnage Signal DataProblem

Time domain profile (a tonnage signal represents the stamping force in a process cycle).

ObjectiveFault detection and classification

Segmentation & Model SelectionKnown segmentation: most process faults occur only in specific working stages. Boundaries and sizes of segments are determined by process knowledge. (Jin and Shi 1999) Global model: wavelet transforms

MonitoringFor each segment, use T2 charts based on selected wavelet coefficients to conduct monitoring. (Jin and Shi 2001)

Dynamic UpdateClassify a new signal as normal, a known fault or a new fault, and update wavelet coefficients’ selection and parameter estimates (e.g. μ, ∑, etc.) using all available data.

ActionsIdentify and remove assignable causes.

Page 82: Data Mining

Service: Telecom. Customer UsageProblem

Profile with uncontrollable predictorsObjective

Abnormal behavior detection and classificationForecasting/prediction

Segmentation & Model SelectionUnknown segmentation: segment customers based on demographic, geographic, psychographic and/or behavioral information.Segmental: fit model for each customer segment, e.g. linear regression.

MonitoringUse the model built for each segment to monitor customer behaviors, e.g. monitor linear regression parameter vector βusing T2 chart.

Dynamic UpdateUpdate customer segmentation, segmental model fitting and/or parameter monitoring, e.g. parameters update based on known trend.

ActionsService improvement, customer approval, etc.

Page 83: Data Mining

Telecom. Customer Usage

Profile: profile with uncontrollable predictors

Objective– Abnormal behavior detection and classification– Forecasting/prediction

Segmentation– Unknown (segments are defined by customer information.)Model Selection– segmental (e.g. linear regression on uncontrollable predictors for each segment)

Monitoring – Phase I: unknown control chart parameters estimated from data– Phase II: monitoring by control charts, like T2 chart, EWMA chart, etc.

Actions: service improvement, customer approval, etc.

Dynamic Update– Update segmentation, model selection and/or parameter monitoring

Page 84: Data Mining

A SPC Approach for Business Activity Monitoring

Jiang, Au, and Tsui (2006), to appear in IIE Transactions

Page 85: Data Mining

Churn Detection via Customer Profiling

Qian, Jiang, and Tsui (2006), appear in International J. of Production Research

Page 86: Data Mining

Activity monitoring for interesting events that require actions (Fawcett and Provost, 1999)Examples:

Credit card or insurance fraud detectionChurn modeling and detectionComputer intrusion detectionNetwork performance monitoring

Objective: Trigger alarms for action accuratelyand as quickly as possible once activity occurs

Activity Monitoring

Page 87: Data Mining

Profiling Approach (SPC & hypothesis test):Characterize populations of key variables that describe normal activityTrigger alarm on activity that deviates from normal

Discriminating Approach (classification):Establish models & patterns of abnormal activity w.r.t. normal Apply pattern recognition to identify abnormal activity

Other Approaches:Hypothesis Vs. classificationNeural network for SPC problems (Hwarng et al. )Apply other classification to SPCDOE for variable selections on discriminationDetect complex patterns in SPC

Activity Monitoring

Page 88: Data Mining

Objective of Activity monitoring is similar to that of statistical process control (SPC)Multivariate control chart methods for continuous and attribute data may be neededMore sophisticated tools are needed

Activity Monitoring

Page 89: Data Mining

STATISTICAL PROCESS CONTROL

Widely used in manufacturing industry for variation reduction by discriminating:

Common causesAssignable causes

Evaluation: in-control vs. out-of-control

Performance:

False alarm rate

Average run length (ARL) Techniques:

Shewhart chart , EWMA chart, CUSUM chart

Page 90: Data Mining

STATISTICAL PROCESS CONTROL

Two stages of implementation:Phase 1 (retrospective): off-line modeling

Identify and clear outliersEstimate in-control models

Phase 2 (prospective): on-line deploymentTrigger out-of-control conditionsIsolate and remove causes of signals

Page 91: Data Mining

AN EXAMPLE

+

+

+

+++++++

+

+

+

+++++

+++

+++++

+++

+

+

+

++++

+++

++

+++++

+

+++

++

+

+

++

+++

+++

++++

+

++++

+

+

+++

+

+++

++

+++

+

+++

+

+

++++

++

++

+

Time

x

0 20 40 60 80 100

-4-2

02

4

Shewhart Chart

+++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++

++++++++++++++++++++

+++++++++++++++++++++

Time

x.se

s$pr

ed

0 20 40 60 80 100

-4-2

02

EWMA Chart

Time

0 20 40 60 80 100

010

300

1030

CUSUM Chart

Page 92: Data Mining

KEY CHALLENGES TO SPC

Off-line modelingRobust models w/ outliers and change pointsAutomatic model building

Scalability - a single algorithm tracking millions of data streamsImportance of early signals Interpretation is mostly qualitative, sacrificing accuracy for speed is acceptableDiagnosis and updating - business rulesOnline fashion: incomplete data - censored and/or truncated

Page 93: Data Mining

SPC Approach for CRM Monitoring

PHASE 1AUTOMATIC MODELING &

PROFILING

PHASE 2PROFILE

MONITORING & UPDATING

PHASE 3EVENT

DIAGNOSIS

Page 94: Data Mining

CRM MONITORING PROCESS

Business Event Definition

Customer Profiling

ProfileUpdating

Event Monitoring and Triggering

Small Set of Interesting Customers

Customer Diagnosis

Page 95: Data Mining

SPC FOR CRM - PHASE 1

Off-line Modeling: building customer profile robustly - time consuming

RequirementsA single, time variant model capturing most customers’behaviorAutomatic modeling, less human intervention

TechniquesRobust and efficient estimation methodsChange-point modeling

Parameter SelectionMSE/AIC/BICBusiness Requirement/Domain Knowledge

Page 96: Data Mining

SPC FOR CRM - PHASE 2

On-line customer profile updating and monitoring, in search for interesting events requiring action

Requirements:Recursive vs. time windowSignal accurately and as quickly as possible

Techniques:Markovian Type Updating – storage space & timeState Space control models

Page 97: Data Mining

SPC FOR CRM - PHASE 3

Diagnosis and Re-profilingRequirements

Following signalsRobust - outliers, trends, …Attribute identification

Techniques:Bayesian modelsNonlinear filtering methods

Page 98: Data Mining
Page 99: Data Mining

PHASE 1: CUSTOMER PROFILE

Dynamic Linear Model (West and Harrison, 1997)

Size/Level

Trend

Variability/Variance

Seasonaility (optional)

)(iMt

)(iTt

)]'(),(),([)()}({ iViTiMiPiX ttttt

)(iVt

=a

)(iSt

Page 100: Data Mining

Estimation Methods

Least Square Estimation (LSE)

Least Absolute Deviation (LAD)

Dummy Change Point Model with LSE

Dummy Change Point Model with LAD

Page 101: Data Mining

LSE and LAD

Page 102: Data Mining

A DUMMY CHANGE-POINT MODEL

Page 103: Data Mining

A DUMMY CHANGE-POINT MODEL

Solve global models assuming dummy change points

can be recursively obtained by reversing DES method with

Combine forecasts with exponential weights

Local variance can be estimated via bootstrap resampling

∑−

=− +−=

1

0

210 )]([arg)( Min

p

kkt

akaaXpa

∑=

t

pp paw

20 )(

1=λ)( pa

Page 104: Data Mining

A DUMMY CHANGE-POINT MODEL

Page 105: Data Mining

PHASE 2: CUSTOMER PROFILE UPDATING AND MONITORING

History data cleaning and profilingForecasting

Online monitoring

Markovian updating

)()()(ˆ1 iTiMiM ttt +=+

2111

11

111

))()(()()1()(

))()(()()1()())(ˆ)(()()1()(

iMiXiViV

iMiMiTiTiMiXiMiM

ttVtVt

ttTtTt

ttMtMt

+++

++

+++

−+−=

−+−=−+−=

λλ

λλλλ

ttt VKMX >− ++ |ˆ| 11

Page 106: Data Mining

Comparisons

Objectives:Robust at phase 1.Sensitive at phase 2.

Four methods:1. LSA2. LAD3. Dummy change point model with LSE4. Dummy change point model with LAD

Page 107: Data Mining
Page 108: Data Mining
Page 109: Data Mining
Page 110: Data Mining
Page 111: Data Mining

Case Study

Data Mining in Telecommunications Industry

(Source: AT & T, Mastering Data Mining by Berry & Linoff.)

Page 112: Data Mining

Outline

BackgroundDataflowsBusiness problemsDataA voyage of discoverySummary

Page 113: Data Mining

Telecommunication Industry

~160 billion dollar per year industry (~70 B long distance & ~90 B dollars local)100 million + customers/accounts/lines>1 billion phone calls per day

Book closing (Estimating this month price/usage/revenue) Budgeting (Forecasting next year price/usage/revenue)Segmentation (Clustering of usage, growth, …)Cross Selling (Association Rule)Churn (Disconnect prediction & Tracking)Fraud (Detection of unusual usage time series behavior)Each of these problems worth hundreds millions dollars

Page 114: Data Mining

Information Sources

OrderingSystem Network Billing

SystemCustomer

Add a phone Make a call

FCC

CensusDun & Bradstreet,...

CompetitiveWin/Loss/New/No Further Use, ...

Call Details/Web access

Revenue,Price, ...

Official Competitive highlevel reports

DelayedAnnually/Quarterly

Daily Real timeDelayedMonthly

DelayedAnnually/Quarterly

(Tera bytes of interesting information)

Page 115: Data Mining

Customer Focus

Telecommunication companies want to meet all the needs of their customers:

Local, long distance, and international voice telephone servicesWireless voice communicationsData communicationsGateways to the InternetData networks between corporationsEntertainment services, cable and satellite television

Instead of miles of cable and numbers of switches, customers are becoming the biggest asset of a telephone company.

Page 116: Data Mining

DataflowsCustomer behavior is in the data.Over a billion phone calls every day.A dataflow is a way of visually representing transformations on data.A dataflow graph consists of nodes and edges.

Data flows along edges, and gets processed at node.

A basic dataflow to read a file, uncompress it, and write it out:

uncompressCompressed

input file (in.z)Uncompressed

output file (out.text)

Page 117: Data Mining

Why are Dataflows efficient?

Dataflows dispense with most of the overhead that traditional databases have, like logging transaction, indexes, pages, etc.

Dataflows can be run in parallel, taking advantage of multiple processors and disks.

Dataflows provide a much richer set of transforma-tions than traditional SQL.

Page 118: Data Mining

Basic Operations in Dataflow

Basic OperationsCompress and uncompressReformatSelectSortAggregate and hash aggregateMerge/Join

Very important stepsData is very, very largeNeed super computer power

Page 119: Data Mining

Business ProblemsTelecommunication business has shifted from an infrastructure business to a customer business.

Understanding customer behavior becomes critical (market segmentation).

Revenue forecasting, churn prediction, fraud detection, new business customer identification.

The detailed transaction data contains a wealth of information, but unexploited due to its huge volume.

Page 120: Data Mining

Important Marketing Questions

Discussions with business users highlight the areas for analysis:

Understanding the behavior of individual customers

Regional differences in calling patterns

High-margin services

Supporting marketing and new sales initiatives

Page 121: Data Mining

Data

Call detail data

Customer data

Auxiliary files

Page 122: Data Mining

Call Detail Data

Definition:A call detail data is a single record for every call made over the telephone network.

Three sources of call detail data:Direct network/switch recordingsSwitch records: the least clean, but the most informative.Inputs into the billing systemBilling records: cleaner, but not complete.Data warehouse feedsRather clean, but limited by the needs of the data warehouse.

Page 123: Data Mining

Network Call DetailsHundred million calls a day

>100 byte per call record (>10 giga-bytes per days)Originating numberTerminating numberDay/Time of the CallLength of the callTypes of call, …..

2 year data online ??? ---> Statistical Compression>70 billion records (> 7 Tera bytes)Currently in tapes, Batch processing

Real time, low level details +++Raw data, Massive data processing ---Key applications : Book closing, Fraud Detection, Early Warning, ...

Page 124: Data Mining

Billing DetailsMillions of customer/accounts

Tons of other information about the customer/accounts100+ services (Regular long distance, Digital 1 rate, easylink, Readyline, VTNS,..)5 Jurisdiction (International, Interstate, …)50 states

NPA-NXX24-36 months of Message, Minute, Revenue

Length of call, Average revenue per minute

~? Billions observations$, Detailed +++Dirty, Delayed ----Key Applications : Budgeting/Forecasting, Segmentation/Clustering.

Page 125: Data Mining

Call Detail DataRecord formatImportant fields in a call detail record includes:

from_numberto_numberduration_of_callstart_timebandservice_field

Page 126: Data Mining

Customer Data

Customers can have multiple telephone lines. Customer data is needed to match telephone numbers to information about customers.

Telecommunication companies have made significant investments in building and populating data models for their customers.

Page 127: Data Mining

Customer Ordering DataHundred thousands of add/disconnect order weekly

Add a line or disconnect a line, …Tons of other information about the customer/accounts4+ Order types (Add, Win, Loss, No Further Use)100+ servicesRelated Carrier

Require Minute/Revenue Estimation/PredictionSummarizing the historical usage of a loss/NFU into 1 numberPredicting the future usage of a win/new (Growth Curve)

5 year online, a few hundred million recordsTimely, Small Volume +++Missing information, Massive Data Integration ---Major Applications : Customer Churn, Early Warning, Predicting disconnects

Page 128: Data Mining

Auxiliary FilesISP access numbersA list of access number of Internet Service Providers

Fax numbersA list of known fax machines

Wireless exchangesA list of exchanges that correspond to mobile carriers

Exchange geographyA list of geographic areas represented by the phone number exchange

InternationalA list of country code, and the names of the corresponding countries.

Page 129: Data Mining

DiscoveryCall durationCalls by time of dayCalls by market segmentInternational calling patternsWhen are customers at homeInternet service providersPrivate networksConcurrent callsBroad band customers

Page 130: Data Mining

Call Duration

Page 131: Data Mining

Call Duration

Page 132: Data Mining

Calls by Time of DayIn call detail data, the field band is a number representing how the call should be charged. This provides a breakdown:

localregionalnationalinternationalfixed-to-mobileotherUnknown

Question: when are different types of calls being made?

Page 133: Data Mining

Calls by Time of Day

Page 134: Data Mining

Calls by Time of Day

Page 135: Data Mining

Calls by Time of Day

Page 136: Data Mining

Calls by Market SegmentThe market segment is a broad categorization of customers:

ResidentialSmall businessMedium businessLarge businessGlobalNamed accountsGovernment

Question: Are customers within market segments similar to each other?What are the calling patterns between market segments?

Page 137: Data Mining

Calls by Market SegmentSolution approach

Results

customer data from_market_segment to_market_segment

from_number to_number

call detail records

Page 138: Data Mining

Calls by Market Segment

Page 139: Data Mining

Calls by Market Segment

Page 140: Data Mining

Calls by Market Segment

Page 141: Data Mining

International Calling PatternsInternational calls are highly profitable, but highly competitive.

Questions:where are calls going to?how do calling patterns change over time?how do calling patterns change during the day?what are differences between business and consumer usage?which customers primarily call one country?which customers call a wider variety of international numbers?

Page 142: Data Mining

International Calling Patterns

Page 143: Data Mining

When are Customers at Home?

Page 144: Data Mining

Internet Providers

Question:which customers own modems?which Internet service providers (ISPs) are customers using?do different segments of customers use different ISPs?

Page 145: Data Mining

Internet Providers

Page 146: Data Mining

Private NetworksSpecial customers:

Businesses that operate from multiple sites likely make large volumes of phone calls and data transfers between the sites.

Some businesses must exchange large volumes of data with other businesses.

Virtual private network (VPN) is a telephone product designed for this situation. For large volumes of phone calls, it provide less expensive service than pay-by-call service

Question: Which customers are good candidates for VPN?

Result: A list of businesses that have multiple offices and makephone calls between them.

Page 147: Data Mining

Concurrent CallsFor businesses having a limited number of outbound lines connected to a large number of extensions, the following questions are of interest:

When do a customer need additional outside line?

When is the right time to offer upgrades to their phone systems?

One measure of a customer’s need for new lines is the maximum number of lines that are used concurrently.

Page 148: Data Mining

Concurrent Calls

Page 149: Data Mining

Identify Broad Band Customers

Objective: Identify customers who use the telephone lines for data/computer access (potential broad band customers)Collect sample of 4000 lines in which voice or data/computer access information are availableDivide to two halves for training and testingDefine hundreds of call behavior variablesRun neural network, logistic regression, and tree

Page 150: Data Mining

Identify Broad Band Customers

Key predictive drivers:length of call (10+ min.)number of repeat phone call to the same number (5+)call by the time of day (at night)Call by day of the week (weekend)

Neural network performed the best.Tree is most intuitive.

Page 151: Data Mining

Summary

Call detail records contain rich information about customers:Customer behavior varies from one region of a country to another.Thousands of companies place calls to ISPs. They own modems and have the ability to respond to web-based marketing.Residential customers indicate when they are home by using the phone. These patterns can be important, both for customer contact and for customer segmentation.The market share of ISPs differs by market segment.International calls show regional variations. The length of calls varies considerably depending on the destinations.International calls made during the evening and early morning are longer than international calls made during the day.Companies making calls between their different sites are candidates for private networking.

Page 152: Data Mining

Case Study: Churn Modelingin Wireless Communications

This case study took place at the largest mobile telephone company in a newly developed country. The primary data source is the prototype of an ongoing data warehousing effort. (Source: “Mastering Data Mining” by Berry & Linoff)

Page 153: Data Mining

Outline The Wireless Telephone IndustryThree GoalsApproach to Building the Churn ModelChurn Model buildingThe DataLessons about Churn Models BuildingSummary

Page 154: Data Mining

The Wireless Telephone Industry

Rapidly maturing of the wireless market makes the number of churners and the effect of churn on the customer base grow significantly. The business shifts away from signing on nonusers, and focuses on existing customers. (see Figure 11.2 and Figure 11.3)

The wireless telephone industry has differences from other industries.Sole service providersRelatively high cost of acquisitionNo direct customer contactLittle customer mindshareThe handset

Page 155: Data Mining

Three Goals

Near-term goal: identify a list of probable churners for a marketing intervention.Discussion with the marketing group define the near-term goal: by the 24th of the month, provide the marketing department with a list of 10’000 club members most likely to churnMedium-term goal: build a churn management application (CMA).Besides running churn models, CMA also needed to:

Manage modelsProvide an environment for data analysis before and after modelingImport data and transform it into the input for churn modelsExport the churn scores developed by the models

Long-term goal: complete customer relationship management

Page 156: Data Mining

Approach to Building the Churn Model

Define churnInvoluntary churn refers to cancellation of a customer’s service due to nonpayment. Voluntary churn is everything that is not involuntary churn. The model is for the latter. Inventory available dataA basic set of data includes data from the customer information file, data from the service account file, and data from billing system.Build modelsDeploy scoresChurn scores can be used for marketing intervention campaigns, prioritizing of customers for different campaigns, and estimating customer longevity in computing estimated lifetime customer value.Measure the scores against what really happens

How close are the estimated churn probabilities to the actual churn probabilities?Are the churn scores “relatively” true, i.e., higher scores imply higher probabilities?

Page 157: Data Mining

Churn Model Building

A churn modeling effort necessitate a number of decisions:The choice of data mining toolSAS Enterprise Miner Version 2 was used for this project.Segmenting the models setThree models were built for three segments of customers: club members, non-club members, recent customers who had joined in the previous eight or nine months.

The final four models on four different segmentsIn order to investigate if customers joining at about the same time have similar reasons for churn, the club model set was split into two segments: customers who joined in the previous two years, and the rest.

Page 158: Data Mining

Churn Model Building (continued)

Choice of modeling algorithmDecision tree models were used for churn modeling due to their ability to handle hundreds of fields in the data, their explanatory power and easy to be automated.This project built six trees for each model set (using Gini and entropy as split functions, and allowing 2-, 3- and 4-way splits) in order to see which performs best and to verify each other.Three parameters need to be set: minimum size of a leaf node, minimum size of a node to split, and maximum depth of the tree. The resulted tree needs to be pruned.

The size and churner density of the model setExperiments with different model sets show that the model set with 30% churners and 50k records works best. (Table 11.3)

The effect of latency (Figure 11.12)Translating models in time (Figure 11.13)

Page 159: Data Mining

The Data

Historical churn ratesHistorical churn rate was calculated along different dimensions: handset, demographic, dealer, and ZIP code.Data at the customer and account levelSSN, ZIP code of residence, market ID, age and gender, pager indication flag, etc.Data at the service levelActivation data and reason, features ordered, billing plan, handset, and dealer, etc.Data billing historyTotal amount billed, late charges and amount overdue, all calls, fee-paid services, etc. Rejecting some variablesVariable that cheat, identifiers, categoricals with too many values, absolute dates, and untrustworthy values, etc.Derived variables

Page 160: Data Mining

Lessons about Churn Model Building

Finding the most significant variableshandset churn rate, other churn rate, number of phones in use by a customer, low usageListening to the business users to define the goalsListening to the dataIncluding historical churn ratesThe past is the best predictor of the future. For churn, the past is historical churn rates: churn rate by handset, by demographics, by area, and by usage patterns.(Figure 11.17)Composing the model setImportant factors: historical data availability, size and churner density. (Figure 11.18)Building a model for the churn management applicationListening to the data to determine model parametersUnderstanding the algorithm and the tool

Page 161: Data Mining

Summary

Four critical success factors for building a churn model:Defining churn, especially differentiating between interesting churn (such as customers who leave for a competitor) and uninteresting churn (customers whose service has been cut off due to nonpayment).Understanding how the churn results will be used. Identifying data requirements for the churn model, being sure toinclude historical predictors of churn, such as churn rate by handset and churn rate by demographics.Designing the model set so the resultant models can slide through different time windows and are not obsolete as soon as they are built.

Page 162: Data Mining

Case Study

Market Basket AnalysisWho buys meat at the health food

store ?

(Source: Mastering Data Mining by Berry & Linoff.)

Page 163: Data Mining

Purpose

Who buys meat at the health food store?

Understand customer behavior.

Page 164: Data Mining

DM Tools

Association Rules of Market Basket Analysis.

Customer clustering.

Decision tree.

Page 165: Data Mining

Customer AnalysisMarket Basket Analysis uses the information about what a customer purchases to give us insight into who they are and why they make certain purchases.

Product AnalysisMarket Basket Analysis gives us insight into the merchandise by telling us which products tend to be purchased together and which are most amenable to purchase.

Market Basket Analysis

Source: E. Wegman

Page 166: Data Mining

Given

A database of transactions.

Each transaction contains a set of items.

Find all rules X →Y that correlate the presence of one set of items X with another set of items Y.

Example: When a customer buys bread and butter, they buy milk 85% of the time.

Market Basket Analysis

Source: E. Wegman

Page 167: Data Mining

While association rules are easy to understand, they are not always useful.Useful: On Friday convenience store customers often purchase diapers and beer together.Trivial: Customer who purchase maintenance agreements are very likely to purchase larger appliances.

Inexplicable: When a new Super Store opens, one of the most commonly sold items is light bulbs.

Market Basket Analysis

Source: E. Wegman

Page 168: Data Mining

Measures for Market Basket Analysis

Confidence: Probability that right-hand product is present given that the left-hand product is in the basket.

Support: Percentage of baskets that contain both the left-hand side and the right-hand side of the association.

Lift (correlation): Compare the likelihood of finding the right-hand product in a basket known to contain the left-hand product to the likelihood of finding the right-hand product in any random basket.

Page 169: Data Mining

Example“Caviar implies Vodka”

High confidenceGiven that we know someone bought a caviar then probability that person buy vodka is very high.

Low supportThe percentage of basket that contain both the vodka and caviar is very low since those products are not much use.

High lift

)Pr()Pr(

basketrandomanyinVodkaFindingbaskettheinalreadyisCavierVodkafinding

Page 170: Data Mining

Association Results

Relation Lift Support (%)

Confidence(%) Rule

1 4 2.47 3.23 33.72 Red pepper -> Yellow pepper & Bananas & Bakery

2 3 2.24 4.75 49.21 Red pepper -> Yellow pepperBananas

… … … … … …

50 2 1.37 3.77 85.96 Green peppers -> Bananas

… … … … … …

LowHighHigh

basketrandomanyinBananaFindingbaskettheinalreadyispepperGreenbanadafinding

==

)Pr()Pr(

Page 171: Data Mining

Clustering

VariablesGender

Meat buying

Total Spending

Page 172: Data Mining

• The height of pies: total spending

• Shaded pie slice: the percentage of people in the cluster who buy meat

• Top row: Women, Bottom row: men

Customer Clusters

Page 173: Data Mining

Decision Tree

The most meat-buying branches

Spend the most money

Buy the largest number of item

Although only about 5% of shoppers buy meat, they are among the most valuable shoppers !!!

Page 174: Data Mining

Decision Treefor More about Meat

Page 175: Data Mining

Conclusion

Data Mining can be used to improve shelf placement decision.

Data Mining can be used to identify a small, but very profitable group of customers.

Page 176: Data Mining

Case Study

Supermarket Mining Analyzing Ethnic Purchasing Patterns

(Source: Mastering Data Mining by Berry & Linoff.)

Page 177: Data Mining

Overview

Describe how the manufacturer learned about ethnic

purchasing patterns.

Aimed at Spanish speaking shoppers in Texas.

Collected data from supermarket chain in Texas.

Employed data mining tools from Mineset (SGI).

Page 178: Data Mining

Purpose

Discover whether the data provided revealed any differences between the stores with a high percentage of Spanish-speaking customers and those having fewer.

Hispanic percentage for the specific item.

Identify which products sell well in Hispanic consumers.

Scatter plot showing variability of Hispanic appeal by category

Page 179: Data Mining

DataConsist of weekly sales figures for products from five basic categories. (Ready-to-eat cereals, Desserts, Snacks, Main meals, Pancake and variety baking mixes)Within category subcategories were assigned.(actual units sold, dollar volume and equivalent case sales)For each store,(store sizes, % of Hispanic shoppers and % of African-American shoppers)

Page 180: Data Mining

Decode variables that carried more than one piece of information.HISPLVL and AALEVEL: % of Hispanic and AAs.HISPLVL: 1 ~15

1 Store outside San Antonio with 90% or more Hispanic.10 With little or no Hispanic.

Normalized values by taking the sales volume to compare stores of different sizes.Hispanic score

Ave. values for the high H. store - Ave. values for the least H. storeLarge post. value indicates a product that sells much better in the heavily Hispanic stores.

Transformation of Data

Page 181: Data Mining

Transformation of Data

The most valuable part of the project was preparing the data and getting familiar with it,

Rather than in running fancy data mining algorithms.

Page 182: Data Mining

Association rule visualization for Hispanic percentage.

Scatter plot showing which products sell well in Hispanic neighborhoods.

Scatter plot showing variability of Hispanic appeal by category.

DM Tools

Page 183: Data Mining

Case Study

Supermarket MiningTransactions & Customer Analysis

(Source: Mastering Data Mining by Berry & Linoff.)

Page 184: Data Mining

A collaboration between a manufacturer and one of the retailer chains.Grocery market usually belong to the retailer actually performed by a supplier.

Overview

Page 185: Data Mining

Effective use sales data to make the category as a whole more profitable for the retailer.

Identify the customer behavior.

Finding clusters of customers

Purpose

Page 186: Data Mining

Transaction Detail Fields

FIELDS DESCRIPTION

Date YYYY-MM-DD

Store CCCSSSS, where CCC=chain, SSSS=store

Lane Lane of transaction

Time The time-stamp of the order start time

Customer IDThe loyalty card number presented by the customerID of 0 means the customer did not present a card

Tender Type Payment type, i.e. 1=cash, 2=check,….

UPC The universal product code for item purchased

Quantity The total quantity of this item

Dollar Amount The total $ amount for the quantity of a particular UPC purchased

Page 187: Data Mining

The numbers, encoded as machine-readable bar code that identify nearly every product that might be sold in a grocery store.Organizations

Uniform Code Council(www.uc-council.org): US and CanadaEuropean Article Numbering Association(www.ean.be):Europe and rest of the world

North America: Consist of 12 digitsThe code itself fits in 11 digits; the twelfth is a checksum

Universal Product Code

Page 188: Data Mining

Calculate the % of each shopper’s total spending that went to that category.

The total number of trips.

The total dollar amount spent for the year along with the total number of items purchased and the total number of distinct itemspurchased.

The % of the items purchased that carried high, medium and low profit margins for the store.

From Transaction Detail FieldsWE can calculate …….. .

Page 189: Data Mining

Finding groups of customers with similar behavior.

K-mean clustering.

Set a certain number k.

Selected as candidate cluster centers.

Assigned to the cluster whose center it is nearest.

Centers of the clusters are recalculated and the records are reassigned based on their proximity to the new cluster center.

Finding Clusters of Customers

Page 190: Data Mining

To get insight in customer behavior by understanding what differentiates one cluster from another.

To build further model within cluster

To use as additional input variables to another models.

Main Ways to Use Cluster

Page 191: Data Mining

Case Study

Who Gets What? Building a Best Next Offer Model for an Online Bank

(Source: Mastering Data Mining by Berry & Linoff.)

Page 192: Data Mining

Who Gets What? Building a Best Next Offer Model for an Online Bank

The use of data mining by the online division of a major bank to improve its ability to perform cross selling.

Cross-selling: the activity of Selling additional services to the customers you already have.

Page 193: Data Mining

Outline

Background on the Banking IndustryThe Business ProblemThe DataApproach to The ProblemModels BuildingLesson learned

Page 194: Data Mining

Background on the Banking Industry

The challenge for today’s large bank is to shift their focus from market share to wallet-share. That is, instead of merely increasing the number of customers, banks need to increase the profitability of the ones they already have.

Page 195: Data Mining

Background on the Banking Industry

Why use data mining?A bank knows much more about current customers than external prospects.

The information gathered on customers in the course of normal business operations is much more reliable than the data purchased on external respects.

Page 196: Data Mining

The Business Problem

The project had immediate, short-term, and long-term goals.Long-term: increase the bank’s share of each customer’s financial business by cross-selling appropriate products.Short term: support a direct e-mail campaign for four selected products (brokerage accounts, money market accounts, home equity loans, and a particular type of saving account).Immediate: take advantage of a data mining platform on loan from SGI to demonstrate the usefulness of data mining to the marketing of online banking services.

Page 197: Data Mining

The Data

The initial data comprised 1,122,692 account records extracted from the Customer Information System (CIS). Before starting data mining, a SAS data set was created, which contain an enriched version of the extracted data.

Page 198: Data Mining

The Data

From accounts to customers

Defining the products to be offered.

Page 199: Data Mining

The Data

From accounts to customersThe data extracted from the CIS had one row per account, which reflects the usual product-centric organization of a bank where managers are responsible for the profitability of particular products rather than the profitability of customers or households.

The best next offer project required pivoting the data to build customer-centric models. The account-level records from the CIS were transformed into around a quarter million household-level records.

Page 200: Data Mining

The Data

Defining the products to be offered45 product types is used for the best next offer model. Of these25 products are ones that may be offered to a customers. Information on the remaining is used only as input variables when building the models.

Page 201: Data Mining

Approach to the Problem

The approach to the problem:

A propensity-to-buy model is built for each product individually, which gives each customer a score for the modeled product. The scores for four products are combined to yield the best next offer model: customers are all offered the product for which they have the highest score.

Page 202: Data Mining

Approach to the Problem

Comparable scores

How to score?

Pitfalls of this approach

Page 203: Data Mining

Approach to the Problem

Comparable scoresThree requirements are needed to make scores from various product propensity models comparable:

All scores must fall into the same range: zero to one.

Anyone who already has a product should score zero for it.

The relative popularity of products should be reflects in the scores.

Page 204: Data Mining

Approach to the Problem

How to score?With the product propensity model, prospects are given a score based on the extent to which they look like the existing account holders for that product.This project used a decision tree-based approach, which use the percentage of existing customers at a leaf to assign a score forthe product.

This approach can be summed up by the words of Richard C. Cushing: “When I see a bird that walks like a duck and swims like a duck and quacks like a duck, I call that bird a duck.”

Page 205: Data Mining

Approach to the Problem

Pitfalls of this approachBecoming a customer may change people’s behavior.

The best approach is to build models based on the way current customers looked just before they became customers. But, the data this approach is not easy to get.

Current customers reflect past policy. This will result in “past discrimination” .

Page 206: Data Mining

Models Building

Build an individual propensity model for each product

Finding important variablesBuilding a decision tree modelModel performance in a controlled test

Get to a cross-sell model by combining individual propensity models

Page 207: Data Mining

Models Building

Start with brokerage accounts

Page 208: Data Mining

Finding important variables

Using the column importance ToolFind a set of variables which, taken together, do a good job of differentiating classes (people with brokerage accounts and people without):

Whether they are a private banking customerThe length of time they have been with the bankThe value of certain lifestyle codes assigned to them by Microvision (a marketing statistics company)

Using the evidence classifierThis tool Uses the naïve Bayes algorithm to build a predictive model.Naïve Bayes models treat each variable independently and measure their contributions to a prediction. Then these independent contributions are combined to make a classification.

Page 209: Data Mining

Building a decision tree model for brokerage

MineSet’s decision tree toolLeaves in the tree are either mostly nonbrokerage or mostly brokerage.Each path through the tree to the leaf containing mostly brokerage customers can be thought of as a “rule” for predicting an unclassified customer. Customers meeting the conditions of the “rule” are likely to have or be interested in a brokerage account.In our data, only 1.2 percent of customers had brokerage accounts.To improve the model, Oversampling is used to increase the percentage of brokerage customers in the model set. The final tree is built on a model set containing about one quarter brokerage accounts.

Page 210: Data Mining

Building a decision tree model for brokerage

Records weights in place of oversamplingAllowing one-off splitsGrouping categoriesInfluencing the pruning decisionsBackfitting the model for comparable scores

Page 211: Data Mining

Building a decision tree model for brokerage

Records weights in place of oversamplingRecording weighting can achieve the effect of oversampling by increasing the relative importance of the rare records.

Splitting decision is based on the total weight of records in each class rather than the total number of records.

In stead of increasing the weight of records in the rare class, the proper approach is to lower the weight of records in the common class.

Bringing the weight of rare records up to 20~25% of the total works well.

Page 212: Data Mining

Building a decision tree model for brokerage

Allowing one-off splitsBy default, MineSet’s tree building algorithm splits a categorical variable on every single value, or dose not split on it at all.Users can control if one-off splits are considered through one parameter.One-off split: split based on a single value of a categorical variable.

Grouping categoriesMineSet’s design: The tree building algorithm is unlikely to make good splits on a categorical variable taking on hundreds of values.Some variables rejected by MineSet seem to be very predictive for some cases. They have the characteristic that although there were hundreds of values in the data, only a few values of those variables appear frequently.The approach is to Lump all values below a certain threshold into a catch-all “other” category, and make splits on the more populous ones.

Page 213: Data Mining

Building a decision tree model for brokerage

Influencing the pruning decisionsUsers have the control of the size, depth, and bushiness of the tree.Good settings: minimum number of records in a node: 50; pruning factor: 0.1; no explicit limit on the depth.

Backfitting the model for comparable scoresThe backfit model is used to run the original data through the tree.The score for each leaf is based on the percentage of brokerage customers at that leaf.The more brokerage at one leaf, the higher scores the customers without brokerage at this leaf will get, and the more possible they will open a brokerage account.

Page 214: Data Mining

Brokerage model performance in a controlled test

High score: any score higher than the density of brokerage customers in the population, not a large number.

Group Size Choosing Email Response Rate

Model 10,000 High score Yes 0.7

Control 10,000 Random Yes 0.3

Hold-out 10,000 Random No 0.05

Page 215: Data Mining

Getting to a cross-sell model

The propensity models for the rest products are built following the same procedure, and individual propensity models are combined into a cross-sell model to find the best next offer.

B

D

A

vote B

0.47

0.10

0.72

C0.31

Page 216: Data Mining

Summary of the Procedure

Determine whether cross-selling makes sense.

Determine whether sufficient data exists to build a good cross-sell model.

Build propensity models for each product individually.

Combine individual propensity models to construct a cross-sell model.

Page 217: Data Mining

Lessons LearnedBefore building customer-centric models, data need to be transformed from product-centric to customer-centric.

Having a particular product may change a customer’s behavior. The best way to solve this problem is to build models based on the behavior before buying the product.

The current composition of the customer population is largely a reflection of past marketing policy.

Oversampling and record weighting can be used to consider rare events.

Page 218: Data Mining

References

Berry & Linoff (Wiley)

Mastering Data Mining, 2000

Han & Kamber (Morgan Kaufmann Publishers)

Data Mining: Concept and Techniques, 2001

Hastie, Tibshirani, & Friedman (Springer Verleg)

The Elements of Statistical Learning, 2001

Taguchi & Jugulum (Wiley)

The Mahalanobis-Taguchi Strategy, 2002