a practical look at data preparation jason brown cognicase inc. data mining irmac: data warehouse...

65
A Practical Look at Data Preparation Jason Brown Jason Brown Cognicase Inc. Cognicase Inc. Data Mining Data Mining IRMAC: Data Warehouse SIG November 5, 2002

Upload: wilfred-mosley

Post on 18-Dec-2015

223 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

A Practical Look at Data Preparation

Jason BrownJason BrownCognicase Inc.Cognicase Inc.

Data MiningData Mining

IRMAC: Data Warehouse SIGNovember 5, 2002

Page 2: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

2

Agenda

• Crash Course in Data Mining– What– Why– How

• The virtuous cycle– Data Preparation

• Case Study– Background– Going through the cycle

• Data Preparation

• Q&A

Page 3: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

3

The Crash Course

•What

•Why

•How

Page 4: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

4

Definitions

Data Mining:The process of exploration and analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns and rules.

Knowledge Discovery

Data Mining is not Data Warehousing, OLAP etc.

Page 5: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

5

Definitions

Modeling:• Not an ER type Data Model• A data mining model is computational, full of

algorithms• A model can be descriptive or predictive.

– A descriptive model helps in understanding underlying processes or behavior.

– A predictive model uses known values (input) to predict an unknown value (output)

Page 6: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

6

Two Types of Data Mining

• Directed– Know specifically what we are looking for

• Who is likely to respond to our offer?• What our customers going to be worth to us over their lifetime?

– Model is a Black Box

Input Output

Page 7: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

7

Two Types of Data Mining

• Undirected– Not exactly sure what we are looking for

• How should we define our Customer Segments?• What is interesting about all of our point of sale data?

– Model is a Transparent Box

Input Output

Page 8: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

8

Modeling Techniques

Decision Trees

If …….. Then ……..

Rule Induction

Neural Networks

Nearest Neighbour

Clustering

Page 9: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

9

Modeling TechniquesDecision Trees

• The tree is built based on the input of a training data set– Training Data Set is based on historical data– Over sample the data that reflects your question

• Each record of the Model Set is run through the branches of the tree until the record reaches a leaf

Page 10: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

10

Modeling TechniquesDecision Trees

Age < 23

YNIncome < 12 000

N

N

Y

Y

Male ?

28% 37%

Page 11: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

11

Modeling TechniquesNeural Networks

• Neural networks are a nonlinear model -similar to a ‘brain’.

• The network is built based on the input of a training set.

• Model sets run through this network will return accurate results based on the patterns identified in the training set.

• Very Complex

Page 12: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

12

Modeling TechniquesClustering

• Clustering finds groups of records that are similar.

• For example, customers can be clustered by:– income

– age

– ytd revenue

Page 13: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

13

Modeling TechniquesClustering

MaleIncome < 12 000Age < 23Coke Buyers

MaleIncome < 12 000Age < 23Non Coke Buyers

Page 14: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

14

Modeling TechniquesNearest Neighbour

• Model is built based on the input of a training set.

• Classifies a record by calculating the distances between the record criteria and the training data set

• Then it assigns the record to the class that is most common among its nearest neighbours

Page 15: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

15

Modeling TechniquesNearest Neighbour

Records plotted based on:

IncomeGenderAge

Bought Coke

Bought Coke

Bought Coke

Did Not Did Not

Page 16: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

16

Modeling TechniquesRule Induction

• A technique that infers generalizations from the information in the data

IF age < 19 AND purchase is coke THEN 40% purchase chips

• Describes the data, allows us to visualize what is going on

Page 17: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

17

The Crash Course

•What

•Why

•How

Page 18: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

18

The Reasons to Mine Data

Expenses

IncreaseProfit

Revenues

Page 19: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

19

The Reasons to Mine Data

• For Marketing/CRM– Targeting prospects – Predicting future customer behaviour– Costs Revenues

• For Research– Identify drugs likely to be successful– Costs

• For Process Improvement– Identify causes of production failures– Costs

Page 20: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

20

The Crash Course

•What

•Why

•How

Page 21: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

21

Process

• Many different processes for Data Mining– Vendor Driven

• SAS - SEMMA– Sample, Explore, Modify, Model, Assess

• SPSS - 5 A’s– Assess, Access, Analyze, Act, Automate

– Consulting Companies– The Virtuous Cycle

• Michael Berry and Gordon Linoff

Page 22: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

22

Business Problem

Transform Data

Act

Measure

ProcessThe Virtuous Cycle

Page 23: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

23

• Define the business problem• Understand the business and the rules• Determine if Data Mining fits the need• Understand the value to the business of

solving the problem

Business Problem

Page 24: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

24

Data for Data Mining

• Type of Data Values– Categorical

• Defined set of values• Ontario, Quebec, PEI …

– Ranks• High, Medium, Low• 0 – 20 000, 20 001 – 35 000, 35 001 – 50 000

– Intervals• Date• Time • Temperature

– True Numeric• Values that support numeric operations

Page 25: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

25

Transform DataSteps

Identify Data1

Prepare Model Set6

Add Derived Variables5

Transpose to Right Granularity4

Validate & Clean3

Obtain Data2

Conduct Modeling7

Page 26: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

26

Transform Data

Identify Data1

Step 1 - Identify Data

• What data is required to meet the modeling need?

• What data is available?

Page 27: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

27

Transform DataStep2 - Obtain Data

Identify Data

Obtain Data

1

2

• OLTP• Data Warehouse• Data Marts and OLAP• Self Reported• External

Page 28: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

28

Transform DataStep 3: Validate & Clean

Identify Data

Obtain Data

Validate & Clean

1

3

2

• Data Issues:– Missing– Fuzzy– Incorrect– Outliers

• Solutions:– Change Source– Filter Out– Ignore– Integrate– Predict– Derive a New Variable

Page 29: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

29

Transform DataStep 4: Transpose to right granularity

Identify Data

Obtain Data

Transpose to Right Granularity

Validate & Clean

1

4

3

2

• Data sets for Data Mining need one view, one record

• Grain must be consistent throughout– Aggregates can be problematic– Atomic data is often required to build data set

• Training data sets cast from point in time of event looking back

Page 30: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

30

Transform DataStep 5: Add Derived Variables

Identify Data

Obtain Data

Transpose to Right Granularity

Validate & Clean

Add Derived Variables

1

5

4

3

2

• Combined Columns• Summarizations• Features from Columns• Time Series

Page 31: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

31

Transform DataStep 6: Prepare Model Set

Identify Data

Obtain Data

Transpose to Right Granularity

Validate & Clean

Add Derived Variables

Prepare Model Set

1

6

5

4

3

2

• The Actual Input to the modeling

Page 32: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

32

Transform DataStep 7: Conduct Modeling

Identify Data

Obtain Data

Transpose to Right Granularity

Validate & Clean

Add Derived Variables

Prepare Model Set

Conduct Modeling

1

6

5

4

3

2

7

• Get our result– Decision Trees– Neural Networks– Clustering – Nearest Neighbour– Rule Induction

Page 33: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

33

Act

The Business has to actually do something with the results or what was the point?

Marketing or Retention Campaigns

Business Changes

Page 34: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

34

Measure

• Answer 2 Questions– Was the Data Mining effort accurate? – Were the Business Actions successful?

• Use different sets of data to compare real results– Actioned Customers vs. Non Actioned

• Accuracy Types– Absolute

• Our prediction was 80% of Group D would buy Coke and 78% really did

– Relative• Our prediction was 80% of Group D would buy Coke but

57 % really did, however Group C which we predicted had a 60% propensity to buy Coke actually bought Coke 42% of the time

Page 35: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

35

And back around

Business Problem

Transform Data

Act

Measure

Page 36: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

36

The Case Study

Page 37: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

37

The Case Study

• Background– The Business– Data Warehouse Overview– Strengths and Challenges

• The Project– Business Problem– Transform Data– Act– Measure

Page 38: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

38

The Case Study

• Background– The Business– Data Warehouse Overview– Strengths and Challenges

• The Project– Business Problem– Transform Data– Act– Measure

Page 39: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

39

The Business

• One of the top 3 (4?) cellular phone providers in Canada

• Recent Acquisitions– Clearnet– Quebectel

• Important Business Concepts– Handset– Subscriber– Client– Activity - Activations, Deactivations– Churn– Usage

Page 40: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

40

Cubes

Catalogues

Sources

Staging

Integration (3N F)

Reports

Data Marts(Dimensional)

DW Environment

Page 41: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

41

• Commitment to Data Warehousing• Prior Experience in Data Mining• Tools already Established• Strong business support for the outcomes Data

Mining would provide

Strengths at

Page 42: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

42

• Data Warehouse still in midst of major re-architecture effort

• Ongoing billing system integration projects• A data mart for data mining had existed

(Clearnet) but it was a victim of both of the above– Successful at Churn Prediction

Challenges at

Page 43: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

43

The Case Study

• Background– The Business– Data Warehouse Overview– Strengths and Challenges

• The Project– Business Problem– Transform Data– Act– Measure

Page 44: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

44

Using the Virtuous Cycle

Business Problem

Transform Data

Act

Measure

Page 45: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

45

Business Problems

• Churn Modeling– predict which subscriber is likely to leave

• Behavioural Segmentation– clustering subscribers into subgroups based on some

commonality– revenue, usage, demographic

• Client Value Estimation– the present value of all future profits generated throughout

the lifetime of that client

Page 46: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

46

Transform Data Steps

Identify Data

Obtain Data

Transpose to Right Granularity

Validate & Clean

Add Derived Variables

Prepare Model Set

Conduct Modeling

1

6

5

4

3

2

7

Page 47: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

47

Identify Data

EVERY POSSIBLE VARIABLE RELATED TO A SUBSCRIBER!

• Business Wanted:

1

• They provided a detailed list, by subject area, of the variables that they believed were required to conduct the kind of Data Mining Activities desired.

Page 48: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

48

• 19 Subject Areas identified with up to 75 variables each - What is the Priority?

• Avoid the big bang - How much can we actually do?

• Where to source the data from?• Resources - Who is going to do it?

Identify DataIT Challenges

1

Page 49: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

49

• First we asked the business to rate each variable as H, M or L priority– Almost everything was given an H

• Then we asked the business to rank the subject areas in order of importance– Hard to convince them of the value– Hard to find consensus– Necessary for determining a release strategy

Identify DataPrioritizing

1

Page 50: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

50

Obtain Data

• For each Subject Area and each variable we assessed and documented the following:– Where can it be sourced from (and when)?– What are the known issues?– Q&A back and forth on the variables with business– Identified possible additional variables

2

Page 51: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

51

+ =Release Strategy:•1 Release a Quarter

•3 planned releases

•15 of 19 Subject Areas

Obtain Data

2

Page 52: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

52

Obtain Data

• Data Mining Access tool against sources– Data Warehouse– OLTP– External Data

• Creation of flat files to feed to Data Mining team– Multi source– Validate and Clean

• Build a Data Mart for Data Mining– Part of the Data Warehouse Architecture

2

Page 53: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

53

Obtain Data

Sources

Staging

Integration (3N F)

Data Marts(Dimensional)

Data Mart forData Mining

Page 54: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

54

Validate & Clean Data

Validate & Clean3

Staging

Integration (3N F)

Data Marts

Data Mart forData Mining

• Clean?– Not Fuzzy– Correct - mostly– Missing and Outliers

Page 55: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

55

Transpose to Right Granularity

• The Grain is Subscriber by Bill Cycle• Monthly Snapshots of subscriber Data• Fact Data Cast to Bill Cycle

Transpose to Right Granularity4

BILLED_AMOUNTS

SUBSCRIBER_ID (FK)MONTH (FK)

LOTS_OF_BILLED_REVENUE_FACTSUPDATE_DTLOAD_DT

PRODUCT_SKU

PRODUCT_SKU_KEY

PRODUCT_INFOUPDATE_DTLOAD_DT

SUBSCRIBER_BILL_CYCLE_SNAPSHOT

SUBSCRIBER_IDMONTH

CURRENT_HANDSET_KEY (FK)FIRST_HANDSET_KEY (FK)SUBSCRIBER_PERSONAL_INFOSUBSCRIBER_DEMOGRAPHIC_INFOCONTRACT_INFORATE_PLAN_INFOMUCH_MORE_INFOLOAD_DTUPDATE_DT

VALUE_ADDED_SERVICES

VALUE_ADDED_SERVICES_KEY

VALUE_ADDED_SERVICES_DESVALUE_ADDED_SERVICES_CDVALUE_ADDED_SERVICES_IDUPDATE_DTLOAD_DT

VALUE_ADDED_SERVICES_BILLED

MONTH (FK)SUBSCRIBER_ID (FK)VALUE_ADDED_SERVICES_KEY (FK)

VALUE_ADDED_SERVICES_AMTUPDATE_DTLOAD_DT

FACTS_5SUBSCRIBER_ID (FK)MONTH (FK)

LOTS_OF_FACTSUPDATE_DTLOAD_DT

FACTS_4SUBSCRIBER_ID (FK)MONTH (FK)

LOTS_OF_FACTSUPDATE_DTLOAD_DT

FACTS_3SUBSCRIBER_ID (FK)MONTH (FK)

LOTS_OF_FACTSUPDATE_DTLOAD_DT

FACTS_2SUBSCRIBER_ID (FK)MONTH (FK)

LOTS_OF_FACTSUPDATE_DTLOAD_DTSUBSCRIBER_BILL_CYCLE_SNAPSHOT

SUBSCRIBER_IDMONTH

CLIENT_ACCOUNT_ID (FK)CLIENT_SNAPSHOT_MONTH (FK)CURRENT_HANDSET_KEY (FK)FIRST_HANDSET_KEY (FK)SUBSCRIBER_PERSONAL_INFOSUBSCRIBER_DEMOGRAPHIC_INFOCONTRACT_INFORATE_PLAN_INFOMUCH_MORE_INFOLOAD_DTUPDATE_DT

CLIENT_ACCOUNT_BILL_CYCLE_SNAPSHOTCLIENT_ACCOUNT_IDCLIENT_SNAPSHOT_MONTH

INFO_ON_THE_ACCOUNTACCOUNT_NO_OF_CANCELLED_SUBSACCOUNT_NO_OF_SUSPENDED_SUBSACCOUNT_NO_OF_ACTIVE_SUBSLOAD_DTUPDATE_DT

Page 56: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

56

Add Derived Variables

• Combined Columns– PERCENT_BUCKET_USED

• IN_BUCKET_CALLS / RATE_PLAN_BUCKET_MINUTES)

• Summarizations– Sphere of Influence

• UNIQUE_CALLED_NUMBER_CNT• UNIQUE_CALLING_NUMBER_CNT

Add Derived Variables5

Page 57: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

57

Add Derived Variables

• Features from Columns– CONTRACT_INDICATOR (Y/N Flag)

• Was the the date of snapshot for that subscriber between COMMIT_START_DATE and COMMIT_END_DATE

• Time Series

Add Derived Variables5

BLOCKED_USAGE_WORK

SUBSCRIBER_ID: NUMBER(22)USAGE_DT: DATE

BLOCKED_CNT: NUMBER(22)UPDATE_DT: DATELOAD_DT: DATE

BLOCKED_USAGE

MONTH: NUMBER(8)SUBSCRIBER_ID: NUMBER(22)

BLOCKED_CNT: NUMBER(22)UPDATE_DT: DATELOAD_DT: DATE

VALUE_ADDED_SERVICES_BILLED

MONTH: NUMBER(8)SUBSCRIBER_ID: NUMBER(22)VALUE_ADDED_SERVICES_KEY: NUMBER

VALUE_ADDED_SERVICES_AMT: NUMBER(22,2)UPDATE_DT: DATELOAD_DT: DATE

SUBSCRIBER_BILL_CYCLE_SNAPSHOT

SUBSCRIBER_ID: NUMBER(22)MONTH: NUMBER(8)

CLIENT_ACCOUNT_ID: NUMBER(22)CLIENT_SNAPSHOT_MONTH: NUMBER(8)SUBSCRIBER_CD: VARCHAR2(50)TIME_REMAINING_ON_CONTRACT: NUMBERCONTRACT_END_DT: DATECONTRACT_START_DT: DATECONTRACT_TERM: NUMBERCONTRACT_IND: VARCHAR2(1)SUBSCRIBER_STATUS_CD: VARCHAR2(50)ACTIVATION_CHANNEL_ORG_CD: VARCHAR2(50)ACTIVATION_REASON_CD: VARCHAR2(50)ACTIVATION_DT: DATEPOSTAL_CODE: VARCHAR2(20)PROVINCE_STATE_CD: VARCHAR2(50)LAST_NM: VARCHAR2(60)LANGUAGE_CD: VARCHAR2(50)BIRTHDATE: DATEGENDER: VARCHAR2(1)DRIVERS_LICENSE: VARCHAR2(20)FIRST_NM: VARCHAR2(50)SOCIAL_INSURANCE_NO: VARCHAR2(20)MOBILE_PHONE_NO: VARCHAR2(50)DEACTIVATION_REASON_CD: VARCHAR2(50)DEACTIVATION_DT: DATEREVENUE_BAND_CD: VARCHAR2(50)SEGMENT_CD: VARCHAR2(50)SUSPEND_REASON_CD: VARCHAR2(50)SUSPEND_DT: DATECURRENT_HANDSET_KEY: NUMBERFIRST_HANDSET_KEY: NUMBERUPDATE_DT: DATELOAD_DT: DATEAGE_OF_CURRENT_HANDSET: NUMBERPREPAID_IND: VARCHAR2(1)REVENUE_IND: VARCHAR2(1)RATE_PLAN_BUCKET_MINUTES: NUMBER(9,2)RATE_PLAN_GROUP_CD: VARCHAR2(50)RATE_PLAN_CD: VARCHAR2(50)SUBSCRIBER_TENURE_DAYS: NUMBERSUBSCRIPTION_ID: VARCHAR2(50)TECHNOLOGY_TYPE_CD: VARCHAR2(50)AGE: NUMBER(4,1)

PRODUCT_SKU_KD2

PRODUCT_SKU_KEY: NUMBER

PRODUCT_GROUP_DES: VARCHAR2(240)PRODUCT_GROUP_CD: VARCHAR2(50)PRODUCT_GROUP_ID: VARCHAR2(50)ENGLISH_PRODUCT_DES: VARCHAR2(240)PRODUCT_CD: VARCHAR2(50)REFURBISHED_HANDSET_IND: VARCHAR2(1)WEB_READY_IND: VARCHAR2(1)SIM_CARD_IND: VARCHAR2(1)PRODUCT_ID: VARCHAR2(50)UPDATE_DT: DATELOAD_DT: DATE

DROPPED_USAGE_WORK

SUBSCRIBER_ID: NUMBER(22)USAGE_DT: DATE

DROPPED_CNT: NUMBER(22)UPDATE_DT: DATELOAD_DT: DATE

DROPPED_USAGE

MONTH: NUMBER(8)SUBSCRIBER_ID: NUMBER(22)

DROPPED_CNT: NUMBER(22)UPDATE_DT: DATELOAD_DT: DATE

BUCKET_USAGE

MONTH: NUMBER(8)SUBSCRIBER_ID: NUMBER(22)

UPDATE_DT: DATELOAD_DT: DATEPERCENT_BUCKET_USED: NUMBER

BILLED_USAGE_3_MTH_AVG

MONTH: NUMBER(8)SUBSCRIBER_ID: NUMBER(22)

TOTAL_CALLS_MOU_3_MTH_AVG: NUMBER(22,5)LONG_DISTANCE_CALLS_MOU_3_MTH_: NUMBER(22,5)WEEKEND_CALLS_MOU_3_MTH_AVG: NUMBER(22,5)EVENING_CALLS_MOU_3_MTH_AVG: NUMBER(22,5)PEAK_CALLS_MOU_3_MTH_AVG: NUMBER(22,5)UPDATE_DT: DATELOAD_DT: DATE

BILLED_USAGE

MONTH: NUMBER(8)SUBSCRIBER_ID: NUMBER(22)

TOTAL_CALLS_MOU: NUMBER(22,5)TOTAL_CALLS_CNT: NUMBERLONG_DISTANCE_CALLS_MOU: NUMBER(22,5)LONG_DISTANCE_CALLS_CNT: NUMBERWEEKEND_CALLS_MOU: NUMBER(22,5)WEEKEND_CALLS_CNT: NUMBEREVENING_CALLS_MOU: NUMBER(22,5)EVENING_CALLS_CNT: NUMBERPEAK_CALLS_MOU: NUMBER(22,5)PEAK_CALLS_CNT: NUMBERUPDATE_DT: DATELOAD_DT: DATEOUT_BUCKET_CALLS_MOU: NUMBER(22,5)IN_BUCKET_CALLS_MOU: NUMBER(22,5)OUT_BUCKET_CALLS_CNT: NUMBERIN_BUCKET_CALLS_CNT: NUMBERROAMING_CALLS_MOU: NUMBER(22,5)ROAMING_CALLS_CNT: NUMBER

SUBSCRIBER_ID_KD2_TEMP

SUBSCRIBER_ID: NUMBER(22)

ACCOUNT_CYCLE_BILL_DAY: NUMBER(2)

Page 58: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

58

Prepare Model Set

Prepare Model Set6

FACTS_5

SUBSCRIBER_ID (FK)MONTH (FK)

LOTS_OF_FACTSUPDATE_DTLOAD_DT

FACTS_4

SUBSCRIBER_ID (FK)MONTH (FK)

LOTS_OF_FACTSUPDATE_DTLOAD_DT

FACTS_3

SUBSCRIBER_ID (FK)MONTH (FK)

LOTS_OF_FACTSUPDATE_DTLOAD_DT

FACTS_2

SUBSCRIBER_ID (FK)MONTH (FK)

LOTS_OF_FACTSUPDATE_DTLOAD_DT

FACTS_1

SUBSCRIBER_ID (FK)MONTH (FK)

LOTS_OF_FACTSUPDATE_DTLOAD_DT

BILLED_AMOUNTS

SUBSCRIBER_ID (FK)MONTH (FK)

LOTS_OF_BILLED_REVENUE_FACTSUPDATE_DTLOAD_DT

DROPPED_AND_BLOCKED_USAGE

DROPPED_CALLS_CNTUPDATE_DTLOAD_DT

EQUIPMENT_ACTIVITY_SNAPSHOT

SUBSCRIBER_ID (FK)MONTH (FK)

HANDSET_COUNTABLE_DATAUPDATE_DTLOAD_DT

PRODUCT_SKU

PRODUCT_SKU_KEY

PRODUCT_INFOUPDATE_DTLOAD_DT

SUBSCRIBER_BILL_CYCLE_SNAPSHOT

SUBSCRIBER_IDMONTH

CLIENT_ACCOUNT_ID (FK)CLIENT_SNAPSHOT_MONTH (FK)CURRENT_HANDSET_KEY (FK)FIRST_HANDSET_KEY (FK)SUBSCRIBER_PERSONAL_INFOSUBSCRIBER_DEMOGRAPHIC_INFOCONTRACT_INFORATE_PLAN_INFOMUCH_MORE_INFOLOAD_DTUPDATE_DT

VALUE_ADDED_SERVICES

VALUE_ADDED_SERVICES_KEY

VALUE_ADDED_SERVICES_DESVALUE_ADDED_SERVICES_CDVALUE_ADDED_SERVICES_IDUPDATE_DTLOAD_DT

VALUE_ADDED_SERVICES_BILLED

MONTH (FK)SUBSCRIBER_ID (FK)VALUE_ADDED_SERVICES_KEY (FK)

VALUE_ADDED_SERVICES_AMTUPDATE_DTLOAD_DT

BLOCKED_USAGE

BLOCKED_CALLS_CNTUPDATE_DTLOAD_DT

CLIENT_ACCOUNT_BILL_CYCLE_SNAPSHOT

CLIENT_ACCOUNT_IDCLIENT_SNAPSHOT_MONTH

INFO_ON_THE_ACCOUNTACCOUNT_NO_OF_CANCELLED_SUBSACCOUNT_NO_OF_SUSPENDED_SUBSACCOUNT_NO_OF_ACTIVE_SUBSLOAD_DTUPDATE_DT

08/01/2002 1,735,436.26 281,032.21 388,970.91 0.00 516,911.31 462,667.55 749,330.77 6,371.95 500,183.42 08/14/2002

• Select Sample– Population– Data for given

modeling effort

• Denormalize completely

Page 59: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

59

Prepare Model Set

Prepare Model Set6

Denormalize completely

Example of Denormalization for Data Mining

Handset 1 Handset 2 Handset 3 Handset 40 1 0 0

BILLED_USAGE

MONTH (FK)SUBSCRIBER_ID (FK)

TOTAL_CALLS_MOUTOTAL_CALLS_CNTLONG_DISTANCE_CALLS_MOULONG_DISTANCE_CALLS_CNTWEEKEND_CALLS_MOUWEEKEND_CALLS_CNTEVENING_CALLS_MOUEVENING_CALLS_CNTPEAK_CALLS_MOUPEAK_CALLS_CNTUPDATE_DTLOAD_DTOUT_BUCKET_CALLS_MOUIN_BUCKET_CALLS_MOUOUT_BUCKET_CALLS_CNTIN_BUCKET_CALLS_CNTROAMING_CALLS_MOUROAMING_CALLS_CNT

BILLED_USAGE_3_MTH_AVG

MONTH (FK)SUBSCRIBER_ID (FK)

TOTAL_CALLS_MOU_3_MTH_AVGLONG_DISTANCE_CALLS_MOU_3_MTH_WEEKEND_CALLS_MOU_3_MTH_AVGEVENING_CALLS_MOU_3_MTH_AVGPEAK_CALLS_MOU_3_MTH_AVGUPDATE_DTLOAD_DT

BUCKET_USAGE

MONTH (FK)SUBSCRIBER_ID (FK)

UPDATE_DTLOAD_DTPERCENT_BUCKET_USED

PRODUCT_SKU

PRODUCT_SKU_KEY

PRODUCT_INFOUPDATE_DTLOAD_DT

SUBSCRIBER_BILL_CYCLE_SNAPSHOT

SUBSCRIBER_IDMONTH

CLIENT_ACCOUNT_ID (FK)CLIENT_SNAPSHOT_MONTH (FK)CURRENT_HANDSET_KEY (FK)FIRST_HANDSET_KEY (FK)SUBSCRIBER_PERSONAL_INFOSUBSCRIBER_DEMOGRAPHIC_INFOCONTRACT_INFORATE_PLAN_INFOMUCH_MORE_INFOLOAD_DTUPDATE_DT

VALUE_ADDED_SERVICES

VALUE_ADDED_SERVICES_KEY

VALUE_ADDED_SERVICES_DESVALUE_ADDED_SERVICES_CDVALUE_ADDED_SERVICES_IDUPDATE_DTLOAD_DT

VALUE_ADDED_SERVICES_BILLED

MONTH (FK)SUBSCRIBER_ID (FK)VALUE_ADDED_SERVICES_KEY (FK)

VALUE_ADDED_SERVICES_AMTUPDATE_DTLOAD_DT

Page 60: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

60

Prepare Model Set

Prepare Model Set6

Example of Casting a a Test Set for Churn Modeling

• Sliding Windows Concept

O c t N o v D e c J a n F e b M a r A p r M a y J u n J u l

M o d e l S e t F e b 3 2 1 X PM o d e l S e t M a r 3 2 1 X PM o d e l S e t A p r 3 2 1 X P

S c o r e S e t 3 2 1 X P P

Page 61: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

61

Conduct Modeling

7

• Done by business team with help from Vendor (SAS)– Decision Trees and Neural Networks (Churn)– Clustering (Segmentation)

Decision Trees

Neural Networks

Clustering

Page 62: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

62

Various Marketing and CRM Activities

The Action

Page 63: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

63

Measuring

• Not there yet– Intend to compare actioned vs. non-actioned results– 50 % to be actioned

Page 64: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

64

And the cycle continues ...

Business Problem

Transform Data

Act

Measure

Page 65: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002

65

AQ&Q U E S T I O N SQ U E S T I O N SA N S W E R SA N S W E R S

Jason [email protected]