caravan insurance data mining statistical analysis

15
K6255 Knowledge Discovery and Data Mining Statistical Analysis of Caravan Insurance using IBM SPSS Muthu Kumaar Thangavelu (G1101765E) [email protected] 1. INTRODUCTION: The data set contains information on customers of an insurance company which includes the product usage data and socio-demographic data derived from zip area codes supplied by the Dutch data mining company Sentient Machine Research. Our aim is to predict a customer circle who will be interested in buying caravan insurance and predict a model with the given 86 variable values representing the socio demographic, education, insurance interests and income levels of customers. 2. STATISTICAL ANALYSIS 2.1. DATA PREPARATION: 2.1.1. ANALYZING AND CATEGORIZING THE VARIABLES: We extract and analyze the raw variables with labels and try to categorize the variables based on the understanding of the insurance product and the product buyers. We classify the broad range of 86 variables to significant predictors as below CUST_SUB_LIFESTYLE_REFLECTION: Customer sub type MOSTYPE variable has 41 value types which can be categorised under two broad classes which relate to their age, social class, life style and reflection towards investing or spending as follows - Middle and Upper Class, middle aged and senior citizens, high risk cultured liberal investors (8, 9, 12, 13, 23, 25, 36, 2, 3, 4, 5, 15, and 27) - Distributed age and social class, low risk cultured conservative investors (1,6,7,10,11,14,16,17,18,19,20,21,22,24,26,28,29,30,31,32,33,34,35,37,38,39,40,41) CUST_LEVEL_LIFECYCLE: Average age MGEMLEEF holds 6 types of values which can be categorised into three groups and are based on family status and age. - Young, family starters (1) - Middle aged family men (2, 3, and 4) - Senior, family men (5, 6)

Upload: muthu-kumaar

Post on 18-Nov-2014

1.457 views

Category:

Technology


3 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Caravan insurance data mining statistical analysis

K6255 – Knowledge Discovery and Data Mining

Statistical Analysis of Caravan Insurance using IBM SPSS

Muthu Kumaar Thangavelu (G1101765E)

[email protected]

1. INTRODUCTION:

The data set contains information on customers of an insurance company which includes the

product usage data and socio-demographic data derived from zip area codes supplied by the Dutch

data mining company Sentient Machine Research. Our aim is to predict a customer circle who will be

interested in buying caravan insurance and predict a model with the given 86 variable values

representing the socio demographic, education, insurance interests and income levels of customers.

2. STATISTICAL ANALYSIS

2.1. DATA PREPARATION:

2.1.1. ANALYZING AND CATEGORIZING THE VARIABLES:

We extract and analyze the raw variables with labels and try to categorize the variables based on the

understanding of the insurance product and the product buyers. We classify the broad range of 86

variables to significant predictors as below

CUST_SUB_LIFESTYLE_REFLECTION:

Customer sub type MOSTYPE variable has 41 value types which can be categorised under two broad

classes which relate to their age, social class, life style and reflection towards investing or spending

as follows

- Middle and Upper Class, middle aged and senior citizens, high risk cultured liberal investors (8, 9,

12, 13, 23, 25, 36, 2, 3, 4, 5, 15, and 27)

- Distributed age and social class, low risk cultured conservative investors

(1,6,7,10,11,14,16,17,18,19,20,21,22,24,26,28,29,30,31,32,33,34,35,37,38,39,40,41)

CUST_LEVEL_LIFECYCLE:

Average age MGEMLEEF holds 6 types of values which can be categorised into three groups and are

based on family status and age.

- Young, family starters (1)

- Middle aged family men (2, 3, and 4)

- Senior, family men (5, 6)

Page 2: Caravan insurance data mining statistical analysis

CUST_MAIN_SPEND_INVEST_ATTITUDE:

Customer main type MOSHOOFD can be classified into two groups based on the attitude of

customers towards buying / spending.

- Liberals (1, 2, 5, 6)

- Conservatives (3, 4, 7, 8, 9, 10)

CUST_MARITAL_STAT:

MRELGE, MRELSA, MRELOV, MFALLEEN describe the relationship status of a person which can be

combined into two categories signifying the marital status

- Married (MRELGE)

- Unmarried (MRELSA, MRELOV, MFALLEEN)

CUST_WORK_CATEGORY_PROFILE:

Variables 19 – 24 describe the profile of work category of a person which can be of 2 types.

- Potential income generating high profile work category (MBERHOOG, MBERZELF, MBERMIDD)

- Relatively less Potential Income generating low profile work category (MBERBOER, MBERARBG,

MBERARBO)

CUST_INCOME_LEVEL:

Variables 37 to 41 represent the income of a person which can be grouped into three classes

Low (MINKM30)

Middle (MINK3045, MINK4575)

High (MINK7512, MINK123M)

These can be best represented by a standalone factor depicting the average income (MINKGEM)

CUST_INSURANCE_INTEREST:

Variables 44 to 85 and 35,36 describe the interest of customers towards various insurance policies

in general starting from much needed insurance policies for life, health, disabilities, family/private

accidents and optimal insurance policies for property, small automobiles of individuals (especially

where cost of replacement of damaged parts are as costly as getting a new vehicle) or delivery

vehicles of companies which are operated by third party drivers or an industrial machine to the most

sophisticated policies offering luxury and high safety in the form of private third party insurance

where the insurer pays off the third party even if the insured is at fault and Car, fire and social

security also represent forms of luxury or high sophistication. Hence here is the classification for

both the number and contribution of policies by different customers:

- Individuals opting sophistication and high safety Insurance policies (WAPART, PERSAUT, BRAND,

BYSTAND)

- Firms/Individuals Opting much needed and Optimal Safety Insurance policies (All others)

Page 3: Caravan insurance data mining statistical analysis

2.1.2. MAPPING TARGET VARIABLES AS PREDICTORS OF CARAVAN INSURANCE BUYERS:

These predictions have been made with descriptive statistics results of the data set along with the

real world logical themes (Appendix-1)

FACTOR 1: AGE

Middle aged people are more likely to get caravan insurance

FACTOR 2: ATTITUDE TOWARDS SPENDING/ BUYING

People with a liberal attitude predicted by Customer Main type are more likely to get caravan

insurance

FACTOR 3: SOCIAL LIFE STYLE REFLECTOR

People who are modern, professional, middle and upper class and liberal investors of their income

as predicted Customer Sub type are likely to get caravan insurance.

FACTOR 4: MARITAL STATUS

Married Family Men are more likely to buy caravan insurance

FACTOR 5: WORK CATEGORY PROFILE

Potential income generating high profile work category people are more likely to get the insurance.

FACTOR 6: INCOME LEVEL

Average, middle scale Income generators are more likely to get caravan insurance

Here the variable MINKGEM acts as a standalone factor to represent the average income of a

person.

FACTOR 7: INSURANCE INTEREST

Individuals opting highly sophisticated high safety Insurance policies are more likely to buy caravan

insurance

FACTOR 8: PURCHASING POWER CLASS

Individuals who purchase or afford to buy high cost products as caravan insurance is not a need but

a luxury which is aimed at the average and high income generators.

FACTOR 9: RENTED HOME RESIDENTS

Residents who stay in rented home might have their own house in their native or settled elsewhere

in a rented home for work and family convenience or might not have enough savings for investing on

Page 4: Caravan insurance data mining statistical analysis

home. All these individuals are more likely to be interested in caravan insurance as they are in need

of a local Asset.

FACTOR 10: CAR OWNERSHIP:

People who own a car signify their buying power, average income and also their interest in cars and

driving and can be interested in buying a caravan and its insurance scheme.

People who own more than one car are unusual and must be car freaks who will be considering the

best quality and fashion symbolizing new models; Caravans are most unlikely to suit their needs.

2.2. DATA TRANSFORMATION

2.2.1. INDEPENDENCE OF DEPENDENT VARIABLES WITH RESPECT TO PREDICTION PARAMETERS:

CUSTOMER SUB TYPE (MOSTYPE) variable represents a combination of the age factor,

spending/buying attitude and social life style. Hence it can be used as a standalone factor for

predicting the potential buyers.

MARRIED PEOPLE are represented by MRELGE and the rest of the variables describing relationship

status can be ignored

2.2.2. INTERACTION VARIABLES DEFINITION FOR INDEPENDENT REPRESENTATION OF A

COMBINATION:

PURCHASING POWER CLASS * AVERAGE INCOME

Work Category, Income Level and purchasing power class can be combined and accurately predicted

as Average Income generators with a high profile work category belonging to the purchasing power

class category represented by the interaction of Independent variables Average Income and

Purchasing Power Class.

PWAPART*PBRAND*PBYSTAND*PERSAUT

People who are already interested in buying sophisticated insurance policies are most likely to

choose caravan insurance. Interaction or Cross Product of Contribution to fire, third party, social

security and car insurance represents a high probability of getting caravan insurance

2.2.3. DIMENSION REDUCTION BY PRINCIPAL COMPONENT ANALYSIS:

Almost all variables used in the final model are significantly independent predicting different factors

of the caravan insurance buying factor.

CUST_SUB_LIFESTYLE_REFLECTION – Social Lifestyle and Attitude towards Spending/investing

MRELGE – Marital Status

MAUT1 – Single Car Owner

MHHUUR – Rented Home Resident

Page 5: Caravan insurance data mining statistical analysis

PBRAND, PPERSAUT, PBYSTAND, PWAPART – Contribution towards different sophisticated and high

safety Insurance policies.

The two factors with significant correlation are MINKGEM and MKOOPKLA where there can be a

bigger overlap in the population logically. It means that Potential Purchasing Class should have a

high or middle scaled average Income which form most part of MINIKGEM variable. So these two

dimensions can be reduced into one that represents high orthogonality of the variable.

Factor Analysis was carried out and the extracted component was rotated and coded as a regression

variable in the data set.

This new variable PURCHASING_POWER_CLASS_INK represents the reduced component of

MINKGEM and MKOOPKLA through PCA.

The factor analysis results are attached in Appendix-3

2.3. DATA ANALYSIS:

2.3.1. APPLYING LOGISTIC REGRESSION: (WITHOUT INCLUDING THE VARIABLE REDUCED BY PCA)

2.3.1.1. CHOSEN VARIABLES REPRESENTING INDEPENDENT FACTORS TO PREDICT THE CARAVAN

INSURANCE BUYERS:

The predictor variables are represented in 2 blocks of covariates for the dependent variable,

CARAVAN (0- Customers will not buy, 1- will buy)

BLOCK 1:

CUST_SUB_LIFESTYLE_ATTITUDE (Social Life Style Reflector)

MRELEGE (Marital Status)

MAUT1 (Car Ownership factor – Single Car Indicating potential income generation)

MHHUUR (House owners –Potential Earning Factor)

BLOCK 2: (INTERACTION VARIABLES)

PBRAND, PBYSTAND, PPERSAUT, PWAPART (Customer Insurance Interest factor on sophisticated and

high Safety policies)

MKOOPKLA, MINKGEM (Purchasing Power Class with Average Income Level factor)

Method: FORWARD LR

Cut Off Value: 0.5

Probability Entry Criteria: 0.05

Probability Exit Criteria: 0.10

Page 6: Caravan insurance data mining statistical analysis

2.3.1.2. CHOOSING THE CATEGORICAL VARIABLES:

The variables which represent a category of users internally are to be marked as categorical in a

logistic regression

In our case

Contribution to various insurance policies (PWAPART, PPERSAUT, PBRAND, and PBYSTAND)

represents internal categories such as high, average and low. They are not evenly distributed across

their base value types as seen in the fig1.3, 1.4, 1.5, 1.6 and hence they can be indicated as

categorical.

Customer Sub type (CUST_SUB_LIFESTYLE_REFLECTION) representing two main categories - Middle

and Upper Class, middle aged and senior citizens, high risk cultured liberal investors and Distributed

age and social class, low risk cultured conservative investors and these values are not evenly spread

as seen in fig 1.2 and they can be treated as categorical.

All other variables are continuous which contain values corresponding to single category which it

stands for. MAUT1 (Owning a Single Car), MRELGE (Married), MHHUUR (Rented Home Residents),

MINKGEM (Average Income), MKOOPKLA (Purchasing Power Class)

The Regression Converged in two steps in block 2 and the prediction model is generated.

The model summary and predictor equation is described in the Appendix-2.

2.3.1.3. GENERATED EQUATION BY LOGISTIC REGRESSION FOR PREDICTING POTENTIAL CARAVAN

INSURANCE BUYERS:

0.073 (MAUT1) +0.069 (MRELGE) – 0.018(MHHUUR) -0.376 (CUST_SUB_LIFESTYLE_REFLECTION(1))

+ 0.016(MINKGEM by MKOOPKLA) + (PBRAND by PBYSTAND by PPERSAUT by PWAPART) – 2.924

Accuracy of the model as predicted by the Nagelkerke R square value is 19.3%

2.3.2. APPLYING LOGISTIC REGRESSION: (WITH THE VARIABLE REDUCED BY PCA)

With the new component extracted with PCA, PURCHASING_POWER_CLASS_INK, we can apply

logistic regression along with other variables.

The regression converged in the first step.

The predictor model is almost the same as the one above without the reduced component through

PCA and is given by the equation

0.093 (MAUT1) +0.069 (MRELGE) – 0.024(MHHUUR) -0.345 (CUST_SUB_LIFESTYLE_REFLECTION (1))

+ 0.237(PURCHASING_POWER_CLASS_INK) + (PBRAND by PBYSTAND by PPERSAUT by PWAPART) –

2.336

The model also has a high degree of accuracy with a Nagelkerke R square percentage of 19.2%

The model summary and predictor equation is described in the Appendix-4.

Page 7: Caravan insurance data mining statistical analysis

3. MODEL INSIGHTS AND CONCLUSION:

The understanding and classification of the initial variables have been thoroughly done to reflect

properties of socio demographic, education, lifestyle, income, car and insurance interests with

relevance to the product type. The logically predicted significant variables have then been analyzed

based on the descriptive statistics of the target variables in the data set using IBM SPSS. Dimension

Reduction, Variable Recoding and Interaction Variables definition have been done to represent

accurate and independent predictors. The logistic regression then gives the required predictor

model.

The model should be broad in prediction with appropriate real world logical reasons for categorizing

and recoding of variables so that it holds good for most possible cases and avoids OVERFITTING.

Appendix -1

DESCRIPTIVE STATISTICS – CROSS TAB RESULTS

Fig 1.0. Rental Home Residents Caravan Insurance Buying Pattern

Page 8: Caravan insurance data mining statistical analysis

Fig 1.1. Purchasing Power Class Caravan Insurance Buying Pattern

Fig.1.2. Social Lifestyle based Caravan Insurance Buying Pattern (RECODED VARIABLE)

1 – Middle and Upper Class, middle aged and senior citizens, high risk cultured liberal investors

0 - Distributed age and social class, low risk cultured conservative investors

Page 9: Caravan insurance data mining statistical analysis

Fig 1.3. Third Party Insurance Buyers and Caravan Insurance buyers

Fig 1.4. Car Insurance Buyers and Caravan Insurance Buyers

Page 10: Caravan insurance data mining statistical analysis

Fig 1.5. Fire Insurance Contribution and Caravan Insurance Interest

Fig 1.6. Social Security Insurance Vs Caravan Insurance Buyers

Page 11: Caravan insurance data mining statistical analysis

Appendix -2: (Logistic Regression Summary and Last Convergence Results without PCA Component)

Model Summary

Step

-2 Log

likelihood

Cox & Snell R

Square

Nagelkerke R

Square

1 2220.272a .069 .189

2 2210.325a .070 .193

a. Estimation terminated at iteration number 20 because

maximum iterations has been reached. Final solution

cannot be found.

Converged Predictors and corresponding Coefficients in

binary logistic regression ( BLOCK 2 - Second Step )

Variables in the Equation

B S.E. Wald df Sig. Exp(B)

.

.

Page 12: Caravan insurance data mining statistical analysis

.

.

The Cross Product continuing up to (4x4 combinations)

a. Variable(s) entered on step 1: PBRAND * PBYSTAND * PPERSAUT * PWAPART .

b. Variable(s) entered on step 2: MINKGEM * MKOOPKLA .

Appendix -3 (Logistic Regression with reduced component with PCA) Initial Components (Average Income and Purchasing Power Class) Vs Principle Component Extracted APPENDIX -3: PRINCIPLE COMPONENT ANALYSIS: FACTOR ANALYSIS:

Correlation Matrix

MINKGEM MKOOPKLA

Correlation MINKGEM 1.000 .452

MKOOPKLA .452 1.000

Sig. (1-tailed) MINKGEM .000

MKOOPKLA .000

Page 13: Caravan insurance data mining statistical analysis

After Principal Component Analysis -

Component Matrixa

Component

1

MINKGEM .852

MKOOPKLA .852

Extraction Method:

Principal Component

Analysis.

a. 1 components extracted.

Reproduced Correlations

MINKGEM MKOOPKLA

Reproduced Correlation MINKGEM .726a .726

MKOOPKLA .726 .726a

Residualb MINKGEM -.274

MKOOPKLA -.274

Extraction Method: Principal Component Analysis.

a. Reproduced communalities

b. Residuals are computed between observed and reproduced

correlations. There are 1 (100.0%) nonredundant residuals with

absolute values greater than 0.05.

APPENDIX -4:

After PCA with the Reduced Component – Binary Logistic Regression with other predictor variables

Model Summary

Step

-2 Log

likelihood

Cox & Snell R

Square

Nagelkerke R

Square

1 2213.728a .070 .192

a. Estimation terminated at iteration number 20 because

maximum iterations has been reached. Final solution

cannot be found.

Page 14: Caravan insurance data mining statistical analysis

Variables in the Equation

B S.E. Wald df Sig. Exp(B)

Step 1a CUST_SUB_LIFESTYLE_REF

LECTION(1)

-.345 .124 7.778 1 .005 .709

PURCHASING_POWER_CL

ASS_INK

.237 .068 12.009 1 .001 1.268

MHHUUR -.024 .024 1.049 1 .306 .976

MAUT1 .093 .040 5.315 1 .021 1.098

PBRAND * PBYSTAND *

PPERSAUT * PWAPART

207.422 112 .000

PBRAND(1) by

PBYSTAND(1) by

PPERSAUT(1) by

PWAPART(1)

-1.467 .779 3.549 1 .060 .231

PBRAND(1) by

PBYSTAND(1) by

PPERSAUT(1) by

PWAPART(2)

-18.885 7541.184 .000 1 .998 .000

PBRAND(1) by

PBYSTAND(1) by

PPERSAUT(1) by

PWAPART(3)

-1.627 .960 2.874 1 .090 .197

PBRAND(1) by

PBYSTAND(1) by

PPERSAUT(2) by

PWAPART(1)

-19.134 40192.970 .000 1 1.000 .000

PBRAND(1) by

PBYSTAND(1) by

PPERSAUT(3) by

PWAPART(1)

-3.743 1.257 8.862 1 .003 .024

PBRAND(1) by

PBYSTAND(1) by

PPERSAUT(3) by

PWAPART(3)

-.218 1.065 .042 1 .838 .804

Page 15: Caravan insurance data mining statistical analysis

.

.

.

.

.

.

PBRAND(7) by

PBYSTAND(4) by

PPERSAUT(4) by

PWAPART(1)

-19.341 23141.295 .000 1 .999 .000

PBRAND(8) by

PBYSTAND(1) by

PPERSAUT(1) by

PWAPART(1)

-18.797 28317.506 .000 1 .999 .000

PBRAND(8) by

PBYSTAND(1) by

PPERSAUT(1) by

PWAPART(3)

-19.114 40192.970 .000 1 1.000 .000

PBRAND(8) by

PBYSTAND(1) by

PPERSAUT(4) by

PWAPART(1)

-19.252 28290.099 .000 1 .999 .000

PBRAND(8) by

PBYSTAND(1) by

PPERSAUT(4) by

PWAPART(3)

-18.921 28301.176 .000 1 .999 .000

PBRAND(8) by

PBYSTAND(1) by

PPERSAUT(5) by

PWAPART(3)

-19.476 40192.970 .000 1 1.000 .000

Constant -2.336 .812 8.271 1 .004 .097

a. Variable(s) entered on step 1: PBRAND * PBYSTAND * PPERSAUT * PWAPART .