K6255 – Knowledge Discovery and Data Mining
Statistical Analysis of Caravan Insurance using IBM SPSS
Muthu Kumaar Thangavelu (G1101765E)
1. INTRODUCTION:
The data set contains information on customers of an insurance company which includes the
product usage data and socio-demographic data derived from zip area codes supplied by the Dutch
data mining company Sentient Machine Research. Our aim is to predict a customer circle who will be
interested in buying caravan insurance and predict a model with the given 86 variable values
representing the socio demographic, education, insurance interests and income levels of customers.
2. STATISTICAL ANALYSIS
2.1. DATA PREPARATION:
2.1.1. ANALYZING AND CATEGORIZING THE VARIABLES:
We extract and analyze the raw variables with labels and try to categorize the variables based on the
understanding of the insurance product and the product buyers. We classify the broad range of 86
variables to significant predictors as below
CUST_SUB_LIFESTYLE_REFLECTION:
Customer sub type MOSTYPE variable has 41 value types which can be categorised under two broad
classes which relate to their age, social class, life style and reflection towards investing or spending
as follows
- Middle and Upper Class, middle aged and senior citizens, high risk cultured liberal investors (8, 9,
12, 13, 23, 25, 36, 2, 3, 4, 5, 15, and 27)
- Distributed age and social class, low risk cultured conservative investors
(1,6,7,10,11,14,16,17,18,19,20,21,22,24,26,28,29,30,31,32,33,34,35,37,38,39,40,41)
CUST_LEVEL_LIFECYCLE:
Average age MGEMLEEF holds 6 types of values which can be categorised into three groups and are
based on family status and age.
- Young, family starters (1)
- Middle aged family men (2, 3, and 4)
- Senior, family men (5, 6)
CUST_MAIN_SPEND_INVEST_ATTITUDE:
Customer main type MOSHOOFD can be classified into two groups based on the attitude of
customers towards buying / spending.
- Liberals (1, 2, 5, 6)
- Conservatives (3, 4, 7, 8, 9, 10)
CUST_MARITAL_STAT:
MRELGE, MRELSA, MRELOV, MFALLEEN describe the relationship status of a person which can be
combined into two categories signifying the marital status
- Married (MRELGE)
- Unmarried (MRELSA, MRELOV, MFALLEEN)
CUST_WORK_CATEGORY_PROFILE:
Variables 19 – 24 describe the profile of work category of a person which can be of 2 types.
- Potential income generating high profile work category (MBERHOOG, MBERZELF, MBERMIDD)
- Relatively less Potential Income generating low profile work category (MBERBOER, MBERARBG,
MBERARBO)
CUST_INCOME_LEVEL:
Variables 37 to 41 represent the income of a person which can be grouped into three classes
Low (MINKM30)
Middle (MINK3045, MINK4575)
High (MINK7512, MINK123M)
These can be best represented by a standalone factor depicting the average income (MINKGEM)
CUST_INSURANCE_INTEREST:
Variables 44 to 85 and 35,36 describe the interest of customers towards various insurance policies
in general starting from much needed insurance policies for life, health, disabilities, family/private
accidents and optimal insurance policies for property, small automobiles of individuals (especially
where cost of replacement of damaged parts are as costly as getting a new vehicle) or delivery
vehicles of companies which are operated by third party drivers or an industrial machine to the most
sophisticated policies offering luxury and high safety in the form of private third party insurance
where the insurer pays off the third party even if the insured is at fault and Car, fire and social
security also represent forms of luxury or high sophistication. Hence here is the classification for
both the number and contribution of policies by different customers:
- Individuals opting sophistication and high safety Insurance policies (WAPART, PERSAUT, BRAND,
BYSTAND)
- Firms/Individuals Opting much needed and Optimal Safety Insurance policies (All others)
2.1.2. MAPPING TARGET VARIABLES AS PREDICTORS OF CARAVAN INSURANCE BUYERS:
These predictions have been made with descriptive statistics results of the data set along with the
real world logical themes (Appendix-1)
FACTOR 1: AGE
Middle aged people are more likely to get caravan insurance
FACTOR 2: ATTITUDE TOWARDS SPENDING/ BUYING
People with a liberal attitude predicted by Customer Main type are more likely to get caravan
insurance
FACTOR 3: SOCIAL LIFE STYLE REFLECTOR
People who are modern, professional, middle and upper class and liberal investors of their income
as predicted Customer Sub type are likely to get caravan insurance.
FACTOR 4: MARITAL STATUS
Married Family Men are more likely to buy caravan insurance
FACTOR 5: WORK CATEGORY PROFILE
Potential income generating high profile work category people are more likely to get the insurance.
FACTOR 6: INCOME LEVEL
Average, middle scale Income generators are more likely to get caravan insurance
Here the variable MINKGEM acts as a standalone factor to represent the average income of a
person.
FACTOR 7: INSURANCE INTEREST
Individuals opting highly sophisticated high safety Insurance policies are more likely to buy caravan
insurance
FACTOR 8: PURCHASING POWER CLASS
Individuals who purchase or afford to buy high cost products as caravan insurance is not a need but
a luxury which is aimed at the average and high income generators.
FACTOR 9: RENTED HOME RESIDENTS
Residents who stay in rented home might have their own house in their native or settled elsewhere
in a rented home for work and family convenience or might not have enough savings for investing on
home. All these individuals are more likely to be interested in caravan insurance as they are in need
of a local Asset.
FACTOR 10: CAR OWNERSHIP:
People who own a car signify their buying power, average income and also their interest in cars and
driving and can be interested in buying a caravan and its insurance scheme.
People who own more than one car are unusual and must be car freaks who will be considering the
best quality and fashion symbolizing new models; Caravans are most unlikely to suit their needs.
2.2. DATA TRANSFORMATION
2.2.1. INDEPENDENCE OF DEPENDENT VARIABLES WITH RESPECT TO PREDICTION PARAMETERS:
CUSTOMER SUB TYPE (MOSTYPE) variable represents a combination of the age factor,
spending/buying attitude and social life style. Hence it can be used as a standalone factor for
predicting the potential buyers.
MARRIED PEOPLE are represented by MRELGE and the rest of the variables describing relationship
status can be ignored
2.2.2. INTERACTION VARIABLES DEFINITION FOR INDEPENDENT REPRESENTATION OF A
COMBINATION:
PURCHASING POWER CLASS * AVERAGE INCOME
Work Category, Income Level and purchasing power class can be combined and accurately predicted
as Average Income generators with a high profile work category belonging to the purchasing power
class category represented by the interaction of Independent variables Average Income and
Purchasing Power Class.
PWAPART*PBRAND*PBYSTAND*PERSAUT
People who are already interested in buying sophisticated insurance policies are most likely to
choose caravan insurance. Interaction or Cross Product of Contribution to fire, third party, social
security and car insurance represents a high probability of getting caravan insurance
2.2.3. DIMENSION REDUCTION BY PRINCIPAL COMPONENT ANALYSIS:
Almost all variables used in the final model are significantly independent predicting different factors
of the caravan insurance buying factor.
CUST_SUB_LIFESTYLE_REFLECTION – Social Lifestyle and Attitude towards Spending/investing
MRELGE – Marital Status
MAUT1 – Single Car Owner
MHHUUR – Rented Home Resident
PBRAND, PPERSAUT, PBYSTAND, PWAPART – Contribution towards different sophisticated and high
safety Insurance policies.
The two factors with significant correlation are MINKGEM and MKOOPKLA where there can be a
bigger overlap in the population logically. It means that Potential Purchasing Class should have a
high or middle scaled average Income which form most part of MINIKGEM variable. So these two
dimensions can be reduced into one that represents high orthogonality of the variable.
Factor Analysis was carried out and the extracted component was rotated and coded as a regression
variable in the data set.
This new variable PURCHASING_POWER_CLASS_INK represents the reduced component of
MINKGEM and MKOOPKLA through PCA.
The factor analysis results are attached in Appendix-3
2.3. DATA ANALYSIS:
2.3.1. APPLYING LOGISTIC REGRESSION: (WITHOUT INCLUDING THE VARIABLE REDUCED BY PCA)
2.3.1.1. CHOSEN VARIABLES REPRESENTING INDEPENDENT FACTORS TO PREDICT THE CARAVAN
INSURANCE BUYERS:
The predictor variables are represented in 2 blocks of covariates for the dependent variable,
CARAVAN (0- Customers will not buy, 1- will buy)
BLOCK 1:
CUST_SUB_LIFESTYLE_ATTITUDE (Social Life Style Reflector)
MRELEGE (Marital Status)
MAUT1 (Car Ownership factor – Single Car Indicating potential income generation)
MHHUUR (House owners –Potential Earning Factor)
BLOCK 2: (INTERACTION VARIABLES)
PBRAND, PBYSTAND, PPERSAUT, PWAPART (Customer Insurance Interest factor on sophisticated and
high Safety policies)
MKOOPKLA, MINKGEM (Purchasing Power Class with Average Income Level factor)
Method: FORWARD LR
Cut Off Value: 0.5
Probability Entry Criteria: 0.05
Probability Exit Criteria: 0.10
2.3.1.2. CHOOSING THE CATEGORICAL VARIABLES:
The variables which represent a category of users internally are to be marked as categorical in a
logistic regression
In our case
Contribution to various insurance policies (PWAPART, PPERSAUT, PBRAND, and PBYSTAND)
represents internal categories such as high, average and low. They are not evenly distributed across
their base value types as seen in the fig1.3, 1.4, 1.5, 1.6 and hence they can be indicated as
categorical.
Customer Sub type (CUST_SUB_LIFESTYLE_REFLECTION) representing two main categories - Middle
and Upper Class, middle aged and senior citizens, high risk cultured liberal investors and Distributed
age and social class, low risk cultured conservative investors and these values are not evenly spread
as seen in fig 1.2 and they can be treated as categorical.
All other variables are continuous which contain values corresponding to single category which it
stands for. MAUT1 (Owning a Single Car), MRELGE (Married), MHHUUR (Rented Home Residents),
MINKGEM (Average Income), MKOOPKLA (Purchasing Power Class)
The Regression Converged in two steps in block 2 and the prediction model is generated.
The model summary and predictor equation is described in the Appendix-2.
2.3.1.3. GENERATED EQUATION BY LOGISTIC REGRESSION FOR PREDICTING POTENTIAL CARAVAN
INSURANCE BUYERS:
0.073 (MAUT1) +0.069 (MRELGE) – 0.018(MHHUUR) -0.376 (CUST_SUB_LIFESTYLE_REFLECTION(1))
+ 0.016(MINKGEM by MKOOPKLA) + (PBRAND by PBYSTAND by PPERSAUT by PWAPART) – 2.924
Accuracy of the model as predicted by the Nagelkerke R square value is 19.3%
2.3.2. APPLYING LOGISTIC REGRESSION: (WITH THE VARIABLE REDUCED BY PCA)
With the new component extracted with PCA, PURCHASING_POWER_CLASS_INK, we can apply
logistic regression along with other variables.
The regression converged in the first step.
The predictor model is almost the same as the one above without the reduced component through
PCA and is given by the equation
0.093 (MAUT1) +0.069 (MRELGE) – 0.024(MHHUUR) -0.345 (CUST_SUB_LIFESTYLE_REFLECTION (1))
+ 0.237(PURCHASING_POWER_CLASS_INK) + (PBRAND by PBYSTAND by PPERSAUT by PWAPART) –
2.336
The model also has a high degree of accuracy with a Nagelkerke R square percentage of 19.2%
The model summary and predictor equation is described in the Appendix-4.
3. MODEL INSIGHTS AND CONCLUSION:
The understanding and classification of the initial variables have been thoroughly done to reflect
properties of socio demographic, education, lifestyle, income, car and insurance interests with
relevance to the product type. The logically predicted significant variables have then been analyzed
based on the descriptive statistics of the target variables in the data set using IBM SPSS. Dimension
Reduction, Variable Recoding and Interaction Variables definition have been done to represent
accurate and independent predictors. The logistic regression then gives the required predictor
model.
The model should be broad in prediction with appropriate real world logical reasons for categorizing
and recoding of variables so that it holds good for most possible cases and avoids OVERFITTING.
Appendix -1
DESCRIPTIVE STATISTICS – CROSS TAB RESULTS
Fig 1.0. Rental Home Residents Caravan Insurance Buying Pattern
Fig 1.1. Purchasing Power Class Caravan Insurance Buying Pattern
Fig.1.2. Social Lifestyle based Caravan Insurance Buying Pattern (RECODED VARIABLE)
1 – Middle and Upper Class, middle aged and senior citizens, high risk cultured liberal investors
0 - Distributed age and social class, low risk cultured conservative investors
Fig 1.3. Third Party Insurance Buyers and Caravan Insurance buyers
Fig 1.4. Car Insurance Buyers and Caravan Insurance Buyers
Fig 1.5. Fire Insurance Contribution and Caravan Insurance Interest
Fig 1.6. Social Security Insurance Vs Caravan Insurance Buyers
Appendix -2: (Logistic Regression Summary and Last Convergence Results without PCA Component)
Model Summary
Step
-2 Log
likelihood
Cox & Snell R
Square
Nagelkerke R
Square
1 2220.272a .069 .189
2 2210.325a .070 .193
a. Estimation terminated at iteration number 20 because
maximum iterations has been reached. Final solution
cannot be found.
Converged Predictors and corresponding Coefficients in
binary logistic regression ( BLOCK 2 - Second Step )
Variables in the Equation
B S.E. Wald df Sig. Exp(B)
.
.
.
.
The Cross Product continuing up to (4x4 combinations)
a. Variable(s) entered on step 1: PBRAND * PBYSTAND * PPERSAUT * PWAPART .
b. Variable(s) entered on step 2: MINKGEM * MKOOPKLA .
Appendix -3 (Logistic Regression with reduced component with PCA) Initial Components (Average Income and Purchasing Power Class) Vs Principle Component Extracted APPENDIX -3: PRINCIPLE COMPONENT ANALYSIS: FACTOR ANALYSIS:
Correlation Matrix
MINKGEM MKOOPKLA
Correlation MINKGEM 1.000 .452
MKOOPKLA .452 1.000
Sig. (1-tailed) MINKGEM .000
MKOOPKLA .000
After Principal Component Analysis -
Component Matrixa
Component
1
MINKGEM .852
MKOOPKLA .852
Extraction Method:
Principal Component
Analysis.
a. 1 components extracted.
Reproduced Correlations
MINKGEM MKOOPKLA
Reproduced Correlation MINKGEM .726a .726
MKOOPKLA .726 .726a
Residualb MINKGEM -.274
MKOOPKLA -.274
Extraction Method: Principal Component Analysis.
a. Reproduced communalities
b. Residuals are computed between observed and reproduced
correlations. There are 1 (100.0%) nonredundant residuals with
absolute values greater than 0.05.
APPENDIX -4:
After PCA with the Reduced Component – Binary Logistic Regression with other predictor variables
Model Summary
Step
-2 Log
likelihood
Cox & Snell R
Square
Nagelkerke R
Square
1 2213.728a .070 .192
a. Estimation terminated at iteration number 20 because
maximum iterations has been reached. Final solution
cannot be found.
Variables in the Equation
B S.E. Wald df Sig. Exp(B)
Step 1a CUST_SUB_LIFESTYLE_REF
LECTION(1)
-.345 .124 7.778 1 .005 .709
PURCHASING_POWER_CL
ASS_INK
.237 .068 12.009 1 .001 1.268
MHHUUR -.024 .024 1.049 1 .306 .976
MAUT1 .093 .040 5.315 1 .021 1.098
PBRAND * PBYSTAND *
PPERSAUT * PWAPART
207.422 112 .000
PBRAND(1) by
PBYSTAND(1) by
PPERSAUT(1) by
PWAPART(1)
-1.467 .779 3.549 1 .060 .231
PBRAND(1) by
PBYSTAND(1) by
PPERSAUT(1) by
PWAPART(2)
-18.885 7541.184 .000 1 .998 .000
PBRAND(1) by
PBYSTAND(1) by
PPERSAUT(1) by
PWAPART(3)
-1.627 .960 2.874 1 .090 .197
PBRAND(1) by
PBYSTAND(1) by
PPERSAUT(2) by
PWAPART(1)
-19.134 40192.970 .000 1 1.000 .000
PBRAND(1) by
PBYSTAND(1) by
PPERSAUT(3) by
PWAPART(1)
-3.743 1.257 8.862 1 .003 .024
PBRAND(1) by
PBYSTAND(1) by
PPERSAUT(3) by
PWAPART(3)
-.218 1.065 .042 1 .838 .804
.
.
.
.
.
.
PBRAND(7) by
PBYSTAND(4) by
PPERSAUT(4) by
PWAPART(1)
-19.341 23141.295 .000 1 .999 .000
PBRAND(8) by
PBYSTAND(1) by
PPERSAUT(1) by
PWAPART(1)
-18.797 28317.506 .000 1 .999 .000
PBRAND(8) by
PBYSTAND(1) by
PPERSAUT(1) by
PWAPART(3)
-19.114 40192.970 .000 1 1.000 .000
PBRAND(8) by
PBYSTAND(1) by
PPERSAUT(4) by
PWAPART(1)
-19.252 28290.099 .000 1 .999 .000
PBRAND(8) by
PBYSTAND(1) by
PPERSAUT(4) by
PWAPART(3)
-18.921 28301.176 .000 1 .999 .000
PBRAND(8) by
PBYSTAND(1) by
PPERSAUT(5) by
PWAPART(3)
-19.476 40192.970 .000 1 1.000 .000
Constant -2.336 .812 8.271 1 .004 .097
a. Variable(s) entered on step 1: PBRAND * PBYSTAND * PPERSAUT * PWAPART .